{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tuning LightGBM Hyperparameters (scikit-learn API)\n", "\n", "This notebook demonstrates how to use FCVOpt to tune LightGBM hyperparameters for a binary classification task with class imbalance.\n", "\n", "LightGBM is a gradient boosting framework that uses histogram-based algorithms for fast training. It exposes a scikit-learn–compatible API (`LGBMClassifier`), which means we can plug it directly into `SklearnCVObj` without any custom wrapper code.\n", "\n", "The notebook follows the same three-step workflow as the introduction:\n", "\n", "```\n", "1. Define a Cross-Validation Objective ← wrap LightGBM + data + metric\n", " ↓\n", "2. Define a Hyperparameter Search Space ← which knobs to tune and over what ranges\n", " ↓\n", "3. Run the Optimizer ← let FCVOpt find the best configuration\n", "```\n", "\n", "**Requirements**: `lightgbm` must be installed (`pip install lightgbm`)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Note: Set OpenMP threads to 1 to avoid threading conflicts on MacOS\n", "import os\n", "os.environ['OMP_NUM_THREADS'] = '1'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import lightgbm as lgb\n", "from sklearn.datasets import make_classification\n", "from sklearn.metrics import roc_auc_score\n", "\n", "# FCVOpt imports\n", "from fcvopt.crossvalidation import SklearnCVObj\n", "from fcvopt.optimizers import FCVOpt\n", "from fcvopt.configspace import ConfigurationSpace\n", "from ConfigSpace import Integer, Float" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating the Data\n", "\n", "We create a synthetic binary classification dataset with **strong class imbalance** (90% negative, 10% positive) to simulate a realistic scenario where ROC-AUC is a more informative metric than accuracy. The dataset has 25 features, of which only 5 are truly informative and 10 are linear combinations of those." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape of features matrix: (2000, 25)\n", "Class distribution: [1796 204]\n" ] } ], "source": [ "# Generate binary classification dataset with class imbalance (90% vs 10%)\n", "X, y = make_classification(\n", " n_samples=2000,\n", " n_features=25,\n", " n_informative=5,\n", " n_redundant=10,\n", " n_classes=2,\n", " n_clusters_per_class=2,\n", " weights=[0.9, 0.1], # imbalanced classes\n", " random_state=23\n", ")\n", "\n", "print(f\"Shape of features matrix: {X.shape}\")\n", "print(f\"Class distribution: {np.bincount(y)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Define the Cross-Validation Objective\n", "\n", "The CV objective bundles together everything needed to evaluate a hyperparameter configuration:\n", "\n", "- **Estimator** — `LGBMClassifier` with binary objective\n", "- **Data** — the features and labels (`X`, `y`)\n", "- **Loss metric** — `1 - ROC-AUC` (we minimize loss, so we convert the AUC we want to maximize)\n", "- **CV scheme** — 10-fold stratified CV to preserve the class imbalance ratio in each fold\n", "\n", "Setting `needs_proba=True` tells `SklearnCVObj` to call `predict_proba` and pass the positive-class probability scores to the loss function, which is what `roc_auc_score` expects." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Configuration space object:\n", " Hyperparameters:\n", " colsample_bytree, Type: UniformFloat, Range: [0.05, 1.0], Default: 0.22360679775, on log-scale\n", " learning_rate, Type: UniformFloat, Range: [0.001, 0.25], Default: 0.0158113883008, on log-scale\n", " min_data_in_leaf, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale\n", " n_estimators, Type: UniformInteger, Range: [50, 1000], Default: 224, on log-scale\n", " num_leaves, Type: UniformInteger, Range: [2, 128], Default: 16, on log-scale\n", "\n" ] } ], "source": [ "# Create configuration space for hyperparameter search\n", "config = ConfigurationSpace()\n", "\n", "# Add hyperparameters with appropriate ranges and scales\n", "# Note: LightGBM sklearn API uses 'num_round' -> number of estimators\n", "config.add([\n", " Integer('n_estimators', bounds=(50, 1000), log=True),\n", " Float('learning_rate', bounds=(1e-3, 0.25), log=True),\n", " Integer('num_leaves', bounds=(2, 128), log=True),\n", " Integer('min_data_in_leaf', bounds=(2, 100), log=True),\n", " Float('colsample_bytree', bounds=(0.05, 1), log=True),\n", "])\n", "print(config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Define the Hyperparameter Search Space\n", "\n", "We tune five LightGBM hyperparameters that collectively control model capacity, regularization, and feature sampling. All parameters are searched on a log scale because their effects are roughly multiplicative.\n", "\n", "| Hyperparameter | Range | Scale | Description |\n", "|---|---|---|---|\n", "| `n_estimators` | [50, 1000] | Log | Number of boosting rounds (trees) |\n", "| `learning_rate` | [1e-3, 0.25] | Log | Shrinkage applied to each tree's contribution |\n", "| `num_leaves` | [2, 128] | Log | Max leaves per tree; controls model complexity |\n", "| `min_data_in_leaf` | [2, 100] | Log | Min samples per leaf; acts as regularization |\n", "| `colsample_bytree` | [0.05, 1.0] | Log | Fraction of features sampled per tree |" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of CV folds: 10\n", "Training samples: 2000\n", "Features: 25\n" ] } ], "source": [ "# Define loss metric: maximize AUC → minimize (1 - AUC)\n", "def auc_loss(y_true, y_pred):\n", " return 1 - roc_auc_score(y_true, y_pred)\n", "\n", "# Create CV objective that wraps the LightGBM classifier\n", "cv_obj = SklearnCVObj(\n", " estimator=lgb.LGBMClassifier(objective=\"binary\", verbosity=-1),\n", " X=X,\n", " y=y,\n", " loss_metric=auc_loss,\n", " needs_proba=True,\n", " task='classification',\n", " n_splits=10, # 10-fold cross-validation\n", " rng_seed=42,\n", " stratified=True # preserve class imbalance ratio across folds\n", ")\n", "\n", "print(f\"Number of CV folds: {cv_obj.cv.get_n_splits()}\")\n", "print(f\"Training samples: {len(cv_obj.y)}\")\n", "print(f\"Features: {cv_obj.X.shape[1]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Initialize and Run the Optimizer\n", "\n", "With the objective and search space in place, we create an `FCVOpt` instance and run 50 optimization trials. At each trial, FCVOpt:\n", "\n", "1. Uses the hierarchical GP to select the most promising hyperparameter configuration\n", "2. Picks a single CV fold to evaluate (instead of all 10)\n", "3. Updates the GP with the new observation and repeats\n", "\n", "This means 50 trials require only ~50 model fits rather than the 500 that full 10-fold CV would demand.\n", "\n", "**Note on acquisition functions**: `'EI'` (Expected Improvement) is not supported with fractional CV." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Number of candidates evaluated.....: 50\n", "Single-fold observed loss (best)...: 0.0511111\n", "Estimated full CV loss (best)......: 0.0697211\n", "\n", " Best configuration at termination:\n", " Configuration(values={\n", " 'colsample_bytree': 0.1543087512749,\n", " 'learning_rate': 0.001,\n", " 'min_data_in_leaf': 34,\n", " 'n_estimators': 718,\n", " 'num_leaves': 61,\n", "})\n" ] } ], "source": [ "# Initialize FCVOpt optimizer\n", "optimizer = FCVOpt(\n", " obj=cv_obj.cvloss,\n", " n_folds=cv_obj.cv.get_n_splits(),\n", " config=config,\n", " acq_function='LCB',\n", " tracking_dir='./hpt_opt_runs/',\n", " experiment='lgb_sklearn_tuning',\n", " seed=123\n", ")\n", "\n", "# Run optimization for 50 trials\n", "best_conf = optimizer.optimize(n_trials=50)\n", "\n", "# Close the MLflow run\n", "optimizer.end_run()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate best configuration on all 10 folds" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10-fold CV Loss: 0.0906\n", "10-fold CV ROC-AUC: 0.9094\n" ] } ], "source": [ "# Evaluate the best configuration found by FCVOpt\n", "# Convert loss back to AUC for easier interpretation (loss = 1 - AUC)\n", "best_cv_loss = cv_obj(best_conf)\n", "best_cv_auc = 1 - best_cv_loss\n", "\n", "print(f\"10-fold CV Loss: {best_cv_loss:.4f}\")\n", "print(f\"10-fold CV ROC-AUC: {best_cv_auc:.4f}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "fcvopt_test (3.10.19)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.19" } }, "nbformat": 4, "nbformat_minor": 4 }