Tuning LightGBM Hyperparameters (scikit-learn API)

This notebook demonstrates how to use FCVOpt to tune LightGBM hyperparameters for a binary classification task with class imbalance.

LightGBM is a gradient boosting framework that uses histogram-based algorithms for fast training. It exposes a scikit-learn–compatible API (LGBMClassifier), which means we can plug it directly into SklearnCVObj without any custom wrapper code.

The notebook follows the same three-step workflow as the introduction:

1. Define a Cross-Validation Objective   ←  wrap LightGBM + data + metric
         ↓
2. Define a Hyperparameter Search Space  ←  which knobs to tune and over what ranges
         ↓
3. Run the Optimizer                     ←  let FCVOpt find the best configuration

Requirements: lightgbm must be installed (pip install lightgbm).

[1]:
# Note: Set OpenMP threads to 1 to avoid threading conflicts on MacOS
import os
os.environ['OMP_NUM_THREADS'] = '1'
[2]:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score

# FCVOpt imports
from fcvopt.crossvalidation import SklearnCVObj
from fcvopt.optimizers import FCVOpt
from fcvopt.configspace import ConfigurationSpace
from ConfigSpace import Integer, Float

Generating the Data

We create a synthetic binary classification dataset with strong class imbalance (90% negative, 10% positive) to simulate a realistic scenario where ROC-AUC is a more informative metric than accuracy. The dataset has 25 features, of which only 5 are truly informative and 10 are linear combinations of those.

[3]:
# Generate binary classification dataset with class imbalance (90% vs 10%)
X, y = make_classification(
    n_samples=2000,
    n_features=25,
    n_informative=5,
    n_redundant=10,
    n_classes=2,
    n_clusters_per_class=2,
    weights=[0.9, 0.1],  # imbalanced classes
    random_state=23
)

print(f"Shape of features matrix: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")
Shape of features matrix: (2000, 25)
Class distribution: [1796  204]

Step 1: Define the Cross-Validation Objective

The CV objective bundles together everything needed to evaluate a hyperparameter configuration:

  • EstimatorLGBMClassifier with binary objective

  • Data — the features and labels (X, y)

  • Loss metric1 - ROC-AUC (we minimize loss, so we convert the AUC we want to maximize)

  • CV scheme — 10-fold stratified CV to preserve the class imbalance ratio in each fold

Setting needs_proba=True tells SklearnCVObj to call predict_proba and pass the positive-class probability scores to the loss function, which is what roc_auc_score expects.

[4]:
# Create configuration space for hyperparameter search
config = ConfigurationSpace()

# Add hyperparameters with appropriate ranges and scales
# Note: LightGBM sklearn API uses 'num_round' -> number of estimators
config.add([
    Integer('n_estimators', bounds=(50, 1000), log=True),
    Float('learning_rate', bounds=(1e-3, 0.25), log=True),
    Integer('num_leaves', bounds=(2, 128), log=True),
    Integer('min_data_in_leaf', bounds=(2, 100), log=True),
    Float('colsample_bytree', bounds=(0.05, 1), log=True),
])
print(config)
Configuration space object:
  Hyperparameters:
    colsample_bytree, Type: UniformFloat, Range: [0.05, 1.0], Default: 0.22360679775, on log-scale
    learning_rate, Type: UniformFloat, Range: [0.001, 0.25], Default: 0.0158113883008, on log-scale
    min_data_in_leaf, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [50, 1000], Default: 224, on log-scale
    num_leaves, Type: UniformInteger, Range: [2, 128], Default: 16, on log-scale

Step 2: Define the Hyperparameter Search Space

We tune five LightGBM hyperparameters that collectively control model capacity, regularization, and feature sampling. All parameters are searched on a log scale because their effects are roughly multiplicative.

Hyperparameter

Range

Scale

Description

n_estimators

[50, 1000]

Log

Number of boosting rounds (trees)

learning_rate

[1e-3, 0.25]

Log

Shrinkage applied to each tree’s contribution

num_leaves

[2, 128]

Log

Max leaves per tree; controls model complexity

min_data_in_leaf

[2, 100]

Log

Min samples per leaf; acts as regularization

colsample_bytree

[0.05, 1.0]

Log

Fraction of features sampled per tree

[5]:
# Define loss metric: maximize AUC → minimize (1 - AUC)
def auc_loss(y_true, y_pred):
    return 1 - roc_auc_score(y_true, y_pred)

# Create CV objective that wraps the LightGBM classifier
cv_obj = SklearnCVObj(
    estimator=lgb.LGBMClassifier(objective="binary", verbosity=-1),
    X=X,
    y=y,
    loss_metric=auc_loss,
    needs_proba=True,
    task='classification',
    n_splits=10,        # 10-fold cross-validation
    rng_seed=42,
    stratified=True     # preserve class imbalance ratio across folds
)

print(f"Number of CV folds: {cv_obj.cv.get_n_splits()}")
print(f"Training samples: {len(cv_obj.y)}")
print(f"Features: {cv_obj.X.shape[1]}")
Number of CV folds: 10
Training samples: 2000
Features: 25

Step 3: Initialize and Run the Optimizer

With the objective and search space in place, we create an FCVOpt instance and run 50 optimization trials. At each trial, FCVOpt:

  1. Uses the hierarchical GP to select the most promising hyperparameter configuration

  2. Picks a single CV fold to evaluate (instead of all 10)

  3. Updates the GP with the new observation and repeats

This means 50 trials require only ~50 model fits rather than the 500 that full 10-fold CV would demand.

Note on acquisition functions: 'EI' (Expected Improvement) is not supported with fractional CV.

[6]:
# Initialize FCVOpt optimizer
optimizer = FCVOpt(
    obj=cv_obj.cvloss,
    n_folds=cv_obj.cv.get_n_splits(),
    config=config,
    acq_function='LCB',
    tracking_dir='./hpt_opt_runs/',
    experiment='lgb_sklearn_tuning',
    seed=123
)

# Run optimization for 50 trials
best_conf = optimizer.optimize(n_trials=50)

# Close the MLflow run
optimizer.end_run()

Number of candidates evaluated.....: 50
Single-fold observed loss (best)...: 0.0511111
Estimated full CV loss (best)......: 0.0697211

 Best configuration at termination:
 Configuration(values={
  'colsample_bytree': 0.1543087512749,
  'learning_rate': 0.001,
  'min_data_in_leaf': 34,
  'n_estimators': 718,
  'num_leaves': 61,
})

Evaluate best configuration on all 10 folds

[7]:
# Evaluate the best configuration found by FCVOpt
# Convert loss back to AUC for easier interpretation (loss = 1 - AUC)
best_cv_loss = cv_obj(best_conf)
best_cv_auc = 1 - best_cv_loss

print(f"10-fold CV Loss: {best_cv_loss:.4f}")
print(f"10-fold CV ROC-AUC: {best_cv_auc:.4f}")
10-fold CV Loss: 0.0906
10-fold CV ROC-AUC: 0.9094
[ ]: