Extending CVObjective for Custom Models
This notebook shows how to extend the CVObjective base class to support models that don’t follow the scikit-learn estimator interface. We use LightGBM’s native Python API as a concrete example.
Requirements: lightgbm must be installed (pip install lightgbm).
When to extend CVObjective
Scenario |
Recommended approach |
|---|---|
Model inherits from |
Use |
Model uses a non-sklearn API (e.g., LightGBM native, PyTorch) |
Extend |
You need custom training logic (early stopping, callbacks) |
Extend |
The key constraint for SklearnCVObj is sklearn.base.clone: it reconstructs a fresh model instance by calling get_params() and passing the result back to the constructor. Any model that doesn’t implement get_params/set_params (i.e., doesn’t inherit BaseEstimator) will fail at this step.
How it works
CVObjective handles the cross-validation loop—splitting the data, iterating over folds, and aggregating results. The only method you need to implement is fit_and_test, which trains and evaluates the model on a single fold:
class MyCVObjective(CVObjective):
def fit_and_test(self, params, train_index, test_index) -> float:
# train on train_index, evaluate on test_index, return scalar loss
...
[1]:
# Note: Set OpenMP threads to 1 to avoid threading conflicts on MacOS
import os
os.environ['OMP_NUM_THREADS'] = '1'
[2]:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
# FCVOpt imports
from fcvopt.crossvalidation import CVObjective
from fcvopt.optimizers import FCVOpt
from fcvopt.configspace import ConfigurationSpace
from ConfigSpace import Integer, Float
Generating the Data
We use the same synthetic binary classification dataset as in notebook 02—2,000 samples, 25 features (5 informative, 10 redundant), with a 90/10 class split—so the results are directly comparable.
[3]:
# Generate binary classification dataset with class imbalance (90% vs 10%)
X, y = make_classification(
n_samples=2000,
n_features=25,
n_informative=5,
n_redundant=10,
n_classes=2,
n_clusters_per_class=2,
weights=[0.9, 0.1], # imbalanced classes
random_state=23
)
print(f"Shape of features matrix: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")
Shape of features matrix: (2000, 25)
Class distribution: [1796 204]
Step 1: Define the Hyperparameter Search Space
We tune the same five hyperparameters as in notebook 02. The only difference is that LightGBM’s native API uses num_round (not n_estimators) for the number of boosting rounds.
Hyperparameter |
Range |
Scale |
Description |
|---|---|---|---|
|
[50, 1000] |
Log |
Number of boosting rounds |
|
[1e-3, 0.25] |
Log |
Shrinkage applied to each tree’s contribution |
|
[2, 128] |
Log |
Max leaves per tree; controls model complexity |
|
[2, 100] |
Log |
Min samples per leaf; acts as regularization |
|
[0.05, 1.0] |
Log |
Fraction of features sampled per tree |
[4]:
# Create configuration space for hyperparameter search
config = ConfigurationSpace()
# Add hyperparameters with appropriate ranges and scales
config.add([
Integer('num_round', bounds=(50, 1000), log=True),
Float('learning_rate', bounds=(1e-3, 0.25), log=True),
Integer('num_leaves', bounds=(2, 128), log=True),
Integer('min_data_in_leaf', bounds=(2, 100), log=True),
Float('colsample_bytree', bounds=(0.05, 1), log=True)
])
print(config)
Configuration space object:
Hyperparameters:
colsample_bytree, Type: UniformFloat, Range: [0.05, 1.0], Default: 0.22360679775, on log-scale
learning_rate, Type: UniformFloat, Range: [0.001, 0.25], Default: 0.0158113883008, on log-scale
min_data_in_leaf, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
num_leaves, Type: UniformInteger, Range: [2, 128], Default: 16, on log-scale
num_round, Type: UniformInteger, Range: [50, 1000], Default: 224, on log-scale
Step 2: Implement a Custom CV Objective
Subclass CVObjective and implement fit_and_test. The parent class calls this method once per fold during evaluation, passing the train/test indices for that fold. Your implementation should:
Slice
self.Xandself.yusing the provided indicesTrain the model on the training split
Predict on the test split
Return a scalar loss (lower is better)
Everything else—iterating over folds, averaging results, handling repeats—is taken care of by the parent class.
[5]:
class LightGBMCVObj(CVObjective):
"""Custom CVObjective for LightGBM's native Python API."""
def fit_and_test(self, params, train_index, test_index):
"""Train and evaluate the model on a single CV fold.
Parameters
----------
params : dict
Hyperparameter configuration to evaluate.
train_index : array-like
Row indices for the training split.
test_index : array-like
Row indices for the test split.
Returns
-------
float
Loss value for this fold (lower is better).
"""
# Slice data for this fold (supports both DataFrame and ndarray)
if isinstance(self.X, pd.DataFrame):
X_train, X_test = self.X.iloc[train_index], self.X.iloc[test_index]
else:
X_train, X_test = self.X[train_index], self.X[test_index]
y_train, y_test = self.y[train_index], self.y[test_index]
# LightGBM native API separates num_round from model parameters
num_round = params.get('num_round', 100)
lgb_params = {
k: v for k, v in params.items() if k != 'num_round'
}
lgb_params.update({'objective': 'binary', 'verbosity': -1, 'seed': self.rng_seed})
# Train and evaluate
train_data = lgb.Dataset(X_train, label=y_train)
bst = lgb.train(lgb_params, train_data, num_round)
y_pred = bst.predict(X_test)
return self.loss_metric(y_test, y_pred)
[6]:
# Define loss metric: maximize AUC → minimize (1 - AUC)
def auc_loss(y_true, y_pred):
return 1 - roc_auc_score(y_true, y_pred)
# Instantiate the custom CV objective
cv_obj = LightGBMCVObj(
X=X,
y=y,
loss_metric=auc_loss,
task='classification',
n_splits=10,
stratified=True,
rng_seed=42
)
print(f"Number of CV folds: {cv_obj.cv.get_n_splits()}")
print(f"Training samples: {len(cv_obj.y)}")
print(f"Features: {cv_obj.X.shape[1]}")
Number of CV folds: 10
Training samples: 2000
Features: 25
Sanity check
Before running full optimization, it’s worth verifying that fit_and_test is wired up correctly by calling cv_obj on the default configuration. This runs all 10 folds and returns the mean loss.
[7]:
test_config = config.get_default_configuration()
print("Default configuration:")
print(test_config)
test_loss = cv_obj(dict(test_config))
print(f"\n10-fold CV Loss (1 - AUC): {test_loss:.6f}")
print(f"10-fold CV AUC: {1 - test_loss:.6f}")
Default configuration:
Configuration(values={
'colsample_bytree': 0.22360679775,
'learning_rate': 0.0158113883008,
'min_data_in_leaf': 14,
'num_leaves': 16,
'num_round': 224,
})
10-fold CV Loss (1 - AUC): 0.073268
10-fold CV AUC: 0.926732
Step 3: Initialize and Run the Optimizer
With the custom objective in place, the optimizer setup is identical to the previous notebooks. FCVOpt only interacts with cv_obj.cvloss—it doesn’t care whether the objective uses the sklearn API or a custom implementation.
[8]:
optimizer = FCVOpt(
obj=cv_obj.cvloss,
n_folds=cv_obj.cv.get_n_splits(),
config=config,
acq_function='LCB',
tracking_dir='./hpt_opt_runs/',
experiment='lgb_native_tuning',
seed=123
)
best_conf = optimizer.optimize(n_trials=50)
# Close the MLflow run
optimizer.end_run()
Number of candidates evaluated.....: 50
Single-fold observed loss (best)...: 0.140556
Estimated full CV loss (best)......: 0.0650603
Best configuration at termination:
Configuration(values={
'colsample_bytree': 0.3594758749201,
'learning_rate': 0.0010175873664,
'min_data_in_leaf': 2,
'num_leaves': 128,
'num_round': 1000,
})
[9]:
best_cv_loss = cv_obj(best_conf)
best_cv_auc = 1 - best_cv_loss
print(f"10-fold CV Loss (1 - AUC): {best_cv_loss:.4f}")
print(f"10-fold CV ROC-AUC: {best_cv_auc:.4f}")
10-fold CV Loss (1 - AUC): 0.0632
10-fold CV ROC-AUC: 0.9368
[ ]: