Extending CVObjective for Custom Models

This notebook shows how to extend the CVObjective base class to support models that don’t follow the scikit-learn estimator interface. We use LightGBM’s native Python API as a concrete example.

Requirements: lightgbm must be installed (pip install lightgbm).

When to extend CVObjective

Scenario

Recommended approach

Model inherits from sklearn.base.BaseEstimator (implements get_params, set_params, fit, predict)

Use SklearnCVObj (see notebook 02)

Model uses a non-sklearn API (e.g., LightGBM native, PyTorch)

Extend CVObjective

You need custom training logic (early stopping, callbacks)

Extend CVObjective

The key constraint for SklearnCVObj is sklearn.base.clone: it reconstructs a fresh model instance by calling get_params() and passing the result back to the constructor. Any model that doesn’t implement get_params/set_params (i.e., doesn’t inherit BaseEstimator) will fail at this step.

How it works

CVObjective handles the cross-validation loop—splitting the data, iterating over folds, and aggregating results. The only method you need to implement is fit_and_test, which trains and evaluates the model on a single fold:

class MyCVObjective(CVObjective):
    def fit_and_test(self, params, train_index, test_index) -> float:
        # train on train_index, evaluate on test_index, return scalar loss
        ...
[1]:
# Note: Set OpenMP threads to 1 to avoid threading conflicts on MacOS
import os
os.environ['OMP_NUM_THREADS'] = '1'
[2]:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score

# FCVOpt imports
from fcvopt.crossvalidation import CVObjective
from fcvopt.optimizers import FCVOpt
from fcvopt.configspace import ConfigurationSpace
from ConfigSpace import Integer, Float

Generating the Data

We use the same synthetic binary classification dataset as in notebook 02—2,000 samples, 25 features (5 informative, 10 redundant), with a 90/10 class split—so the results are directly comparable.

[3]:
# Generate binary classification dataset with class imbalance (90% vs 10%)
X, y = make_classification(
    n_samples=2000,
    n_features=25,
    n_informative=5,
    n_redundant=10,
    n_classes=2,
    n_clusters_per_class=2,
    weights=[0.9, 0.1],  # imbalanced classes
    random_state=23
)

print(f"Shape of features matrix: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")
Shape of features matrix: (2000, 25)
Class distribution: [1796  204]

Step 1: Define the Hyperparameter Search Space

We tune the same five hyperparameters as in notebook 02. The only difference is that LightGBM’s native API uses num_round (not n_estimators) for the number of boosting rounds.

Hyperparameter

Range

Scale

Description

num_round

[50, 1000]

Log

Number of boosting rounds

learning_rate

[1e-3, 0.25]

Log

Shrinkage applied to each tree’s contribution

num_leaves

[2, 128]

Log

Max leaves per tree; controls model complexity

min_data_in_leaf

[2, 100]

Log

Min samples per leaf; acts as regularization

colsample_bytree

[0.05, 1.0]

Log

Fraction of features sampled per tree

[4]:
# Create configuration space for hyperparameter search
config = ConfigurationSpace()

# Add hyperparameters with appropriate ranges and scales
config.add([
    Integer('num_round', bounds=(50, 1000), log=True),
    Float('learning_rate', bounds=(1e-3, 0.25), log=True),
    Integer('num_leaves', bounds=(2, 128), log=True),
    Integer('min_data_in_leaf', bounds=(2, 100), log=True),
    Float('colsample_bytree', bounds=(0.05, 1), log=True)
])
print(config)
Configuration space object:
  Hyperparameters:
    colsample_bytree, Type: UniformFloat, Range: [0.05, 1.0], Default: 0.22360679775, on log-scale
    learning_rate, Type: UniformFloat, Range: [0.001, 0.25], Default: 0.0158113883008, on log-scale
    min_data_in_leaf, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    num_leaves, Type: UniformInteger, Range: [2, 128], Default: 16, on log-scale
    num_round, Type: UniformInteger, Range: [50, 1000], Default: 224, on log-scale

Step 2: Implement a Custom CV Objective

Subclass CVObjective and implement fit_and_test. The parent class calls this method once per fold during evaluation, passing the train/test indices for that fold. Your implementation should:

  1. Slice self.X and self.y using the provided indices

  2. Train the model on the training split

  3. Predict on the test split

  4. Return a scalar loss (lower is better)

Everything else—iterating over folds, averaging results, handling repeats—is taken care of by the parent class.

[5]:
class LightGBMCVObj(CVObjective):
    """Custom CVObjective for LightGBM's native Python API."""

    def fit_and_test(self, params, train_index, test_index):
        """Train and evaluate the model on a single CV fold.

        Parameters
        ----------
        params : dict
            Hyperparameter configuration to evaluate.
        train_index : array-like
            Row indices for the training split.
        test_index : array-like
            Row indices for the test split.

        Returns
        -------
        float
            Loss value for this fold (lower is better).
        """
        # Slice data for this fold (supports both DataFrame and ndarray)
        if isinstance(self.X, pd.DataFrame):
            X_train, X_test = self.X.iloc[train_index], self.X.iloc[test_index]
        else:
            X_train, X_test = self.X[train_index], self.X[test_index]
        y_train, y_test = self.y[train_index], self.y[test_index]

        # LightGBM native API separates num_round from model parameters
        num_round = params.get('num_round', 100)
        lgb_params = {
            k: v for k, v in params.items() if k != 'num_round'
        }
        lgb_params.update({'objective': 'binary', 'verbosity': -1, 'seed': self.rng_seed})

        # Train and evaluate
        train_data = lgb.Dataset(X_train, label=y_train)
        bst = lgb.train(lgb_params, train_data, num_round)
        y_pred = bst.predict(X_test)

        return self.loss_metric(y_test, y_pred)
[6]:
# Define loss metric: maximize AUC → minimize (1 - AUC)
def auc_loss(y_true, y_pred):
    return 1 - roc_auc_score(y_true, y_pred)

# Instantiate the custom CV objective
cv_obj = LightGBMCVObj(
    X=X,
    y=y,
    loss_metric=auc_loss,
    task='classification',
    n_splits=10,
    stratified=True,
    rng_seed=42
)

print(f"Number of CV folds: {cv_obj.cv.get_n_splits()}")
print(f"Training samples:   {len(cv_obj.y)}")
print(f"Features:           {cv_obj.X.shape[1]}")
Number of CV folds: 10
Training samples:   2000
Features:           25

Sanity check

Before running full optimization, it’s worth verifying that fit_and_test is wired up correctly by calling cv_obj on the default configuration. This runs all 10 folds and returns the mean loss.

[7]:
test_config = config.get_default_configuration()
print("Default configuration:")
print(test_config)

test_loss = cv_obj(dict(test_config))
print(f"\n10-fold CV Loss (1 - AUC): {test_loss:.6f}")
print(f"10-fold CV AUC:            {1 - test_loss:.6f}")
Default configuration:
Configuration(values={
  'colsample_bytree': 0.22360679775,
  'learning_rate': 0.0158113883008,
  'min_data_in_leaf': 14,
  'num_leaves': 16,
  'num_round': 224,
})

10-fold CV Loss (1 - AUC): 0.073268
10-fold CV AUC:            0.926732

Step 3: Initialize and Run the Optimizer

With the custom objective in place, the optimizer setup is identical to the previous notebooks. FCVOpt only interacts with cv_obj.cvloss—it doesn’t care whether the objective uses the sklearn API or a custom implementation.

[8]:
optimizer = FCVOpt(
    obj=cv_obj.cvloss,
    n_folds=cv_obj.cv.get_n_splits(),
    config=config,
    acq_function='LCB',
    tracking_dir='./hpt_opt_runs/',
    experiment='lgb_native_tuning',
    seed=123
)

best_conf = optimizer.optimize(n_trials=50)

# Close the MLflow run
optimizer.end_run()

Number of candidates evaluated.....: 50
Single-fold observed loss (best)...: 0.140556
Estimated full CV loss (best)......: 0.0650603

 Best configuration at termination:
 Configuration(values={
  'colsample_bytree': 0.3594758749201,
  'learning_rate': 0.0010175873664,
  'min_data_in_leaf': 2,
  'num_leaves': 128,
  'num_round': 1000,
})
[9]:
best_cv_loss = cv_obj(best_conf)
best_cv_auc = 1 - best_cv_loss

print(f"10-fold CV Loss (1 - AUC): {best_cv_loss:.4f}")
print(f"10-fold CV ROC-AUC:        {best_cv_auc:.4f}")
10-fold CV Loss (1 - AUC): 0.0632
10-fold CV ROC-AUC:        0.9368
[ ]: