Extending CVObjective for Custom Models

This notebook shows how to extend the CVObjective base class to support models that don’t follow the scikit-learn estimator interface. We use LightGBM’s native Python API as a concrete example.

Requirements: lightgbm must be installed (pip install lightgbm).

When to extend `CVObjective`

Scenario	Recommended approach
Model inherits from `sklearn.base.BaseEstimator` (implements `get_params`, `set_params`, `fit`, `predict`)	Use `SklearnCVObj` (see notebook 02)
Model uses a non-sklearn API (e.g., LightGBM native, PyTorch)	Extend `CVObjective`
You need custom training logic (early stopping, callbacks)	Extend `CVObjective`

The key constraint for SklearnCVObj is sklearn.base.clone: it reconstructs a fresh model instance by calling get_params() and passing the result back to the constructor. Any model that doesn’t implement get_params/set_params (i.e., doesn’t inherit BaseEstimator) will fail at this step.

How it works

CVObjective handles the cross-validation loop—splitting the data, iterating over folds, and aggregating results. The only method you need to implement is fit_and_test, which trains and evaluates the model on a single fold:

class MyCVObjective(CVObjective):
    def fit_and_test(self, params, train_index, test_index) -> float:
        # train on train_index, evaluate on test_index, return scalar loss
        ...

[1]:

# Note: Set OpenMP threads to 1 to avoid threading conflicts on MacOS
import os
os.environ['OMP_NUM_THREADS'] = '1'

[2]:

import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score

# FCVOpt imports
from fcvopt.crossvalidation import CVObjective
from fcvopt.optimizers import FCVOpt
from fcvopt.configspace import ConfigurationSpace
from ConfigSpace import Integer, Float

Generating the Data

We use the same synthetic binary classification dataset as in notebook 02—2,000 samples, 25 features (5 informative, 10 redundant), with a 90/10 class split—so the results are directly comparable.

[3]:

# Generate binary classification dataset with class imbalance (90% vs 10%)
X, y = make_classification(
    n_samples=2000,
    n_features=25,
    n_informative=5,
    n_redundant=10,
    n_classes=2,
    n_clusters_per_class=2,
    weights=[0.9, 0.1],  # imbalanced classes
    random_state=23
)

print(f"Shape of features matrix: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")

Shape of features matrix: (2000, 25)
Class distribution: [1796  204]

Step 1: Define the Hyperparameter Search Space

We tune the same five hyperparameters as in notebook 02. The only difference is that LightGBM’s native API uses num_round (not n_estimators) for the number of boosting rounds.

Hyperparameter	Range	Scale	Description
`num_round`	[50, 1000]	Log	Number of boosting rounds
`learning_rate`	[1e-3, 0.25]	Log	Shrinkage applied to each tree’s contribution
`num_leaves`	[2, 128]	Log	Max leaves per tree; controls model complexity
`min_data_in_leaf`	[2, 100]	Log	Min samples per leaf; acts as regularization
`colsample_bytree`	[0.05, 1.0]	Log	Fraction of features sampled per tree

[4]:

# Create configuration space for hyperparameter search
config = ConfigurationSpace()

# Add hyperparameters with appropriate ranges and scales
config.add([
    Integer('num_round', bounds=(50, 1000), log=True),
    Float('learning_rate', bounds=(1e-3, 0.25), log=True),
    Integer('num_leaves', bounds=(2, 128), log=True),
    Integer('min_data_in_leaf', bounds=(2, 100), log=True),
    Float('colsample_bytree', bounds=(0.05, 1), log=True)
])
print(config)

Configuration space object:
  Hyperparameters:
    colsample_bytree, Type: UniformFloat, Range: [0.05, 1.0], Default: 0.22360679775, on log-scale
    learning_rate, Type: UniformFloat, Range: [0.001, 0.25], Default: 0.0158113883008, on log-scale
    min_data_in_leaf, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    num_leaves, Type: UniformInteger, Range: [2, 128], Default: 16, on log-scale
    num_round, Type: UniformInteger, Range: [50, 1000], Default: 224, on log-scale

Step 2: Implement a Custom CV Objective

Subclass CVObjective and implement fit_and_test. The parent class calls this method once per fold during evaluation, passing the train/test indices for that fold. Your implementation should:

Slice self.X and self.y using the provided indices
Train the model on the training split
Predict on the test split
Return a scalar loss (lower is better)

Everything else—iterating over folds, averaging results, handling repeats—is taken care of by the parent class.

[5]:

class LightGBMCVObj(CVObjective):
    """Custom CVObjective for LightGBM's native Python API."""

    def fit_and_test(self, params, train_index, test_index):
        """Train and evaluate the model on a single CV fold.

        Parameters
        ----------
        params : dict
            Hyperparameter configuration to evaluate.
        train_index : array-like
            Row indices for the training split.
        test_index : array-like
            Row indices for the test split.

        Returns
        -------
        float
            Loss value for this fold (lower is better).
        """
        # Slice data for this fold (supports both DataFrame and ndarray)
        if isinstance(self.X, pd.DataFrame):
            X_train, X_test = self.X.iloc[train_index], self.X.iloc[test_index]
        else:
            X_train, X_test = self.X[train_index], self.X[test_index]
        y_train, y_test = self.y[train_index], self.y[test_index]

        # LightGBM native API separates num_round from model parameters
        num_round = params.get('num_round', 100)
        lgb_params = {
            k: v for k, v in params.items() if k != 'num_round'
        }
        lgb_params.update({'objective': 'binary', 'verbosity': -1, 'seed': self.rng_seed})

        # Train and evaluate
        train_data = lgb.Dataset(X_train, label=y_train)
        bst = lgb.train(lgb_params, train_data, num_round)
        y_pred = bst.predict(X_test)

        return self.loss_metric(y_test, y_pred)

[6]:

# Define loss metric: maximize AUC → minimize (1 - AUC)
def auc_loss(y_true, y_pred):
    return 1 - roc_auc_score(y_true, y_pred)

# Instantiate the custom CV objective
cv_obj = LightGBMCVObj(
    X=X,
    y=y,
    loss_metric=auc_loss,
    task='classification',
    n_splits=10,
    stratified=True,
    rng_seed=42
)

print(f"Number of CV folds: {cv_obj.cv.get_n_splits()}")
print(f"Training samples:   {len(cv_obj.y)}")
print(f"Features:           {cv_obj.X.shape[1]}")

Number of CV folds: 10
Training samples:   2000
Features:           25

Sanity check

Before running full optimization, it’s worth verifying that fit_and_test is wired up correctly by calling cv_obj on the default configuration. This runs all 10 folds and returns the mean loss.

[7]:

test_config = config.get_default_configuration()
print("Default configuration:")
print(test_config)

test_loss = cv_obj(dict(test_config))
print(f"\n10-fold CV Loss (1 - AUC): {test_loss:.6f}")
print(f"10-fold CV AUC:            {1 - test_loss:.6f}")

Default configuration:
Configuration(values={
  'colsample_bytree': 0.22360679775,
  'learning_rate': 0.0158113883008,
  'min_data_in_leaf': 14,
  'num_leaves': 16,
  'num_round': 224,
})

10-fold CV Loss (1 - AUC): 0.073268
10-fold CV AUC:            0.926732

Step 3: Initialize and Run the Optimizer

With the custom objective in place, the optimizer setup is identical to the previous notebooks. FCVOpt only interacts with cv_obj.cvloss—it doesn’t care whether the objective uses the sklearn API or a custom implementation.

[8]:

optimizer = FCVOpt(
    obj=cv_obj.cvloss,
    n_folds=cv_obj.cv.get_n_splits(),
    config=config,
    acq_function='LCB',
    tracking_dir='./hpt_opt_runs/',
    experiment='lgb_native_tuning',
    seed=123
)

best_conf = optimizer.optimize(n_trials=50)

# Close the MLflow run
optimizer.end_run()


Number of candidates evaluated.....: 50
Single-fold observed loss (best)...: 0.140556
Estimated full CV loss (best)......: 0.0650603

 Best configuration at termination:
 Configuration(values={
  'colsample_bytree': 0.3594758749201,
  'learning_rate': 0.0010175873664,
  'min_data_in_leaf': 2,
  'num_leaves': 128,
  'num_round': 1000,
})

[9]:

best_cv_loss = cv_obj(best_conf)
best_cv_auc = 1 - best_cv_loss

print(f"10-fold CV Loss (1 - AUC): {best_cv_loss:.4f}")
print(f"10-fold CV ROC-AUC:        {best_cv_auc:.4f}")

10-fold CV Loss (1 - AUC): 0.0632
10-fold CV ROC-AUC:        0.9368

[ ]: