{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Extending CVObjective for Custom Models\n", "\n", "This notebook shows how to extend the `CVObjective` base class to support models that don't follow the scikit-learn estimator interface. We use LightGBM's **native Python API** as a concrete example.\n", "\n", "**Requirements**: `lightgbm` must be installed (`pip install lightgbm`).\n", "\n", "## When to extend `CVObjective`\n", "\n", "| Scenario | Recommended approach |\n", "|---|---|\n", "| Model inherits from `sklearn.base.BaseEstimator` (implements `get_params`, `set_params`, `fit`, `predict`) | Use `SklearnCVObj` (see notebook 02) |\n", "| Model uses a non-sklearn API (e.g., LightGBM native, PyTorch) | Extend `CVObjective` |\n", "| You need custom training logic (early stopping, callbacks) | Extend `CVObjective` |\n", "\n", "The key constraint for `SklearnCVObj` is `sklearn.base.clone`: it reconstructs a fresh model instance by calling `get_params()` and passing the result back to the constructor. Any model that doesn't implement `get_params`/`set_params` (i.e., doesn't inherit `BaseEstimator`) will fail at this step.\n", "\n", "## How it works\n", "\n", "`CVObjective` handles the cross-validation loop—splitting the data, iterating over folds, and aggregating results. The only method you need to implement is `fit_and_test`, which trains and evaluates the model on a **single fold**:\n", "\n", "```python\n", "class MyCVObjective(CVObjective):\n", " def fit_and_test(self, params, train_index, test_index) -> float:\n", " # train on train_index, evaluate on test_index, return scalar loss\n", " ...\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Note: Set OpenMP threads to 1 to avoid threading conflicts on MacOS\n", "import os\n", "os.environ['OMP_NUM_THREADS'] = '1'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import lightgbm as lgb\n", "from sklearn.datasets import make_classification\n", "from sklearn.metrics import roc_auc_score\n", "\n", "# FCVOpt imports\n", "from fcvopt.crossvalidation import CVObjective\n", "from fcvopt.optimizers import FCVOpt\n", "from fcvopt.configspace import ConfigurationSpace\n", "from ConfigSpace import Integer, Float" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating the Data\n", "\n", "We use the same synthetic binary classification dataset as in notebook 02—2,000 samples, 25 features (5 informative, 10 redundant), with a 90/10 class split—so the results are directly comparable." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape of features matrix: (2000, 25)\n", "Class distribution: [1796 204]\n" ] } ], "source": [ "# Generate binary classification dataset with class imbalance (90% vs 10%)\n", "X, y = make_classification(\n", " n_samples=2000,\n", " n_features=25,\n", " n_informative=5,\n", " n_redundant=10,\n", " n_classes=2,\n", " n_clusters_per_class=2,\n", " weights=[0.9, 0.1], # imbalanced classes\n", " random_state=23\n", ")\n", "\n", "print(f\"Shape of features matrix: {X.shape}\")\n", "print(f\"Class distribution: {np.bincount(y)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Define the Hyperparameter Search Space\n", "\n", "We tune the same five hyperparameters as in notebook 02. The only difference is that LightGBM's native API uses `num_round` (not `n_estimators`) for the number of boosting rounds.\n", "\n", "| Hyperparameter | Range | Scale | Description |\n", "|---|---|---|---|\n", "| `num_round` | [50, 1000] | Log | Number of boosting rounds |\n", "| `learning_rate` | [1e-3, 0.25] | Log | Shrinkage applied to each tree's contribution |\n", "| `num_leaves` | [2, 128] | Log | Max leaves per tree; controls model complexity |\n", "| `min_data_in_leaf` | [2, 100] | Log | Min samples per leaf; acts as regularization |\n", "| `colsample_bytree` | [0.05, 1.0] | Log | Fraction of features sampled per tree |" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Configuration space object:\n", " Hyperparameters:\n", " colsample_bytree, Type: UniformFloat, Range: [0.05, 1.0], Default: 0.22360679775, on log-scale\n", " learning_rate, Type: UniformFloat, Range: [0.001, 0.25], Default: 0.0158113883008, on log-scale\n", " min_data_in_leaf, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale\n", " num_leaves, Type: UniformInteger, Range: [2, 128], Default: 16, on log-scale\n", " num_round, Type: UniformInteger, Range: [50, 1000], Default: 224, on log-scale\n", "\n" ] } ], "source": [ "# Create configuration space for hyperparameter search\n", "config = ConfigurationSpace()\n", "\n", "# Add hyperparameters with appropriate ranges and scales\n", "config.add([\n", " Integer('num_round', bounds=(50, 1000), log=True),\n", " Float('learning_rate', bounds=(1e-3, 0.25), log=True),\n", " Integer('num_leaves', bounds=(2, 128), log=True),\n", " Integer('min_data_in_leaf', bounds=(2, 100), log=True),\n", " Float('colsample_bytree', bounds=(0.05, 1), log=True)\n", "])\n", "print(config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Implement a Custom CV Objective\n", "\n", "Subclass `CVObjective` and implement `fit_and_test`. The parent class calls this method once per fold during evaluation, passing the train/test indices for that fold. Your implementation should:\n", "\n", "1. Slice `self.X` and `self.y` using the provided indices\n", "2. Train the model on the training split\n", "3. Predict on the test split\n", "4. Return a scalar loss (lower is better)\n", "\n", "Everything else—iterating over folds, averaging results, handling repeats—is taken care of by the parent class." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "class LightGBMCVObj(CVObjective):\n", " \"\"\"Custom CVObjective for LightGBM's native Python API.\"\"\"\n", "\n", " def fit_and_test(self, params, train_index, test_index):\n", " \"\"\"Train and evaluate the model on a single CV fold.\n", "\n", " Parameters\n", " ----------\n", " params : dict\n", " Hyperparameter configuration to evaluate.\n", " train_index : array-like\n", " Row indices for the training split.\n", " test_index : array-like\n", " Row indices for the test split.\n", "\n", " Returns\n", " -------\n", " float\n", " Loss value for this fold (lower is better).\n", " \"\"\"\n", " # Slice data for this fold (supports both DataFrame and ndarray)\n", " if isinstance(self.X, pd.DataFrame):\n", " X_train, X_test = self.X.iloc[train_index], self.X.iloc[test_index]\n", " else:\n", " X_train, X_test = self.X[train_index], self.X[test_index]\n", " y_train, y_test = self.y[train_index], self.y[test_index]\n", "\n", " # LightGBM native API separates num_round from model parameters\n", " num_round = params.get('num_round', 100)\n", " lgb_params = {\n", " k: v for k, v in params.items() if k != 'num_round'\n", " }\n", " lgb_params.update({'objective': 'binary', 'verbosity': -1, 'seed': self.rng_seed})\n", "\n", " # Train and evaluate\n", " train_data = lgb.Dataset(X_train, label=y_train)\n", " bst = lgb.train(lgb_params, train_data, num_round)\n", " y_pred = bst.predict(X_test)\n", "\n", " return self.loss_metric(y_test, y_pred)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of CV folds: 10\n", "Training samples: 2000\n", "Features: 25\n" ] } ], "source": [ "# Define loss metric: maximize AUC → minimize (1 - AUC)\n", "def auc_loss(y_true, y_pred):\n", " return 1 - roc_auc_score(y_true, y_pred)\n", "\n", "# Instantiate the custom CV objective\n", "cv_obj = LightGBMCVObj(\n", " X=X,\n", " y=y,\n", " loss_metric=auc_loss,\n", " task='classification',\n", " n_splits=10,\n", " stratified=True,\n", " rng_seed=42\n", ")\n", "\n", "print(f\"Number of CV folds: {cv_obj.cv.get_n_splits()}\")\n", "print(f\"Training samples: {len(cv_obj.y)}\")\n", "print(f\"Features: {cv_obj.X.shape[1]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sanity check\n", "\n", "Before running full optimization, it's worth verifying that `fit_and_test` is wired up correctly by calling `cv_obj` on the default configuration. This runs all 10 folds and returns the mean loss." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Default configuration:\n", "Configuration(values={\n", " 'colsample_bytree': 0.22360679775,\n", " 'learning_rate': 0.0158113883008,\n", " 'min_data_in_leaf': 14,\n", " 'num_leaves': 16,\n", " 'num_round': 224,\n", "})\n", "\n", "10-fold CV Loss (1 - AUC): 0.073268\n", "10-fold CV AUC: 0.926732\n" ] } ], "source": [ "test_config = config.get_default_configuration()\n", "print(\"Default configuration:\")\n", "print(test_config)\n", "\n", "test_loss = cv_obj(dict(test_config))\n", "print(f\"\\n10-fold CV Loss (1 - AUC): {test_loss:.6f}\")\n", "print(f\"10-fold CV AUC: {1 - test_loss:.6f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Initialize and Run the Optimizer\n", "\n", "With the custom objective in place, the optimizer setup is identical to the previous notebooks. `FCVOpt` only interacts with `cv_obj.cvloss`—it doesn't care whether the objective uses the sklearn API or a custom implementation." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Number of candidates evaluated.....: 50\n", "Single-fold observed loss (best)...: 0.140556\n", "Estimated full CV loss (best)......: 0.0650603\n", "\n", " Best configuration at termination:\n", " Configuration(values={\n", " 'colsample_bytree': 0.3594758749201,\n", " 'learning_rate': 0.0010175873664,\n", " 'min_data_in_leaf': 2,\n", " 'num_leaves': 128,\n", " 'num_round': 1000,\n", "})\n" ] } ], "source": [ "optimizer = FCVOpt(\n", " obj=cv_obj.cvloss,\n", " n_folds=cv_obj.cv.get_n_splits(),\n", " config=config,\n", " acq_function='LCB',\n", " tracking_dir='./hpt_opt_runs/',\n", " experiment='lgb_native_tuning',\n", " seed=123\n", ")\n", "\n", "best_conf = optimizer.optimize(n_trials=50)\n", "\n", "# Close the MLflow run\n", "optimizer.end_run()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10-fold CV Loss (1 - AUC): 0.0632\n", "10-fold CV ROC-AUC: 0.9368\n" ] } ], "source": [ "best_cv_loss = cv_obj(best_conf)\n", "best_cv_auc = 1 - best_cv_loss\n", "\n", "print(f\"10-fold CV Loss (1 - AUC): {best_cv_loss:.4f}\")\n", "print(f\"10-fold CV ROC-AUC: {best_cv_auc:.4f}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "fcvopt_test (3.10.19)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.19" } }, "nbformat": 4, "nbformat_minor": 4 }