{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tuning LightGBM Hyperparameters (scikit-learn API)\n",
    "\n",
    "This notebook demonstrates how to use FCVOpt to tune LightGBM hyperparameters for a binary classification task with class imbalance.\n",
    "\n",
    "LightGBM is a gradient boosting framework that uses histogram-based algorithms for fast training. It exposes a scikit-learn–compatible API (`LGBMClassifier`), which means we can plug it directly into `SklearnCVObj` without any custom wrapper code.\n",
    "\n",
    "The notebook follows the same three-step workflow as the introduction:\n",
    "\n",
    "```\n",
    "1. Define a Cross-Validation Objective   ←  wrap LightGBM + data + metric\n",
    "         ↓\n",
    "2. Define a Hyperparameter Search Space  ←  which knobs to tune and over what ranges\n",
    "         ↓\n",
    "3. Run the Optimizer                     ←  let FCVOpt find the best configuration\n",
    "```\n",
    "\n",
    "**Requirements**: `lightgbm` must be installed (`pip install lightgbm`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note: Set OpenMP threads to 1 to avoid threading conflicts on MacOS\n",
    "import os\n",
    "os.environ['OMP_NUM_THREADS'] = '1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import lightgbm as lgb\n",
    "from sklearn.datasets import make_classification\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "# FCVOpt imports\n",
    "from fcvopt.crossvalidation import SklearnCVObj\n",
    "from fcvopt.optimizers import FCVOpt\n",
    "from fcvopt.configspace import ConfigurationSpace\n",
    "from ConfigSpace import Integer, Float"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating the Data\n",
    "\n",
    "We create a synthetic binary classification dataset with **strong class imbalance** (90% negative, 10% positive) to simulate a realistic scenario where ROC-AUC is a more informative metric than accuracy. The dataset has 25 features, of which only 5 are truly informative and 10 are linear combinations of those."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape of features matrix: (2000, 25)\n",
      "Class distribution: [1796  204]\n"
     ]
    }
   ],
   "source": [
    "# Generate binary classification dataset with class imbalance (90% vs 10%)\n",
    "X, y = make_classification(\n",
    "    n_samples=2000,\n",
    "    n_features=25,\n",
    "    n_informative=5,\n",
    "    n_redundant=10,\n",
    "    n_classes=2,\n",
    "    n_clusters_per_class=2,\n",
    "    weights=[0.9, 0.1],  # imbalanced classes\n",
    "    random_state=23\n",
    ")\n",
    "\n",
    "print(f\"Shape of features matrix: {X.shape}\")\n",
    "print(f\"Class distribution: {np.bincount(y)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Define the Cross-Validation Objective\n",
    "\n",
    "The CV objective bundles together everything needed to evaluate a hyperparameter configuration:\n",
    "\n",
    "- **Estimator** — `LGBMClassifier` with binary objective\n",
    "- **Data** — the features and labels (`X`, `y`)\n",
    "- **Loss metric** — `1 - ROC-AUC` (we minimize loss, so we convert the AUC we want to maximize)\n",
    "- **CV scheme** — 10-fold stratified CV to preserve the class imbalance ratio in each fold\n",
    "\n",
    "Setting `needs_proba=True` tells `SklearnCVObj` to call `predict_proba` and pass the positive-class probability scores to the loss function, which is what `roc_auc_score` expects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Configuration space object:\n",
      "  Hyperparameters:\n",
      "    colsample_bytree, Type: UniformFloat, Range: [0.05, 1.0], Default: 0.22360679775, on log-scale\n",
      "    learning_rate, Type: UniformFloat, Range: [0.001, 0.25], Default: 0.0158113883008, on log-scale\n",
      "    min_data_in_leaf, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale\n",
      "    n_estimators, Type: UniformInteger, Range: [50, 1000], Default: 224, on log-scale\n",
      "    num_leaves, Type: UniformInteger, Range: [2, 128], Default: 16, on log-scale\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Create configuration space for hyperparameter search\n",
    "config = ConfigurationSpace()\n",
    "\n",
    "# Add hyperparameters with appropriate ranges and scales\n",
    "# Note: LightGBM sklearn API uses 'num_round' -> number of estimators\n",
    "config.add([\n",
    "    Integer('n_estimators', bounds=(50, 1000), log=True),\n",
    "    Float('learning_rate', bounds=(1e-3, 0.25), log=True),\n",
    "    Integer('num_leaves', bounds=(2, 128), log=True),\n",
    "    Integer('min_data_in_leaf', bounds=(2, 100), log=True),\n",
    "    Float('colsample_bytree', bounds=(0.05, 1), log=True),\n",
    "])\n",
    "print(config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Define the Hyperparameter Search Space\n",
    "\n",
    "We tune five LightGBM hyperparameters that collectively control model capacity, regularization, and feature sampling. All parameters are searched on a log scale because their effects are roughly multiplicative.\n",
    "\n",
    "| Hyperparameter | Range | Scale | Description |\n",
    "|---|---|---|---|\n",
    "| `n_estimators` | [50, 1000] | Log | Number of boosting rounds (trees) |\n",
    "| `learning_rate` | [1e-3, 0.25] | Log | Shrinkage applied to each tree's contribution |\n",
    "| `num_leaves` | [2, 128] | Log | Max leaves per tree; controls model complexity |\n",
    "| `min_data_in_leaf` | [2, 100] | Log | Min samples per leaf; acts as regularization |\n",
    "| `colsample_bytree` | [0.05, 1.0] | Log | Fraction of features sampled per tree |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of CV folds: 10\n",
      "Training samples: 2000\n",
      "Features: 25\n"
     ]
    }
   ],
   "source": [
    "# Define loss metric: maximize AUC → minimize (1 - AUC)\n",
    "def auc_loss(y_true, y_pred):\n",
    "    return 1 - roc_auc_score(y_true, y_pred)\n",
    "\n",
    "# Create CV objective that wraps the LightGBM classifier\n",
    "cv_obj = SklearnCVObj(\n",
    "    estimator=lgb.LGBMClassifier(objective=\"binary\", verbosity=-1),\n",
    "    X=X,\n",
    "    y=y,\n",
    "    loss_metric=auc_loss,\n",
    "    needs_proba=True,\n",
    "    task='classification',\n",
    "    n_splits=10,        # 10-fold cross-validation\n",
    "    rng_seed=42,\n",
    "    stratified=True     # preserve class imbalance ratio across folds\n",
    ")\n",
    "\n",
    "print(f\"Number of CV folds: {cv_obj.cv.get_n_splits()}\")\n",
    "print(f\"Training samples: {len(cv_obj.y)}\")\n",
    "print(f\"Features: {cv_obj.X.shape[1]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Initialize and Run the Optimizer\n",
    "\n",
    "With the objective and search space in place, we create an `FCVOpt` instance and run 50 optimization trials. At each trial, FCVOpt:\n",
    "\n",
    "1. Uses the hierarchical GP to select the most promising hyperparameter configuration\n",
    "2. Picks a single CV fold to evaluate (instead of all 10)\n",
    "3. Updates the GP with the new observation and repeats\n",
    "\n",
    "This means 50 trials require only ~50 model fits rather than the 500 that full 10-fold CV would demand.\n",
    "\n",
    "**Note on acquisition functions**: `'EI'` (Expected Improvement) is not supported with fractional CV."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Number of candidates evaluated.....: 50\n",
      "Single-fold observed loss (best)...: 0.0511111\n",
      "Estimated full CV loss (best)......: 0.0697211\n",
      "\n",
      " Best configuration at termination:\n",
      " Configuration(values={\n",
      "  'colsample_bytree': 0.1543087512749,\n",
      "  'learning_rate': 0.001,\n",
      "  'min_data_in_leaf': 34,\n",
      "  'n_estimators': 718,\n",
      "  'num_leaves': 61,\n",
      "})\n"
     ]
    }
   ],
   "source": [
    "# Initialize FCVOpt optimizer\n",
    "optimizer = FCVOpt(\n",
    "    obj=cv_obj.cvloss,\n",
    "    n_folds=cv_obj.cv.get_n_splits(),\n",
    "    config=config,\n",
    "    acq_function='LCB',\n",
    "    tracking_dir='./hpt_opt_runs/',\n",
    "    experiment='lgb_sklearn_tuning',\n",
    "    seed=123\n",
    ")\n",
    "\n",
    "# Run optimization for 50 trials\n",
    "best_conf = optimizer.optimize(n_trials=50)\n",
    "\n",
    "# Close the MLflow run\n",
    "optimizer.end_run()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate best configuration on all 10 folds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10-fold CV Loss: 0.0906\n",
      "10-fold CV ROC-AUC: 0.9094\n"
     ]
    }
   ],
   "source": [
    "# Evaluate the best configuration found by FCVOpt\n",
    "# Convert loss back to AUC for easier interpretation (loss = 1 - AUC)\n",
    "best_cv_loss = cv_obj(best_conf)\n",
    "best_cv_auc = 1 - best_cv_loss\n",
    "\n",
    "print(f\"10-fold CV Loss: {best_cv_loss:.4f}\")\n",
    "print(f\"10-fold CV ROC-AUC: {best_cv_auc:.4f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "fcvopt_test (3.10.19)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}