{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to FCVOpt\n", "\n", "This notebook walks through the FCVOpt API for efficient hyperparameter optimization using **fractional cross-validation**. We tune a Random Forest classifier on a synthetic dataset to illustrate the core concepts and workflow.\n", "\n", "## What is FCVOpt?\n", "\n", "FCVOpt addresses a fundamental tension in hyperparameter optimization: K-fold cross-validation is more reliable than a single train-test split, but fitting K models per configuration makes optimization prohibitively expensive.\n", "\n", "**The key insight** is that CV folds are not independent—configurations that perform well on one fold tend to perform well on others. FCVOpt exploits this structure via a **hierarchical Gaussian process (HGP)** that jointly models performance across all folds. This allows the optimizer to evaluate just a single fold per configuration while still reasoning about full K-fold performance, yielding substantial speedups with little loss in quality.\n", "\n", "In contrast, standard Bayesian optimization with K-fold CV requires all K folds to be evaluated at each candidate configuration before a decision can be made." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Import required libraries\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.datasets import make_classification\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.metrics import zero_one_loss\n", "\n", "from fcvopt.optimizers import FCVOpt\n", "from fcvopt.crossvalidation import SklearnCVObj\n", "from fcvopt.configspace import ConfigurationSpace\n", "from ConfigSpace import Integer, Float" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating the Data\n", "\n", "We generate a synthetic binary classification dataset with 2,000 samples and 50 features, of which only 10 are truly informative and 25 are linear combinations of those. A 10% label noise rate (`flip_y=0.1`) makes the task non-trivial." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape of features matrix: (1500, 50)\n", "Class distribution: [761 739]\n" ] } ], "source": [ "# Generate sample classification data\n", "X, y = make_classification(\n", " n_samples=1500, \n", " n_features=50, \n", " n_informative=10,\n", " n_redundant=25,\n", " n_classes=2,\n", " flip_y=0.1,\n", " random_state=42\n", ")\n", "\n", "print(f\"Shape of features matrix: {X.shape}\")\n", "print(f\"Class distribution: {np.bincount(y)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The FCVOpt API\n", "\n", "FCVOpt follows a simple three-step workflow:\n", "\n", "```\n", "1. Define a Cross-Validation Objective ← what to evaluate and how\n", " ↓\n", "2. Define a Hyperparameter Search Space ← what to optimize over\n", " ↓\n", "3. Run the Optimizer ← find the best configuration\n", "```\n", "\n", "Each step is covered in detail below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Define the Cross-Validation Objective\n", "\n", "The CV objective bundles together everything needed to evaluate a hyperparameter configuration:\n", "\n", "- **Estimator** — the model to tune (`RandomForestClassifier`)\n", "- **Data** — the features and labels (`X`, `y`)\n", "- **Loss metric** — the quantity to minimize (misclassification rate)\n", "- **CV scheme** — how to split the data (10-fold stratified CV)\n", "\n", "For scikit-learn–compatible estimators, FCVOpt provides `SklearnCVObj` as a convenient wrapper. Under the hood, calling `cv_obj.cvloss(params)` fits and evaluates the model on each fold and returns the average loss—this is the function the optimizer will minimize." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created CV objective with 10 folds\n" ] } ], "source": [ "# Create CV objective for Random Forest\n", "cv_obj = SklearnCVObj(\n", " estimator=RandomForestClassifier(random_state=42),\n", " X=X, y=y,\n", " loss_metric=zero_one_loss, # Minimize misclassification rate\n", " task='classification',\n", " n_splits=10, \n", " rng_seed=42\n", ")\n", "\n", "print(f\"Created CV objective with {cv_obj.cv.get_n_splits()} folds\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Define the Hyperparameter Search Space\n", "\n", "The configuration space declares which hyperparameters to tune and their valid ranges. We use log-scale bounds for all parameters since their effects are roughly multiplicative—e.g., increasing the number of trees by 50 matters more at 50 than at 500.\n", "\n", "| Hyperparameter | Range | Scale | Description |\n", "|---|---|---|---|\n", "| `n_estimators` | [50, 1000] | Log | Number of trees in the forest |\n", "| `max_depth` | [1, 15] | Log | Maximum depth of each tree |\n", "| `max_features` | [0.01, 1.0] | Log | Fraction of features considered at each split |\n", "| `min_samples_split` | [2, 200] | Log | Minimum samples required to split a node |\n", "\n", "FCVOpt's `ConfigurationSpace` extends the standard ConfigSpace with utilities for Latin Hypercube sampling and conversion between named configurations and numeric arrays used by the GP model." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Configuration space object:\n", " Hyperparameters:\n", " max_depth, Type: UniformInteger, Range: [1, 15], Default: 4, on log-scale\n", " max_features, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.1, on log-scale\n", " min_samples_split, Type: UniformInteger, Range: [2, 200], Default: 20, on log-scale\n", " n_estimators, Type: UniformInteger, Range: [50, 1000], Default: 224, on log-scale\n", "\n" ] } ], "source": [ "# Define hyperparameter search space\n", "config = ConfigurationSpace()\n", "config.add([\n", " Integer('n_estimators', bounds=(50, 1000), log=True),\n", " Integer('max_depth', bounds=(1, 15), log=True),\n", " Float('max_features', bounds=(0.01, 1.0), log=True),\n", " Integer('min_samples_split', bounds=(2, 200), log=True)\n", "])\n", "\n", "print(config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Initialize and Run the Optimizer\n", "\n", "With the objective and search space defined, we can create an `FCVOpt` instance and run the optimization loop. The key constructor arguments are:\n", "\n", "| Argument | Description |\n", "|---|---|\n", "| `obj` | The callable loss function to minimize (`cv_obj.cvloss`) |\n", "| `n_folds` | Total number of CV folds (must match the objective) |\n", "| `config` | The hyperparameter search space |\n", "| `acq_function` | Acquisition function: `'LCB'` (Lower Confidence Bound) or `'KG'` (Knowledge Gradient) |\n", "| `tracking_dir` | Local directory for MLflow logs (see below) |\n", "| `experiment` | Name for this optimization run in MLflow |\n", "| `seed` | Random seed for reproducibility |\n", "\n", "**Choosing an acquisition function**: `'LCB'` is fast and strikes a good balance between exploration and exploitation. `'KG'` (Knowledge Gradient) often finds better configurations but is more computationally expensive per iteration.\n", "\n", "### Experiment Tracking with MLflow\n", "\n", "[MLflow](https://mlflow.org) is an open-source library for tracking machine learning experiments. FCVOpt uses it to automatically record everything that happens during optimization—so you can inspect, compare, and resume runs without any extra bookkeeping code.\n", "\n", "At each iteration, FCVOpt logs to MLflow:\n", "- **Metrics** (indexed by iteration): incumbent observed loss (`f_inc_obs`), estimated loss from the GP (`f_inc_est`), GP fitting time, and acquisition optimization time\n", "- **Artifacts**: a per-iteration JSON snapshot with the candidate and incumbent configurations, and periodic checkpoints of the GP model weights (`.pth` files)\n", "- **Parameters & tags**: acquisition function, seed, batch size, and other run settings\n", "\n", "There are two ways to tell FCVOpt where to write these logs:\n", "\n", "| Option | When to use | Example |\n", "|---|---|---|\n", "| `tracking_dir` | Local logging to a directory on disk | `tracking_dir='./hp_opt_runs/'` |\n", "| `tracking_uri` | Remote MLflow server, or an explicit `file:` URI | `tracking_uri='http://localhost:5000'` |\n", "\n", "Only one of the two should be provided. If neither is given, logs are written to `./mlruns/` in the current directory.\n", "\n", "Once a run is complete (or even mid-run), you can browse all logged data with the MLflow UI:\n", "\n", "```bash\n", "mlflow ui --backend-store-uri ./hp_opt_runs/\n", "```\n", "\n", "This opens a browser dashboard where you can plot metrics over iterations, compare different runs side by side, and download artifacts. You can also restore a previous optimizer state directly from a logged run using `FCVOpt.restore_from_mlflow()`.\n", "\n", "---\n", "\n", "We run 50 trials below. Each trial selects a hyperparameter configuration via the acquisition function, evaluates it on a **single** held-out fold chosen" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Number of candidates evaluated.....: 50\n", "Single-fold observed loss (best)...: 0.146667\n", "Estimated full CV loss (best)......: 0.129033\n", "\n", " Best configuration at termination:\n", " Configuration(values={\n", " 'max_depth': 15,\n", " 'max_features': 0.3571846673984,\n", " 'min_samples_split': 6,\n", " 'n_estimators': 460,\n", "})\n" ] } ], "source": [ "# Initialize FCVOpt optimizer\n", "optimizer = FCVOpt(\n", " obj=cv_obj.cvloss,\n", " n_folds=cv_obj.cv.get_n_splits(),\n", " config=config,\n", " acq_function='LCB', # Lower Confidence Bound acquisition\n", " tracking_dir='./hpt_opt_runs/', # MLflow tracking directory\n", " experiment='rf_tuning_example',\n", " seed=123\n", ")\n", "\n", "# run for 50 trials \n", "best_conf = optimizer.optimize(n_trials=50)\n", "\n", "# end run\n", "optimizer.end_run()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluating and Deploying the Best Configuration\n", "\n", "After optimization, `best_conf` holds the best configuration found. The end-of-run summary prints two loss values:\n", "\n", "- **Single-fold observed loss** — the raw loss measured on whichever single held-out fold was evaluated for the best configuration. This is a biased estimate of true CV performance.\n", "- **Estimated full CV loss** — the HGP's prediction of what the full K-fold CV loss would be. It becomes more accurate as more trials accumulate observations across folds.\n", "\n", "To get a definitive, unbiased estimate of generalization performance, we call `cv_obj(best_conf)`, which evaluates the configuration on **all** 10 folds and returns their average." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 10-fold CV Misclassification Rate....:0.124667\n" ] } ], "source": [ "# Evaluate best configuration\n", "best_cv_mcr = cv_obj(best_conf)\n", "print(f\" 10-fold CV Misclassification Rate....:{best_cv_mcr:.6f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train the final model\n", "\n", "Finally, we retrain on the full dataset using the best hyperparameters. This final model is what you would deploy or use for inference." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# get the model with best hyperparmeters found\n", "best_model = cv_obj.construct_model(dict(best_conf))\n", "\n", "# train the model on the data\n", "_ = best_model.fit(X, y)" ] } ], "metadata": { "kernelspec": { "display_name": "fcvopt_test (3.10.19)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.19" } }, "nbformat": 4, "nbformat_minor": 4 }