HistGradientBoosting算法
早期做数据竞赛的时候,常听大家戏称Catboost,Xgboost和Lightgbm为竞赛三剑客,如今,在诸多的竞赛中,又多了一位HistGradientBoosting,该算法基本也会出现在所有的竞赛问题中。例如:Steel Plate Defect Prediction竞赛中第二名的模型就是用到了HistGradientBoostingClassifier。
基于直方图的模型HistGradientBoosting算法相较于GradientBoostingClassifier 和 GradientBoostingRegressor:
HistGradientBoosting算法首先将输入样本 X 划分为整数值的箱(通常为 256 个箱),这极大地减少了需要考虑的分割点数量,并允许算法在构建树时利用基于整数的数据结构(直方图),而不是依赖排序的连续值。
RETRAIN_HGBC_MODEL = False
def objective(trial):
# Define hyperparameters to tune
param = {
'learning_rate': trial.suggest_float('learning_rate', 0.005, 0.1),
'max_iter': trial.suggest_int('max_iter', 100, 2500), # Equivalent to n_estimators
'max_depth': trial.suggest_int('max_depth', 3, 20),
'l2_regularization': trial.suggest_float('l2_regularization', 1e-8, 10.0, log=True), # Equivalent to reg_lambda
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 20, 300),
'max_bins': trial.suggest_int('max_bins', 25, 255),
}
auc_scores = []
for train_idx, valid_idx in cv.split(X, y):
X_train_fold, X_valid_fold = X.iloc[train_idx], X.iloc[valid_idx]
y_train_fold, y_valid_fold = y.iloc[train_idx], y.iloc[valid_idx]
# Create and fit the model
model = HistGradientBoostingClassifier(**param)
model.fit(X_train_fold, y_train_fold)
# Predict class probabilities
y_prob = model.predict_proba(X_valid_fold)
# Compute the AUC for each class and take the average
average_auc = roc_auc_score(targets_bin.iloc[valid_idx], y_prob[:, 1:], multi_class="ovr", average="macro")
auc_scores.append(average_auc)
# Return the average AUC score across all folds
return np.mean(auc_scores)
if RETRAIN_HGBC_MODEL:
# Example usage with Optuna
hgbc_study = optuna.create_study(direction='maximize', study_name="HistGradientBoostingClassifier_model_training")
hgbc_study.optimize(objective, n_trials=200) # Adjust the number of trials as necessary
# Output the optimization results
print(f"Best trial average AUC: {study.best_value:.4f}")
print(study.best_params)
for key, value in study.best_params.items():
print(f"{key}: {value}")
LightGBM等模型试用的场景都可以直接使用。