监督学习实战指南

如果你刚刚入门机器学习，监督学习绝对是你第一个要掌握的领域。简单来说，监督学习就是”有老师教”的学习模式——我们有输入数据和对应的正确答案，模型通过学习这些配对数据，掌握从输入预测输出的能力。这篇文章会带你深入理解监督学习的核心算法，从原理到实战，把那些面试官爱问、你实际项目又常用的知识点讲透。

逻辑回归：看似回归实则分类

很多人第一次看到”逻辑回归”这个名字就懵了——明明名字里有”回归”，为什么却是用来做分类的？这个问题问得好，实际上逻辑回归骨子里确实带着回归的影子，但它被发明出来的主要用途就是解决分类问题。

Sigmoid函数的魔力

逻辑回归的核心是Sigmoid函数，公式长这样：σ(z) = 1/(1+e^(-z))。这个函数的图像是一个S型曲线，输出值永远在0到1之间。你可以把它理解成一个”概率转换器”——任意实数经过它都能变成一个合法的概率值。

当我们用逻辑回归做二分类时，模型会输出一个0到1之间的概率值。比如判断一封邮件是不是垃圾邮件，输出0.8就意味着”有80%的概率是垃圾邮件”。然后我们设定一个阈值（通常默认是0.5），大于等于0.5判定为正类，小于0.5判定为负类。

为什么逻辑回归这么重要

你可能会问，既然深度学习这么强大，为什么还要学逻辑回归这个”上古时代”的算法？原因有几个：第一，它简单高效，训练速度快，适合作为baseline模型；第二，它的输出是概率值，可解释性强，在金融风控、医疗诊断等领域特别受欢迎；第三，它是理解神经网络激活函数的基础——ReLU、LeakyReLU这些函数都可以看成是Sigmoid的变种或简化。

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
 
# 生成模拟的二分类数据集
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    random_state=42
)
 
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
 
# 特征标准化（逻辑回归对特征尺度敏感）
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
 
# 训练逻辑回归模型
lr_model = LogisticRegression(
    penalty='l2',           # L2正则化
    C=1.0,                  # 正则化强度的倒数，越小正则化越强
    solver='lbfgs',         # 优化算法
    max_iter=1000,          # 最大迭代次数
    random_state=42
)
 
lr_model.fit(X_train_scaled, y_train)
 
# 预测并评估
y_pred = lr_model.predict(X_test_scaled)
y_pred_proba = lr_model.predict_proba(X_test_scaled)[:, 1]
 
print(f"准确率: {accuracy_score(y_test, y_pred):.4f}")
print("\n分类报告:")
print(classification_report(y_test, y_pred))

决策树：人类可理解的模型

如果说逻辑回归像个黑盒子，决策树就是个白盒子——它的决策过程清晰可见，你可以把整棵树画出来，任何人都能看懂模型是怎么做预测的。

信息增益与基尼系数

构建决策树的核心问题是如何选择最佳的分裂特征。这里有两个主要指标：信息增益（基于熵）和基尼系数。

熵用来衡量数据的不确定性，公式是H = -∑p_i·log2(p_i)。当数据完全纯净（所有样本都属于同一类）时，熵为0；当数据完全混乱（各类均匀分布）时，熵达到最大值。信息增益就是分裂前后熵的变化——我们希望分裂后数据的纯度提高，也就是熵降低。

基尼系数则更直接，Gini = 1 - ∑p_i²，它衡量的是从数据中随机抽取两个样本，它们被分错的概率。基尼系数越小，数据越纯净。

实际使用中，scikit-learn的DecisionTreeClassifier默认使用基尼系数，因为它比熵快一点，而且效果差不多。

决策树的致命弱点

决策树虽然可解释性强，但它有个严重的缺陷——容易过拟合。一棵完全生长的决策树可以对训练数据达到100%准确率，但在新数据上往往表现很差。这就是所谓的”过拟合”问题。

解决这个问题的办法就是剪枝（pruning）。预剪枝是在树的生长过程中就进行限制，比如限制最大深度、最小样本数、信息增益阈值等；后剪枝是先让树完全生长，然后再从下往上剪掉那些增益不显著的分支。

from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
 
# 使用不同深度的决策树展示过拟合
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
 
for idx, max_depth in enumerate([1, 5, 15]):
    tree_clf = DecisionTreeClassifier(
        max_depth=max_depth,
        min_samples_split=5,
        min_samples_leaf=2,
        random_state=42
    )
    tree_clf.fit(X_train, y_train)
    
    train_acc = tree_clf.score(X_train, y_train)
    test_acc = tree_clf.score(X_test, y_test)
    
    plot_tree(tree_clf, ax=axes[idx], filled=True, 
              class_names=['Class 0', 'Class 1'],
              feature_names=[f'Feature {i}' for i in range(X.shape[1])],
              max_depth=3)
    axes[idx].set_title(f'Max Depth={max_depth}\nTrain: {train_acc:.3f}, Test: {test_acc:.3f}')
 
plt.tight_layout()
plt.savefig('decision_tree_comparison.png', dpi=150)
plt.show()

随机森林：集体的智慧

单棵决策树容易过拟合，那如果我们构建很多棵树，让它们投票决定结果呢？这就是随机森林的核心思想。

Bootstrap与特征子空间

随机森林的”随机”体现在两个地方。第一是Bootstrap抽样——对每棵树的训练数据，都是从原始数据集中有放回地随机抽取n个样本，这样每棵树用的数据略有不同，增加了模型的多样性。

第二是特征子空间——每次分裂节点时，不是从所有特征中选择最佳分裂，而是随机选取一部分特征（比如sqrt(n_features)个），然后从这些特征中找最优分裂。这进一步增加了树之间的差异性。

当进行预测时，随机森林让所有树独立预测，然后采用多数投票（分类）或平均（回归）来得到最终结果。这种集成策略能有效抵消单棵树的偏差，显著提升模型的泛化能力。

随机森林的关键参数

n_estimators很好理解，就是树的数量。通常越多越好，但边际效益递减。一般100-500棵树就够用了，如果数据量大或特征多，可以用到1000棵。

max_depth控制每棵树的最大深度。如果不设限制，树会一直分裂到每个叶子节点只有一个样本。设一个合理的深度（比如10-20）通常能取得不错的效果。

min_samples_split和min_samples_leaf分别控制分裂所需的最小样本数和叶子节点的最小样本数。设大一点可以防止过拟合，但设太大可能导致欠拟合。

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, roc_auc_score
import seaborn as sns
 
# 训练随机森林
rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=15,
    min_samples_split=5,
    min_samples_leaf=2,
    max_features='sqrt',      # 每次分裂考虑的特征数
    bootstrap=True,
    oob_score=True,           # 使用袋外数据评估
    random_state=42,
    n_jobs=-1                 # 使用所有CPU核心
)
 
rf_model.fit(X_train, y_train)
 
# 特征重要性分析
feature_importance = rf_model.feature_importances_
feature_names = [f'Feature {i}' for i in range(X.shape[1])]
 
# 取前10个最重要的特征
top_n = 10
top_indices = np.argsort(feature_importance)[-top_n:]
top_features = [feature_names[i] for i in top_indices]
top_importance = feature_importance[top_indices]
 
plt.figure(figsize=(10, 6))
plt.barh(top_features, top_importance)
plt.xlabel('Importance')
plt.title('Top 10 Feature Importances (Random Forest)')
plt.tight_layout()
plt.savefig('rf_feature_importance.png', dpi=150)
plt.show()
 
# 模型评估
y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]
 
print(f"OOB Score: {rf_model.oob_score_:.4f}")
print(f"Test AUC-ROC: {roc_auc_score(y_test, y_pred_proba_rf):.4f}")
 
# 混淆矩阵
cm = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Random Forest')
plt.savefig('rf_confusion_matrix.png', dpi=150)

XGBoost与LightGBM：Kaggle竞赛神器

如果说随机森林是集成学习的经典之作，XGBoost和LightGBM就是它的进化版。这两个算法在Kaggle竞赛中几乎成了标配，几乎所有结构化数据的比赛都能看到它们的身影。

Gradient Boosting的核心思想

XGBoost全称是eXtreme Gradient Boosting，核心思想是梯度提升。不同于随机森林的并行构建多棵树，梯度提升是串行构建的——每一棵新树都是为了纠正前面所有树的错误。

具体来说，我们先用全部数据训练一棵树，这棵树预测完后会有一些错误。然后我们计算这些错误的梯度（残差），用这些梯度作为目标值训练第二棵树。第二棵树专注于学习如何修正第一棵树的错误。第三棵树再学习第二棵树的残差……如此循环，最终的预测结果是所有树的预测之和。

XGBoost的独门绝技

XGBoost之所以比传统梯度提升树快很多，主要靠三个技术：一是二阶泰勒展开近似目标函数，比传统的一阶展开精度更高；二是对缺失值和稀疏数据的自动处理；三是Block存储结构，支持特征预排序，使得分裂点查找更快。

XGBoost还有一个很重要的特性是正则化。它在目标函数中直接加入了正则项：L1的叶子数量惩罚和L2的叶子权重惩罚。这使得XGBoost天然不容易过拟合，比单纯的调参效果更好。

import xgboost as xgb
from xgboost import XGBClassifier
 
# XGBoost训练
xgb_model = XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,            # 每棵树使用的样本比例
    colsample_bytree=0.8,     # 每棵树使用的特征比例
    reg_alpha=0.1,            # L1正则化
    reg_lambda=1.0,           # L2正则化
    gamma=0.1,                # 分裂最小损失减少阈值
    min_child_weight=1,       # 叶子节点的最小权重和
    objective='binary:logistic',
    eval_metric='auc',
    use_label_encoder=False,
    random_state=42,
    n_jobs=-1
)
 
# 训练并使用早停
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=50,               # 每50轮输出一次
    early_stopping_rounds=30  # 30轮没有提升就停止
)
 
# 预测
y_pred_xgb = xgb_model.predict(X_test)
y_pred_proba_xgb = xgb_model.predict_proba(X_test)[:, 1]
 
print(f"Best Iteration: {xgb_model.best_iteration}")
print(f"Test AUC-ROC: {roc_auc_score(y_test, y_pred_proba_xgb):.4f}")

LightGBM：更快的梯度提升

LightGBM是微软开源的高效梯度提升框架，它的主要优势是训练速度极快，内存占用低。这让它在大规模数据集上特别有用。

LightGBM使用了一种叫Histogram的算法来近似寻找最佳分裂点——把连续特征值离散化到若干个箱子（bins）里，只在箱子的边界上找分裂点。这大大减少了计算量。另外LightGBM还使用了基于梯度的单边采样（GOSS）和互斥特征绑定（EFB）等技术进一步加速。

import lightgbm as lgb
 
# LightGBM参数
lgb_params = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'gbdt',
    'num_leaves': 31,              # 叶子节点数，控制模型复杂度
    'max_depth': -1,               # 不限制深度
    'learning_rate': 0.05,
    'n_estimators': 500,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
    'min_child_samples': 20,
    'random_state': 42,
    'n_jobs': -1,
    'verbose': -1
}
 
# 创建数据集
lgb_train = lgb.Dataset(X_train, label=y_train)
lgb_test = lgb.Dataset(X_test, label=y_test, reference=lgb_train)
 
# 训练
lgb_model = lgb.train(
    lgb_params,
    lgb_train,
    valid_sets=[lgb_train, lgb_test],
    valid_names=['train', 'valid'],
    callbacks=[
        lgb.early_stopping(stopping_rounds=50),
        lgb.log_evaluation(period=50)
    ]
)
 
# 预测
y_pred_proba_lgb = lgb_model.predict(X_test)
print(f"Test AUC-ROC: {roc_auc_score(y_test, y_pred_proba_lgb):.4f}")
 
# 特征重要性可视化
lgb.plot_importance(lgb_model, max_num_features=10, figsize=(10, 6))
plt.title('LightGBM Feature Importance')
plt.savefig('lgb_feature_importance.png', dpi=150)

分类与回归的统一框架

你可能注意到了，上面讲的大多数算法既可以用于分类，也可以用于回归——逻辑回归、决策树、随机森林、XGBoost、LightGBM都有分类和回归两个版本。理解这个统一框架对你是很重要的。

分类和回归的本质区别在于输出的形式：分类输出离散的类别标签，回归输出连续的数值。但从数学上看，两者的区别没那么大——都是学习一个从输入到输出的映射函数。

对于输出形式，分类模型通常输出概率（通过Sigmoid或Softmax），回归模型直接输出原始数值。对于损失函数，分类常用交叉熵损失，回归常用MSE或MAE。对于评估指标，分类用准确率、AUC、F1，回归用MSE、RMSE、R²。

在实际项目中，很多情况下分类和回归是相通的。比如预测房价是回归任务，但如果你把房价分成”高""中""低”三档，就变成了分类任务。反过来，有时候排序问题既可以当回归做（预测得分），也可以当分类做（预测是否点击）。

类别不平衡问题处理

真实世界的数据集很少是完美平衡的。欺诈检测中，欺诈交易可能只占0.1%；疾病筛查中，患者可能只占5%。如果不做处理，模型会倾向于预测占多数的类别。

采样策略

处理类别不平衡最直接的方法是采样——要么增加少数类的样本（过采样），要么减少多数类的样本（欠采样）。

SMOTE（Synthetic Minority Over-sampling Technique）是最经典的过采样方法。它的思路是在少数类的样本之间插值生成新样本。具体做法是：对于每个少数类样本，随机选择一个近邻，然后在两者之间的连线上随机选择一个点作为新样本。

欠采样则相反，是从多数类中选取部分样本。最简单的方法是随机欠采样，但这样会丢失很多有价值的信息。更智能的方法是Edited Nearest Neighbors（ENN）和Tomek Links，它们会删除那些边界样本，保留信息量更大的样本。

from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek, SMOTEENN
from imblearn.pipeline import Pipeline as ImbPipeline
 
# 创建不平衡数据集
X_imbalanced, y_imbalanced = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.9, 0.1],    # 90%的class 0，10%的class 1
    random_state=42
)
 
print(f"原始类别分布: {np.bincount(y_imbalanced)}")
 
# SMOTE过采样
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_imbalanced, y_imbalanced)
print(f"SMOTE后类别分布: {np.bincount(y_smote)}")
 
# 欠采样
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X_imbalanced, y_imbalanced)
print(f"欠采样后类别分布: {np.bincount(y_rus)}")
 
# SMOTE + 欠采样组合
smote_tomek = SMOTETomek(random_state=42)
X_combined, y_combined = smote_tomek.fit_resample(X_imbalanced, y_imbalanced)
print(f"组合采样后类别分布: {np.bincount(y_combined)}")
 
# 在不平衡数据集上对比不同策略的效果
from sklearn.model_selection import cross_val_score
 
strategies = {
    '原始数据': None,
    'SMOTE': SMOTE(random_state=42),
    'ADASYN': ADASYN(random_state=42),
    '欠采样': RandomUnderSampler(random_state=42),
    'SMOTE+Tomek': SMOTETomek(random_state=42)
}
 
results = {}
for name, sampler in strategies.items():
    if sampler is None:
        X_resampled, y_resampled = X_imbalanced, y_imbalanced
    else:
        X_resampled, y_resampled = sampler.fit_resample(X_imbalanced, y_imbalanced)
    
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    scores = cross_val_score(clf, X_resampled, y_resampled, cv=5, scoring='f1')
    results[name] = scores.mean()
    print(f"{name}: F1={scores.mean():.4f} (+/- {scores.std():.4f})")

阈值调整

除了改变数据本身，我们还可以调整决策阈值。默认的0.5阈值在类别不平衡时往往不是最优的。我们可以通过绘制Precision-Recall曲线或ROC曲线，找到最优阈值。

一个更直接的方法是：找到那个使某项指标（如F1分数）最大化的阈值。

from sklearn.metrics import precision_recall_curve, f1_score
 
# 计算不同阈值下的精确率和召回率
precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred_proba_rf)
 
# 找到F1最大的阈值
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
optimal_idx = np.argmax(f1_scores[:-1])  # 最后一个元素是特殊情况
optimal_threshold = thresholds[optimal_idx]
 
print(f"最优阈值: {optimal_threshold:.4f}")
print(f"最优F1分数: {f1_scores[optimal_idx]:.4f}")
 
# 使用最优阈值进行预测
y_pred_optimal = (y_pred_proba_rf >= optimal_threshold).astype(int)
print(f"调整阈值后的F1: {f1_score(y_test, y_pred_optimal):.4f}")

过拟合与正则化

过拟合是机器学习的核心挑战之一。模型在训练数据上表现很好，但在新数据上表现糟糕，这就是过拟合。它的本质是模型学到了训练数据中的噪声，而不仅仅是真正的模式。

正则化：约束模型的复杂度

正则化是防止过拟合的主要手段。它的核心思想是对模型的复杂度进行惩罚——越复杂的模型要付出越大的代价。这样模型就被迫在拟合数据和保持简单之间找平衡。

L1正则化（Lasso）会在损失函数中加入|w|_1 = ∑|w_i|。它的特点是能让一些权重变为零，从而实现特征选择。如果你有100个特征但只有20个真正有用，L1正则化会自动把不重要的特征权重压成0。L1特别适合高维稀疏数据，比如文本分类中的词袋特征。

L2正则化（Ridge）在损失函数中加入|w|_2² = ∑w_i²。它不会让权重变成零，但会让它们变小。L2适合处理特征高度相关的情况，比如多个高度相关的基因表达特征。

ElasticNet结合了L1和L2的优点：loss = MSE + α·L1 + β·L2。它既能做特征选择，又能在特征相关时保持稳定。

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import GridSearchCV
 
# 对比不同正则化的效果
alphas = [0.001, 0.01, 0.1, 1, 10, 100]
 
# Ridge回归 (L2)
ridge_scores = []
for alpha in alphas:
    ridge = Ridge(alpha=alpha, random_state=42)
    ridge.fit(X_train, y_train)
    ridge_scores.append(ridge.score(X_test, y_test))
 
# Lasso回归 (L1)
lasso_scores = []
non_zero_features = []
for alpha in alphas:
    lasso = Lasso(alpha=alpha, random_state=42, max_iter=10000)
    lasso.fit(X_train, y_train)
    lasso_scores.append(lasso.score(X_test, y_test))
    non_zero_features.append(np.sum(lasso.coef_ != 0))
 
# ElasticNet回归
elasticnet_scores = []
for alpha in alphas:
    en = ElasticNet(alpha=alpha, l1_ratio=0.5, random_state=42, max_iter=10000)
    en.fit(X_train, y_train)
    elasticnet_scores.append(en.score(X_test, y_test))
 
# 绘制对比图
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
axes[0].semilogx(alphas, ridge_scores, 'b-o', label='Ridge (L2)')
axes[0].semilogx(alphas, lasso_scores, 'r-o', label='Lasso (L1)')
axes[0].semilogx(alphas, elasticnet_scores, 'g-o', label='ElasticNet')
axes[0].set_xlabel('Alpha')
axes[0].set_ylabel('R² Score')
axes[0].set_title('Regularization Comparison')
axes[0].legend()
axes[0].grid(True)
 
axes[1].semilogx(alphas, non_zero_features, 'r-o')
axes[1].set_xlabel('Alpha')
axes[1].set_ylabel('Non-zero Features')
axes[1].set_title('L1 Feature Selection Effect')
axes[1].grid(True)
 
plt.tight_layout()
plt.savefig('regularization_comparison.png', dpi=150)

Dropout：神经网络专属的正则化

Dropout是神经网络中最重要的正则化技术之一。它的工作原理很简单：训练时，每个神经元有p%的概率被”关闭”（输出设为0）。这样每次只训练一个”稀疏”的网络，不同的神经元组合形成不同的子网络。最终预测时，所有神经元都参与，但输出要乘以(1-p)。

Dropout为什么有效？它防止了特征之间的过度协同适应。想象一下，如果某个特征A总是和特征B一起出现，网络可能会过度依赖这个组合。Dropout强迫每个神经元独立学习有用的特征，而不是依赖其他神经元的存在。

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
 
# 定义带Dropout的神经网络
class DropoutNet(nn.Module):
    def __init__(self, input_dim, dropout_rate=0.5):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Dropout(dropout_rate),  # 训练时随机丢弃
            
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            
            nn.Linear(64, 1),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        # 评估时关闭Dropout
        return self.network(x)
 
# 训练循环中Dropout的处理
def train_with_dropout(model, train_loader, test_loader, epochs=50):
    criterion = nn.BCELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    train_losses = []
    test_losses = []
    
    for epoch in range(epochs):
        model.train()  # 开启Dropout
        train_loss = 0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
        
        train_losses.append(train_loss / len(train_loader))
        
        model.eval()  # 关闭Dropout进行评估
        with torch.no_grad():
            test_loss = 0
            for X_batch, y_batch in test_loader:
                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)
                test_loss += loss.item()
            test_losses.append(test_loss / len(test_loader))
        
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}: Train Loss={train_losses[-1]:.4f}, Test Loss={test_losses[-1]:.4f}")
    
    return train_losses, test_losses

调参经验：GridSearchCV与RandomizedSearchCV

机器学习模型的性能高度依赖超参数的选择。手动调参既费时又容易遗漏最优组合。GridSearchCV和RandomizedSearchCV是两种自动调参策略。

GridSearchCV会遍历所有参数组合，如果参数空间很大，计算量会爆炸。比如3个参数，每个4个取值，就要训练4³=64个模型。但它的好处是覆盖全面，不会漏掉最优组合。

RandomizedSearchCV在参数空间中随机采样指定次数。它特别适合参数空间大、训练相对快的情况。直觉上随机搜索好像不如网格搜索，但研究发现随机搜索往往能找到和网格搜索差不多甚至更好的结果，而且速度快得多。

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint, uniform
 
# GridSearchCV示例
param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7, 9],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}
 
# 这个组合有3×4×4×3×3=432种组合，训练量很大
# 实际使用时建议先用粗粒度搜索，再细化
grid_search = GridSearchCV(
    XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='auc'),
    param_grid_xgb,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)
 
print("开始GridSearchCV...")
grid_search.fit(X_train, y_train)
 
print(f"\n最优参数: {grid_search.best_params_}")
print(f"最优CV分数: {grid_search.best_score_:.4f}")
 
# RandomizedSearchCV示例 - 适合大参数空间
param_dist_xgb = {
    'n_estimators': randint(100, 500),
    'max_depth': randint(3, 12),
    'learning_rate': uniform(0.01, 0.29),  # [0.01, 0.30]
    'subsample': uniform(0.6, 0.35),       # [0.6, 0.95]
    'colsample_bytree': uniform(0.6, 0.35),
    'min_child_weight': randint(1, 10),
    'gamma': uniform(0, 0.5)
}
 
random_search = RandomizedSearchCV(
    XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='auc'),
    param_dist_xgb,
    n_iter=100,            # 随机采样100种组合
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1,
    random_state=42
)
 
print("\n开始RandomizedSearchCV...")
random_search.fit(X_train, y_train)
 
print(f"\n最优参数: {random_search.best_params_}")
print(f"最优CV分数: {random_search.best_score_:.4f}")

实战：完整Pipeline

最后，我们来一个完整的实战项目，把上面的知识串起来。这个例子会包含数据预处理、特征工程、模型选择、交叉验证、模型融合等完整流程。

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.ensemble import VotingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
import warnings
warnings.filterwarnings('ignore')
 
# 模拟一个真实场景：信用风控二分类
# 假设我们有数值特征和类别特征
np.random.seed(42)
n_samples = 5000
 
# 生成数据
data = {
    'age': np.random.randint(18, 70, n_samples),
    'income': np.random.lognormal(10.5, 0.8, n_samples),
    'credit_score': np.random.randint(300, 850, n_samples),
    'employment_years': np.random.exponential(5, n_samples),
    'loan_amount': np.random.lognormal(9, 1, n_samples),
    'loan_purpose': np.random.choice(['home', 'car', 'education', 'business', 'other'], n_samples),
    'education': np.random.choice(['high_school', 'bachelor', 'master', 'phd'], n_samples),
    'default': np.zeros(n_samples, dtype=int)
}
 
# 生成目标变量（考虑特征之间的关系）
prob = 1/(1 + np.exp(-(
    -3 
    + 0.03 * data['income']/1000 
    - 0.01 * data['loan_amount']/10000
    + 0.005 * (data['credit_score'] - 600)
    - 0.2 * (data['loan_amount'] / data['income'])
    - 0.1 * (data['loan_purpose'] == 'business').astype(int)
    + 0.1 * (data['education'] == 'phd').astype(int)
)))
data['default'] = (np.random.random(n_samples) < prob).astype(int)
 
# 转换为DataFrame
import pandas as pd
df = pd.DataFrame(data)
 
# 定义特征
numeric_features = ['age', 'income', 'credit_score', 'employment_years', 'loan_amount']
categorical_features = ['loan_purpose', 'education']
 
# 划分数据集
X = df.drop('default', axis=1)
y = df['default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
 
# 构建Pipeline
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])
 
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
 
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)
 
# 定义多个基础模型
lr = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])
 
rf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42, n_jobs=-1))
])
 
xgb_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(n_estimators=200, max_depth=6, learning_rate=0.1, 
                                   random_state=42, use_label_encoder=False, eval_metric='auc'))
])
 
lgb_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', lgb.LGBMClassifier(n_estimators=200, max_depth=6, learning_rate=0.1, 
                                       random_state=42, verbose=-1))
])
 
# 投票融合
voting_clf = VotingClassifier(
    estimators=[
        ('lr', lr),
        ('rf', rf),
        ('xgb', xgb_pipe),
        ('lgb', lgb_pipe)
    ],
    voting='soft'  # 使用概率投票，效果通常比硬投票好
)
 
# Stacking融合 - 用逻辑回归作为元学习器
stacking_clf = StackingClassifier(
    estimators=[
        ('rf', rf),
        ('xgb', xgb_pipe),
        ('lgb', lgb_pipe)
    ],
    final_estimator=LogisticRegression(max_iter=1000),
    cv=5,
    n_jobs=-1
)
 
# 训练和评估
models = {
    'Logistic Regression': lr,
    'Random Forest': rf,
    'XGBoost': xgb_pipe,
    'LightGBM': lgb_pipe,
    'Voting Ensemble': voting_clf,
    'Stacking Ensemble': stacking_clf
}
 
results = {}
for name, model in models.items():
    print(f"\n{'='*50}")
    print(f"Training: {name}")
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    auc = roc_auc_score(y_test, y_pred_proba)
    results[name] = {'auc': auc, 'model': model}
    print(f"AUC-ROC: {auc:.4f}")
 
# 找出最佳模型
best_model_name = max(results, key=lambda x: results[x]['auc'])
print(f"\n{'='*50}")
print(f"Best Model: {best_model_name} with AUC={results[best_model_name]['auc']:.4f}")

总结

监督学习是机器学习的基石，这篇文章涵盖了最核心的知识点。从逻辑回归到决策树，从随机森林到XGBoost/LightGBM，每种算法都有它的适用场景。

实际工作中，我的建议是先从简单的模型开始（逻辑回归、决策树），建立好评估框架和Pipeline，然后逐步尝试更复杂的模型。记住，模型不是越复杂越好，适合数据特点和业务需求的才是最好的。

类别不平衡、正则化、调参这些技巧，能在关键时刻让你的模型从”能用”变成”好用”。但最重要的还是对数据的理解和特征的工程——garbage in, garbage out，再好的算法也救不了差的数据。

动手实践才是学习机器学习的最佳方式。找一些真实数据集，跑一跑这些代码，改一改参数，感受一下不同选择带来的效果差异。祝你学习愉快！

本文为机器学习实战指南系列文章，主要涵盖监督学习的核心算法和实践技巧。

人工智能知识库

探索

监督学习实战指南

监督学习实战指南

逻辑回归：看似回归实则分类

Sigmoid函数的魔力

为什么逻辑回归这么重要

决策树：人类可理解的模型

信息增益与基尼系数

决策树的致命弱点

随机森林：集体的智慧

Bootstrap与特征子空间

随机森林的关键参数

XGBoost与LightGBM：Kaggle竞赛神器

Gradient Boosting的核心思想

XGBoost的独门绝技

LightGBM：更快的梯度提升

分类与回归的统一框架

类别不平衡问题处理

采样策略

阈值调整

过拟合与正则化

正则化：约束模型的复杂度

Dropout：神经网络专属的正则化

调参经验：GridSearchCV与RandomizedSearchCV

实战：完整Pipeline

总结

关系图谱

目录

反向链接