模型评估与交叉验证实战

模型评估是机器学习中容易被忽视但极其重要的一环。很多初学者跑模型只看个准确率就完事了，这是远远不够的。你需要理解为什么这个指标重要、在什么场景下用什么指标、如何正确地评估模型。评估做不好，后面的优化、调参都是空中楼阁。

这篇文章会从数据集划分讲起，系统讲解分类和回归的各种评估指标，深入分析偏差-方差分解，最后通过代码实战展示完整的评估流程。

数据集划分：训练、验证、测试

模型评估的第一个问题是：数据怎么划分？很多人直接7:3划分训练测试集就跑模型，这在小数据集上勉强可以，但在大数据时代你有更好的选择。

基础划分策略

最基本的划分是三部分：训练集、验证集、测试集。训练集用来拟合模型，验证集用来调参和模型选择，测试集用来最终评估模型性能。测试集要”藏着掖着”，直到最后才用，而且只使用一次。

为什么要这么麻烦？答案是防止过拟合和数据泄露。如果你在训练数据上评估模型，模型当然表现好——它已经看过这些数据了。验证集的存在是让你知道模型在”新”数据上的表现，从而做出正确的调参决策。测试集更是要完全隔离，因为它代表的是未来模型会遇到的数据。

对于数据量不大的情况（几千到几万条），常用的划分比例是70:15:15或80:10:10。数据量越大，测试集和验证集的比例可以越小——百万级数据可能只需要各1%就够了。

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
 
# 生成模拟数据集
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    random_state=42
)
 
# 基础划分：先分出测试集
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
 
# 再从剩余数据中分出验证集
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.15, random_state=42, stratify=y_temp
)
 
print(f"训练集大小: {len(X_train)} ({len(X_train)/len(X):.1%})")
print(f"验证集大小: {len(X_val)} ({len(X_val)/len(X):.1%})")
print(f"测试集大小: {len(X_test)} ({len(X_test)/len(X):.1%})")

分层抽样：保持类别分布

对于分类问题，分层抽样是必须掌握的技巧。普通随机划分可能在某些类别上失衡——比如1000个样本里有100个正例、900个负例，如果随机划分，测试集可能只有8个正例，这样的测试结果完全没有代表性。

分层抽样的思路是：在每个类别内部按比例抽取样本。比如正例占10%，那训练、验证、测试集中的正例也各占10%。

from sklearn.model_selection import StratifiedKFold
 
# 验证分层抽样的效果
print("\n原始类别分布:")
print(f"  正例: {np.sum(y == 1)} ({np.mean(y):.1%})")
print(f"  负例: {np.sum(y == 0)} ({1-np.mean(y):.1%})")
 
print("\n分层划分后的类别分布:")
print(f"  训练集 - 正例: {np.sum(y_train == 1)} ({np.mean(y_train):.1%})")
print(f"  验证集 - 正例: {np.sum(y_val == 1)} ({np.mean(y_val):.1%})")
print(f"  测试集 - 正例: {np.sum(y_test == 1)} ({np.mean(y_test):.1%})")
 
# 不分层划分可能的问题
X_bad, X_bad_test, y_bad, y_bad_test = train_test_split(
    X, y, test_size=0.2, random_state=123  # 换个随机种子试试
)
print(f"\n非分层划分 - 测试集正例: {np.sum(y_bad_test == 1)} ({np.mean(y_bad_test):.1%})")
# 可能变成0或者更少！

交叉验证：更可靠的评估方法

单次划分有偶然性——万一测试集恰好有几条”偏难怪”数据，评估结果就不准。交叉验证（Cross-Validation）通过多次划分取平均，大大提高了评估的稳定性。

K折交叉验证

K折交叉验证是最常用的方法。把数据分成K份（比如5份），每次用K-1份训练、1份验证，循环K次后取平均值。

K的选择有讲究。K太小，每折训练数据少，模型偏差大；K太大，每折验证数据少，评估方差大。通常K=5或K=10是经验最优——5折评估更稳定，10折训练数据更多但计算量稍大。

留一法（Leave-One-Out）是K=N的特殊情况，每次只留一个样本验证。听起来完美，但计算量太大（N次训练），只适合小数据集。

from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
 
# 基础K折交叉验证
kf = KFold(n_splits=5, shuffle=True, random_state=42)
 
model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
 
# 计算每个折的得分
cv_scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print(f"K折交叉验证得分: {cv_scores}")
print(f"平均得分: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
 
# 对比不同模型
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42)
}
 
results = {}
for name, clf in models.items():
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    results[name] = scores
    print(f"{name}: {scores.mean():.4f} (+/- {scores.std():.4f})")
 
# 可视化交叉验证结果
plt.figure(figsize=(10, 6))
plt.boxplot([results[name] for name in results.keys()], 
            labels=results.keys())
plt.ylabel('Accuracy')
plt.title('Cross-Validation Results Comparison')
plt.grid(True, alpha=0.3)
plt.savefig('cv_comparison.png', dpi=150)

分层K折：保持类别平衡

分层K折（Stratified K-Fold）在K折的基础上增加了一层保护——确保每折中各类别的比例与整体一致。这对不平衡数据特别重要。

想象一下，如果你有1000个样本，其中990个负例、10个正例。普通5折可能有一折只有1个正例甚至0个正例，这折的评估结果完全没有意义。分层K折保证每折都正好有2个正例。

# 对比普通K折和分层K折
print("\n===== 普通K折 vs 分层K折 =====\n")
 
for fold_idx, (train_idx, val_idx) in enumerate(KFold(n_splits=5).split(X, y)):
    y_fold = y[val_idx]
    print(f"普通K折 Fold {fold_idx+1}: 正例比例 = {y_fold.mean():.3f}")
 
print()
 
for fold_idx, (train_idx, val_idx) in enumerate(StratifiedKFold(n_splits=5).split(X, y)):
    y_fold = y[val_idx]
    print(f"分层K折 Fold {fold_idx+1}: 正例比例 = {y_fold.mean():.3f}")

重复交叉验证与分层交叉验证

为了更稳健的评估，你可以使用重复交叉验证——多次执行K折交叉验证取平均。比如5折重复10次，得到50个评估结果，平均值的置信度自然更高。

from sklearn.model_selection import RepeatedStratifiedKFold
 
# 重复分层K折：5折重复10次 = 50次评估
rskf = RepeatedStratifiedKFold(
    n_splits=5, 
    n_repeats=10, 
    random_state=42
)
 
scores_repeated = cross_val_score(model, X, y, cv=rskf, scoring='accuracy')
print(f"重复交叉验证 (5折x10次)")
print(f"平均得分: {scores_repeated.mean():.4f}")
print(f"标准差: {scores_repeated.std():.4f}")
print(f"95%置信区间: [{scores_repeated.mean() - 1.96*scores_repeated.std():.4f}, "
      f"{scores_repeated.mean() + 1.96*scores_repeated.std():.4f}]")

分类指标详解

现在进入核心内容：分类任务的评估指标。很多人只用一个准确率就完事了，但这是远远不够的。准确率在类别不平衡时会产生严重的误导。

混淆矩阵：评估的基础

混淆矩阵是分类评估的起点。它把预测结果分成四类：

TP（True Positive）：真正例，实际正预测正
TN（True Negative）：真负例，实际负预测负
FP（False Positive）：假正例，实际负预测正
FN（False Negative）：假负例，实际正预测负

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import seaborn as sns
 
# 训练模型并预测
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
 
# 计算混淆矩阵
cm = confusion_matrix(y_test, y_pred)
print("混淆矩阵:")
print(cm)
 
# 可视化
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# 普通混淆矩阵
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_xlabel('Predicted Label')
axes[0].set_ylabel('True Label')
axes[0].set_title('Confusion Matrix (Counts)')
 
# 归一化混淆矩阵（按行归一化，看每个类别的预测准确率）
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=axes[1])
axes[1].set_xlabel('Predicted Label')
axes[1].set_ylabel('True Label')
axes[1].set_title('Confusion Matrix (Normalized by Row)')
 
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
 
# 从混淆矩阵计算各项指标
TN, FP, FN, TP = cm.ravel()
print(f"\n详细指标:")
print(f"TN: {TN}, FP: {FP}")
print(f"FN: {FN}, TP: {TP}")
print(f"准确率 (Accuracy): {(TP+TN)/(TP+TN+FP+FN):.4f}")

准确率、精确率、召回率

这三个指标是最基础的。准确率（Accuracy）你已经知道了，是预测正确的比例。但当类别不平衡时，它会欺骗你。

精确率（Precision）回答的是”预测为正的样本里，有多少是真正的正例？“公式：TP/(TP+FP)。高精确率意味着假正例少——宁可少预测，也不错预测。在垃圾邮件过滤中，高精确率意味着好邮件不会被误标为垃圾邮件。

召回率（Recall）回答的是”所有正例中，有多少被预测出来了？“公式：TP/(TP+FN)。高召回率意味着假负例少——尽量把所有正例都找出来。在疾病筛查中，高召回率意味着不会漏诊病人。

from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
 
print("\n===== 基础分类指标 =====")
print(f"准确率 (Accuracy): {accuracy_score(y_test, y_pred):.4f}")
print(f"精确率 (Precision): {precision_score(y_test, y_pred):.4f}")
print(f"召回率 (Recall): {recall_score(y_test, y_pred):.4f}")
 
# 精确率和召回率的权衡
# 调低阈值会提高召回率但降低精确率
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
precisions = []
recalls = []
accuracies = []
 
for threshold in thresholds:
    y_pred_thresh = (y_pred_proba >= threshold).astype(int)
    precisions.append(precision_score(y_test, y_pred_thresh, zero_division=0))
    recalls.append(recall_score(y_test, y_pred_thresh))
    accuracies.append(accuracy_score(y_test, y_pred_thresh))
 
plt.figure(figsize=(10, 6))
plt.plot(recalls, precisions, 'b-o', linewidth=2, markersize=8)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Trade-off')
plt.grid(True, alpha=0.3)
 
# 标注每个阈值点
for i, threshold in enumerate([0.3, 0.5, 0.7]):
    idx = thresholds.index(threshold)
    plt.annotate(f'θ={threshold}', 
                (recalls[idx], precisions[idx]),
                textcoords="offset points", 
                xytext=(10, 10),
                arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0'))
 
plt.savefig('precision_recall_tradeoff.png', dpi=150)

F1分数：精确率和召回率的调和平均

F1分数是精确率和召回率的调和平均数：2×(Precision×Recall)/(Precision+Recall)。调和平均对极端值更敏感——如果其中一个很低，F1也会很低。

F1分数适合评估不平衡数据集。但F1有两个局限：第一，它没有考虑真负例（TN），无法反映模型正确排除负例的能力；第二，它平等对待精确率和召回率，但在实际应用中两者往往权重不同。

你可以用F-beta分数来调整权重：F_beta = (1+beta²)×(P×R)/(beta²×P+R)。beta>1时更看重召回率，beta<1时更看重精确率。比如在搜索引擎中，beta=0.5更看重精确率；在疾病筛查中，beta=2更看重召回率。

from sklearn.metrics import fbeta_score, f1_score
 
# 计算不同beta值的F-beta分数
betas = [0.25, 0.5, 1, 2, 4]
for beta in betas:
    fbeta = fbeta_score(y_test, y_pred, beta=beta)
    print(f"F{beta}: {fbeta:.4f}")
 
# 多类别分类的报告
from sklearn.metrics import classification_report
 
print("\n===== 分类报告 =====")
print(classification_report(y_test, y_pred, target_names=['负例', '正例']))

AUC-ROC：衡量分类器区分能力

AUC-ROC（Area Under the ROC Curve）是评估二分类最重要的指标。ROC曲线（Receiver Operating Characteristic Curve）展示了在不同阈值下，真正例率（TPR=召回率）和假正例率（FPR=1-特异度）的权衡。

ROC曲线的横轴是FPR（假正例率），纵轴是TPR（真正例率）。一个随机分类器的ROC曲线是对角线，完美分类器的ROC曲线是左上角的一个点。AUC就是这个曲线下的面积，取值0-1，1表示完美分类，0.5表示随机猜测。

AUC的优势在于：第一，它不受阈值选择的影响，反映的是模型整体的区分能力；第二，它对类别不平衡不敏感，因为比较的是排序能力而不是绝对预测值。

from sklearn.metrics import roc_curve, auc, roc_auc_score
 
# 计算ROC曲线
fpr, tpr, thresholds_roc = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
 
plt.figure(figsize=(10, 8))
 
# 主ROC曲线
plt.subplot(2, 2, 1)
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC = 0.500)')
plt.fill_between(fpr, tpr, alpha=0.3)
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
 
# 标注关键阈值点
key_thresholds = [0.3, 0.5, 0.7]
for thresh in key_thresholds:
    idx = np.argmin(np.abs(thresholds_roc - thresh))
    plt.annotate(f'θ={thresh}', 
                (fpr[idx], tpr[idx]),
                textcoords="offset points", 
                xytext=(-30, -20),
                arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0'))
 
# 不同模型的ROC曲线对比
plt.subplot(2, 2, 2)
 
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
 
models_compare = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}
 
for name, clf in models_compare.items():
    clf.fit(X_train, y_train)
    if hasattr(clf, 'predict_proba'):
        y_proba = clf.predict_proba(X_test)[:, 1]
    else:
        y_proba = clf.decision_function(X_test)
    
    fpr_m, tpr_m, _ = roc_curve(y_test, y_proba)
    auc_m = auc(fpr_m, tpr_m)
    plt.plot(fpr_m, tpr_m, linewidth=2, label=f'{name} (AUC={auc_m:.3f})')
 
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
 
# Precision-Recall曲线
plt.subplot(2, 2, 3)
from sklearn.metrics import precision_recall_curve, average_precision_score
 
precision_pr, recall_pr, thresholds_pr = precision_recall_curve(y_test, y_pred_proba)
ap = average_precision_score(y_test, y_pred_proba)
 
plt.plot(recall_pr, precision_pr, 'b-', linewidth=2, label=f'PR (AP = {ap:.3f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True, alpha=0.3)
 
# 当正例比例很低时，PR曲线比ROC更敏感
plt.subplot(2, 2, 4)
plt.text(0.5, 0.5, 
         f'AUC-ROC: {roc_auc:.4f}\n'
         f'AP: {ap:.4f}\n\n'
         f'AUC-ROC: 不受阈值影响\n'
         f'AP: 更关注正例预测质量',
         ha='center', va='center', fontsize=14,
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
 
plt.tight_layout()
plt.savefig('roc_pr_curves.png', dpi=150)
 
print(f"\n===== 关键指标 =====")
print(f"AUC-ROC: {roc_auc:.4f}")
print(f"Average Precision (AP): {ap:.4f}")

PR曲线 vs ROC曲线

这里要特别强调一下PR曲线和ROC曲线的区别。很多人只使用ROC-AUC，但在极端不平衡的数据上，ROC曲线会给出过于乐观的结果。

假设你有一个10000条数据的数据集，只有100条正例。模型把9900条负例都预测正确，正例只预测出了50条。ROC-AUC仍然是0.95左右，看起来很好。但PR曲线的AP（Average Precision）只有0.5左右——实际上，正例的召回率只有50%，这在实际应用中可能完全不可接受。

PR曲线更适合：正例稀少（如欺诈检测、疾病筛查）、误报成本高（如推荐系统的”猜你喜欢”变成”强制推荐”）。

# 极端不平衡场景下的ROC vs PR对比
print("\n===== 极端不平衡数据测试 =====\n")
 
# 创建9:1不平衡的数据
X_imba, y_imba = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_classes=2,
    weights=[0.9, 0.1],  # 90%负例，10%正例
    random_state=42
)
 
X_train_imba, X_test_imba, y_train_imba, y_test_imba = train_test_split(
    X_imba, y_imba, test_size=0.3, random_state=42, stratify=y_imba
)
 
# 训练模型
rf_imba = RandomForestClassifier(n_estimators=100, random_state=42)
rf_imba.fit(X_train_imba, y_train_imba)
y_pred_imba = rf_imba.predict(X_test_imba)
y_proba_imba = rf_imba.predict_proba(X_test_imba)[:, 1]
 
# 计算指标
print(f"正例比例: {y_test_imba.mean():.2%}")
print(f"准确率: {accuracy_score(y_test_imba, y_pred_imba):.4f}")
print(f"AUC-ROC: {roc_auc_score(y_test_imba, y_proba_imba):.4f}")
print(f"Average Precision: {average_precision_score(y_test_imba, y_proba_imba):.4f}")
print(f"\n混淆矩阵:")
print(confusion_matrix(y_test_imba, y_pred_imba))
print(f"\n正例召回率: {recall_score(y_test_imba, y_pred_imba):.4f}")
print(f"正例精确率: {precision_score(y_test_imba, y_pred_imba):.4f}")

回归指标详解

说完分类指标，再来看看回归任务。回归预测的是连续值，评估方法与分类有本质不同。

MSE、RMSE、MAE

MSE（Mean Squared Error，均方误差）是最常用的回归指标：MSE = (1/n)∑(y_i - ŷ_i)²。它对大误差惩罚更重——差2倍的误差，MSE会是4倍。这在某些场景下是优点（比如你特别不能容忍大偏差），但在某些场景下是缺点（比如数据中有离群点）。

RMSE（Root Mean Squared Error）是MSE的平方根，与原始数据单位一致，更容易解释。比如预测房价，RMSE=20000意味着”平均误差在2万左右”。

MAE（Mean Absolute Error，平均绝对误差）：MAE = (1/n)∑|y_i - ŷ_i|。它对所有误差一视同仁，不会因为几个特别大的误差就飙升。MAE在有离群点的数据上更稳健。

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
import pandas as pd
 
# 生成回归数据
np.random.seed(42)
X_reg = np.random.randn(1000, 5)
y_reg = 3*X_reg[:, 0] + 2*X_reg[:, 1] - X_reg[:, 2] + np.random.randn(1000)*0.5
 
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)
 
# 训练模型
lr = LinearRegression()
lr.fit(X_train_reg, y_train_reg)
y_pred_reg = lr.predict(X_test_reg)
 
# 计算各种回归指标
mse = mean_squared_error(y_test_reg, y_pred_reg)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test_reg, y_pred_reg)
r2 = r2_score(y_test_reg, y_pred_reg)
 
print("===== 回归评估指标 =====")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R²: {r2:.4f}")
 
# R²的解释：模型解释了目标变量方差的r2比例
print(f"\nR²解释: 模型解释了{y_test_reg.var():.4f}方差中的{r2*y_test_reg.var():.4f}")

MAPE：百分比误差

MAPE（Mean Absolute Percentage Error）是另一个常用的回归指标，用百分比表示误差：MAPE = (1/n)∑|y_i - ŷ_i|/|y_i|×100%。它的好处是与数据尺度无关，可以比较不同数据集上的预测效果。

但MAPE有个严重问题：当真实值为0或接近0时，MAPE会爆炸。所以MAPE只适合真实值都不接近零的场景，比如预测价格、收入等。

# 添加离群点测试指标的鲁棒性
np.random.seed(123)
y_test_with_outliers = y_test_reg.copy()
outlier_indices = np.random.choice(len(y_test_with_outliers), size=5, replace=False)
y_test_with_outliers[outlier_indices] *= 10  # 放大5个离群点
 
y_pred_outliers = y_pred_reg.copy()
 
# 计算指标
mse_out = mean_squared_error(y_test_with_outliers, y_pred_outliers)
mae_out = mean_absolute_error(y_test_with_outliers, y_pred_outliers)
 
print("\n===== 离群点影响测试 =====")
print(f"正常数据 - MSE: {mse:.4f}, MAE: {mae:.4f}")
print(f"有离群点 - MSE: {mse_out:.4f} (↑{((mse_out/mse)-1)*100:.0f}%), MAE: {mae_out:.4f} (↑{((mae_out/mae)-1)*100:.0f}%)")
print("结论: MSE对离群点更敏感，MAE更稳健")

R²：模型解释力

R²（R-squared，决定系数）衡量的是模型解释了目标变量多少方差。R²=0.8意味着模型解释了80%的方差——听起来不错，但如果只看R²，你不知道预测值和真实值的差距有多大。

R²可以是负数！如果模型比简单用均值预测还差，R²就会是负的。这在使用了错误的模型或数据有严重问题时会发生。

调整R²（Adjusted R²）考虑了特征数量的影响。普通的R²会随着特征增加而增加（即使新特征没用），调整R²会在特征无效时惩罚它，适合做特征选择。

# 可视化预测 vs 真实值
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
 
# 散点图
axes[0].scatter(y_test_reg, y_pred_reg, alpha=0.5)
axes[0].plot([y_test_reg.min(), y_test_reg.max()], 
             [y_test_reg.min(), y_test_reg.max()], 'r--', linewidth=2)
axes[0].set_xlabel('True Values')
axes[0].set_ylabel('Predicted Values')
axes[0].set_title(f'Predicted vs True (R²={r2:.3f})')
 
# 残差分布
residuals = y_test_reg - y_pred_reg
axes[1].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='r', linestyle='--', linewidth=2)
axes[1].set_xlabel('Residuals')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')
 
# 预测误差 vs 真实值
axes[2].scatter(y_test_reg, np.abs(residuals), alpha=0.5)
axes[2].set_xlabel('True Values')
axes[2].set_ylabel('Absolute Error')
axes[2].set_title('Error vs True Values')
 
plt.tight_layout()
plt.savefig('regression_evaluation.png', dpi=150)

偏差-方差分解：理解模型错误

理解偏差-方差分解是成为机器学习高手的关键。它解释了为什么模型会在训练集上表现好但在测试集上表现差，以及如何在偏差和方差之间找平衡。

数学分解

模型的泛化误差可以分解为三部分：

偏差（Bias）：模型预测值的期望与真实值之间的差距。高偏差意味着模型太简单，没有学到数据的模式——欠拟合。
方差（Variance）：模型预测值在不同训练集上的变化程度。高方差意味着模型太复杂，对训练数据的微小变化过于敏感——过拟合。
噪声（Noise）：数据本身的随机误差，这是无法消除的下界。

泛化误差 = 偏差² + 方差 + 噪声

偏差和方差往往此消彼长：简单的模型偏差高方差低（欠拟合），复杂的模型偏差低方差高（过拟合）。机器学习就是在这个权衡中找最优。

# 演示偏差-方差权衡
from sklearn.tree import DecisionTreeRegressor
 
def plot_bias_variance_tradeoff():
    """演示不同复杂度模型的偏差和方差"""
    np.random.seed(42)
    
    # 真函数是非线性的
    def true_function(x):
        return np.sin(x * 2 * np.pi)
    
    n_samples = 100
    x_train_all = np.sort(np.random.uniform(0, 1, n_samples))
    y_train_all = true_function(x_train_all) + np.random.normal(0, 0.3, n_samples)
    
    depths = [1, 2, 3, 5, 10, 20]  # 树深度代表模型复杂度
    n_simulations = 100
    
    fig, axes = plt.subplots(2, len(depths), figsize=(20, 8))
    
    # 第一行：展示不同深度树的拟合效果
    # 第二行：展示预测的不确定性
    
    predictions_all = []
    
    for idx, depth in enumerate(depths):
        predictions = []
        
        for sim in range(n_simulations):
            # 每次用不同的训练数据子集
            choice = np.random.choice(n_samples, size=50, replace=True)
            x_sim = x_train_all[choice]
            y_sim = y_train_all[choice]
            
            tree = DecisionTreeRegressor(max_depth=depth, random_state=sim)
            tree.fit(x_sim.reshape(-1, 1), y_sim)
            
            # 在整个x范围内预测
            x_range = np.linspace(0, 1, 200).reshape(-1, 1)
            pred = tree.predict(x_range)
            predictions.append(pred)
        
        predictions = np.array(predictions)
        predictions_all.append(predictions)
        
        # 画多次模拟的预测（展示方差）
        axes[0, idx].plot(x_range, true_function(x_range), 'g-', linewidth=2, label='True Function')
        for p in predictions[:10]:  # 只画10条线
            axes[0, idx].plot(x_range, p, 'b-', alpha=0.3)
        
        # 画均值和置信区间
        mean_pred = predictions.mean(axis=0)
        std_pred = predictions.std(axis=0)
        axes[0, idx].fill_between(x_range.flatten(), 
                                  mean_pred - 2*std_pred, 
                                  mean_pred + 2*std_pred, 
                                  alpha=0.2, color='red', label='±2σ')
        axes[0, idx].plot(x_range, mean_pred, 'r-', linewidth=2, label='Mean Prediction')
        
        axes[0, idx].set_title(f'Depth={depth}')
        axes[0, idx].set_ylim(-2, 2)
        
        # 计算偏差和方差
        true_values = true_function(x_range.flatten())
        bias_sq = (mean_pred - true_values) ** 2
        variance = predictions.var(axis=0)
        
        axes[1, idx].bar(['Bias²', 'Variance'], 
                         [bias_sq.mean(), variance.mean()],
                         color=['blue', 'orange'])
        axes[1, idx].set_title(f'Bias²={bias_sq.mean():.3f}, Var={variance.mean():.3f}')
    
    axes[0, 0].set_ylabel('Prediction')
    axes[1, 0].set_ylabel('Error Components')
    
    plt.suptitle('Bias-Variance Tradeoff in Decision Trees', fontsize=14)
    plt.tight_layout()
    plt.savefig('bias_variance_tradeoff.png', dpi=150)
 
plot_bias_variance_tradeoff()

学习曲线：诊断过拟合与欠拟合

学习曲线是诊断模型问题的好工具。它展示了训练集大小对训练分数和验证分数的影响。

过拟合的特征：训练分数高且稳定，验证分数低且随着数据增加可能还在下降。两者的gap很大。

欠拟合的特征：训练分数和验证分数都很低，且随着数据增加趋近于同一个值。

from sklearn.model_selection import learning_curve
 
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=5, n_jobs=-1):
    """绘制学习曲线"""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    train_sizes = np.linspace(0.1, 1.0, 10)
    
    train_sizes_abs, train_scores, test_scores = learning_curve(
        estimator, X, y, 
        train_sizes=train_sizes,
        cv=cv,
        n_jobs=n_jobs,
        scoring='accuracy',
        random_state=42
    )
    
    train_scores_mean = train_scores.mean(axis=1)
    train_scores_std = train_scores.std(axis=1)
    test_scores_mean = test_scores.mean(axis=1)
    test_scores_std = test_scores.std(axis=1)
    
    # 学习曲线
    axes[0].fill_between(train_sizes_abs, 
                         train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1, color='blue')
    axes[0].fill_between(train_sizes_abs, 
                         test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1, color='orange')
    axes[0].plot(train_sizes_abs, train_scores_mean, 'o-', color='blue', label='Training Score')
    axes[0].plot(train_sizes_abs, test_scores_mean, 'o-', color='orange', label='Validation Score')
    
    axes[0].set_xlabel('Training Examples')
    axes[0].set_ylabel('Score')
    axes[0].set_title(title)
    axes[0].legend(loc='best')
    axes[0].grid(True, alpha=0.3)
    
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    
    # 验证曲线：某个参数对性能的影响
    return fig, axes
 
# 对比不同模型的学习曲线
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
 
models_compare = [
    ('Logistic Regression (欠拟合)', LogisticRegression(max_iter=500, random_state=42)),
    ('Decision Tree (过拟合)', DecisionTreeClassifier(random_state=42)),
    ('Random Forest (适度)', RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
]
 
train_sizes = np.linspace(0.1, 1.0, 8)
 
for ax, (title, model) in zip(axes, models_compare):
    train_sizes_abs, train_scores, test_scores = learning_curve(
        model, X, y, 
        train_sizes=train_sizes,
        cv=5,
        n_jobs=-1,
        scoring='accuracy',
        random_state=42
    )
    
    train_scores_mean = train_scores.mean(axis=1)
    train_scores_std = train_scores.std(axis=1)
    test_scores_mean = test_scores.mean(axis=1)
    test_scores_std = test_scores.std(axis=1)
    
    ax.fill_between(train_sizes_abs, 
                    train_scores_mean - train_scores_std,
                    train_scores_mean + train_scores_std, alpha=0.1, color='blue')
    ax.fill_between(train_sizes_abs, 
                    test_scores_mean - test_scores_std,
                    test_scores_mean + test_scores_std, alpha=0.1, color='orange')
    ax.plot(train_sizes_abs, train_scores_mean, 'o-', color='blue', label='Training')
    ax.plot(train_sizes_abs, test_scores_mean, 'o-', color='orange', label='Validation')
    
    ax.set_xlabel('Training Examples')
    ax.set_ylabel('Score')
    ax.set_title(title)
    ax.legend(loc='best')
    ax.grid(True, alpha=0.3)
    ax.set_ylim(0.5, 1.1)
 
plt.tight_layout()
plt.savefig('learning_curves_comparison.png', dpi=150)

验证曲线：调参的指南针

验证曲线展示了某个超参数对模型性能的影响。配合学习曲线使用，可以更精确地诊断问题。

from sklearn.model_selection import validation_curve
 
# Decision Tree的max_depth验证曲线
max_depth_range = range(1, 30)
 
train_scores, test_scores = validation_curve(
    DecisionTreeClassifier(random_state=42),
    X, y,
    param_name='max_depth',
    param_range=max_depth_range,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
 
train_scores_mean = train_scores.mean(axis=1)
train_scores_std = train_scores.std(axis=1)
test_scores_mean = test_scores.mean(axis=1)
test_scores_std = test_scores.std(axis=1)
 
plt.figure(figsize=(12, 5))
 
plt.fill_between(max_depth_range, 
                 train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1, color='blue')
plt.fill_between(max_depth_range, 
                 test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1, color='orange')
plt.plot(max_depth_range, train_scores_mean, 'o-', color='blue', label='Training')
plt.plot(max_depth_range, test_scores_mean, 'o-', color='orange', label='Validation')
 
plt.xlabel('max_depth')
plt.ylabel('Score')
plt.title('Validation Curve for Decision Tree (max_depth)')
plt.legend()
plt.grid(True, alpha=0.3)
 
# 标注最优深度
best_depth = max_depth_range[np.argmax(test_scores_mean)]
best_score = test_scores_mean.max()
plt.axvline(x=best_depth, color='r', linestyle='--', alpha=0.5)
plt.annotate(f'Best: depth={best_depth}\nScore={best_score:.3f}',
            xy=(best_depth, best_score),
            xytext=(best_depth+3, best_score-0.02),
            arrowprops=dict(arrowstyle='->', color='red'))
 
plt.savefig('validation_curve_decision_tree.png', dpi=150)
 
print(f"最优max_depth: {best_depth}")

完整评估Pipeline实战

最后做一个完整的模型评估实战，模拟真实项目中的评估流程。

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import (classification_report, confusion_matrix, 
                             roc_auc_score, precision_recall_curve,
                             average_precision_score, brier_score_loss,
                             log_loss)
import pandas as pd
 
# 构建评估报告函数
def comprehensive_evaluation(y_true, y_pred, y_pred_proba, model_name, threshold=0.5):
    """生成全面的分类模型评估报告"""
    
    report = {
        'Model': model_name,
        'Threshold': threshold,
        'Accuracy': accuracy_score(y_true, y_pred),
        'Precision': precision_score(y_true, y_pred),
        'Recall': recall_score(y_true, y_pred),
        'F1': f1_score(y_true, y_pred),
        'AUC-ROC': roc_auc_score(y_true, y_pred_proba),
        'Average Precision': average_precision_score(y_true, y_pred_proba),
        'Brier Score': brier_score_loss(y_true, y_pred_proba)
    }
    
    return report
 
# 对比多个模型的完整评估
print("=" * 60)
print("           模型综合评估报告")
print("=" * 60)
 
models_eval = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42, n_jobs=-1),
    'XGBoost': XGBClassifier(n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42,
                              use_label_encoder=False, eval_metric='auc')
}
 
from xgboost import XGBClassifier
 
results = []
for name, model in models_eval.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    
    result = comprehensive_evaluation(y_test, y_pred, y_proba, name)
    results.append(result)
 
# 创建对比表格
results_df = pd.DataFrame(results)
results_df = results_df.set_index('Model')
print("\n" + results_df.to_string())
 
# 找出每个指标的最佳模型
print("\n" + "=" * 60)
print("           各指标最佳模型")
print("=" * 60)
best_models = {}
for col in results_df.columns:
    if col != 'Threshold':
        best_model = results_df[col].idxmax()
        best_value = results_df[col].max()
        best_models[col] = (best_model, best_value)
        print(f"{col:20s}: {best_model:20s} ({best_value:.4f})")
 
# 综合评分（加权平均）
print("\n" + "=" * 60)
print("           综合评分（加权）")
print("=" * 60)
 
# 给各指标分配权重
weights = {
    'Accuracy': 0.1,
    'Precision': 0.15,
    'Recall': 0.2,
    'F1': 0.2,
    'AUC-ROC': 0.2,
    'Average Precision': 0.1,
    'Brier Score': 0.05  # Brier Score越小越好
}
 
# 标准化评分（Min-Max到0-1，然后对Brier Score取反）
normalized_scores = results_df.copy()
for col, weight in weights.items():
    if col == 'Brier Score':
        normalized_scores[col] = 1 - (results_df[col] - results_df[col].min()) / (results_df[col].max() - results_df[col].min() + 1e-10)
    else:
        normalized_scores[col] = (results_df[col] - results_df[col].min()) / (results_df[col].max() - results_df[col].min() + 1e-10)
 
# 计算加权综合分
normalized_scores['Weighted Score'] = sum(
    normalized_scores[col] * weight for col, weight in weights.items()
)
 
for model in normalized_scores.index:
    score = normalized_scores.loc[model, 'Weighted Score']
    print(f"{model:25s}: {score:.4f}")
 
# 最终可视化
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
 
# 1. 各指标对比雷达图
ax_radar = axes[0, 0]
categories = ['Accuracy', 'Precision', 'Recall', 'F1', 'AUC-ROC', 'AP']
N = len(categories)
angles = [n / float(N) * 2 * np.pi for n in range(N)]
angles += angles[:1]
 
ax_radar = plt.subplot(221, polar=True)
colors = plt.cm.Set2(np.linspace(0, 1, len(models_eval)))
 
for (name, model), color in zip(models_eval.items(), colors):
    values = [results_df.loc[name, cat] for cat in categories]
    values += values[:1]
    ax_radar.plot(angles, values, 'o-', linewidth=2, label=name, color=color)
    ax_radar.fill(angles, values, alpha=0.1, color=color)
 
ax_radar.set_xticks(angles[:-1])
ax_radar.set_xticklabels(categories)
ax_radar.set_ylim(0.5, 1.0)
ax_radar.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))
ax_radar.set_title('Model Performance Radar')
 
# 2. AUC-ROC对比
ax_roc = axes[0, 1]
for (name, model), color in zip(models_eval.items(), colors):
    model.fit(X_train, y_train)
    y_proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    roc_auc = auc(fpr, tpr)
    ax_roc.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC={roc_auc:.3f})', color=color)
 
ax_roc.plot([0, 1], [0, 1], 'k--')
ax_roc.set_xlabel('False Positive Rate')
ax_roc.set_ylabel('True Positive Rate')
ax_roc.set_title('ROC Curves')
ax_roc.legend(loc='lower right')
ax_roc.grid(True, alpha=0.3)
 
# 3. PR曲线对比
ax_pr = axes[1, 0]
for (name, model), color in zip(models_eval.items(), colors):
    model.fit(X_train, y_train)
    y_proba = model.predict_proba(X_test)[:, 1]
    precision, recall, _ = precision_recall_curve(y_test, y_proba)
    ap = average_precision_score(y_test, y_proba)
    ax_pr.plot(recall, precision, linewidth=2, label=f'{name} (AP={ap:.3f})', color=color)
 
ax_pr.set_xlabel('Recall')
ax_pr.set_ylabel('Precision')
ax_pr.set_title('Precision-Recall Curves')
ax_pr.legend(loc='lower left')
ax_pr.grid(True, alpha=0.3)
 
# 4. 综合评分柱状图
ax_score = axes[1, 1]
model_names = normalized_scores.index.tolist()
scores = normalized_scores['Weighted Score'].values
bars = ax_score.barh(model_names, scores, color=colors)
ax_score.set_xlabel('Weighted Score')
ax_score.set_title('Overall Performance Score')
ax_score.set_xlim(0, 1)
 
# 标注分数
for bar, score in zip(bars, scores):
    ax_score.text(score + 0.01, bar.get_y() + bar.get_height()/2, 
                  f'{score:.3f}', va='center')
 
plt.tight_layout()
plt.savefig('model_evaluation_summary.png', dpi=150)
 
print("\n评估完成！图表已保存。")

总结

模型评估是机器学习中极其重要但经常被低估的环节。这篇文章涵盖了从数据划分到交叉验证，从分类指标到回归指标，从偏差-方差分解到学习曲线的完整知识体系。

几个关键要点要记住：

第一，永远使用分层划分，保持类别分布一致。对于小数据集更要谨慎，因为单次划分可能带来很大偶然性。

第二，分类任务不要只看准确率。AUC-ROC是衡量分类器区分能力的最佳单一指标，PR曲线在不平衡数据上更敏感，F1分数是精确率和召回率的调和平均。

第三，理解偏差-方差分解。欠拟合需要增加模型复杂度，过拟合需要增加数据、减少复杂度或添加正则化。

第四，学习曲线和验证曲线是诊断问题的利器。gap大说明过拟合，两边都低说明欠拟合。

最后，实际项目中评估指标的选择要与业务目标一致。疾病筛查更看重召回率，推荐系统更看重精确率，风控模型需要在两者之间找平衡。

本文为机器学习实战指南系列文章，主要涵盖模型评估与交叉验证的核心知识点和实践技巧。

人工智能知识库

探索

模型评估与交叉验证实战

模型评估与交叉验证实战

数据集划分：训练、验证、测试

基础划分策略

分层抽样：保持类别分布

交叉验证：更可靠的评估方法

K折交叉验证

分层K折：保持类别平衡

重复交叉验证与分层交叉验证

分类指标详解

混淆矩阵：评估的基础

准确率、精确率、召回率

F1分数：精确率和召回率的调和平均

AUC-ROC：衡量分类器区分能力

PR曲线 vs ROC曲线

回归指标详解

MSE、RMSE、MAE

MAPE：百分比误差

R²：模型解释力

偏差-方差分解：理解模型错误

数学分解

学习曲线：诊断过拟合与欠拟合

验证曲线：调参的指南针

完整评估Pipeline实战

总结

关系图谱

目录

反向链接