第10章:K近邻算法
K近邻算法(K-Nearest Neighbors, KNN)是机器学习中最简单直观的算法之一。它基于"物以类聚,人以群分"的思想,通过寻找最相似的K个邻居来进行预测。
10.1 什么是K近邻算法?
K近邻算法是一种基于实例的学习方法,也称为"懒惰学习",因为它在训练阶段不构建显式的模型,而是在预测时才进行计算。
10.1.1 核心思想
- 相似性假设:相似的样本具有相似的标签
- 局部性原理:近邻样本比远距离样本更重要
- 非参数方法:不对数据分布做任何假设
10.1.2 算法步骤
- 存储训练数据:保存所有训练样本
- 计算距离:计算测试样本与所有训练样本的距离
- 找到K个最近邻:选择距离最小的K个样本
- 进行预测:
- 分类:多数投票
- 回归:平均值或加权平均
10.1.3 KNN的优势
- 简单易懂:算法逻辑直观
- 无需训练:不需要学习参数
- 适应性强:能处理多分类问题
- 局部敏感:能捕捉局部模式
10.1.4 KNN的劣势
- 计算复杂度高:预测时需要计算所有距离
- 存储需求大:需要保存所有训练数据
- 对维度敏感:高维数据性能下降
- 对噪声敏感:异常值影响预测结果
10.2 准备环境和数据
python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_regression, load_iris, load_wine
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')
# 设置随机种子
np.random.seed(42)
# 设置图形样式
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False10.3 距离度量
10.3.1 常用距离度量
KNN算法的核心是计算样本间的距离,常用的距离度量包括:
python
def demonstrate_distance_metrics():
"""演示不同距离度量的计算"""
# 创建两个示例点
point1 = np.array([1, 2])
point2 = np.array([4, 6])
print("两个点的坐标:")
print(f"点1: {point1}")
print(f"点2: {point2}")
print()
# 1. 欧几里得距离 (L2)
euclidean_dist = np.sqrt(np.sum((point1 - point2)**2))
print(f"欧几里得距离: {euclidean_dist:.4f}")
# 2. 曼哈顿距离 (L1)
manhattan_dist = np.sum(np.abs(point1 - point2))
print(f"曼哈顿距离: {manhattan_dist:.4f}")
# 3. 切比雪夫距离 (L∞)
chebyshev_dist = np.max(np.abs(point1 - point2))
print(f"切比雪夫距离: {chebyshev_dist:.4f}")
# 4. 闵可夫斯基距离 (一般形式)
def minkowski_distance(p1, p2, p):
return np.sum(np.abs(p1 - p2)**p)**(1/p)
print(f"闵可夫斯基距离 (p=1): {minkowski_distance(point1, point2, 1):.4f}")
print(f"闵可夫斯基距离 (p=2): {minkowski_distance(point1, point2, 2):.4f}")
print(f"闵可夫斯基距离 (p=3): {minkowski_distance(point1, point2, 3):.4f}")
demonstrate_distance_metrics()10.3.2 可视化不同距离度量
python
def visualize_distance_metrics():
"""可视化不同距离度量的等距离线"""
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# 中心点
center = np.array([0, 0])
# 创建网格
x = np.linspace(-2, 2, 100)
y = np.linspace(-2, 2, 100)
X, Y = np.meshgrid(x, y)
# 计算不同距离
# 欧几里得距离
euclidean = np.sqrt(X**2 + Y**2)
axes[0].contour(X, Y, euclidean, levels=[0.5, 1.0, 1.5], colors=['red', 'blue', 'green'])
axes[0].set_title('欧几里得距离等距线')
axes[0].set_xlabel('X')
axes[0].set_ylabel('Y')
axes[0].grid(True, alpha=0.3)
axes[0].plot(0, 0, 'ko', markersize=8)
# 曼哈顿距离
manhattan = np.abs(X) + np.abs(Y)
axes[1].contour(X, Y, manhattan, levels=[0.5, 1.0, 1.5], colors=['red', 'blue', 'green'])
axes[1].set_title('曼哈顿距离等距线')
axes[1].set_xlabel('X')
axes[1].set_ylabel('Y')
axes[1].grid(True, alpha=0.3)
axes[1].plot(0, 0, 'ko', markersize=8)
# 切比雪夫距离
chebyshev = np.maximum(np.abs(X), np.abs(Y))
axes[2].contour(X, Y, chebyshev, levels=[0.5, 1.0, 1.5], colors=['red', 'blue', 'green'])
axes[2].set_title('切比雪夫距离等距线')
axes[2].set_xlabel('X')
axes[2].set_ylabel('Y')
axes[2].grid(True, alpha=0.3)
axes[2].plot(0, 0, 'ko', markersize=8)
plt.tight_layout()
plt.show()
visualize_distance_metrics()10.4 KNN分类
10.4.1 基本KNN分类
python
# 创建二分类数据
X_class, y_class = make_classification(
n_samples=200,
n_features=2,
n_redundant=0,
n_informative=2,
n_clusters_per_class=1,
random_state=42
)
# 分割数据
X_train, X_test, y_train, y_test = train_test_split(
X_class, y_class, test_size=0.2, random_state=42, stratify=y_class
)
# 创建KNN分类器
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)
# 预测
y_pred = knn_classifier.predict(X_test)
y_pred_proba = knn_classifier.predict_proba(X_test)
# 评估
accuracy = accuracy_score(y_test, y_pred)
print(f"KNN分类准确率: {accuracy:.4f}")
print("\n详细分类报告:")
print(classification_report(y_test, y_pred))
# 可视化数据分布
plt.figure(figsize=(12, 5))
# 训练数据
plt.subplot(1, 2, 1)
colors = ['red', 'blue']
for i, color in enumerate(colors):
mask = y_train == i
plt.scatter(X_train[mask, 0], X_train[mask, 1],
c=color, label=f'类别 {i}', alpha=0.7)
plt.xlabel('特征 1')
plt.ylabel('特征 2')
plt.title('训练数据分布')
plt.legend()
plt.grid(True, alpha=0.3)
# 测试数据和预测结果
plt.subplot(1, 2, 2)
for i, color in enumerate(colors):
mask = y_test == i
plt.scatter(X_test[mask, 0], X_test[mask, 1],
c=color, label=f'真实类别 {i}', alpha=0.7, s=100)
# 标记错误预测
wrong_predictions = y_test != y_pred
plt.scatter(X_test[wrong_predictions, 0], X_test[wrong_predictions, 1],
facecolors='none', edgecolors='black', s=200, linewidths=2,
label='预测错误')
plt.xlabel('特征 1')
plt.ylabel('特征 2')
plt.title('测试结果')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
```### 10.
4.2 决策边界可视化
```python
def plot_knn_decision_boundary(X, y, k_values, title="KNN决策边界"):
"""绘制不同K值的KNN决策边界"""
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle(title, fontsize=16)
for i, k in enumerate(k_values):
row = i // 2
col = i % 2
# 训练KNN
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X, y)
# 创建网格
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# 预测网格点
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# 绘制决策边界
axes[row, col].contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
# 绘制数据点
colors = ['red', 'blue']
for j, color in enumerate(colors):
mask = y == j
axes[row, col].scatter(X[mask, 0], X[mask, 1],
c=color, alpha=0.7, s=50)
axes[row, col].set_title(f'K = {k}')
axes[row, col].set_xlabel('特征 1')
axes[row, col].set_ylabel('特征 2')
axes[row, col].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# 比较不同K值的决策边界
k_values = [1, 3, 5, 15]
plot_knn_decision_boundary(X_train, y_train, k_values, "不同K值的KNN决策边界")10.4.3 K值的影响分析
python
# 分析不同K值对性能的影响
k_range = range(1, 31)
train_scores = []
test_scores = []
print("K值对KNN性能的影响:")
print("K值\t训练准确率\t测试准确率")
print("-" * 35)
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
train_pred = knn.predict(X_train)
test_pred = knn.predict(X_test)
train_acc = accuracy_score(y_train, train_pred)
test_acc = accuracy_score(y_test, test_pred)
train_scores.append(train_acc)
test_scores.append(test_acc)
if k <= 10 or k % 5 == 0:
print(f"{k}\t{train_acc:.4f}\t\t{test_acc:.4f}")
# 可视化K值的影响
plt.figure(figsize=(10, 6))
plt.plot(k_range, train_scores, 'o-', label='训练准确率', linewidth=2, markersize=6)
plt.plot(k_range, test_scores, 's-', label='测试准确率', linewidth=2, markersize=6)
plt.xlabel('K值')
plt.ylabel('准确率')
plt.title('K值对KNN性能的影响')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(range(1, 31, 2))
plt.show()
# 找出最佳K值
best_k = k_range[np.argmax(test_scores)]
best_test_score = max(test_scores)
print(f"\n最佳K值: {best_k}")
print(f"最佳测试准确率: {best_test_score:.4f}")10.5 特征缩放的重要性
10.5.1 缩放前后的比较
python
# 创建具有不同尺度特征的数据
np.random.seed(42)
n_samples = 200
# 特征1:小范围 [0, 1]
feature1 = np.random.uniform(0, 1, n_samples)
# 特征2:大范围 [0, 1000]
feature2 = np.random.uniform(0, 1000, n_samples)
# 创建标签(基于两个特征的组合)
labels = ((feature1 > 0.5) & (feature2 > 500)).astype(int)
X_scale = np.column_stack([feature1, feature2])
y_scale = labels
print("特征缩放前的数据统计:")
print(f"特征1范围: [{X_scale[:, 0].min():.3f}, {X_scale[:, 0].max():.3f}]")
print(f"特征2范围: [{X_scale[:, 1].min():.3f}, {X_scale[:, 1].max():.3f}]")
# 分割数据
X_train_scale, X_test_scale, y_train_scale, y_test_scale = train_test_split(
X_scale, y_scale, test_size=0.2, random_state=42, stratify=y_scale
)
# 1. 不进行缩放的KNN
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train_scale, y_train_scale)
y_pred_no_scale = knn_no_scale.predict(X_test_scale)
acc_no_scale = accuracy_score(y_test_scale, y_pred_no_scale)
# 2. 标准化后的KNN
scaler_std = StandardScaler()
X_train_std = scaler_std.fit_transform(X_train_scale)
X_test_std = scaler_std.transform(X_test_scale)
knn_std = KNeighborsClassifier(n_neighbors=5)
knn_std.fit(X_train_std, y_train_scale)
y_pred_std = knn_std.predict(X_test_std)
acc_std = accuracy_score(y_test_scale, y_pred_std)
# 3. 最小-最大缩放后的KNN
scaler_minmax = MinMaxScaler()
X_train_minmax = scaler_minmax.fit_transform(X_train_scale)
X_test_minmax = scaler_minmax.transform(X_test_scale)
knn_minmax = KNeighborsClassifier(n_neighbors=5)
knn_minmax.fit(X_train_minmax, y_train_scale)
y_pred_minmax = knn_minmax.predict(X_test_minmax)
acc_minmax = accuracy_score(y_test_scale, y_pred_minmax)
print(f"\n特征缩放对KNN性能的影响:")
print(f"无缩放: {acc_no_scale:.4f}")
print(f"标准化: {acc_std:.4f}")
print(f"最小-最大缩放: {acc_minmax:.4f}")
# 可视化缩放效果
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('特征缩放对KNN的影响', fontsize=16)
# 原始数据
axes[0, 0].scatter(X_train_scale[y_train_scale==0, 0], X_train_scale[y_train_scale==0, 1],
c='red', alpha=0.6, label='类别0')
axes[0, 0].scatter(X_train_scale[y_train_scale==1, 0], X_train_scale[y_train_scale==1, 1],
c='blue', alpha=0.6, label='类别1')
axes[0, 0].set_title(f'原始数据 (准确率: {acc_no_scale:.3f})')
axes[0, 0].set_xlabel('特征1')
axes[0, 0].set_ylabel('特征2')
axes[0, 0].legend()
# 标准化数据
axes[0, 1].scatter(X_train_std[y_train_scale==0, 0], X_train_std[y_train_scale==0, 1],
c='red', alpha=0.6, label='类别0')
axes[0, 1].scatter(X_train_std[y_train_scale==1, 0], X_train_std[y_train_scale==1, 1],
c='blue', alpha=0.6, label='类别1')
axes[0, 1].set_title(f'标准化数据 (准确率: {acc_std:.3f})')
axes[0, 1].set_xlabel('特征1 (标准化)')
axes[0, 1].set_ylabel('特征2 (标准化)')
axes[0, 1].legend()
# 最小-最大缩放数据
axes[1, 0].scatter(X_train_minmax[y_train_scale==0, 0], X_train_minmax[y_train_scale==0, 1],
c='red', alpha=0.6, label='类别0')
axes[1, 0].scatter(X_train_minmax[y_train_scale==1, 0], X_train_minmax[y_train_scale==1, 1],
c='blue', alpha=0.6, label='类别1')
axes[1, 0].set_title(f'最小-最大缩放 (准确率: {acc_minmax:.3f})')
axes[1, 0].set_xlabel('特征1 (缩放)')
axes[1, 0].set_ylabel('特征2 (缩放)')
axes[1, 0].legend()
# 性能比较柱状图
methods = ['无缩放', '标准化', '最小-最大缩放']
accuracies = [acc_no_scale, acc_std, acc_minmax]
axes[1, 1].bar(methods, accuracies, color=['red', 'green', 'blue'], alpha=0.7)
axes[1, 1].set_title('缩放方法性能比较')
axes[1, 1].set_ylabel('准确率')
axes[1, 1].set_ylim(0, 1)
# 添加数值标签
for i, acc in enumerate(accuracies):
axes[1, 1].text(i, acc + 0.02, f'{acc:.3f}', ha='center')
plt.tight_layout()
plt.show()
```##
10.6 KNN回归
### 10.6.1 基本KNN回归
```python
# 创建回归数据
X_reg, y_reg = make_regression(
n_samples=200,
n_features=1,
noise=10,
random_state=42
)
# 添加一些非线性关系
X_reg = X_reg.flatten()
y_reg = y_reg + 0.1 * X_reg**2
# 分割数据
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
X_reg.reshape(-1, 1), y_reg, test_size=0.2, random_state=42
)
# 比较不同K值的KNN回归
k_values_reg = [1, 3, 5, 10, 20]
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('不同K值的KNN回归', fontsize=16)
from sklearn.metrics import mean_squared_error, r2_score
print("KNN回归不同K值的性能:")
print("K值\tR²得分\t\tRMSE")
print("-" * 30)
for i, k in enumerate(k_values_reg):
row = i // 3
col = i % 3
# 训练KNN回归器
knn_reg = KNeighborsRegressor(n_neighbors=k)
knn_reg.fit(X_train_reg, y_train_reg)
# 预测
y_pred_reg = knn_reg.predict(X_test_reg)
# 评估
r2 = r2_score(y_test_reg, y_pred_reg)
rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
print(f"{k}\t{r2:.4f}\t\t{rmse:.2f}")
# 可视化
axes[row, col].scatter(X_train_reg, y_train_reg, alpha=0.6, label='训练数据', s=20)
axes[row, col].scatter(X_test_reg, y_test_reg, alpha=0.6, color='green', label='测试数据', s=20)
# 绘制预测曲线
X_plot = np.linspace(X_reg.min(), X_reg.max(), 100).reshape(-1, 1)
y_plot = knn_reg.predict(X_plot)
axes[row, col].plot(X_plot, y_plot, color='red', linewidth=2, label='KNN预测')
axes[row, col].set_title(f'K={k}, R²={r2:.3f}, RMSE={rmse:.1f}')
axes[row, col].set_xlabel('X')
axes[row, col].set_ylabel('y')
axes[row, col].legend()
axes[row, col].grid(True, alpha=0.3)
# 移除空的子图
if len(k_values_reg) < 6:
axes[1, 2].remove()
plt.tight_layout()
plt.show()10.6.2 加权KNN回归
python
# 比较普通KNN和加权KNN
def compare_weighted_knn():
"""比较普通KNN和加权KNN回归"""
# 使用相同的回归数据
k = 5
# 普通KNN回归
knn_uniform = KNeighborsRegressor(n_neighbors=k, weights='uniform')
knn_uniform.fit(X_train_reg, y_train_reg)
y_pred_uniform = knn_uniform.predict(X_test_reg)
# 距离加权KNN回归
knn_distance = KNeighborsRegressor(n_neighbors=k, weights='distance')
knn_distance.fit(X_train_reg, y_train_reg)
y_pred_distance = knn_distance.predict(X_test_reg)
# 评估
r2_uniform = r2_score(y_test_reg, y_pred_uniform)
r2_distance = r2_score(y_test_reg, y_pred_distance)
rmse_uniform = np.sqrt(mean_squared_error(y_test_reg, y_pred_uniform))
rmse_distance = np.sqrt(mean_squared_error(y_test_reg, y_pred_distance))
print(f"加权方式对KNN回归的影响 (K={k}):")
print(f"普通KNN: R²={r2_uniform:.4f}, RMSE={rmse_uniform:.2f}")
print(f"距离加权: R²={r2_distance:.4f}, RMSE={rmse_distance:.2f}")
# 可视化比较
plt.figure(figsize=(15, 5))
X_plot = np.linspace(X_reg.min(), X_reg.max(), 100).reshape(-1, 1)
y_plot_uniform = knn_uniform.predict(X_plot)
y_plot_distance = knn_distance.predict(X_plot)
# 普通KNN
plt.subplot(1, 3, 1)
plt.scatter(X_train_reg, y_train_reg, alpha=0.6, label='训练数据', s=20)
plt.scatter(X_test_reg, y_test_reg, alpha=0.6, color='green', label='测试数据', s=20)
plt.plot(X_plot, y_plot_uniform, color='red', linewidth=2, label='普通KNN')
plt.title(f'普通KNN (R²={r2_uniform:.3f})')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True, alpha=0.3)
# 距离加权KNN
plt.subplot(1, 3, 2)
plt.scatter(X_train_reg, y_train_reg, alpha=0.6, label='训练数据', s=20)
plt.scatter(X_test_reg, y_test_reg, alpha=0.6, color='green', label='测试数据', s=20)
plt.plot(X_plot, y_plot_distance, color='blue', linewidth=2, label='距离加权KNN')
plt.title(f'距离加权KNN (R²={r2_distance:.3f})')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True, alpha=0.3)
# 对比
plt.subplot(1, 3, 3)
plt.scatter(X_train_reg, y_train_reg, alpha=0.4, label='训练数据', s=20, color='gray')
plt.plot(X_plot, y_plot_uniform, color='red', linewidth=2, label='普通KNN', linestyle='--')
plt.plot(X_plot, y_plot_distance, color='blue', linewidth=2, label='距离加权KNN')
plt.title('两种方法对比')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
compare_weighted_knn()10.7 距离度量的影响
10.7.1 不同距离度量的比较
python
# 比较不同距离度量对KNN的影响
from sklearn.neighbors import DistanceMetric
# 使用鸢尾花数据集
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)
# 标准化数据
scaler = StandardScaler()
X_train_iris_scaled = scaler.fit_transform(X_train_iris)
X_test_iris_scaled = scaler.transform(X_test_iris)
# 不同的距离度量
distance_metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']
metric_results = {}
print("不同距离度量对KNN分类的影响:")
print("距离度量\t\t准确率\t\t交叉验证得分")
print("-" * 50)
for metric in distance_metrics:
if metric == 'minkowski':
# 闵可夫斯基距离需要指定p参数
knn_metric = KNeighborsClassifier(n_neighbors=5, metric=metric, p=3)
else:
knn_metric = KNeighborsClassifier(n_neighbors=5, metric=metric)
knn_metric.fit(X_train_iris_scaled, y_train_iris)
y_pred_metric = knn_metric.predict(X_test_iris_scaled)
accuracy_metric = accuracy_score(y_test_iris, y_pred_metric)
cv_scores = cross_val_score(knn_metric, X_iris, y_iris, cv=5)
cv_mean = np.mean(cv_scores)
metric_results[metric] = {'accuracy': accuracy_metric, 'cv_score': cv_mean}
print(f"{metric}\t\t{accuracy_metric:.4f}\t\t{cv_mean:.4f}")
# 可视化距离度量的影响
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
metrics = list(metric_results.keys())
accuracies = [metric_results[m]['accuracy'] for m in metrics]
cv_scores = [metric_results[m]['cv_score'] for m in metrics]
# 测试准确率
axes[0].bar(metrics, accuracies, color='skyblue', alpha=0.7)
axes[0].set_title('不同距离度量的测试准确率')
axes[0].set_ylabel('准确率')
axes[0].tick_params(axis='x', rotation=45)
axes[0].set_ylim(0.8, 1.0)
# 交叉验证得分
axes[1].bar(metrics, cv_scores, color='lightgreen', alpha=0.7)
axes[1].set_title('不同距离度量的交叉验证得分')
axes[1].set_ylabel('CV得分')
axes[1].tick_params(axis='x', rotation=45)
axes[1].set_ylim(0.8, 1.0)
plt.tight_layout()
plt.show()
```#
# 10.8 超参数调优
### 10.8.1 网格搜索优化
```python
# 使用网格搜索优化KNN超参数
param_grid = {
'n_neighbors': range(1, 21),
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan', 'minkowski'],
'p': [1, 2, 3] # 仅对minkowski距离有效
}
# 创建KNN分类器
knn_grid = KNeighborsClassifier()
# 网格搜索
print("正在进行KNN超参数网格搜索...")
grid_search = GridSearchCV(
knn_grid,
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train_iris_scaled, y_train_iris)
print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳交叉验证得分: {grid_search.best_score_:.4f}")
# 测试最佳模型
best_knn = grid_search.best_estimator_
y_pred_best = best_knn.predict(X_test_iris_scaled)
test_accuracy = accuracy_score(y_test_iris, y_pred_best)
print(f"测试集准确率: {test_accuracy:.4f}")
# 详细评估
print("\n最佳KNN模型详细评估:")
print(classification_report(y_test_iris, y_pred_best,
target_names=iris.target_names))10.8.2 学习曲线分析
python
from sklearn.model_selection import learning_curve
def plot_learning_curve_knn(estimator, X, y, title="KNN学习曲线"):
"""绘制KNN的学习曲线"""
train_sizes, train_scores, val_scores = learning_curve(
estimator, X, y, cv=5, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='accuracy'
)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', color='blue', label='训练得分')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std,
alpha=0.1, color='blue')
plt.plot(train_sizes, val_mean, 'o-', color='red', label='验证得分')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std,
alpha=0.1, color='red')
plt.xlabel('训练样本数')
plt.ylabel('准确率')
plt.title(title)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# 绘制最佳KNN模型的学习曲线
plot_learning_curve_knn(best_knn, X_iris, y_iris, "最佳KNN模型学习曲线")10.9 实际应用案例
10.9.1 手写数字识别
python
# 创建简化的手写数字数据
from sklearn.datasets import load_digits
# 加载数字数据集
digits = load_digits()
X_digits, y_digits = digits.data, digits.target
print(f"手写数字数据集信息:")
print(f"样本数: {X_digits.shape[0]}")
print(f"特征数: {X_digits.shape[1]}")
print(f"类别数: {len(np.unique(y_digits))}")
# 可视化一些数字样本
fig, axes = plt.subplots(2, 5, figsize=(12, 6))
fig.suptitle('手写数字样本', fontsize=16)
for i in range(10):
row = i // 5
col = i % 5
# 找到第一个该数字的样本
idx = np.where(y_digits == i)[0][0]
digit_image = X_digits[idx].reshape(8, 8)
axes[row, col].imshow(digit_image, cmap='gray')
axes[row, col].set_title(f'数字 {i}')
axes[row, col].axis('off')
plt.tight_layout()
plt.show()
# 分割数据
X_train_digits, X_test_digits, y_train_digits, y_test_digits = train_test_split(
X_digits, y_digits, test_size=0.2, random_state=42, stratify=y_digits
)
# 标准化特征
scaler_digits = StandardScaler()
X_train_digits_scaled = scaler_digits.fit_transform(X_train_digits)
X_test_digits_scaled = scaler_digits.transform(X_test_digits)
# 训练KNN分类器
knn_digits = KNeighborsClassifier(n_neighbors=3, weights='distance')
knn_digits.fit(X_train_digits_scaled, y_train_digits)
# 预测
y_pred_digits = knn_digits.predict(X_test_digits_scaled)
accuracy_digits = accuracy_score(y_test_digits, y_pred_digits)
print(f"\n手写数字识别KNN准确率: {accuracy_digits:.4f}")
# 详细分类报告
print("\n详细分类报告:")
print(classification_report(y_test_digits, y_pred_digits))
# 混淆矩阵
cm_digits = confusion_matrix(y_test_digits, y_pred_digits)
plt.figure(figsize=(10, 8))
sns.heatmap(cm_digits, annot=True, fmt='d', cmap='Blues',
xticklabels=range(10), yticklabels=range(10))
plt.title('手写数字识别混淆矩阵')
plt.xlabel('预测标签')
plt.ylabel('真实标签')
plt.show()
# 分析错误分类的样本
wrong_predictions = y_test_digits != y_pred_digits
wrong_indices = np.where(wrong_predictions)[0]
if len(wrong_indices) > 0:
print(f"\n错误分类样本数: {len(wrong_indices)}")
# 显示一些错误分类的样本
fig, axes = plt.subplots(2, 5, figsize=(12, 6))
fig.suptitle('错误分类的样本', fontsize=16)
for i in range(min(10, len(wrong_indices))):
row = i // 5
col = i % 5
idx = wrong_indices[i]
digit_image = X_test_digits[idx].reshape(8, 8)
true_label = y_test_digits[idx]
pred_label = y_pred_digits[idx]
axes[row, col].imshow(digit_image, cmap='gray')
axes[row, col].set_title(f'真实: {true_label}, 预测: {pred_label}')
axes[row, col].axis('off')
# 移除空的子图
for i in range(min(10, len(wrong_indices)), 10):
row = i // 5
col = i % 5
axes[row, col].remove()
plt.tight_layout()
plt.show()10.9.2 推荐系统基础
python
# 创建简单的推荐系统示例
def create_recommendation_system():
"""创建基于KNN的简单推荐系统"""
# 创建用户-物品评分矩阵
np.random.seed(42)
n_users = 100
n_items = 20
# 生成稀疏的评分矩阵
ratings = np.zeros((n_users, n_items))
# 为每个用户随机评分一些物品
for user in range(n_users):
# 每个用户评分5-15个物品
n_ratings = np.random.randint(5, 16)
items_to_rate = np.random.choice(n_items, n_ratings, replace=False)
for item in items_to_rate:
# 评分1-5
ratings[user, item] = np.random.randint(1, 6)
print(f"用户-物品评分矩阵:")
print(f"用户数: {n_users}")
print(f"物品数: {n_items}")
print(f"评分密度: {np.count_nonzero(ratings) / (n_users * n_items):.2%}")
# 使用KNN找相似用户
# 只使用有评分的用户进行训练
user_profiles = ratings[np.sum(ratings, axis=1) > 0] # 过滤掉没有评分的用户
# 使用余弦相似度
knn_recommender = KNeighborsClassifier(
n_neighbors=5,
metric='cosine',
algorithm='brute' # 对于稀疏数据使用brute force
)
# 为了演示,我们创建一些虚拟标签(用户类型)
user_types = np.random.randint(0, 3, len(user_profiles))
knn_recommender.fit(user_profiles, user_types)
# 为新用户推荐
new_user_ratings = np.zeros(n_items)
# 新用户对一些物品有评分
rated_items = [0, 2, 5, 8, 12]
new_user_ratings[rated_items] = [4, 5, 3, 4, 5]
print(f"\n新用户的评分:")
for i, item in enumerate(rated_items):
print(f"物品 {item}: {new_user_ratings[item]}")
# 找到相似用户
distances, indices = knn_recommender.kneighbors([new_user_ratings], n_neighbors=5)
print(f"\n最相似的5个用户:")
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
print(f"用户 {idx}: 距离 = {dist:.3f}")
# 基于相似用户推荐物品
similar_users = user_profiles[indices[0]]
# 计算推荐分数(相似用户的平均评分)
recommendation_scores = np.mean(similar_users, axis=0)
# 排除已评分的物品
recommendation_scores[rated_items] = 0
# 获取推荐物品
recommended_items = np.argsort(recommendation_scores)[::-1][:5]
print(f"\n推荐物品:")
for i, item in enumerate(recommended_items):
if recommendation_scores[item] > 0:
print(f"{i+1}. 物品 {item}: 推荐分数 = {recommendation_scores[item]:.2f}")
return ratings, recommendation_scores
ratings_matrix, rec_scores = create_recommendation_system()
# 可视化评分矩阵
plt.figure(figsize=(12, 8))
plt.imshow(ratings_matrix[:20, :], cmap='YlOrRd', aspect='auto')
plt.colorbar(label='评分')
plt.title('用户-物品评分矩阵(前20个用户)')
plt.xlabel('物品ID')
plt.ylabel('用户ID')
plt.show()
```## 10.1
0 KNN的优化技巧
### 10.10.1 维度诅咒问题
```python
# 演示维度诅咒对KNN的影响
def demonstrate_curse_of_dimensionality():
"""演示维度诅咒对KNN性能的影响"""
dimensions = [2, 5, 10, 20, 50, 100]
accuracies = []
print("维度诅咒对KNN的影响:")
print("维度\t样本数\t准确率\t\t训练时间")
print("-" * 45)
import time
for dim in dimensions:
# 创建不同维度的数据
X_dim, y_dim = make_classification(
n_samples=500,
n_features=dim,
n_informative=min(dim, 10),
n_redundant=0,
n_clusters_per_class=1,
random_state=42
)
# 分割数据
X_train_dim, X_test_dim, y_train_dim, y_test_dim = train_test_split(
X_dim, y_dim, test_size=0.2, random_state=42, stratify=y_dim
)
# 标准化
scaler_dim = StandardScaler()
X_train_dim_scaled = scaler_dim.fit_transform(X_train_dim)
X_test_dim_scaled = scaler_dim.transform(X_test_dim)
# 训练KNN
start_time = time.time()
knn_dim = KNeighborsClassifier(n_neighbors=5)
knn_dim.fit(X_train_dim_scaled, y_train_dim)
# 预测
y_pred_dim = knn_dim.predict(X_test_dim_scaled)
train_time = time.time() - start_time
# 评估
accuracy_dim = accuracy_score(y_test_dim, y_pred_dim)
accuracies.append(accuracy_dim)
print(f"{dim}\t{len(X_dim)}\t{accuracy_dim:.4f}\t\t{train_time:.4f}s")
# 可视化维度对性能的影响
plt.figure(figsize=(10, 6))
plt.plot(dimensions, accuracies, 'o-', linewidth=2, markersize=8)
plt.xlabel('特征维度')
plt.ylabel('准确率')
plt.title('维度诅咒对KNN性能的影响')
plt.grid(True, alpha=0.3)
plt.show()
return dimensions, accuracies
dims, accs = demonstrate_curse_of_dimensionality()10.10.2 特征选择
python
# 使用特征选择改善KNN性能
from sklearn.feature_selection import SelectKBest, f_classif
# 使用高维数据集
X_high_dim, y_high_dim = make_classification(
n_samples=1000,
n_features=100,
n_informative=20,
n_redundant=10,
random_state=42
)
X_train_hd, X_test_hd, y_train_hd, y_test_hd = train_test_split(
X_high_dim, y_high_dim, test_size=0.2, random_state=42, stratify=y_high_dim
)
# 标准化
scaler_hd = StandardScaler()
X_train_hd_scaled = scaler_hd.fit_transform(X_train_hd)
X_test_hd_scaled = scaler_hd.transform(X_test_hd)
# 比较不同特征数量的影响
k_features = [5, 10, 20, 30, 50, 100]
feature_selection_results = {}
print("特征选择对KNN性能的影响:")
print("特征数\t准确率\t\t训练时间")
print("-" * 35)
for k in k_features:
if k < X_high_dim.shape[1]:
# 特征选择
selector = SelectKBest(f_classif, k=k)
X_train_selected = selector.fit_transform(X_train_hd_scaled, y_train_hd)
X_test_selected = selector.transform(X_test_hd_scaled)
else:
X_train_selected = X_train_hd_scaled
X_test_selected = X_test_hd_scaled
# 训练KNN
start_time = time.time()
knn_fs = KNeighborsClassifier(n_neighbors=5)
knn_fs.fit(X_train_selected, y_train_hd)
y_pred_fs = knn_fs.predict(X_test_selected)
train_time = time.time() - start_time
accuracy_fs = accuracy_score(y_test_hd, y_pred_fs)
feature_selection_results[k] = {'accuracy': accuracy_fs, 'time': train_time}
print(f"{k}\t{accuracy_fs:.4f}\t\t{train_time:.4f}s")
# 可视化特征选择的效果
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
features = list(feature_selection_results.keys())
accuracies_fs = [feature_selection_results[k]['accuracy'] for k in features]
times_fs = [feature_selection_results[k]['time'] for k in features]
# 准确率
axes[0].plot(features, accuracies_fs, 'o-', linewidth=2, markersize=8, color='blue')
axes[0].set_xlabel('特征数量')
axes[0].set_ylabel('准确率')
axes[0].set_title('特征数量对准确率的影响')
axes[0].grid(True, alpha=0.3)
# 训练时间
axes[1].plot(features, times_fs, 's-', linewidth=2, markersize=8, color='red')
axes[1].set_xlabel('特征数量')
axes[1].set_ylabel('训练时间 (秒)')
axes[1].set_title('特征数量对训练时间的影响')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()10.10.3 近似最近邻搜索
python
# 演示近似最近邻搜索的概念
from sklearn.neighbors import NearestNeighbors
def compare_exact_vs_approximate_search():
"""比较精确搜索和近似搜索"""
# 创建大规模数据集
X_large, y_large = make_classification(
n_samples=5000,
n_features=50,
n_informative=30,
random_state=42
)
# 标准化
scaler_large = StandardScaler()
X_large_scaled = scaler_large.fit_transform(X_large)
# 分割数据
X_train_large, X_test_large, y_train_large, y_test_large = train_test_split(
X_large_scaled, y_large, test_size=0.2, random_state=42
)
# 不同的算法
algorithms = ['auto', 'ball_tree', 'kd_tree', 'brute']
algorithm_results = {}
print("不同搜索算法的性能比较:")
print("算法\t\t训练时间\t预测时间\t准确率")
print("-" * 50)
for algorithm in algorithms:
# 训练时间
start_time = time.time()
knn_alg = KNeighborsClassifier(n_neighbors=5, algorithm=algorithm)
knn_alg.fit(X_train_large, y_train_large)
train_time = time.time() - start_time
# 预测时间
start_time = time.time()
y_pred_alg = knn_alg.predict(X_test_large)
pred_time = time.time() - start_time
# 准确率
accuracy_alg = accuracy_score(y_test_large, y_pred_alg)
algorithm_results[algorithm] = {
'train_time': train_time,
'pred_time': pred_time,
'accuracy': accuracy_alg
}
print(f"{algorithm}\t\t{train_time:.4f}s\t\t{pred_time:.4f}s\t\t{accuracy_alg:.4f}")
# 可视化算法比较
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
algs = list(algorithm_results.keys())
train_times = [algorithm_results[alg]['train_time'] for alg in algs]
pred_times = [algorithm_results[alg]['pred_time'] for alg in algs]
accuracies_alg = [algorithm_results[alg]['accuracy'] for alg in algs]
# 训练时间
axes[0].bar(algs, train_times, color='skyblue', alpha=0.7)
axes[0].set_title('训练时间比较')
axes[0].set_ylabel('时间 (秒)')
axes[0].tick_params(axis='x', rotation=45)
# 预测时间
axes[1].bar(algs, pred_times, color='lightgreen', alpha=0.7)
axes[1].set_title('预测时间比较')
axes[1].set_ylabel('时间 (秒)')
axes[1].tick_params(axis='x', rotation=45)
# 准确率
axes[2].bar(algs, accuracies_alg, color='lightcoral', alpha=0.7)
axes[2].set_title('准确率比较')
axes[2].set_ylabel('准确率')
axes[2].tick_params(axis='x', rotation=45)
axes[2].set_ylim(0.8, 1.0)
plt.tight_layout()
plt.show()
return algorithm_results
alg_results = compare_exact_vs_approximate_search()10.11 KNN vs 其他算法
10.11.1 综合性能比较
python
# 比较KNN与其他算法的性能
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
def comprehensive_algorithm_comparison():
"""全面比较KNN与其他算法"""
# 使用葡萄酒数据集
wine = load_wine()
X_wine, y_wine = wine.data, wine.target
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
X_wine, y_wine, test_size=0.2, random_state=42, stratify=y_wine
)
# 标准化(对某些算法重要)
scaler_wine = StandardScaler()
X_train_wine_scaled = scaler_wine.fit_transform(X_train_wine)
X_test_wine_scaled = scaler_wine.transform(X_test_wine)
# 定义算法
algorithms = {
'KNN': KNeighborsClassifier(n_neighbors=5),
'随机森林': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM': SVC(random_state=42),
'逻辑回归': LogisticRegression(random_state=42, max_iter=1000),
'朴素贝叶斯': GaussianNB()
}
results = {}
print("算法综合性能比较:")
print("算法\t\t训练时间\t预测时间\t准确率\t\tCV得分")
print("-" * 65)
for name, algorithm in algorithms.items():
# 选择合适的数据
if name in ['KNN', 'SVM', '逻辑回归']:
X_train_alg = X_train_wine_scaled
X_test_alg = X_test_wine_scaled
X_cv = X_wine
scaler_cv = StandardScaler()
X_cv_scaled = scaler_cv.fit_transform(X_cv)
X_cv_final = X_cv_scaled
else:
X_train_alg = X_train_wine
X_test_alg = X_test_wine
X_cv_final = X_wine
# 训练时间
start_time = time.time()
algorithm.fit(X_train_alg, y_train_wine)
train_time = time.time() - start_time
# 预测时间
start_time = time.time()
y_pred_alg = algorithm.predict(X_test_alg)
pred_time = time.time() - start_time
# 准确率
accuracy_alg = accuracy_score(y_test_wine, y_pred_alg)
# 交叉验证
cv_scores = cross_val_score(algorithm, X_cv_final, y_wine, cv=5)
cv_mean = np.mean(cv_scores)
results[name] = {
'train_time': train_time,
'pred_time': pred_time,
'accuracy': accuracy_alg,
'cv_score': cv_mean
}
print(f"{name}\t{train_time:.4f}s\t\t{pred_time:.4f}s\t\t{accuracy_alg:.4f}\t\t{cv_mean:.4f}")
return results
comparison_results = comprehensive_algorithm_comparison()
# 可视化综合比较
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('算法综合性能比较', fontsize=16)
names = list(comparison_results.keys())
train_times = [comparison_results[name]['train_time'] for name in names]
pred_times = [comparison_results[name]['pred_time'] for name in names]
accuracies = [comparison_results[name]['accuracy'] for name in names]
cv_scores = [comparison_results[name]['cv_score'] for name in names]
# 训练时间
axes[0, 0].bar(names, train_times, color='skyblue', alpha=0.7)
axes[0, 0].set_title('训练时间')
axes[0, 0].set_ylabel('时间 (秒)')
axes[0, 0].tick_params(axis='x', rotation=45)
# 预测时间
axes[0, 1].bar(names, pred_times, color='lightgreen', alpha=0.7)
axes[0, 1].set_title('预测时间')
axes[0, 1].set_ylabel('时间 (秒)')
axes[0, 1].tick_params(axis='x', rotation=45)
# 测试准确率
axes[1, 0].bar(names, accuracies, color='lightcoral', alpha=0.7)
axes[1, 0].set_title('测试准确率')
axes[1, 0].set_ylabel('准确率')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].set_ylim(0.8, 1.0)
# 交叉验证得分
axes[1, 1].bar(names, cv_scores, color='gold', alpha=0.7)
axes[1, 1].set_title('交叉验证得分')
axes[1, 1].set_ylabel('CV得分')
axes[1, 1].tick_params(axis='x', rotation=45)
axes[1, 1].set_ylim(0.8, 1.0)
plt.tight_layout()
plt.show()
```#
# 10.12 练习题
### 练习1:基础KNN
1. 使用鸢尾花数据集训练KNN分类器
2. 比较不同K值(1, 3, 5, 7, 9)对性能的影响
3. 可视化决策边界(选择两个特征)
### 练习2:距离度量比较
1. 创建一个二维分类数据集
2. 比较欧几里得、曼哈顿和切比雪夫距离的效果
3. 分析哪种距离度量最适合你的数据
### 练习3:特征缩放实验
1. 创建一个具有不同尺度特征的数据集
2. 比较标准化、最小-最大缩放和无缩放的KNN性能
3. 解释为什么特征缩放对KNN如此重要
### 练习4:KNN回归
1. 使用波士顿房价数据集训练KNN回归器
2. 比较不同K值和权重方法的效果
3. 与线性回归的性能进行比较
### 练习5:高维数据处理
1. 创建一个高维数据集(特征数>50)
2. 使用特征选择技术改善KNN性能
3. 分析维度诅咒对KNN的影响
## 10.13 小结
在本章中,我们深入学习了K近邻算法的各个方面:
### 核心概念
- **基于实例的学习**:通过相似样本进行预测
- **距离度量**:欧几里得、曼哈顿、切比雪夫等距离
- **K值选择**:平衡偏差和方差的重要参数
### 主要技术
- **KNN分类**:多数投票决策
- **KNN回归**:平均值或加权平均预测
- **距离加权**:近邻样本权重更大
- **不同算法**:暴力搜索、KD树、球树
### 实践技能
- **特征缩放**:标准化和归一化的重要性
- **超参数调优**:K值、距离度量、权重方法
- **性能优化**:特征选择、算法选择
- **实际应用**:推荐系统、图像识别
### 关键要点
- KNN简单直观,无需训练过程
- 对特征缩放和距离度量敏感
- 计算复杂度高,不适合大规模数据
- 在低维数据上表现良好,高维数据需要特征选择
### 优势与劣势
**优势**:
- 算法简单,易于理解和实现
- 无需假设数据分布
- 适合多分类问题
- 能够捕捉局部模式
**劣势**:
- 计算和存储开销大
- 对维度诅咒敏感
- 对噪声和异常值敏感
- 需要选择合适的K值和距离度量
## 10.14 使用建议
### 何时使用KNN
- **小到中等规模数据集**
- **非线性决策边界**
- **需要简单基线模型**
- **局部模式重要的问题**
### 何时避免使用KNN
- **大规模数据集**
- **高维数据(无特征选择)**
- **实时预测要求**
- **存储空间有限**
### 最佳实践
1. **必须进行特征缩放**
2. **使用交叉验证选择K值**
3. **考虑特征选择或降维**
4. **根据数据特点选择距离度量**
5. **对于大数据集考虑近似方法**
## 10.15 下一步
现在你已经掌握了K近邻这个直观而强大的算法!在下一章[聚类分析](./sklearn-clustering.md)中,我们将进入无监督学习的世界,学习如何在没有标签的情况下发现数据中的隐藏模式。
---
**章节要点回顾**:
- ✅ 理解了KNN的基本原理和算法步骤
- ✅ 掌握了不同距离度量的选择和应用
- ✅ 学会了KNN分类和回归的实现
- ✅ 了解了特征缩放对KNN的重要性
- ✅ 掌握了KNN的超参数调优方法
- ✅ 能够处理维度诅咒和性能优化问题
- ✅ 了解了KNN在实际应用中的优势和局限性