Skip to content

第10章:K近邻算法

K近邻算法(K-Nearest Neighbors, KNN)是机器学习中最简单直观的算法之一。它基于"物以类聚,人以群分"的思想,通过寻找最相似的K个邻居来进行预测。

10.1 什么是K近邻算法?

K近邻算法是一种基于实例的学习方法,也称为"懒惰学习",因为它在训练阶段不构建显式的模型,而是在预测时才进行计算。

10.1.1 核心思想

  • 相似性假设:相似的样本具有相似的标签
  • 局部性原理:近邻样本比远距离样本更重要
  • 非参数方法:不对数据分布做任何假设

10.1.2 算法步骤

  1. 存储训练数据:保存所有训练样本
  2. 计算距离:计算测试样本与所有训练样本的距离
  3. 找到K个最近邻:选择距离最小的K个样本
  4. 进行预测
    • 分类:多数投票
    • 回归:平均值或加权平均

10.1.3 KNN的优势

  • 简单易懂:算法逻辑直观
  • 无需训练:不需要学习参数
  • 适应性强:能处理多分类问题
  • 局部敏感:能捕捉局部模式

10.1.4 KNN的劣势

  • 计算复杂度高:预测时需要计算所有距离
  • 存储需求大:需要保存所有训练数据
  • 对维度敏感:高维数据性能下降
  • 对噪声敏感:异常值影响预测结果

10.2 准备环境和数据

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_regression, load_iris, load_wine
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# 设置随机种子
np.random.seed(42)

# 设置图形样式
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

10.3 距离度量

10.3.1 常用距离度量

KNN算法的核心是计算样本间的距离,常用的距离度量包括:

python
def demonstrate_distance_metrics():
    """演示不同距离度量的计算"""
    
    # 创建两个示例点
    point1 = np.array([1, 2])
    point2 = np.array([4, 6])
    
    print("两个点的坐标:")
    print(f"点1: {point1}")
    print(f"点2: {point2}")
    print()
    
    # 1. 欧几里得距离 (L2)
    euclidean_dist = np.sqrt(np.sum((point1 - point2)**2))
    print(f"欧几里得距离: {euclidean_dist:.4f}")
    
    # 2. 曼哈顿距离 (L1)
    manhattan_dist = np.sum(np.abs(point1 - point2))
    print(f"曼哈顿距离: {manhattan_dist:.4f}")
    
    # 3. 切比雪夫距离 (L∞)
    chebyshev_dist = np.max(np.abs(point1 - point2))
    print(f"切比雪夫距离: {chebyshev_dist:.4f}")
    
    # 4. 闵可夫斯基距离 (一般形式)
    def minkowski_distance(p1, p2, p):
        return np.sum(np.abs(p1 - p2)**p)**(1/p)
    
    print(f"闵可夫斯基距离 (p=1): {minkowski_distance(point1, point2, 1):.4f}")
    print(f"闵可夫斯基距离 (p=2): {minkowski_distance(point1, point2, 2):.4f}")
    print(f"闵可夫斯基距离 (p=3): {minkowski_distance(point1, point2, 3):.4f}")

demonstrate_distance_metrics()

10.3.2 可视化不同距离度量

python
def visualize_distance_metrics():
    """可视化不同距离度量的等距离线"""
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # 中心点
    center = np.array([0, 0])
    
    # 创建网格
    x = np.linspace(-2, 2, 100)
    y = np.linspace(-2, 2, 100)
    X, Y = np.meshgrid(x, y)
    
    # 计算不同距离
    # 欧几里得距离
    euclidean = np.sqrt(X**2 + Y**2)
    axes[0].contour(X, Y, euclidean, levels=[0.5, 1.0, 1.5], colors=['red', 'blue', 'green'])
    axes[0].set_title('欧几里得距离等距线')
    axes[0].set_xlabel('X')
    axes[0].set_ylabel('Y')
    axes[0].grid(True, alpha=0.3)
    axes[0].plot(0, 0, 'ko', markersize=8)
    
    # 曼哈顿距离
    manhattan = np.abs(X) + np.abs(Y)
    axes[1].contour(X, Y, manhattan, levels=[0.5, 1.0, 1.5], colors=['red', 'blue', 'green'])
    axes[1].set_title('曼哈顿距离等距线')
    axes[1].set_xlabel('X')
    axes[1].set_ylabel('Y')
    axes[1].grid(True, alpha=0.3)
    axes[1].plot(0, 0, 'ko', markersize=8)
    
    # 切比雪夫距离
    chebyshev = np.maximum(np.abs(X), np.abs(Y))
    axes[2].contour(X, Y, chebyshev, levels=[0.5, 1.0, 1.5], colors=['red', 'blue', 'green'])
    axes[2].set_title('切比雪夫距离等距线')
    axes[2].set_xlabel('X')
    axes[2].set_ylabel('Y')
    axes[2].grid(True, alpha=0.3)
    axes[2].plot(0, 0, 'ko', markersize=8)
    
    plt.tight_layout()
    plt.show()

visualize_distance_metrics()

10.4 KNN分类

10.4.1 基本KNN分类

python
# 创建二分类数据
X_class, y_class = make_classification(
    n_samples=200,
    n_features=2,
    n_redundant=0,
    n_informative=2,
    n_clusters_per_class=1,
    random_state=42
)

# 分割数据
X_train, X_test, y_train, y_test = train_test_split(
    X_class, y_class, test_size=0.2, random_state=42, stratify=y_class
)

# 创建KNN分类器
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)

# 预测
y_pred = knn_classifier.predict(X_test)
y_pred_proba = knn_classifier.predict_proba(X_test)

# 评估
accuracy = accuracy_score(y_test, y_pred)
print(f"KNN分类准确率: {accuracy:.4f}")

print("\n详细分类报告:")
print(classification_report(y_test, y_pred))

# 可视化数据分布
plt.figure(figsize=(12, 5))

# 训练数据
plt.subplot(1, 2, 1)
colors = ['red', 'blue']
for i, color in enumerate(colors):
    mask = y_train == i
    plt.scatter(X_train[mask, 0], X_train[mask, 1], 
                c=color, label=f'类别 {i}', alpha=0.7)

plt.xlabel('特征 1')
plt.ylabel('特征 2')
plt.title('训练数据分布')
plt.legend()
plt.grid(True, alpha=0.3)

# 测试数据和预测结果
plt.subplot(1, 2, 2)
for i, color in enumerate(colors):
    mask = y_test == i
    plt.scatter(X_test[mask, 0], X_test[mask, 1], 
                c=color, label=f'真实类别 {i}', alpha=0.7, s=100)

# 标记错误预测
wrong_predictions = y_test != y_pred
plt.scatter(X_test[wrong_predictions, 0], X_test[wrong_predictions, 1], 
            facecolors='none', edgecolors='black', s=200, linewidths=2, 
            label='预测错误')

plt.xlabel('特征 1')
plt.ylabel('特征 2')
plt.title('测试结果')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
```### 10.
4.2 决策边界可视化

```python
def plot_knn_decision_boundary(X, y, k_values, title="KNN决策边界"):
    """绘制不同K值的KNN决策边界"""
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle(title, fontsize=16)
    
    for i, k in enumerate(k_values):
        row = i // 2
        col = i % 2
        
        # 训练KNN
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X, y)
        
        # 创建网格
        h = 0.02
        x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
        y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
        xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                           np.arange(y_min, y_max, h))
        
        # 预测网格点
        Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        
        # 绘制决策边界
        axes[row, col].contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
        
        # 绘制数据点
        colors = ['red', 'blue']
        for j, color in enumerate(colors):
            mask = y == j
            axes[row, col].scatter(X[mask, 0], X[mask, 1], 
                                 c=color, alpha=0.7, s=50)
        
        axes[row, col].set_title(f'K = {k}')
        axes[row, col].set_xlabel('特征 1')
        axes[row, col].set_ylabel('特征 2')
        axes[row, col].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# 比较不同K值的决策边界
k_values = [1, 3, 5, 15]
plot_knn_decision_boundary(X_train, y_train, k_values, "不同K值的KNN决策边界")

10.4.3 K值的影响分析

python
# 分析不同K值对性能的影响
k_range = range(1, 31)
train_scores = []
test_scores = []

print("K值对KNN性能的影响:")
print("K值\t训练准确率\t测试准确率")
print("-" * 35)

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    
    train_pred = knn.predict(X_train)
    test_pred = knn.predict(X_test)
    
    train_acc = accuracy_score(y_train, train_pred)
    test_acc = accuracy_score(y_test, test_pred)
    
    train_scores.append(train_acc)
    test_scores.append(test_acc)
    
    if k <= 10 or k % 5 == 0:
        print(f"{k}\t{train_acc:.4f}\t\t{test_acc:.4f}")

# 可视化K值的影响
plt.figure(figsize=(10, 6))
plt.plot(k_range, train_scores, 'o-', label='训练准确率', linewidth=2, markersize=6)
plt.plot(k_range, test_scores, 's-', label='测试准确率', linewidth=2, markersize=6)

plt.xlabel('K值')
plt.ylabel('准确率')
plt.title('K值对KNN性能的影响')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(range(1, 31, 2))
plt.show()

# 找出最佳K值
best_k = k_range[np.argmax(test_scores)]
best_test_score = max(test_scores)
print(f"\n最佳K值: {best_k}")
print(f"最佳测试准确率: {best_test_score:.4f}")

10.5 特征缩放的重要性

10.5.1 缩放前后的比较

python
# 创建具有不同尺度特征的数据
np.random.seed(42)
n_samples = 200

# 特征1:小范围 [0, 1]
feature1 = np.random.uniform(0, 1, n_samples)

# 特征2:大范围 [0, 1000]
feature2 = np.random.uniform(0, 1000, n_samples)

# 创建标签(基于两个特征的组合)
labels = ((feature1 > 0.5) & (feature2 > 500)).astype(int)

X_scale = np.column_stack([feature1, feature2])
y_scale = labels

print("特征缩放前的数据统计:")
print(f"特征1范围: [{X_scale[:, 0].min():.3f}, {X_scale[:, 0].max():.3f}]")
print(f"特征2范围: [{X_scale[:, 1].min():.3f}, {X_scale[:, 1].max():.3f}]")

# 分割数据
X_train_scale, X_test_scale, y_train_scale, y_test_scale = train_test_split(
    X_scale, y_scale, test_size=0.2, random_state=42, stratify=y_scale
)

# 1. 不进行缩放的KNN
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train_scale, y_train_scale)
y_pred_no_scale = knn_no_scale.predict(X_test_scale)
acc_no_scale = accuracy_score(y_test_scale, y_pred_no_scale)

# 2. 标准化后的KNN
scaler_std = StandardScaler()
X_train_std = scaler_std.fit_transform(X_train_scale)
X_test_std = scaler_std.transform(X_test_scale)

knn_std = KNeighborsClassifier(n_neighbors=5)
knn_std.fit(X_train_std, y_train_scale)
y_pred_std = knn_std.predict(X_test_std)
acc_std = accuracy_score(y_test_scale, y_pred_std)

# 3. 最小-最大缩放后的KNN
scaler_minmax = MinMaxScaler()
X_train_minmax = scaler_minmax.fit_transform(X_train_scale)
X_test_minmax = scaler_minmax.transform(X_test_scale)

knn_minmax = KNeighborsClassifier(n_neighbors=5)
knn_minmax.fit(X_train_minmax, y_train_scale)
y_pred_minmax = knn_minmax.predict(X_test_minmax)
acc_minmax = accuracy_score(y_test_scale, y_pred_minmax)

print(f"\n特征缩放对KNN性能的影响:")
print(f"无缩放: {acc_no_scale:.4f}")
print(f"标准化: {acc_std:.4f}")
print(f"最小-最大缩放: {acc_minmax:.4f}")

# 可视化缩放效果
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('特征缩放对KNN的影响', fontsize=16)

# 原始数据
axes[0, 0].scatter(X_train_scale[y_train_scale==0, 0], X_train_scale[y_train_scale==0, 1], 
                  c='red', alpha=0.6, label='类别0')
axes[0, 0].scatter(X_train_scale[y_train_scale==1, 0], X_train_scale[y_train_scale==1, 1], 
                  c='blue', alpha=0.6, label='类别1')
axes[0, 0].set_title(f'原始数据 (准确率: {acc_no_scale:.3f})')
axes[0, 0].set_xlabel('特征1')
axes[0, 0].set_ylabel('特征2')
axes[0, 0].legend()

# 标准化数据
axes[0, 1].scatter(X_train_std[y_train_scale==0, 0], X_train_std[y_train_scale==0, 1], 
                  c='red', alpha=0.6, label='类别0')
axes[0, 1].scatter(X_train_std[y_train_scale==1, 0], X_train_std[y_train_scale==1, 1], 
                  c='blue', alpha=0.6, label='类别1')
axes[0, 1].set_title(f'标准化数据 (准确率: {acc_std:.3f})')
axes[0, 1].set_xlabel('特征1 (标准化)')
axes[0, 1].set_ylabel('特征2 (标准化)')
axes[0, 1].legend()

# 最小-最大缩放数据
axes[1, 0].scatter(X_train_minmax[y_train_scale==0, 0], X_train_minmax[y_train_scale==0, 1], 
                  c='red', alpha=0.6, label='类别0')
axes[1, 0].scatter(X_train_minmax[y_train_scale==1, 0], X_train_minmax[y_train_scale==1, 1], 
                  c='blue', alpha=0.6, label='类别1')
axes[1, 0].set_title(f'最小-最大缩放 (准确率: {acc_minmax:.3f})')
axes[1, 0].set_xlabel('特征1 (缩放)')
axes[1, 0].set_ylabel('特征2 (缩放)')
axes[1, 0].legend()

# 性能比较柱状图
methods = ['无缩放', '标准化', '最小-最大缩放']
accuracies = [acc_no_scale, acc_std, acc_minmax]
axes[1, 1].bar(methods, accuracies, color=['red', 'green', 'blue'], alpha=0.7)
axes[1, 1].set_title('缩放方法性能比较')
axes[1, 1].set_ylabel('准确率')
axes[1, 1].set_ylim(0, 1)

# 添加数值标签
for i, acc in enumerate(accuracies):
    axes[1, 1].text(i, acc + 0.02, f'{acc:.3f}', ha='center')

plt.tight_layout()
plt.show()
```##
 10.6 KNN回归

### 10.6.1 基本KNN回归

```python
# 创建回归数据
X_reg, y_reg = make_regression(
    n_samples=200,
    n_features=1,
    noise=10,
    random_state=42
)

# 添加一些非线性关系
X_reg = X_reg.flatten()
y_reg = y_reg + 0.1 * X_reg**2

# 分割数据
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg.reshape(-1, 1), y_reg, test_size=0.2, random_state=42
)

# 比较不同K值的KNN回归
k_values_reg = [1, 3, 5, 10, 20]
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('不同K值的KNN回归', fontsize=16)

from sklearn.metrics import mean_squared_error, r2_score

print("KNN回归不同K值的性能:")
print("K值\tR²得分\t\tRMSE")
print("-" * 30)

for i, k in enumerate(k_values_reg):
    row = i // 3
    col = i % 3
    
    # 训练KNN回归器
    knn_reg = KNeighborsRegressor(n_neighbors=k)
    knn_reg.fit(X_train_reg, y_train_reg)
    
    # 预测
    y_pred_reg = knn_reg.predict(X_test_reg)
    
    # 评估
    r2 = r2_score(y_test_reg, y_pred_reg)
    rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
    
    print(f"{k}\t{r2:.4f}\t\t{rmse:.2f}")
    
    # 可视化
    axes[row, col].scatter(X_train_reg, y_train_reg, alpha=0.6, label='训练数据', s=20)
    axes[row, col].scatter(X_test_reg, y_test_reg, alpha=0.6, color='green', label='测试数据', s=20)
    
    # 绘制预测曲线
    X_plot = np.linspace(X_reg.min(), X_reg.max(), 100).reshape(-1, 1)
    y_plot = knn_reg.predict(X_plot)
    axes[row, col].plot(X_plot, y_plot, color='red', linewidth=2, label='KNN预测')
    
    axes[row, col].set_title(f'K={k}, R²={r2:.3f}, RMSE={rmse:.1f}')
    axes[row, col].set_xlabel('X')
    axes[row, col].set_ylabel('y')
    axes[row, col].legend()
    axes[row, col].grid(True, alpha=0.3)

# 移除空的子图
if len(k_values_reg) < 6:
    axes[1, 2].remove()

plt.tight_layout()
plt.show()

10.6.2 加权KNN回归

python
# 比较普通KNN和加权KNN
def compare_weighted_knn():
    """比较普通KNN和加权KNN回归"""
    
    # 使用相同的回归数据
    k = 5
    
    # 普通KNN回归
    knn_uniform = KNeighborsRegressor(n_neighbors=k, weights='uniform')
    knn_uniform.fit(X_train_reg, y_train_reg)
    y_pred_uniform = knn_uniform.predict(X_test_reg)
    
    # 距离加权KNN回归
    knn_distance = KNeighborsRegressor(n_neighbors=k, weights='distance')
    knn_distance.fit(X_train_reg, y_train_reg)
    y_pred_distance = knn_distance.predict(X_test_reg)
    
    # 评估
    r2_uniform = r2_score(y_test_reg, y_pred_uniform)
    r2_distance = r2_score(y_test_reg, y_pred_distance)
    rmse_uniform = np.sqrt(mean_squared_error(y_test_reg, y_pred_uniform))
    rmse_distance = np.sqrt(mean_squared_error(y_test_reg, y_pred_distance))
    
    print(f"加权方式对KNN回归的影响 (K={k}):")
    print(f"普通KNN:   R²={r2_uniform:.4f}, RMSE={rmse_uniform:.2f}")
    print(f"距离加权:  R²={r2_distance:.4f}, RMSE={rmse_distance:.2f}")
    
    # 可视化比较
    plt.figure(figsize=(15, 5))
    
    X_plot = np.linspace(X_reg.min(), X_reg.max(), 100).reshape(-1, 1)
    y_plot_uniform = knn_uniform.predict(X_plot)
    y_plot_distance = knn_distance.predict(X_plot)
    
    # 普通KNN
    plt.subplot(1, 3, 1)
    plt.scatter(X_train_reg, y_train_reg, alpha=0.6, label='训练数据', s=20)
    plt.scatter(X_test_reg, y_test_reg, alpha=0.6, color='green', label='测试数据', s=20)
    plt.plot(X_plot, y_plot_uniform, color='red', linewidth=2, label='普通KNN')
    plt.title(f'普通KNN (R²={r2_uniform:.3f})')
    plt.xlabel('X')
    plt.ylabel('y')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # 距离加权KNN
    plt.subplot(1, 3, 2)
    plt.scatter(X_train_reg, y_train_reg, alpha=0.6, label='训练数据', s=20)
    plt.scatter(X_test_reg, y_test_reg, alpha=0.6, color='green', label='测试数据', s=20)
    plt.plot(X_plot, y_plot_distance, color='blue', linewidth=2, label='距离加权KNN')
    plt.title(f'距离加权KNN (R²={r2_distance:.3f})')
    plt.xlabel('X')
    plt.ylabel('y')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # 对比
    plt.subplot(1, 3, 3)
    plt.scatter(X_train_reg, y_train_reg, alpha=0.4, label='训练数据', s=20, color='gray')
    plt.plot(X_plot, y_plot_uniform, color='red', linewidth=2, label='普通KNN', linestyle='--')
    plt.plot(X_plot, y_plot_distance, color='blue', linewidth=2, label='距离加权KNN')
    plt.title('两种方法对比')
    plt.xlabel('X')
    plt.ylabel('y')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

compare_weighted_knn()

10.7 距离度量的影响

10.7.1 不同距离度量的比较

python
# 比较不同距离度量对KNN的影响
from sklearn.neighbors import DistanceMetric

# 使用鸢尾花数据集
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)

# 标准化数据
scaler = StandardScaler()
X_train_iris_scaled = scaler.fit_transform(X_train_iris)
X_test_iris_scaled = scaler.transform(X_test_iris)

# 不同的距离度量
distance_metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']
metric_results = {}

print("不同距离度量对KNN分类的影响:")
print("距离度量\t\t准确率\t\t交叉验证得分")
print("-" * 50)

for metric in distance_metrics:
    if metric == 'minkowski':
        # 闵可夫斯基距离需要指定p参数
        knn_metric = KNeighborsClassifier(n_neighbors=5, metric=metric, p=3)
    else:
        knn_metric = KNeighborsClassifier(n_neighbors=5, metric=metric)
    
    knn_metric.fit(X_train_iris_scaled, y_train_iris)
    y_pred_metric = knn_metric.predict(X_test_iris_scaled)
    
    accuracy_metric = accuracy_score(y_test_iris, y_pred_metric)
    cv_scores = cross_val_score(knn_metric, X_iris, y_iris, cv=5)
    cv_mean = np.mean(cv_scores)
    
    metric_results[metric] = {'accuracy': accuracy_metric, 'cv_score': cv_mean}
    
    print(f"{metric}\t\t{accuracy_metric:.4f}\t\t{cv_mean:.4f}")

# 可视化距离度量的影响
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

metrics = list(metric_results.keys())
accuracies = [metric_results[m]['accuracy'] for m in metrics]
cv_scores = [metric_results[m]['cv_score'] for m in metrics]

# 测试准确率
axes[0].bar(metrics, accuracies, color='skyblue', alpha=0.7)
axes[0].set_title('不同距离度量的测试准确率')
axes[0].set_ylabel('准确率')
axes[0].tick_params(axis='x', rotation=45)
axes[0].set_ylim(0.8, 1.0)

# 交叉验证得分
axes[1].bar(metrics, cv_scores, color='lightgreen', alpha=0.7)
axes[1].set_title('不同距离度量的交叉验证得分')
axes[1].set_ylabel('CV得分')
axes[1].tick_params(axis='x', rotation=45)
axes[1].set_ylim(0.8, 1.0)

plt.tight_layout()
plt.show()
```#
# 10.8 超参数调优

### 10.8.1 网格搜索优化

```python
# 使用网格搜索优化KNN超参数
param_grid = {
    'n_neighbors': range(1, 21),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski'],
    'p': [1, 2, 3]  # 仅对minkowski距离有效
}

# 创建KNN分类器
knn_grid = KNeighborsClassifier()

# 网格搜索
print("正在进行KNN超参数网格搜索...")
grid_search = GridSearchCV(
    knn_grid,
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train_iris_scaled, y_train_iris)

print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳交叉验证得分: {grid_search.best_score_:.4f}")

# 测试最佳模型
best_knn = grid_search.best_estimator_
y_pred_best = best_knn.predict(X_test_iris_scaled)
test_accuracy = accuracy_score(y_test_iris, y_pred_best)
print(f"测试集准确率: {test_accuracy:.4f}")

# 详细评估
print("\n最佳KNN模型详细评估:")
print(classification_report(y_test_iris, y_pred_best, 
                          target_names=iris.target_names))

10.8.2 学习曲线分析

python
from sklearn.model_selection import learning_curve

def plot_learning_curve_knn(estimator, X, y, title="KNN学习曲线"):
    """绘制KNN的学习曲线"""
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y, cv=5, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='accuracy'
    )
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, 'o-', color='blue', label='训练得分')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, 
                     alpha=0.1, color='blue')
    
    plt.plot(train_sizes, val_mean, 'o-', color='red', label='验证得分')
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, 
                     alpha=0.1, color='red')
    
    plt.xlabel('训练样本数')
    plt.ylabel('准确率')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# 绘制最佳KNN模型的学习曲线
plot_learning_curve_knn(best_knn, X_iris, y_iris, "最佳KNN模型学习曲线")

10.9 实际应用案例

10.9.1 手写数字识别

python
# 创建简化的手写数字数据
from sklearn.datasets import load_digits

# 加载数字数据集
digits = load_digits()
X_digits, y_digits = digits.data, digits.target

print(f"手写数字数据集信息:")
print(f"样本数: {X_digits.shape[0]}")
print(f"特征数: {X_digits.shape[1]}")
print(f"类别数: {len(np.unique(y_digits))}")

# 可视化一些数字样本
fig, axes = plt.subplots(2, 5, figsize=(12, 6))
fig.suptitle('手写数字样本', fontsize=16)

for i in range(10):
    row = i // 5
    col = i % 5
    
    # 找到第一个该数字的样本
    idx = np.where(y_digits == i)[0][0]
    digit_image = X_digits[idx].reshape(8, 8)
    
    axes[row, col].imshow(digit_image, cmap='gray')
    axes[row, col].set_title(f'数字 {i}')
    axes[row, col].axis('off')

plt.tight_layout()
plt.show()

# 分割数据
X_train_digits, X_test_digits, y_train_digits, y_test_digits = train_test_split(
    X_digits, y_digits, test_size=0.2, random_state=42, stratify=y_digits
)

# 标准化特征
scaler_digits = StandardScaler()
X_train_digits_scaled = scaler_digits.fit_transform(X_train_digits)
X_test_digits_scaled = scaler_digits.transform(X_test_digits)

# 训练KNN分类器
knn_digits = KNeighborsClassifier(n_neighbors=3, weights='distance')
knn_digits.fit(X_train_digits_scaled, y_train_digits)

# 预测
y_pred_digits = knn_digits.predict(X_test_digits_scaled)
accuracy_digits = accuracy_score(y_test_digits, y_pred_digits)

print(f"\n手写数字识别KNN准确率: {accuracy_digits:.4f}")

# 详细分类报告
print("\n详细分类报告:")
print(classification_report(y_test_digits, y_pred_digits))

# 混淆矩阵
cm_digits = confusion_matrix(y_test_digits, y_pred_digits)
plt.figure(figsize=(10, 8))
sns.heatmap(cm_digits, annot=True, fmt='d', cmap='Blues',
            xticklabels=range(10), yticklabels=range(10))
plt.title('手写数字识别混淆矩阵')
plt.xlabel('预测标签')
plt.ylabel('真实标签')
plt.show()

# 分析错误分类的样本
wrong_predictions = y_test_digits != y_pred_digits
wrong_indices = np.where(wrong_predictions)[0]

if len(wrong_indices) > 0:
    print(f"\n错误分类样本数: {len(wrong_indices)}")
    
    # 显示一些错误分类的样本
    fig, axes = plt.subplots(2, 5, figsize=(12, 6))
    fig.suptitle('错误分类的样本', fontsize=16)
    
    for i in range(min(10, len(wrong_indices))):
        row = i // 5
        col = i % 5
        
        idx = wrong_indices[i]
        digit_image = X_test_digits[idx].reshape(8, 8)
        true_label = y_test_digits[idx]
        pred_label = y_pred_digits[idx]
        
        axes[row, col].imshow(digit_image, cmap='gray')
        axes[row, col].set_title(f'真实: {true_label}, 预测: {pred_label}')
        axes[row, col].axis('off')
    
    # 移除空的子图
    for i in range(min(10, len(wrong_indices)), 10):
        row = i // 5
        col = i % 5
        axes[row, col].remove()
    
    plt.tight_layout()
    plt.show()

10.9.2 推荐系统基础

python
# 创建简单的推荐系统示例
def create_recommendation_system():
    """创建基于KNN的简单推荐系统"""
    
    # 创建用户-物品评分矩阵
    np.random.seed(42)
    n_users = 100
    n_items = 20
    
    # 生成稀疏的评分矩阵
    ratings = np.zeros((n_users, n_items))
    
    # 为每个用户随机评分一些物品
    for user in range(n_users):
        # 每个用户评分5-15个物品
        n_ratings = np.random.randint(5, 16)
        items_to_rate = np.random.choice(n_items, n_ratings, replace=False)
        
        for item in items_to_rate:
            # 评分1-5
            ratings[user, item] = np.random.randint(1, 6)
    
    print(f"用户-物品评分矩阵:")
    print(f"用户数: {n_users}")
    print(f"物品数: {n_items}")
    print(f"评分密度: {np.count_nonzero(ratings) / (n_users * n_items):.2%}")
    
    # 使用KNN找相似用户
    # 只使用有评分的用户进行训练
    user_profiles = ratings[np.sum(ratings, axis=1) > 0]  # 过滤掉没有评分的用户
    
    # 使用余弦相似度
    knn_recommender = KNeighborsClassifier(
        n_neighbors=5, 
        metric='cosine',
        algorithm='brute'  # 对于稀疏数据使用brute force
    )
    
    # 为了演示,我们创建一些虚拟标签(用户类型)
    user_types = np.random.randint(0, 3, len(user_profiles))
    knn_recommender.fit(user_profiles, user_types)
    
    # 为新用户推荐
    new_user_ratings = np.zeros(n_items)
    # 新用户对一些物品有评分
    rated_items = [0, 2, 5, 8, 12]
    new_user_ratings[rated_items] = [4, 5, 3, 4, 5]
    
    print(f"\n新用户的评分:")
    for i, item in enumerate(rated_items):
        print(f"物品 {item}: {new_user_ratings[item]}")
    
    # 找到相似用户
    distances, indices = knn_recommender.kneighbors([new_user_ratings], n_neighbors=5)
    
    print(f"\n最相似的5个用户:")
    for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
        print(f"用户 {idx}: 距离 = {dist:.3f}")
    
    # 基于相似用户推荐物品
    similar_users = user_profiles[indices[0]]
    
    # 计算推荐分数(相似用户的平均评分)
    recommendation_scores = np.mean(similar_users, axis=0)
    
    # 排除已评分的物品
    recommendation_scores[rated_items] = 0
    
    # 获取推荐物品
    recommended_items = np.argsort(recommendation_scores)[::-1][:5]
    
    print(f"\n推荐物品:")
    for i, item in enumerate(recommended_items):
        if recommendation_scores[item] > 0:
            print(f"{i+1}. 物品 {item}: 推荐分数 = {recommendation_scores[item]:.2f}")
    
    return ratings, recommendation_scores

ratings_matrix, rec_scores = create_recommendation_system()

# 可视化评分矩阵
plt.figure(figsize=(12, 8))
plt.imshow(ratings_matrix[:20, :], cmap='YlOrRd', aspect='auto')
plt.colorbar(label='评分')
plt.title('用户-物品评分矩阵(前20个用户)')
plt.xlabel('物品ID')
plt.ylabel('用户ID')
plt.show()
```## 10.1
0 KNN的优化技巧

### 10.10.1 维度诅咒问题

```python
# 演示维度诅咒对KNN的影响
def demonstrate_curse_of_dimensionality():
    """演示维度诅咒对KNN性能的影响"""
    
    dimensions = [2, 5, 10, 20, 50, 100]
    accuracies = []
    
    print("维度诅咒对KNN的影响:")
    print("维度\t样本数\t准确率\t\t训练时间")
    print("-" * 45)
    
    import time
    
    for dim in dimensions:
        # 创建不同维度的数据
        X_dim, y_dim = make_classification(
            n_samples=500,
            n_features=dim,
            n_informative=min(dim, 10),
            n_redundant=0,
            n_clusters_per_class=1,
            random_state=42
        )
        
        # 分割数据
        X_train_dim, X_test_dim, y_train_dim, y_test_dim = train_test_split(
            X_dim, y_dim, test_size=0.2, random_state=42, stratify=y_dim
        )
        
        # 标准化
        scaler_dim = StandardScaler()
        X_train_dim_scaled = scaler_dim.fit_transform(X_train_dim)
        X_test_dim_scaled = scaler_dim.transform(X_test_dim)
        
        # 训练KNN
        start_time = time.time()
        knn_dim = KNeighborsClassifier(n_neighbors=5)
        knn_dim.fit(X_train_dim_scaled, y_train_dim)
        
        # 预测
        y_pred_dim = knn_dim.predict(X_test_dim_scaled)
        train_time = time.time() - start_time
        
        # 评估
        accuracy_dim = accuracy_score(y_test_dim, y_pred_dim)
        accuracies.append(accuracy_dim)
        
        print(f"{dim}\t{len(X_dim)}\t{accuracy_dim:.4f}\t\t{train_time:.4f}s")
    
    # 可视化维度对性能的影响
    plt.figure(figsize=(10, 6))
    plt.plot(dimensions, accuracies, 'o-', linewidth=2, markersize=8)
    plt.xlabel('特征维度')
    plt.ylabel('准确率')
    plt.title('维度诅咒对KNN性能的影响')
    plt.grid(True, alpha=0.3)
    plt.show()
    
    return dimensions, accuracies

dims, accs = demonstrate_curse_of_dimensionality()

10.10.2 特征选择

python
# 使用特征选择改善KNN性能
from sklearn.feature_selection import SelectKBest, f_classif

# 使用高维数据集
X_high_dim, y_high_dim = make_classification(
    n_samples=1000,
    n_features=100,
    n_informative=20,
    n_redundant=10,
    random_state=42
)

X_train_hd, X_test_hd, y_train_hd, y_test_hd = train_test_split(
    X_high_dim, y_high_dim, test_size=0.2, random_state=42, stratify=y_high_dim
)

# 标准化
scaler_hd = StandardScaler()
X_train_hd_scaled = scaler_hd.fit_transform(X_train_hd)
X_test_hd_scaled = scaler_hd.transform(X_test_hd)

# 比较不同特征数量的影响
k_features = [5, 10, 20, 30, 50, 100]
feature_selection_results = {}

print("特征选择对KNN性能的影响:")
print("特征数\t准确率\t\t训练时间")
print("-" * 35)

for k in k_features:
    if k < X_high_dim.shape[1]:
        # 特征选择
        selector = SelectKBest(f_classif, k=k)
        X_train_selected = selector.fit_transform(X_train_hd_scaled, y_train_hd)
        X_test_selected = selector.transform(X_test_hd_scaled)
    else:
        X_train_selected = X_train_hd_scaled
        X_test_selected = X_test_hd_scaled
    
    # 训练KNN
    start_time = time.time()
    knn_fs = KNeighborsClassifier(n_neighbors=5)
    knn_fs.fit(X_train_selected, y_train_hd)
    
    y_pred_fs = knn_fs.predict(X_test_selected)
    train_time = time.time() - start_time
    
    accuracy_fs = accuracy_score(y_test_hd, y_pred_fs)
    feature_selection_results[k] = {'accuracy': accuracy_fs, 'time': train_time}
    
    print(f"{k}\t{accuracy_fs:.4f}\t\t{train_time:.4f}s")

# 可视化特征选择的效果
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

features = list(feature_selection_results.keys())
accuracies_fs = [feature_selection_results[k]['accuracy'] for k in features]
times_fs = [feature_selection_results[k]['time'] for k in features]

# 准确率
axes[0].plot(features, accuracies_fs, 'o-', linewidth=2, markersize=8, color='blue')
axes[0].set_xlabel('特征数量')
axes[0].set_ylabel('准确率')
axes[0].set_title('特征数量对准确率的影响')
axes[0].grid(True, alpha=0.3)

# 训练时间
axes[1].plot(features, times_fs, 's-', linewidth=2, markersize=8, color='red')
axes[1].set_xlabel('特征数量')
axes[1].set_ylabel('训练时间 (秒)')
axes[1].set_title('特征数量对训练时间的影响')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

10.10.3 近似最近邻搜索

python
# 演示近似最近邻搜索的概念
from sklearn.neighbors import NearestNeighbors

def compare_exact_vs_approximate_search():
    """比较精确搜索和近似搜索"""
    
    # 创建大规模数据集
    X_large, y_large = make_classification(
        n_samples=5000,
        n_features=50,
        n_informative=30,
        random_state=42
    )
    
    # 标准化
    scaler_large = StandardScaler()
    X_large_scaled = scaler_large.fit_transform(X_large)
    
    # 分割数据
    X_train_large, X_test_large, y_train_large, y_test_large = train_test_split(
        X_large_scaled, y_large, test_size=0.2, random_state=42
    )
    
    # 不同的算法
    algorithms = ['auto', 'ball_tree', 'kd_tree', 'brute']
    algorithm_results = {}
    
    print("不同搜索算法的性能比较:")
    print("算法\t\t训练时间\t预测时间\t准确率")
    print("-" * 50)
    
    for algorithm in algorithms:
        # 训练时间
        start_time = time.time()
        knn_alg = KNeighborsClassifier(n_neighbors=5, algorithm=algorithm)
        knn_alg.fit(X_train_large, y_train_large)
        train_time = time.time() - start_time
        
        # 预测时间
        start_time = time.time()
        y_pred_alg = knn_alg.predict(X_test_large)
        pred_time = time.time() - start_time
        
        # 准确率
        accuracy_alg = accuracy_score(y_test_large, y_pred_alg)
        
        algorithm_results[algorithm] = {
            'train_time': train_time,
            'pred_time': pred_time,
            'accuracy': accuracy_alg
        }
        
        print(f"{algorithm}\t\t{train_time:.4f}s\t\t{pred_time:.4f}s\t\t{accuracy_alg:.4f}")
    
    # 可视化算法比较
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    
    algs = list(algorithm_results.keys())
    train_times = [algorithm_results[alg]['train_time'] for alg in algs]
    pred_times = [algorithm_results[alg]['pred_time'] for alg in algs]
    accuracies_alg = [algorithm_results[alg]['accuracy'] for alg in algs]
    
    # 训练时间
    axes[0].bar(algs, train_times, color='skyblue', alpha=0.7)
    axes[0].set_title('训练时间比较')
    axes[0].set_ylabel('时间 (秒)')
    axes[0].tick_params(axis='x', rotation=45)
    
    # 预测时间
    axes[1].bar(algs, pred_times, color='lightgreen', alpha=0.7)
    axes[1].set_title('预测时间比较')
    axes[1].set_ylabel('时间 (秒)')
    axes[1].tick_params(axis='x', rotation=45)
    
    # 准确率
    axes[2].bar(algs, accuracies_alg, color='lightcoral', alpha=0.7)
    axes[2].set_title('准确率比较')
    axes[2].set_ylabel('准确率')
    axes[2].tick_params(axis='x', rotation=45)
    axes[2].set_ylim(0.8, 1.0)
    
    plt.tight_layout()
    plt.show()
    
    return algorithm_results

alg_results = compare_exact_vs_approximate_search()

10.11 KNN vs 其他算法

10.11.1 综合性能比较

python
# 比较KNN与其他算法的性能
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

def comprehensive_algorithm_comparison():
    """全面比较KNN与其他算法"""
    
    # 使用葡萄酒数据集
    wine = load_wine()
    X_wine, y_wine = wine.data, wine.target
    
    X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
        X_wine, y_wine, test_size=0.2, random_state=42, stratify=y_wine
    )
    
    # 标准化(对某些算法重要)
    scaler_wine = StandardScaler()
    X_train_wine_scaled = scaler_wine.fit_transform(X_train_wine)
    X_test_wine_scaled = scaler_wine.transform(X_test_wine)
    
    # 定义算法
    algorithms = {
        'KNN': KNeighborsClassifier(n_neighbors=5),
        '随机森林': RandomForestClassifier(n_estimators=100, random_state=42),
        'SVM': SVC(random_state=42),
        '逻辑回归': LogisticRegression(random_state=42, max_iter=1000),
        '朴素贝叶斯': GaussianNB()
    }
    
    results = {}
    
    print("算法综合性能比较:")
    print("算法\t\t训练时间\t预测时间\t准确率\t\tCV得分")
    print("-" * 65)
    
    for name, algorithm in algorithms.items():
        # 选择合适的数据
        if name in ['KNN', 'SVM', '逻辑回归']:
            X_train_alg = X_train_wine_scaled
            X_test_alg = X_test_wine_scaled
            X_cv = X_wine
            scaler_cv = StandardScaler()
            X_cv_scaled = scaler_cv.fit_transform(X_cv)
            X_cv_final = X_cv_scaled
        else:
            X_train_alg = X_train_wine
            X_test_alg = X_test_wine
            X_cv_final = X_wine
        
        # 训练时间
        start_time = time.time()
        algorithm.fit(X_train_alg, y_train_wine)
        train_time = time.time() - start_time
        
        # 预测时间
        start_time = time.time()
        y_pred_alg = algorithm.predict(X_test_alg)
        pred_time = time.time() - start_time
        
        # 准确率
        accuracy_alg = accuracy_score(y_test_wine, y_pred_alg)
        
        # 交叉验证
        cv_scores = cross_val_score(algorithm, X_cv_final, y_wine, cv=5)
        cv_mean = np.mean(cv_scores)
        
        results[name] = {
            'train_time': train_time,
            'pred_time': pred_time,
            'accuracy': accuracy_alg,
            'cv_score': cv_mean
        }
        
        print(f"{name}\t{train_time:.4f}s\t\t{pred_time:.4f}s\t\t{accuracy_alg:.4f}\t\t{cv_mean:.4f}")
    
    return results

comparison_results = comprehensive_algorithm_comparison()

# 可视化综合比较
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('算法综合性能比较', fontsize=16)

names = list(comparison_results.keys())
train_times = [comparison_results[name]['train_time'] for name in names]
pred_times = [comparison_results[name]['pred_time'] for name in names]
accuracies = [comparison_results[name]['accuracy'] for name in names]
cv_scores = [comparison_results[name]['cv_score'] for name in names]

# 训练时间
axes[0, 0].bar(names, train_times, color='skyblue', alpha=0.7)
axes[0, 0].set_title('训练时间')
axes[0, 0].set_ylabel('时间 (秒)')
axes[0, 0].tick_params(axis='x', rotation=45)

# 预测时间
axes[0, 1].bar(names, pred_times, color='lightgreen', alpha=0.7)
axes[0, 1].set_title('预测时间')
axes[0, 1].set_ylabel('时间 (秒)')
axes[0, 1].tick_params(axis='x', rotation=45)

# 测试准确率
axes[1, 0].bar(names, accuracies, color='lightcoral', alpha=0.7)
axes[1, 0].set_title('测试准确率')
axes[1, 0].set_ylabel('准确率')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].set_ylim(0.8, 1.0)

# 交叉验证得分
axes[1, 1].bar(names, cv_scores, color='gold', alpha=0.7)
axes[1, 1].set_title('交叉验证得分')
axes[1, 1].set_ylabel('CV得分')
axes[1, 1].tick_params(axis='x', rotation=45)
axes[1, 1].set_ylim(0.8, 1.0)

plt.tight_layout()
plt.show()
```#
# 10.12 练习题

### 练习1:基础KNN
1. 使用鸢尾花数据集训练KNN分类器
2. 比较不同K值(1, 3, 5, 7, 9)对性能的影响
3. 可视化决策边界(选择两个特征)

### 练习2:距离度量比较
1. 创建一个二维分类数据集
2. 比较欧几里得、曼哈顿和切比雪夫距离的效果
3. 分析哪种距离度量最适合你的数据

### 练习3:特征缩放实验
1. 创建一个具有不同尺度特征的数据集
2. 比较标准化、最小-最大缩放和无缩放的KNN性能
3. 解释为什么特征缩放对KNN如此重要

### 练习4:KNN回归
1. 使用波士顿房价数据集训练KNN回归器
2. 比较不同K值和权重方法的效果
3. 与线性回归的性能进行比较

### 练习5:高维数据处理
1. 创建一个高维数据集(特征数>50
2. 使用特征选择技术改善KNN性能
3. 分析维度诅咒对KNN的影响

## 10.13 小结

在本章中,我们深入学习了K近邻算法的各个方面:

### 核心概念
- **基于实例的学习**:通过相似样本进行预测
- **距离度量**:欧几里得、曼哈顿、切比雪夫等距离
- **K值选择**:平衡偏差和方差的重要参数

### 主要技术
- **KNN分类**:多数投票决策
- **KNN回归**:平均值或加权平均预测
- **距离加权**:近邻样本权重更大
- **不同算法**:暴力搜索、KD树、球树

### 实践技能
- **特征缩放**:标准化和归一化的重要性
- **超参数调优**:K值、距离度量、权重方法
- **性能优化**:特征选择、算法选择
- **实际应用**:推荐系统、图像识别

### 关键要点
- KNN简单直观,无需训练过程
- 对特征缩放和距离度量敏感
- 计算复杂度高,不适合大规模数据
- 在低维数据上表现良好,高维数据需要特征选择

### 优势与劣势

**优势**
- 算法简单,易于理解和实现
- 无需假设数据分布
- 适合多分类问题
- 能够捕捉局部模式

**劣势**
- 计算和存储开销大
- 对维度诅咒敏感
- 对噪声和异常值敏感
- 需要选择合适的K值和距离度量

## 10.14 使用建议

### 何时使用KNN
- **小到中等规模数据集**
- **非线性决策边界**
- **需要简单基线模型**
- **局部模式重要的问题**

### 何时避免使用KNN
- **大规模数据集**
- **高维数据(无特征选择)**
- **实时预测要求**
- **存储空间有限**

### 最佳实践
1. **必须进行特征缩放**
2. **使用交叉验证选择K值**
3. **考虑特征选择或降维**
4. **根据数据特点选择距离度量**
5. **对于大数据集考虑近似方法**

## 10.15 下一步

现在你已经掌握了K近邻这个直观而强大的算法!在下一章[聚类分析](./sklearn-clustering.md)中,我们将进入无监督学习的世界,学习如何在没有标签的情况下发现数据中的隐藏模式。

---

**章节要点回顾**
- ✅ 理解了KNN的基本原理和算法步骤
- ✅ 掌握了不同距离度量的选择和应用
- ✅ 学会了KNN分类和回归的实现
- ✅ 了解了特征缩放对KNN的重要性
- ✅ 掌握了KNN的超参数调优方法
- ✅ 能够处理维度诅咒和性能优化问题
- ✅ 了解了KNN在实际应用中的优势和局限性

本站内容仅供学习和研究使用。