第4章：线性回归详解

线性回归是机器学习中最基础也是最重要的算法之一。它不仅简单易懂，而且为理解更复杂的算法奠定了基础。本章将深入探讨线性回归的原理、实现和应用。

4.1 什么是线性回归？

线性回归是一种用于预测连续数值的监督学习算法。它假设目标变量与特征变量之间存在线性关系。

4.1.1 数学原理

对于简单线性回归（一个特征）：

y = β₀ + β₁x + ε

对于多元线性回归（多个特征）：

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε

其中：

y：目标变量（因变量）
x：特征变量（自变量）
β₀：截距（偏置项）
β₁, β₂, ..., βₙ：回归系数（权重）
ε：误差项

4.1.2 核心假设

线性回归基于以下假设：

线性关系：特征与目标变量之间存在线性关系
独立性：观测值之间相互独立
同方差性：误差项的方差恒定
正态性：误差项服从正态分布
无多重共线性：特征之间不存在完全的线性关系

4.2 准备数据和环境

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_regression, load_boston
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# 设置随机种子
np.random.seed(42)

# 设置图形样式
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

4.3 简单线性回归

4.3.1 生成示例数据

python

# 生成简单的线性数据
def generate_simple_data(n_samples=100, noise=10):
    """生成简单线性回归数据"""
    np.random.seed(42)
    X = np.random.uniform(0, 100, n_samples)
    y = 2 * X + 10 + np.random.normal(0, noise, n_samples)  # y = 2x + 10 + noise
    return X.reshape(-1, 1), y

# 生成数据
X_simple, y_simple = generate_simple_data(100, 15)

# 可视化数据
plt.figure(figsize=(10, 6))
plt.scatter(X_simple, y_simple, alpha=0.6, color='blue')
plt.xlabel('特征 X')
plt.ylabel('目标变量 y')
plt.title('简单线性回归数据')
plt.grid(True, alpha=0.3)
plt.show()

print(f"数据形状: X={X_simple.shape}, y={y_simple.shape}")

4.3.2 训练简单线性回归模型

python

# 分割数据
X_train, X_test, y_train, y_test = train_test_split(
    X_simple, y_simple, test_size=0.2, random_state=42
)

# 创建和训练模型
model_simple = LinearRegression()
model_simple.fit(X_train, y_train)

# 查看模型参数
print("模型参数：")
print(f"截距 (β₀): {model_simple.intercept_:.4f}")
print(f"斜率 (β₁): {model_simple.coef_[0]:.4f}")
print(f"真实参数: 截距=10, 斜率=2")

# 进行预测
y_pred_train = model_simple.predict(X_train)
y_pred_test = model_simple.predict(X_test)

# 可视化结果
plt.figure(figsize=(12, 5))

# 训练集结果
plt.subplot(1, 2, 1)
plt.scatter(X_train, y_train, alpha=0.6, label='训练数据')
plt.plot(X_train, y_pred_train, color='red', linewidth=2, label='拟合直线')
plt.xlabel('特征 X')
plt.ylabel('目标变量 y')
plt.title('训练集拟合结果')
plt.legend()
plt.grid(True, alpha=0.3)

# 测试集结果
plt.subplot(1, 2, 2)
plt.scatter(X_test, y_test, alpha=0.6, label='测试数据', color='green')
plt.plot(X_test, y_pred_test, color='red', linewidth=2, label='预测直线')
plt.xlabel('特征 X')
plt.ylabel('目标变量 y')
plt.title('测试集预测结果')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

4.3.3 模型评估

python

# 计算评估指标
def evaluate_regression_model(y_true, y_pred, model_name="模型"):
    """计算回归模型的评估指标"""
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    
    print(f"{model_name} 评估结果：")
    print(f"均方误差 (MSE): {mse:.4f}")
    print(f"均方根误差 (RMSE): {rmse:.4f}")
    print(f"平均绝对误差 (MAE): {mae:.4f}")
    print(f"决定系数 (R²): {r2:.4f}")
    print("-" * 40)
    
    return {'MSE': mse, 'RMSE': rmse, 'MAE': mae, 'R2': r2}

# 评估训练集和测试集性能
train_metrics = evaluate_regression_model(y_train, y_pred_train, "训练集")
test_metrics = evaluate_regression_model(y_test, y_pred_test, "测试集")

4.4 多元线性回归

4.4.1 使用真实数据集

python

# 创建更复杂的数据集
X_multi, y_multi = make_regression(
    n_samples=500,
    n_features=5,
    n_informative=3,
    noise=10,
    random_state=42
)

# 创建特征名称
feature_names = [f'特征_{i+1}' for i in range(X_multi.shape[1])]

# 转换为DataFrame便于分析
df_multi = pd.DataFrame(X_multi, columns=feature_names)
df_multi['目标变量'] = y_multi

print("多元回归数据集信息：")
print(df_multi.info())
print("\n数据统计摘要：")
print(df_multi.describe())

4.4.2 探索性数据分析

python

# 相关性分析
plt.figure(figsize=(10, 8))
correlation_matrix = df_multi.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5)
plt.title('特征相关性热力图')
plt.tight_layout()
plt.show()

# 特征与目标变量的关系
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('特征与目标变量的关系', fontsize=16)

for i, feature in enumerate(feature_names):
    row = i // 3
    col = i % 3
    if i < len(feature_names):
        axes[row, col].scatter(df_multi[feature], df_multi['目标变量'], alpha=0.6)
        axes[row, col].set_xlabel(feature)
        axes[row, col].set_ylabel('目标变量')
        axes[row, col].set_title(f'{feature} vs 目标变量')
        
        # 添加趋势线
        z = np.polyfit(df_multi[feature], df_multi['目标变量'], 1)
        p = np.poly1d(z)
        axes[row, col].plot(df_multi[feature], p(df_multi[feature]), "r--", alpha=0.8)

# 移除空的子图
if len(feature_names) < 6:
    axes[1, 2].remove()

plt.tight_layout()
plt.show()

4.4.3 训练多元线性回归模型

python

# 准备数据
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(
    X_multi, y_multi, test_size=0.2, random_state=42
)

# 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_multi)
X_test_scaled = scaler.transform(X_test_multi)

# 训练模型
model_multi = LinearRegression()
model_multi.fit(X_train_scaled, y_train_multi)

# 查看模型参数
print("多元线性回归模型参数：")
print(f"截距: {model_multi.intercept_:.4f}")
print("回归系数：")
for i, coef in enumerate(model_multi.coef_):
    print(f"  {feature_names[i]}: {coef:.4f}")

# 特征重要性可视化
plt.figure(figsize=(10, 6))
feature_importance = np.abs(model_multi.coef_)
plt.barh(feature_names, feature_importance)
plt.xlabel('系数绝对值')
plt.title('特征重要性（基于回归系数）')
plt.tight_layout()
plt.show()

4.4.4 模型预测和评估

python

# 进行预测
y_pred_train_multi = model_multi.predict(X_train_scaled)
y_pred_test_multi = model_multi.predict(X_test_scaled)

# 评估模型性能
print("多元线性回归模型评估：")
train_metrics_multi = evaluate_regression_model(y_train_multi, y_pred_train_multi, "训练集")
test_metrics_multi = evaluate_regression_model(y_test_multi, y_pred_test_multi, "测试集")

# 预测值 vs 真实值可视化
plt.figure(figsize=(12, 5))

# 训练集
plt.subplot(1, 2, 1)
plt.scatter(y_train_multi, y_pred_train_multi, alpha=0.6)
plt.plot([y_train_multi.min(), y_train_multi.max()], 
         [y_train_multi.min(), y_train_multi.max()], 'r--', lw=2)
plt.xlabel('真实值')
plt.ylabel('预测值')
plt.title(f'训练集: 真实值 vs 预测值 (R² = {train_metrics_multi["R2"]:.3f})')
plt.grid(True, alpha=0.3)

# 测试集
plt.subplot(1, 2, 2)
plt.scatter(y_test_multi, y_pred_test_multi, alpha=0.6, color='green')
plt.plot([y_test_multi.min(), y_test_multi.max()], 
         [y_test_multi.min(), y_test_multi.max()], 'r--', lw=2)
plt.xlabel('真实值')
plt.ylabel('预测值')
plt.title(f'测试集: 真实值 vs 预测值 (R² = {test_metrics_multi["R2"]:.3f})')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

4.5 残差分析

4.5.1 残差图

python

# 计算残差
residuals_train = y_train_multi - y_pred_train_multi
residuals_test = y_test_multi - y_pred_test_multi

# 残差分析图
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('残差分析', fontsize=16)

# 残差 vs 预测值
axes[0, 0].scatter(y_pred_train_multi, residuals_train, alpha=0.6)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('预测值')
axes[0, 0].set_ylabel('残差')
axes[0, 0].set_title('残差 vs 预测值')
axes[0, 0].grid(True, alpha=0.3)

# 残差直方图
axes[0, 1].hist(residuals_train, bins=30, alpha=0.7, edgecolor='black')
axes[0, 1].set_xlabel('残差')
axes[0, 1].set_ylabel('频次')
axes[0, 1].set_title('残差分布')
axes[0, 1].grid(True, alpha=0.3)

# Q-Q图（正态性检验）
from scipy import stats
stats.probplot(residuals_train, dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('残差Q-Q图')
axes[1, 0].grid(True, alpha=0.3)

# 残差的标准化
standardized_residuals = residuals_train / np.std(residuals_train)
axes[1, 1].scatter(y_pred_train_multi, standardized_residuals, alpha=0.6)
axes[1, 1].axhline(y=0, color='r', linestyle='--')
axes[1, 1].axhline(y=2, color='r', linestyle=':', alpha=0.7)
axes[1, 1].axhline(y=-2, color='r', linestyle=':', alpha=0.7)
axes[1, 1].set_xlabel('预测值')
axes[1, 1].set_ylabel('标准化残差')
axes[1, 1].set_title('标准化残差图')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

4.5.2 残差统计分析

python

# 残差统计分析
def analyze_residuals(residuals, name="残差"):
    """分析残差的统计特性"""
    print(f"{name} 统计分析：")
    print(f"均值: {np.mean(residuals):.6f}")
    print(f"标准差: {np.std(residuals):.4f}")
    print(f"偏度: {stats.skew(residuals):.4f}")
    print(f"峰度: {stats.kurtosis(residuals):.4f}")
    
    # 正态性检验
    shapiro_stat, shapiro_p = stats.shapiro(residuals)
    print(f"Shapiro-Wilk正态性检验: 统计量={shapiro_stat:.4f}, p值={shapiro_p:.4f}")
    
    if shapiro_p > 0.05:
        print("✅ 残差符合正态分布假设")
    else:
        print("❌ 残差不符合正态分布假设")
    
    print("-" * 50)

analyze_residuals(residuals_train, "训练集残差")
analyze_residuals(residuals_test, "测试集残差")

4.6 正则化线性回归

4.6.1 Ridge回归（L2正则化）

python

# Ridge回归
from sklearn.linear_model import RidgeCV

# 使用交叉验证选择最佳alpha
ridge_alphas = np.logspace(-3, 2, 50)
ridge_model = RidgeCV(alphas=ridge_alphas, cv=5)
ridge_model.fit(X_train_scaled, y_train_multi)

print(f"Ridge回归最佳alpha: {ridge_model.alpha_:.4f}")

# 预测和评估
y_pred_ridge = ridge_model.predict(X_test_scaled)
ridge_metrics = evaluate_regression_model(y_test_multi, y_pred_ridge, "Ridge回归")

4.6.2 Lasso回归（L1正则化）

python

# Lasso回归
from sklearn.linear_model import LassoCV

# 使用交叉验证选择最佳alpha
lasso_alphas = np.logspace(-4, 1, 50)
lasso_model = LassoCV(alphas=lasso_alphas, cv=5, random_state=42)
lasso_model.fit(X_train_scaled, y_train_multi)

print(f"Lasso回归最佳alpha: {lasso_model.alpha_:.4f}")

# 预测和评估
y_pred_lasso = lasso_model.predict(X_test_scaled)
lasso_metrics = evaluate_regression_model(y_test_multi, y_pred_lasso, "Lasso回归")

# 查看特征选择结果
print("Lasso回归系数：")
for i, coef in enumerate(lasso_model.coef_):
    print(f"  {feature_names[i]}: {coef:.4f}")

# 统计非零系数
non_zero_coefs = np.sum(lasso_model.coef_ != 0)
print(f"非零系数数量: {non_zero_coefs}/{len(feature_names)}")

4.6.3 ElasticNet回归（L1+L2正则化）

python

# ElasticNet回归
from sklearn.linear_model import ElasticNetCV

# 使用交叉验证选择最佳参数
elasticnet_model = ElasticNetCV(
    alphas=np.logspace(-4, 1, 20),
    l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9],
    cv=5,
    random_state=42
)
elasticnet_model.fit(X_train_scaled, y_train_multi)

print(f"ElasticNet最佳alpha: {elasticnet_model.alpha_:.4f}")
print(f"ElasticNet最佳l1_ratio: {elasticnet_model.l1_ratio_:.4f}")

# 预测和评估
y_pred_elasticnet = elasticnet_model.predict(X_test_scaled)
elasticnet_metrics = evaluate_regression_model(y_test_multi, y_pred_elasticnet, "ElasticNet回归")

4.6.4 正则化方法比较

python

# 比较不同正则化方法
models = {
    '线性回归': model_multi,
    'Ridge回归': ridge_model,
    'Lasso回归': lasso_model,
    'ElasticNet回归': elasticnet_model
}

# 系数比较
plt.figure(figsize=(12, 8))
x_pos = np.arange(len(feature_names))
width = 0.2

for i, (name, model) in enumerate(models.items()):
    plt.bar(x_pos + i * width, model.coef_, width, label=name, alpha=0.8)

plt.xlabel('特征')
plt.ylabel('回归系数')
plt.title('不同正则化方法的系数比较')
plt.xticks(x_pos + width * 1.5, feature_names, rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# 性能比较
performance_comparison = pd.DataFrame({
    '线性回归': test_metrics_multi,
    'Ridge回归': ridge_metrics,
    'Lasso回归': lasso_metrics,
    'ElasticNet回归': elasticnet_metrics
}).T

print("模型性能比较：")
print(performance_comparison.round(4))

# 可视化性能比较
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# R²比较
performance_comparison['R2'].plot(kind='bar', ax=axes[0], color='skyblue')
axes[0].set_title('R²得分比较')
axes[0].set_ylabel('R²')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3)

# RMSE比较
performance_comparison['RMSE'].plot(kind='bar', ax=axes[1], color='lightcoral')
axes[1].set_title('RMSE比较')
axes[1].set_ylabel('RMSE')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

4.7 多项式回归

4.7.1 生成非线性数据

python

# 生成非线性数据
def generate_polynomial_data(n_samples=100):
    """生成多项式数据"""
    np.random.seed(42)
    X = np.random.uniform(-2, 2, n_samples)
    y = 0.5 * X**3 - 2 * X**2 + X + 1 + np.random.normal(0, 0.5, n_samples)
    return X.reshape(-1, 1), y

X_poly, y_poly = generate_polynomial_data(150)

# 可视化非线性数据
plt.figure(figsize=(10, 6))
plt.scatter(X_poly, y_poly, alpha=0.6, color='blue')
plt.xlabel('X')
plt.ylabel('y')
plt.title('非线性数据')
plt.grid(True, alpha=0.3)
plt.show()

4.7.2 多项式特征变换

python

# 分割数据
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(
    X_poly, y_poly, test_size=0.2, random_state=42
)

# 比较不同阶数的多项式回归
degrees = [1, 2, 3, 4, 5]
poly_results = {}

plt.figure(figsize=(15, 10))

for i, degree in enumerate(degrees):
    # 创建多项式特征
    poly_features = PolynomialFeatures(degree=degree)
    X_train_poly_features = poly_features.fit_transform(X_train_poly)
    X_test_poly_features = poly_features.transform(X_test_poly)
    
    # 训练模型
    poly_model = LinearRegression()
    poly_model.fit(X_train_poly_features, y_train_poly)
    
    # 预测
    y_pred_poly = poly_model.predict(X_test_poly_features)
    
    # 评估
    r2 = r2_score(y_test_poly, y_pred_poly)
    rmse = np.sqrt(mean_squared_error(y_test_poly, y_pred_poly))
    
    poly_results[degree] = {'R2': r2, 'RMSE': rmse}
    
    # 可视化
    plt.subplot(2, 3, i + 1)
    
    # 绘制数据点
    plt.scatter(X_train_poly, y_train_poly, alpha=0.6, label='训练数据')
    plt.scatter(X_test_poly, y_test_poly, alpha=0.6, color='green', label='测试数据')
    
    # 绘制拟合曲线
    X_plot = np.linspace(X_poly.min(), X_poly.max(), 100).reshape(-1, 1)
    X_plot_poly = poly_features.transform(X_plot)
    y_plot = poly_model.predict(X_plot_poly)
    plt.plot(X_plot, y_plot, color='red', linewidth=2, label=f'多项式拟合 (度={degree})')
    
    plt.xlabel('X')
    plt.ylabel('y')
    plt.title(f'度={degree}, R²={r2:.3f}, RMSE={rmse:.3f}')
    plt.legend()
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 性能总结
poly_df = pd.DataFrame(poly_results).T
print("多项式回归性能比较：")
print(poly_df.round(4))

4.7.3 正则化多项式回归

python

# 使用Ridge正则化防止过拟合
degree = 5  # 使用高阶多项式
poly_features = PolynomialFeatures(degree=degree)
X_train_poly_high = poly_features.fit_transform(X_train_poly)
X_test_poly_high = poly_features.transform(X_test_poly)

print(f"多项式特征数量: {X_train_poly_high.shape[1]}")

# 比较不同正则化强度
alphas = [0, 0.1, 1, 10, 100]
plt.figure(figsize=(15, 10))

for i, alpha in enumerate(alphas):
    if alpha == 0:
        model = LinearRegression()
        model_name = "无正则化"
    else:
        model = Ridge(alpha=alpha)
        model_name = f"Ridge (α={alpha})"
    
    model.fit(X_train_poly_high, y_train_poly)
    
    # 预测
    y_pred = model.predict(X_test_poly_high)
    r2 = r2_score(y_test_poly, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test_poly, y_pred))
    
    # 可视化
    plt.subplot(2, 3, i + 1)
    plt.scatter(X_train_poly, y_train_poly, alpha=0.6, label='训练数据')
    plt.scatter(X_test_poly, y_test_poly, alpha=0.6, color='green', label='测试数据')
    
    # 绘制拟合曲线
    X_plot = np.linspace(X_poly.min(), X_poly.max(), 100).reshape(-1, 1)
    X_plot_poly = poly_features.transform(X_plot)
    y_plot = model.predict(X_plot_poly)
    plt.plot(X_plot, y_plot, color='red', linewidth=2, label=model_name)
    
    plt.xlabel('X')
    plt.ylabel('y')
    plt.title(f'{model_name}\nR²={r2:.3f}, RMSE={rmse:.3f}')
    plt.legend()
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

4.8 交叉验证和模型选择

4.8.1 学习曲线

python

from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, X, y, title="学习曲线"):
    """绘制学习曲线"""
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y, cv=5, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='r2'
    )
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, 'o-', color='blue', label='训练得分')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
    
    plt.plot(train_sizes, val_mean, 'o-', color='red', label='验证得分')
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color='red')
    
    plt.xlabel('训练样本数')
    plt.ylabel('R² 得分')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# 绘制不同模型的学习曲线
plot_learning_curve(LinearRegression(), X_train_scaled, y_train_multi, "线性回归学习曲线")
plot_learning_curve(Ridge(alpha=1.0), X_train_scaled, y_train_multi, "Ridge回归学习曲线")

4.8.2 验证曲线

python

from sklearn.model_selection import validation_curve

def plot_validation_curve(estimator, X, y, param_name, param_range, title="验证曲线"):
    """绘制验证曲线"""
    train_scores, val_scores = validation_curve(
        estimator, X, y, param_name=param_name, param_range=param_range,
        cv=5, scoring='r2', n_jobs=-1
    )
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.semilogx(param_range, train_mean, 'o-', color='blue', label='训练得分')
    plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
    
    plt.semilogx(param_range, val_mean, 'o-', color='red', label='验证得分')
    plt.fill_between(param_range, val_mean - val_std, val_mean + val_std, alpha=0.1, color='red')
    
    plt.xlabel(param_name)
    plt.ylabel('R² 得分')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# Ridge回归的验证曲线
alpha_range = np.logspace(-4, 2, 20)
plot_validation_curve(
    Ridge(), X_train_scaled, y_train_multi,
    'alpha', alpha_range, "Ridge回归验证曲线"
)

4.9 实际应用案例

4.9.1 房价预测案例

python

# 创建房价预测数据集
def create_house_price_dataset():
    """创建房价预测数据集"""
    np.random.seed(42)
    n_samples = 1000
    
    # 生成特征
    area = np.random.normal(150, 50, n_samples)  # 面积
    bedrooms = np.random.poisson(3, n_samples)  # 卧室数
    age = np.random.exponential(10, n_samples)  # 房龄
    distance_to_center = np.random.exponential(5, n_samples)  # 距离市中心距离
    
    # 生成目标变量（房价）
    price = (
        area * 500 +  # 面积影响
        bedrooms * 10000 +  # 卧室数影响
        -age * 1000 +  # 房龄负影响
        -distance_to_center * 2000 +  # 距离负影响
        np.random.normal(0, 20000, n_samples)  # 噪声
    )
    
    # 确保价格为正
    price = np.maximum(price, 50000)
    
    data = pd.DataFrame({
        '面积': area,
        '卧室数': bedrooms,
        '房龄': age,
        '距离市中心': distance_to_center,
        '价格': price
    })
    
    return data

# 创建数据集
house_data = create_house_price_dataset()

print("房价数据集信息：")
print(house_data.describe())

# 数据可视化
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('房价数据集特征分析', fontsize=16)

features = ['面积', '卧室数', '房龄', '距离市中心']
for i, feature in enumerate(features):
    row = i // 2
    col = i % 2
    
    axes[row, col].scatter(house_data[feature], house_data['价格'], alpha=0.6)
    axes[row, col].set_xlabel(feature)
    axes[row, col].set_ylabel('价格')
    axes[row, col].set_title(f'{feature} vs 价格')
    axes[row, col].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

4.9.2 构建房价预测模型

python

# 准备数据
X_house = house_data[features]
y_house = house_data['价格']

# 分割数据
X_train_house, X_test_house, y_train_house, y_test_house = train_test_split(
    X_house, y_house, test_size=0.2, random_state=42
)

# 创建预处理和建模管道
house_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', Ridge(alpha=1.0))
])

# 训练模型
house_pipeline.fit(X_train_house, y_train_house)

# 预测
y_pred_house = house_pipeline.predict(X_test_house)

# 评估
house_metrics = evaluate_regression_model(y_test_house, y_pred_house, "房价预测模型")

# 特征重要性
regressor = house_pipeline.named_steps['regressor']
feature_importance = np.abs(regressor.coef_)

plt.figure(figsize=(10, 6))
plt.barh(features, feature_importance)
plt.xlabel('系数绝对值')
plt.title('房价预测模型特征重要性')
plt.tight_layout()
plt.show()

# 预测示例
sample_houses = pd.DataFrame({
    '面积': [120, 200, 80],
    '卧室数': [2, 4, 1],
    '房龄': [5, 15, 25],
    '距离市中心': [3, 8, 1]
})

predicted_prices = house_pipeline.predict(sample_houses)

print("房价预测示例：")
for i, (_, house) in enumerate(sample_houses.iterrows()):
    print(f"房屋 {i+1}:")
    print(f"  面积: {house['面积']}平米, 卧室: {house['卧室数']}个, 房龄: {house['房龄']}年, 距离: {house['距离市中心']}公里")
    print(f"  预测价格: ¥{predicted_prices[i]:,.0f}")
    print()

4.10 练习题

练习1：基础线性回归

使用 make_regression 生成一个包含噪声的数据集
训练线性回归模型并评估性能
分析残差是否满足正态分布假设

练习2：特征工程

创建一个包含类别特征的数据集
使用独热编码处理类别特征
比较处理前后的模型性能

练习3：正则化比较

生成一个高维数据集（特征数 > 样本数）
比较线性回归、Ridge、Lasso和ElasticNet的性能
分析不同正则化方法的特征选择效果

练习4：多项式回归

生成一个复杂的非线性数据集
使用不同阶数的多项式回归拟合数据
使用交叉验证选择最佳多项式阶数

4.11 小结

在本章中，我们深入学习了线性回归的各个方面：

核心概念

线性回归原理：假设、数学公式、几何解释
模型评估：MSE、RMSE、MAE、R²等指标
残差分析：检验模型假设的有效性

主要技术

简单线性回归：单特征预测
多元线性回归：多特征预测
正则化方法：Ridge、Lasso、ElasticNet
多项式回归：处理非线性关系

实践技能

数据预处理：标准化、特征工程
模型选择：交叉验证、学习曲线
性能评估：多种评估指标的使用
结果解释：系数解释、特征重要性

关键要点

线性回归是理解机器学习的基础
正则化可以防止过拟合并进行特征选择
残差分析是验证模型有效性的重要工具
特征工程对模型性能有重要影响

4.12 下一步

现在你已经掌握了线性回归的核心知识！在下一章逻辑回归实战中，我们将学习如何处理分类问题，了解逻辑回归这个强大的分类算法。

章节要点回顾：

✅ 理解了线性回归的数学原理和假设
✅ 掌握了简单和多元线性回归的实现
✅ 学会了使用正则化方法防止过拟合
✅ 了解了多项式回归处理非线性关系
✅ 掌握了模型评估和残差分析方法
✅ 能够构建完整的回归预测系统

第4章：线性回归详解 ​

4.1 什么是线性回归？ ​

4.1.1 数学原理 ​

4.1.2 核心假设 ​

4.2 准备数据和环境 ​

4.3 简单线性回归 ​

4.3.1 生成示例数据 ​

4.3.2 训练简单线性回归模型 ​

4.3.3 模型评估 ​

4.4 多元线性回归 ​

4.4.1 使用真实数据集 ​

4.4.2 探索性数据分析 ​

4.4.3 训练多元线性回归模型 ​

4.4.4 模型预测和评估 ​

4.5 残差分析 ​

4.5.1 残差图 ​

4.5.2 残差统计分析 ​

4.6 正则化线性回归 ​

4.6.1 Ridge回归（L2正则化） ​

4.6.2 Lasso回归（L1正则化） ​

4.6.3 ElasticNet回归（L1+L2正则化） ​

4.6.4 正则化方法比较 ​

4.7 多项式回归 ​

4.7.1 生成非线性数据 ​

4.7.2 多项式特征变换 ​

4.7.3 正则化多项式回归 ​

4.8 交叉验证和模型选择 ​

4.8.1 学习曲线 ​

4.8.2 验证曲线 ​

4.9 实际应用案例 ​

4.9.1 房价预测案例 ​

4.9.2 构建房价预测模型 ​

4.10 练习题 ​

练习1：基础线性回归 ​

练习2：特征工程 ​

练习3：正则化比较 ​

练习4：多项式回归 ​

4.11 小结 ​

核心概念 ​

主要技术 ​

实践技能 ​

关键要点 ​

4.12 下一步 ​

第4章：线性回归详解

4.1 什么是线性回归？

4.1.1 数学原理

4.1.2 核心假设

4.2 准备数据和环境

4.3 简单线性回归

4.3.1 生成示例数据

4.3.2 训练简单线性回归模型

4.3.3 模型评估

4.4 多元线性回归

4.4.1 使用真实数据集

4.4.2 探索性数据分析

4.4.3 训练多元线性回归模型

4.4.4 模型预测和评估

4.5 残差分析

4.5.1 残差图

4.5.2 残差统计分析

4.6 正则化线性回归

4.6.1 Ridge回归（L2正则化）

4.6.2 Lasso回归（L1正则化）

4.6.3 ElasticNet回归（L1+L2正则化）

4.6.4 正则化方法比较

4.7 多项式回归

4.7.1 生成非线性数据

4.7.2 多项式特征变换

4.7.3 正则化多项式回归

4.8 交叉验证和模型选择

4.8.1 学习曲线

4.8.2 验证曲线

4.9 实际应用案例

4.9.1 房价预测案例

4.9.2 构建房价预测模型

4.10 练习题

练习1：基础线性回归

练习2：特征工程

练习3：正则化比较

练习4：多项式回归

4.11 小结

核心概念

主要技术

实践技能

关键要点

4.12 下一步