统计分析

SciPy 的 scipy.stats 模块是 Python 中最全面的统计分析工具包之一。它提供了大量的概率分布、统计函数和假设检验方法，是数据科学和统计分析的重要工具。本章将详细介绍如何使用这些功能进行各种统计分析。

scipy.stats 模块概述

scipy.stats 模块包含了以下主要功能：

80+ 种连续和离散概率分布
描述性统计函数
假设检验方法
相关性和回归分析
核密度估计
统计距离计算

python

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# 设置绘图样式
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# 查看 stats 模块的主要功能
print("scipy.stats 主要功能:")
functions = [attr for attr in dir(stats) if not attr.startswith('_')]
print(f"总共 {len(functions)} 个函数和类")
print("部分函数:", functions[:15])

概率分布

1. 连续分布

正态分布 (Normal Distribution)

python

# 正态分布是最重要的连续分布
mu, sigma = 0, 1  # 标准正态分布

# 创建正态分布对象
norm_dist = stats.norm(loc=mu, scale=sigma)

# 概率密度函数 (PDF)
x = np.linspace(-4, 4, 1000)
pdf_values = norm_dist.pdf(x)

# 累积分布函数 (CDF)
cdf_values = norm_dist.cdf(x)

# 分位数函数 (PPF - Percent Point Function)
quantiles = [0.025, 0.25, 0.5, 0.75, 0.975]
ppf_values = norm_dist.ppf(quantiles)

print(f"正态分布 N({mu}, {sigma}²):")
print(f"分位数: {dict(zip(quantiles, ppf_values))}")

# 随机数生成
random_samples = norm_dist.rvs(size=1000, random_state=42)

# 可视化
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# PDF
axes[0, 0].plot(x, pdf_values, 'b-', linewidth=2, label='PDF')
axes[0, 0].fill_between(x, pdf_values, alpha=0.3)
axes[0, 0].set_title('概率密度函数 (PDF)')
axes[0, 0].set_xlabel('x')
axes[0, 0].set_ylabel('密度')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# CDF
axes[0, 1].plot(x, cdf_values, 'r-', linewidth=2, label='CDF')
axes[0, 1].set_title('累积分布函数 (CDF)')
axes[0, 1].set_xlabel('x')
axes[0, 1].set_ylabel('概率')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 随机样本直方图
axes[1, 0].hist(random_samples, bins=50, density=True, alpha=0.7, color='green')
axes[1, 0].plot(x, pdf_values, 'b-', linewidth=2, label='理论 PDF')
axes[1, 0].set_title('随机样本分布')
axes[1, 0].set_xlabel('x')
axes[1, 0].set_ylabel('密度')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Q-Q 图
stats.probplot(random_samples, dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q 图')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

其他重要的连续分布

python

# t 分布
t_dist = stats.t(df=5)  # 自由度为 5

# 卡方分布
chi2_dist = stats.chi2(df=3)  # 自由度为 3

# F 分布
f_dist = stats.f(dfn=2, dfd=10)  # 分子自由度 2，分母自由度 10

# 指数分布
expon_dist = stats.expon(scale=2)  # 尺度参数为 2

# 伽马分布
gamma_dist = stats.gamma(a=2, scale=1)  # 形状参数 2，尺度参数 1

# 贝塔分布
beta_dist = stats.beta(a=2, b=5)  # 参数 a=2, b=5

# 均匀分布
uniform_dist = stats.uniform(loc=0, scale=1)  # [0, 1] 上的均匀分布

# 比较不同分布的 PDF
x_range = np.linspace(0, 5, 1000)

plt.figure(figsize=(15, 10))

# 绘制各种分布
distributions = {
    '正态分布 N(1,0.5²)': stats.norm(1, 0.5),
    't分布 (df=5)': t_dist,
    '指数分布 (λ=0.5)': expon_dist,
    '伽马分布 (α=2,β=1)': gamma_dist,
    '卡方分布 (df=3)': chi2_dist
}

for i, (name, dist) in enumerate(distributions.items()):
    plt.subplot(2, 3, i+1)
    
    if name.startswith('贝塔'):
        x_plot = np.linspace(0, 1, 1000)
    else:
        x_plot = x_range
    
    pdf_vals = dist.pdf(x_plot)
    plt.plot(x_plot, pdf_vals, linewidth=2, label=name)
    plt.fill_between(x_plot, pdf_vals, alpha=0.3)
    plt.title(name)
    plt.xlabel('x')
    plt.ylabel('密度')
    plt.grid(True, alpha=0.3)
    plt.legend()

plt.tight_layout()
plt.show()

# 分布的统计量
print("\n各分布的统计量:")
for name, dist in distributions.items():
    mean = dist.mean()
    var = dist.var()
    std = dist.std()
    skew = dist.stats(moments='s')
    kurt = dist.stats(moments='k')
    
    print(f"{name}:")
    print(f"  均值: {mean:.3f}, 方差: {var:.3f}, 标准差: {std:.3f}")
    print(f"  偏度: {skew:.3f}, 峰度: {kurt:.3f}")
    print()

2. 离散分布

python

# 二项分布
binom_dist = stats.binom(n=20, p=0.3)

# 泊松分布
poisson_dist = stats.poisson(mu=3)

# 几何分布
geom_dist = stats.geom(p=0.2)

# 负二项分布
nbinom_dist = stats.nbinom(n=5, p=0.3)

# 超几何分布
hypergeom_dist = stats.hypergeom(M=50, n=10, N=15)

# 可视化离散分布
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# 二项分布
k_binom = np.arange(0, 21)
pmf_binom = binom_dist.pmf(k_binom)
axes[0, 0].bar(k_binom, pmf_binom, alpha=0.7)
axes[0, 0].set_title('二项分布 B(20, 0.3)')
axes[0, 0].set_xlabel('k')
axes[0, 0].set_ylabel('P(X=k)')
axes[0, 0].grid(True, alpha=0.3)

# 泊松分布
k_poisson = np.arange(0, 15)
pmf_poisson = poisson_dist.pmf(k_poisson)
axes[0, 1].bar(k_poisson, pmf_poisson, alpha=0.7, color='orange')
axes[0, 1].set_title('泊松分布 Pois(3)')
axes[0, 1].set_xlabel('k')
axes[0, 1].set_ylabel('P(X=k)')
axes[0, 1].grid(True, alpha=0.3)

# 几何分布
k_geom = np.arange(1, 21)
pmf_geom = geom_dist.pmf(k_geom)
axes[0, 2].bar(k_geom, pmf_geom, alpha=0.7, color='green')
axes[0, 2].set_title('几何分布 Geom(0.2)')
axes[0, 2].set_xlabel('k')
axes[0, 2].set_ylabel('P(X=k)')
axes[0, 2].grid(True, alpha=0.3)

# 负二项分布
k_nbinom = np.arange(0, 30)
pmf_nbinom = nbinom_dist.pmf(k_nbinom)
axes[1, 0].bar(k_nbinom, pmf_nbinom, alpha=0.7, color='red')
axes[1, 0].set_title('负二项分布 NB(5, 0.3)')
axes[1, 0].set_xlabel('k')
axes[1, 0].set_ylabel('P(X=k)')
axes[1, 0].grid(True, alpha=0.3)

# 超几何分布
k_hypergeom = np.arange(0, 11)
pmf_hypergeom = hypergeom_dist.pmf(k_hypergeom)
axes[1, 1].bar(k_hypergeom, pmf_hypergeom, alpha=0.7, color='purple')
axes[1, 1].set_title('超几何分布 H(50,10,15)')
axes[1, 1].set_xlabel('k')
axes[1, 1].set_ylabel('P(X=k)')
axes[1, 1].grid(True, alpha=0.3)

# 比较二项分布和泊松分布（当 n 大，p 小时）
n_large, p_small = 100, 0.03
binom_approx = stats.binom(n=n_large, p=p_small)
poisson_approx = stats.poisson(mu=n_large * p_small)

k_compare = np.arange(0, 15)
pmf_binom_approx = binom_approx.pmf(k_compare)
pmf_poisson_approx = poisson_approx.pmf(k_compare)

axes[1, 2].bar(k_compare - 0.2, pmf_binom_approx, width=0.4, alpha=0.7, 
               label='二项分布 B(100,0.03)')
axes[1, 2].bar(k_compare + 0.2, pmf_poisson_approx, width=0.4, alpha=0.7, 
               label='泊松分布 Pois(3)')
axes[1, 2].set_title('二项分布的泊松近似')
axes[1, 2].set_xlabel('k')
axes[1, 2].set_ylabel('P(X=k)')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 离散分布的统计量
discrete_distributions = {
    '二项分布 B(20,0.3)': binom_dist,
    '泊松分布 Pois(3)': poisson_dist,
    '几何分布 Geom(0.2)': geom_dist,
    '负二项分布 NB(5,0.3)': nbinom_dist
}

print("\n离散分布的统计量:")
for name, dist in discrete_distributions.items():
    mean = dist.mean()
    var = dist.var()
    std = dist.std()
    
    print(f"{name}:")
    print(f"  均值: {mean:.3f}, 方差: {var:.3f}, 标准差: {std:.3f}")
    print()

描述性统计

1. 基本统计量

python

# 生成示例数据
np.random.seed(42)
data1 = np.random.normal(100, 15, 1000)  # 正态分布数据
data2 = np.random.exponential(2, 1000)   # 指数分布数据
data3 = np.random.uniform(0, 10, 1000)   # 均匀分布数据

# 计算各种统计量
def describe_data(data, name):
    print(f"\n{name} 的描述性统计:")
    print(f"样本大小: {len(data)}")
    print(f"均值: {np.mean(data):.3f}")
    print(f"中位数: {np.median(data):.3f}")
    print(f"众数: {stats.mode(data, keepdims=True)[0][0]:.3f}")
    print(f"标准差: {np.std(data, ddof=1):.3f}")
    print(f"方差: {np.var(data, ddof=1):.3f}")
    print(f"偏度: {stats.skew(data):.3f}")
    print(f"峰度: {stats.kurtosis(data):.3f}")
    print(f"最小值: {np.min(data):.3f}")
    print(f"最大值: {np.max(data):.3f}")
    print(f"极差: {np.ptp(data):.3f}")
    
    # 分位数
    percentiles = [5, 25, 50, 75, 95]
    quantile_values = np.percentile(data, percentiles)
    print(f"分位数: {dict(zip(percentiles, quantile_values))}")
    
    # 四分位距
    q1, q3 = np.percentile(data, [25, 75])
    iqr = q3 - q1
    print(f"四分位距 (IQR): {iqr:.3f}")
    
    # 变异系数
    cv = np.std(data, ddof=1) / np.mean(data)
    print(f"变异系数: {cv:.3f}")
    
    return {
        'mean': np.mean(data),
        'median': np.median(data),
        'std': np.std(data, ddof=1),
        'skew': stats.skew(data),
        'kurtosis': stats.kurtosis(data)
    }

# 分析三组数据
stats1 = describe_data(data1, "正态分布数据")
stats2 = describe_data(data2, "指数分布数据")
stats3 = describe_data(data3, "均匀分布数据")

# 使用 scipy.stats.describe 函数
print("\n使用 scipy.stats.describe:")
for i, (data, name) in enumerate([(data1, "正态分布"), (data2, "指数分布"), (data3, "均匀分布")]):
    desc = stats.describe(data)
    print(f"\n{name}:")
    print(f"  样本数: {desc.nobs}")
    print(f"  最小值-最大值: ({desc.minmax[0]:.3f}, {desc.minmax[1]:.3f})")
    print(f"  均值: {desc.mean:.3f}")
    print(f"  方差: {desc.variance:.3f}")
    print(f"  偏度: {desc.skewness:.3f}")
    print(f"  峰度: {desc.kurtosis:.3f}")

2. 数据可视化

python

# 综合可视化分析
fig, axes = plt.subplots(3, 4, figsize=(16, 12))

datasets = [(data1, "正态分布"), (data2, "指数分布"), (data3, "均匀分布")]

for i, (data, name) in enumerate(datasets):
    # 直方图
    axes[i, 0].hist(data, bins=50, density=True, alpha=0.7, edgecolor='black')
    axes[i, 0].set_title(f'{name} - 直方图')
    axes[i, 0].set_ylabel('密度')
    axes[i, 0].grid(True, alpha=0.3)
    
    # 箱线图
    axes[i, 1].boxplot(data, vert=True)
    axes[i, 1].set_title(f'{name} - 箱线图')
    axes[i, 1].grid(True, alpha=0.3)
    
    # Q-Q 图（与正态分布比较）
    stats.probplot(data, dist="norm", plot=axes[i, 2])
    axes[i, 2].set_title(f'{name} - Q-Q图')
    axes[i, 2].grid(True, alpha=0.3)
    
    # 经验累积分布函数
    sorted_data = np.sort(data)
    y_values = np.arange(1, len(sorted_data) + 1) / len(sorted_data)
    axes[i, 3].plot(sorted_data, y_values, linewidth=2)
    axes[i, 3].set_title(f'{name} - 经验CDF')
    axes[i, 3].set_xlabel('值')
    axes[i, 3].set_ylabel('累积概率')
    axes[i, 3].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 核密度估计
plt.figure(figsize=(12, 4))

for i, (data, name) in enumerate(datasets):
    plt.subplot(1, 3, i+1)
    
    # 直方图
    plt.hist(data, bins=50, density=True, alpha=0.5, label='直方图')
    
    # 核密度估计
    kde = stats.gaussian_kde(data)
    x_range = np.linspace(data.min(), data.max(), 200)
    kde_values = kde(x_range)
    plt.plot(x_range, kde_values, linewidth=2, label='核密度估计')
    
    plt.title(f'{name}')
    plt.xlabel('值')
    plt.ylabel('密度')
    plt.legend()
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

假设检验

1. 单样本检验

python

# 单样本 t 检验
# H0: μ = μ0 vs H1: μ ≠ μ0
np.random.seed(42)
sample = np.random.normal(105, 15, 50)  # 真实均值为 105
mu0 = 100  # 假设均值

t_stat, p_value = stats.ttest_1samp(sample, mu0)

print("单样本 t 检验:")
print(f"样本均值: {np.mean(sample):.3f}")
print(f"假设均值: {mu0}")
print(f"t 统计量: {t_stat:.3f}")
print(f"p 值: {p_value:.6f}")
print(f"结论: {'拒绝' if p_value < 0.05 else '不拒绝'}原假设 (α=0.05)")

# 单样本 Wilcoxon 符号秩检验（非参数）
wilcoxon_stat, wilcoxon_p = stats.wilcoxon(sample - mu0)
print(f"\nWilcoxon 符号秩检验:")
print(f"统计量: {wilcoxon_stat}")
print(f"p 值: {wilcoxon_p:.6f}")
print(f"结论: {'拒绝' if wilcoxon_p < 0.05 else '不拒绝'}原假设 (α=0.05)")

# 正态性检验
# Shapiro-Wilk 检验
shapiro_stat, shapiro_p = stats.shapiro(sample)
print(f"\nShapiro-Wilk 正态性检验:")
print(f"统计量: {shapiro_stat:.6f}")
print(f"p 值: {shapiro_p:.6f}")
print(f"结论: 数据{'不'}符合正态分布 (α=0.05)" if shapiro_p < 0.05 else "结论: 数据符合正态分布 (α=0.05)")

# Kolmogorov-Smirnov 检验
ks_stat, ks_p = stats.kstest(sample, 'norm', args=(np.mean(sample), np.std(sample, ddof=1)))
print(f"\nKolmogorov-Smirnov 正态性检验:")
print(f"统计量: {ks_stat:.6f}")
print(f"p 值: {ks_p:.6f}")
print(f"结论: 数据{'不'}符合正态分布 (α=0.05)" if ks_p < 0.05 else "结论: 数据符合正态分布 (α=0.05)")

# Anderson-Darling 检验
ad_result = stats.anderson(sample, dist='norm')
print(f"\nAnderson-Darling 正态性检验:")
print(f"统计量: {ad_result.statistic:.6f}")
print(f"临界值: {ad_result.critical_values}")
print(f"显著性水平: {ad_result.significance_level}")

2. 双样本检验

python

# 生成两组样本
np.random.seed(42)
group1 = np.random.normal(100, 15, 50)
group2 = np.random.normal(105, 15, 45)

# 独立样本 t 检验
# 假设方差相等
t_stat_equal, p_equal = stats.ttest_ind(group1, group2, equal_var=True)

# 不假设方差相等（Welch's t-test）
t_stat_unequal, p_unequal = stats.ttest_ind(group1, group2, equal_var=False)

print("独立样本 t 检验:")
print(f"组1均值: {np.mean(group1):.3f}, 组2均值: {np.mean(group2):.3f}")
print(f"\n假设方差相等:")
print(f"  t 统计量: {t_stat_equal:.3f}")
print(f"  p 值: {p_equal:.6f}")
print(f"\n不假设方差相等 (Welch's t-test):")
print(f"  t 统计量: {t_stat_unequal:.3f}")
print(f"  p 值: {p_unequal:.6f}")

# 方差齐性检验
# Levene 检验
levene_stat, levene_p = stats.levene(group1, group2)
print(f"\nLevene 方差齐性检验:")
print(f"统计量: {levene_stat:.3f}")
print(f"p 值: {levene_p:.6f}")
print(f"结论: 方差{'不'}相等 (α=0.05)" if levene_p < 0.05 else "结论: 方差相等 (α=0.05)")

# Bartlett 检验（假设正态性）
bartlett_stat, bartlett_p = stats.bartlett(group1, group2)
print(f"\nBartlett 方差齐性检验:")
print(f"统计量: {bartlett_stat:.3f}")
print(f"p 值: {bartlett_p:.6f}")

# 非参数检验
# Mann-Whitney U 检验
mw_stat, mw_p = stats.mannwhitneyu(group1, group2, alternative='two-sided')
print(f"\nMann-Whitney U 检验:")
print(f"统计量: {mw_stat}")
print(f"p 值: {mw_p:.6f}")
print(f"结论: {'拒绝' if mw_p < 0.05 else '不拒绝'}原假设 (α=0.05)")

# Kolmogorov-Smirnov 双样本检验
ks2_stat, ks2_p = stats.ks_2samp(group1, group2)
print(f"\nKolmogorov-Smirnov 双样本检验:")
print(f"统计量: {ks2_stat:.6f}")
print(f"p 值: {ks2_p:.6f}")

3. 多样本检验

python

# 生成多组样本
np.random.seed(42)
group_a = np.random.normal(100, 15, 30)
group_b = np.random.normal(105, 15, 35)
group_c = np.random.normal(110, 15, 25)
group_d = np.random.normal(103, 15, 40)

# 单因素方差分析 (One-way ANOVA)
f_stat, anova_p = stats.f_oneway(group_a, group_b, group_c, group_d)

print("单因素方差分析 (One-way ANOVA):")
print(f"F 统计量: {f_stat:.3f}")
print(f"p 值: {anova_p:.6f}")
print(f"结论: {'拒绝' if anova_p < 0.05 else '不拒绝'}原假设 (α=0.05)")

# 非参数版本：Kruskal-Wallis 检验
kw_stat, kw_p = stats.kruskal(group_a, group_b, group_c, group_d)
print(f"\nKruskal-Wallis 检验:")
print(f"统计量: {kw_stat:.3f}")
print(f"p 值: {kw_p:.6f}")
print(f"结论: {'拒绝' if kw_p < 0.05 else '不拒绝'}原假设 (α=0.05)")

# 事后检验（如果 ANOVA 显著）
if anova_p < 0.05:
    print("\n进行事后检验...")
    
    # 合并所有数据
    all_data = np.concatenate([group_a, group_b, group_c, group_d])
    groups = (['A'] * len(group_a) + ['B'] * len(group_b) + 
              ['C'] * len(group_c) + ['D'] * len(group_d))
    
    # Tukey HSD 检验需要使用其他库，这里展示成对 t 检验
    from itertools import combinations
    
    group_data = {'A': group_a, 'B': group_b, 'C': group_c, 'D': group_d}
    group_names = list(group_data.keys())
    
    print("成对 t 检验 (Bonferroni 校正):")
    n_comparisons = len(list(combinations(group_names, 2)))
    alpha_corrected = 0.05 / n_comparisons
    
    for name1, name2 in combinations(group_names, 2):
        t_stat, p_val = stats.ttest_ind(group_data[name1], group_data[name2])
        significant = p_val < alpha_corrected
        print(f"  {name1} vs {name2}: t={t_stat:.3f}, p={p_val:.6f}, 显著: {significant}")

# 可视化多组比较
plt.figure(figsize=(12, 8))

# 箱线图
plt.subplot(2, 2, 1)
data_for_box = [group_a, group_b, group_c, group_d]
labels = ['Group A', 'Group B', 'Group C', 'Group D']
plt.boxplot(data_for_box, labels=labels)
plt.title('多组数据箱线图')
plt.ylabel('值')
plt.grid(True, alpha=0.3)

# 小提琴图
plt.subplot(2, 2, 2)
positions = [1, 2, 3, 4]
parts = plt.violinplot(data_for_box, positions=positions)
plt.xticks(positions, labels)
plt.title('多组数据小提琴图')
plt.ylabel('值')
plt.grid(True, alpha=0.3)

# 均值和置信区间
plt.subplot(2, 2, 3)
means = [np.mean(data) for data in data_for_box]
stds = [np.std(data, ddof=1) for data in data_for_box]
ns = [len(data) for data in data_for_box]
ses = [std / np.sqrt(n) for std, n in zip(stds, ns)]
cis = [1.96 * se for se in ses]  # 95% 置信区间

plt.errorbar(range(1, 5), means, yerr=cis, fmt='o-', capsize=5, capthick=2)
plt.xticks(range(1, 5), labels)
plt.title('均值和95%置信区间')
plt.ylabel('均值')
plt.grid(True, alpha=0.3)

# 密度图
plt.subplot(2, 2, 4)
for i, (data, label) in enumerate(zip(data_for_box, labels)):
    plt.hist(data, bins=20, alpha=0.5, label=label, density=True)
plt.title('多组数据密度分布')
plt.xlabel('值')
plt.ylabel('密度')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

4. 卡方检验

python

# 拟合优度检验
# 检验骰子是否公平
observed_dice = np.array([18, 22, 16, 25, 12, 7])  # 观察到的频数
expected_dice = np.array([100/6] * 6)  # 期望频数（公平骰子）

chi2_stat, chi2_p = stats.chisquare(observed_dice, expected_dice)

print("卡方拟合优度检验（骰子公平性）:")
print(f"观察频数: {observed_dice}")
print(f"期望频数: {expected_dice}")
print(f"卡方统计量: {chi2_stat:.3f}")
print(f"p 值: {chi2_p:.6f}")
print(f"结论: 骰子{'不'}公平 (α=0.05)" if chi2_p < 0.05 else "结论: 骰子公平 (α=0.05)")

# 独立性检验
# 性别与产品偏好的关联性
contingency_table = np.array([[20, 30, 25],   # 男性对产品A、B、C的偏好
                              [25, 20, 30]])  # 女性对产品A、B、C的偏好

chi2_indep, p_indep, dof, expected = stats.chi2_contingency(contingency_table)

print(f"\n卡方独立性检验（性别与产品偏好）:")
print(f"观察频数表:\n{contingency_table}")
print(f"期望频数表:\n{expected}")
print(f"卡方统计量: {chi2_indep:.3f}")
print(f"自由度: {dof}")
print(f"p 值: {p_indep:.6f}")
print(f"结论: 性别与产品偏好{'有'}关联 (α=0.05)" if p_indep < 0.05 else "结论: 性别与产品偏好无关联 (α=0.05)")

# Cramér's V（关联强度）
n = np.sum(contingency_table)
cramers_v = np.sqrt(chi2_indep / (n * (min(contingency_table.shape) - 1)))
print(f"Cramér's V: {cramers_v:.3f}")

# Fisher 精确检验（2x2 表）
table_2x2 = np.array([[10, 15], [5, 20]])
odds_ratio, fisher_p = stats.fisher_exact(table_2x2)

print(f"\nFisher 精确检验:")
print(f"2x2 表:\n{table_2x2}")
print(f"比值比: {odds_ratio:.3f}")
print(f"p 值: {fisher_p:.6f}")
print(f"结论: {'拒绝' if fisher_p < 0.05 else '不拒绝'}独立性假设 (α=0.05)")

回归分析

1. 简单线性回归

python

# 生成回归数据
np.random.seed(42)
n = 100
x = np.random.uniform(0, 10, n)
y = 2 + 3 * x + np.random.normal(0, 2, n)  # y = 2 + 3x + ε

# 线性回归
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

print("简单线性回归结果:")
print(f"斜率: {slope:.3f} ± {std_err:.3f}")
print(f"截距: {intercept:.3f}")
print(f"相关系数: {r_value:.3f}")
print(f"决定系数 R²: {r_value**2:.3f}")
print(f"p 值: {p_value:.6f}")
print(f"标准误差: {std_err:.3f}")

# 预测
y_pred = slope * x + intercept
residuals = y - y_pred

# 计算置信区间和预测区间
from scipy import stats as scipy_stats

# 计算统计量
n = len(x)
mean_x = np.mean(x)
ss_xx = np.sum((x - mean_x)**2)
ss_res = np.sum(residuals**2)
mse = ss_res / (n - 2)

# t 值（95% 置信区间）
t_val = scipy_stats.t.ppf(0.975, n - 2)

# 置信区间和预测区间
x_new = np.linspace(x.min(), x.max(), 100)
y_new = slope * x_new + intercept

# 置信区间（均值的置信区间）
se_mean = np.sqrt(mse * (1/n + (x_new - mean_x)**2 / ss_xx))
ci_lower = y_new - t_val * se_mean
ci_upper = y_new + t_val * se_mean

# 预测区间（个别值的预测区间）
se_pred = np.sqrt(mse * (1 + 1/n + (x_new - mean_x)**2 / ss_xx))
pi_lower = y_new - t_val * se_pred
pi_upper = y_new + t_val * se_pred

# 可视化
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 回归图
axes[0, 0].scatter(x, y, alpha=0.6, label='数据点')
axes[0, 0].plot(x_new, y_new, 'r-', linewidth=2, label='回归线')
axes[0, 0].fill_between(x_new, ci_lower, ci_upper, alpha=0.3, label='95% 置信区间')
axes[0, 0].fill_between(x_new, pi_lower, pi_upper, alpha=0.2, label='95% 预测区间')
axes[0, 0].set_xlabel('X')
axes[0, 0].set_ylabel('Y')
axes[0, 0].set_title('线性回归')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 残差图
axes[0, 1].scatter(y_pred, residuals, alpha=0.6)
axes[0, 1].axhline(y=0, color='r', linestyle='--')
axes[0, 1].set_xlabel('预测值')
axes[0, 1].set_ylabel('残差')
axes[0, 1].set_title('残差图')
axes[0, 1].grid(True, alpha=0.3)

# 残差的正态性检验
axes[1, 0].hist(residuals, bins=20, density=True, alpha=0.7, edgecolor='black')
# 叠加正态分布
resid_mean, resid_std = np.mean(residuals), np.std(residuals)
x_norm = np.linspace(residuals.min(), residuals.max(), 100)
y_norm = scipy_stats.norm.pdf(x_norm, resid_mean, resid_std)
axes[1, 0].plot(x_norm, y_norm, 'r-', linewidth=2, label='正态分布')
axes[1, 0].set_xlabel('残差')
axes[1, 0].set_ylabel('密度')
axes[1, 0].set_title('残差分布')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Q-Q 图
scipy_stats.probplot(residuals, dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('残差 Q-Q 图')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 残差分析
print(f"\n残差分析:")
print(f"残差均值: {np.mean(residuals):.6f}")
print(f"残差标准差: {np.std(residuals):.3f}")

# 残差正态性检验
shapiro_stat, shapiro_p = scipy_stats.shapiro(residuals)
print(f"Shapiro-Wilk 检验: 统计量={shapiro_stat:.6f}, p值={shapiro_p:.6f}")

# Durbin-Watson 检验（自相关性）
def durbin_watson(residuals):
    diff = np.diff(residuals)
    return np.sum(diff**2) / np.sum(residuals**2)

dw_stat = durbin_watson(residuals)
print(f"Durbin-Watson 统计量: {dw_stat:.3f}")
print(f"自相关性: {'存在' if dw_stat < 1.5 or dw_stat > 2.5 else '不存在'}")

2. 多元线性回归

python

# 使用 sklearn 进行多元回归（scipy.stats 主要用于统计检验）
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# 生成多元回归数据
np.random.seed(42)
n = 200
X = np.random.normal(0, 1, (n, 3))  # 3 个预测变量
beta = np.array([2, -1.5, 3])  # 真实系数
y = 5 + X @ beta + np.random.normal(0, 2, n)  # y = 5 + 2x1 - 1.5x2 + 3x3 + ε

# 拟合模型
model = LinearRegression()
model.fit(X, y)

# 预测
y_pred = model.predict(X)
residuals = y - y_pred

print("多元线性回归结果:")
print(f"截距: {model.intercept_:.3f}")
print(f"系数: {model.coef_}")
print(f"真实系数: {beta}")
print(f"R²: {r2_score(y, y_pred):.3f}")
print(f"均方误差: {mean_squared_error(y, y_pred):.3f}")

# 系数的显著性检验（需要手动计算）
# 计算标准误差
X_with_intercept = np.column_stack([np.ones(n), X])
XTX_inv = np.linalg.inv(X_with_intercept.T @ X_with_intercept)
mse = np.sum(residuals**2) / (n - X.shape[1] - 1)
se_coeffs = np.sqrt(np.diag(XTX_inv) * mse)

# t 统计量
coeffs_with_intercept = np.array([model.intercept_] + list(model.coef_))
t_stats = coeffs_with_intercept / se_coeffs
p_values = 2 * (1 - scipy_stats.t.cdf(np.abs(t_stats), n - X.shape[1] - 1))

print(f"\n系数显著性检验:")
coeff_names = ['截距', 'X1', 'X2', 'X3']
for i, name in enumerate(coeff_names):
    print(f"{name}: 系数={coeffs_with_intercept[i]:.3f}, "
          f"标准误={se_coeffs[i]:.3f}, t={t_stats[i]:.3f}, p={p_values[i]:.6f}")

# 模型整体显著性检验（F 检验）
ss_res = np.sum(residuals**2)
ss_tot = np.sum((y - np.mean(y))**2)
ss_reg = ss_tot - ss_res

df_reg = X.shape[1]
df_res = n - X.shape[1] - 1

ms_reg = ss_reg / df_reg
ms_res = ss_res / df_res

f_stat = ms_reg / ms_res
f_p_value = 1 - scipy_stats.f.cdf(f_stat, df_reg, df_res)

print(f"\n模型整体显著性检验:")
print(f"F 统计量: {f_stat:.3f}")
print(f"p 值: {f_p_value:.6f}")
print(f"结论: 模型{'显著' if f_p_value < 0.05 else '不显著'} (α=0.05)")

非参数统计

1. 符号检验和秩检验

python

# 符号检验
np.random.seed(42)
data = np.random.normal(2, 1, 50)  # 真实中位数为 2
median_0 = 0  # 假设中位数

# 计算符号
signs = np.sign(data - median_0)
positive_signs = np.sum(signs > 0)
negative_signs = np.sum(signs < 0)
zero_signs = np.sum(signs == 0)

print("符号检验:")
print(f"正号数量: {positive_signs}")
print(f"负号数量: {negative_signs}")
print(f"零的数量: {zero_signs}")

# 使用二项检验
n_nonzero = positive_signs + negative_signs
binom_p = scipy_stats.binom_test(positive_signs, n_nonzero, p=0.5)
print(f"二项检验 p 值: {binom_p:.6f}")
print(f"结论: {'拒绝' if binom_p < 0.05 else '不拒绝'}原假设 (α=0.05)")

# Wilcoxon 符号秩检验
wilcoxon_stat, wilcoxon_p = scipy_stats.wilcoxon(data - median_0)
print(f"\nWilcoxon 符号秩检验:")
print(f"统计量: {wilcoxon_stat}")
print(f"p 值: {wilcoxon_p:.6f}")
print(f"结论: {'拒绝' if wilcoxon_p < 0.05 else '不拒绝'}原假设 (α=0.05)")

2. 游程检验

python

# 游程检验（随机性检验）
def runs_test(data, cutoff=None):
    """
    游程检验
    """
    if cutoff is None:
        cutoff = np.median(data)
    
    # 转换为二进制序列
    binary_seq = (data > cutoff).astype(int)
    
    # 计算游程
    runs = []
    current_run = 1
    
    for i in range(1, len(binary_seq)):
        if binary_seq[i] == binary_seq[i-1]:
            current_run += 1
        else:
            runs.append(current_run)
            current_run = 1
    runs.append(current_run)
    
    # 游程数量
    n_runs = len(runs)
    n1 = np.sum(binary_seq)
    n2 = len(binary_seq) - n1
    
    # 期望游程数和方差
    expected_runs = (2 * n1 * n2) / (n1 + n2) + 1
    var_runs = (2 * n1 * n2 * (2 * n1 * n2 - n1 - n2)) / ((n1 + n2)**2 * (n1 + n2 - 1))
    
    # Z 统计量
    z_stat = (n_runs - expected_runs) / np.sqrt(var_runs)
    p_value = 2 * (1 - scipy_stats.norm.cdf(abs(z_stat)))
    
    return n_runs, expected_runs, z_stat, p_value

# 测试随机性
np.random.seed(42)
random_data = np.random.normal(0, 1, 100)
trend_data = np.linspace(-2, 2, 100) + np.random.normal(0, 0.1, 100)

print("游程检验（随机性检验）:")
for data, name in [(random_data, "随机数据"), (trend_data, "趋势数据")]:
    n_runs, expected, z_stat, p_val = runs_test(data)
    print(f"\n{name}:")
    print(f"  观察游程数: {n_runs}")
    print(f"  期望游程数: {expected:.2f}")
    print(f"  Z 统计量: {z_stat:.3f}")
    print(f"  p 值: {p_val:.6f}")
    print(f"  结论: 数据{'不'}随机 (α=0.05)" if p_val < 0.05 else "  结论: 数据随机 (α=0.05)")

# 可视化
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

for i, (data, name) in enumerate([(random_data, "随机数据"), (trend_data, "趋势数据")]):
    # 时间序列图
    axes[i, 0].plot(data)
    axes[i, 0].axhline(y=np.median(data), color='r', linestyle='--', label='中位数')
    axes[i, 0].set_title(f'{name} - 时间序列')
    axes[i, 0].set_xlabel('时间')
    axes[i, 0].set_ylabel('值')
    axes[i, 0].legend()
    axes[i, 0].grid(True, alpha=0.3)
    
    # 游程图
    binary_seq = (data > np.median(data)).astype(int)
    axes[i, 1].plot(binary_seq, 'o-', markersize=3)
    axes[i, 1].set_title(f'{name} - 游程图')
    axes[i, 1].set_xlabel('时间')
    axes[i, 1].set_ylabel('高于中位数 (1) / 低于中位数 (0)')
    axes[i, 1].set_ylim(-0.1, 1.1)
    axes[i, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

核密度估计

python

# 核密度估计
np.random.seed(42)

# 生成混合分布数据
data1 = np.random.normal(-2, 0.5, 300)
data2 = np.random.normal(2, 0.8, 200)
mixed_data = np.concatenate([data1, data2])

# 使用不同带宽的核密度估计
bandwidths = [0.1, 0.3, 0.5, 1.0]
x_range = np.linspace(mixed_data.min() - 1, mixed_data.max() + 1, 1000)

plt.figure(figsize=(15, 10))

for i, bw in enumerate(bandwidths):
    plt.subplot(2, 2, i+1)
    
    # 直方图
    plt.hist(mixed_data, bins=30, density=True, alpha=0.5, color='lightblue', 
             edgecolor='black', label='直方图')
    
    # 核密度估计
    kde = scipy_stats.gaussian_kde(mixed_data, bw_method=bw)
    kde_values = kde(x_range)
    plt.plot(x_range, kde_values, linewidth=2, label=f'KDE (带宽={bw})')
    
    # 真实密度（理论）
    true_density = (0.6 * scipy_stats.norm.pdf(x_range, -2, 0.5) + 
                   0.4 * scipy_stats.norm.pdf(x_range, 2, 0.8))
    plt.plot(x_range, true_density, 'r--', linewidth=2, label='真实密度')
    
    plt.title(f'核密度估计 (带宽 = {bw})')
    plt.xlabel('值')
    plt.ylabel('密度')
    plt.legend()
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 自动带宽选择
kde_auto = scipy_stats.gaussian_kde(mixed_data)
print(f"自动选择的带宽: {kde_auto.factor:.4f}")

# 不同核函数的比较（需要使用 sklearn）
from sklearn.neighbors import KernelDensity

kernels = ['gaussian', 'tophat', 'epanechnikov', 'exponential']

plt.figure(figsize=(12, 8))

for i, kernel in enumerate(kernels):
    plt.subplot(2, 2, i+1)
    
    # 直方图
    plt.hist(mixed_data, bins=30, density=True, alpha=0.5, color='lightblue', 
             edgecolor='black', label='直方图')
    
    # 核密度估计
    kde_sk = KernelDensity(kernel=kernel, bandwidth=0.5)
    kde_sk.fit(mixed_data.reshape(-1, 1))
    log_density = kde_sk.score_samples(x_range.reshape(-1, 1))
    density = np.exp(log_density)
    
    plt.plot(x_range, density, linewidth=2, label=f'{kernel} 核')
    
    plt.title(f'{kernel.capitalize()} 核密度估计')
    plt.xlabel('值')
    plt.ylabel('密度')
    plt.legend()
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

实际应用案例

案例1：质量控制分析

python

# 质量控制数据分析
np.random.seed(42)

# 模拟生产数据
control_data = np.random.normal(100, 2, 200)  # 正常生产
defective_data = np.random.normal(95, 3, 50)  # 异常生产
all_data = np.concatenate([control_data, defective_data])
labels = ['正常'] * 200 + ['异常'] * 50

print("质量控制统计分析:")
print(f"总样本数: {len(all_data)}")
print(f"正常样本: {len(control_data)}, 异常样本: {len(defective_data)}")

# 描述性统计
print(f"\n正常生产数据:")
print(f"  均值: {np.mean(control_data):.3f}")
print(f"  标准差: {np.std(control_data, ddof=1):.3f}")
print(f"  范围: [{np.min(control_data):.3f}, {np.max(control_data):.3f}]")

print(f"\n异常生产数据:")
print(f"  均值: {np.mean(defective_data):.3f}")
print(f"  标准差: {np.std(defective_data, ddof=1):.3f}")
print(f"  范围: [{np.min(defective_data):.3f}, {np.max(defective_data):.3f}]")

# 假设检验
t_stat, p_value = scipy_stats.ttest_ind(control_data, defective_data)
print(f"\n独立样本 t 检验:")
print(f"t 统计量: {t_stat:.3f}")
print(f"p 值: {p_value:.6f}")
print(f"结论: 两组数据均值{'有'}显著差异 (α=0.05)" if p_value < 0.05 else "结论: 两组数据均值无显著差异 (α=0.05)")

# 控制图
control_mean = np.mean(control_data)
control_std = np.std(control_data, ddof=1)
ucl = control_mean + 3 * control_std  # 上控制限
lcl = control_mean - 3 * control_std  # 下控制限

plt.figure(figsize=(15, 10))

# 控制图
plt.subplot(2, 2, 1)
plt.plot(range(len(all_data)), all_data, 'o-', markersize=3, alpha=0.7)
plt.axhline(y=control_mean, color='g', linestyle='-', label='中心线')
plt.axhline(y=ucl, color='r', linestyle='--', label='上控制限 (UCL)')
plt.axhline(y=lcl, color='r', linestyle='--', label='下控制限 (LCL)')
plt.axvline(x=200, color='orange', linestyle=':', label='异常开始')
plt.title('质量控制图')
plt.xlabel('样本序号')
plt.ylabel('质量指标')
plt.legend()
plt.grid(True, alpha=0.3)

# 直方图比较
plt.subplot(2, 2, 2)
plt.hist(control_data, bins=30, alpha=0.7, label='正常', density=True)
plt.hist(defective_data, bins=20, alpha=0.7, label='异常', density=True)
plt.title('质量分布比较')
plt.xlabel('质量指标')
plt.ylabel('密度')
plt.legend()
plt.grid(True, alpha=0.3)

# 箱线图
plt.subplot(2, 2, 3)
data_for_box = [control_data, defective_data]
plt.boxplot(data_for_box, labels=['正常', '异常'])
plt.title('质量指标箱线图')
plt.ylabel('质量指标')
plt.grid(True, alpha=0.3)

# 累积分布函数
plt.subplot(2, 2, 4)
sorted_control = np.sort(control_data)
sorted_defective = np.sort(defective_data)
y_control = np.arange(1, len(sorted_control) + 1) / len(sorted_control)
y_defective = np.arange(1, len(sorted_defective) + 1) / len(sorted_defective)

plt.plot(sorted_control, y_control, label='正常', linewidth=2)
plt.plot(sorted_defective, y_defective, label='异常', linewidth=2)
plt.title('累积分布函数比较')
plt.xlabel('质量指标')
plt.ylabel('累积概率')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

案例2：A/B测试分析

python

# A/B 测试数据分析
np.random.seed(42)

# 模拟 A/B 测试数据
n_a, n_b = 1000, 1200
conversion_rate_a = 0.12  # A 组转化率
conversion_rate_b = 0.15  # B 组转化率

# 生成转化数据
conversions_a = np.random.binomial(1, conversion_rate_a, n_a)
conversions_b = np.random.binomial(1, conversion_rate_b, n_b)

# 计算统计量
conv_a = np.sum(conversions_a)
conv_b = np.sum(conversions_b)
rate_a = conv_a / n_a
rate_b = conv_b / n_b

print("A/B 测试结果分析:")
print(f"A 组: {conv_a}/{n_a} = {rate_a:.4f} ({rate_a*100:.2f}%)")
print(f"B 组: {conv_b}/{n_b} = {rate_b:.4f} ({rate_b*100:.2f}%)")
print(f"提升: {(rate_b - rate_a)/rate_a*100:.2f}%")

# 比例检验
from statsmodels.stats.proportion import proportions_ztest

counts = np.array([conv_a, conv_b])
nobs = np.array([n_a, n_b])

z_stat, p_value = proportions_ztest(counts, nobs)
print(f"\n比例检验:")
print(f"Z 统计量: {z_stat:.3f}")
print(f"p 值: {p_value:.6f}")
print(f"结论: B组转化率{'显著'}高于A组 (α=0.05)" if p_value < 0.05 else "结论: B组转化率与A组无显著差异 (α=0.05)")

# 置信区间
from statsmodels.stats.proportion import proportion_confint

ci_a = proportion_confint(conv_a, n_a, alpha=0.05)
ci_b = proportion_confint(conv_b, n_b, alpha=0.05)

print(f"\n95% 置信区间:")
print(f"A 组: [{ci_a[0]:.4f}, {ci_a[1]:.4f}]")
print(f"B 组: [{ci_b[0]:.4f}, {ci_b[1]:.4f}]")

# 效应量（Cohen's h）
def cohens_h(p1, p2):
    return 2 * (np.arcsin(np.sqrt(p1)) - np.arcsin(np.sqrt(p2)))

effect_size = cohens_h(rate_b, rate_a)
print(f"\n效应量 (Cohen's h): {effect_size:.4f}")

# 功效分析
from statsmodels.stats.power import ttest_power

power = ttest_power(effect_size, n_a, alpha=0.05)
print(f"统计功效: {power:.4f}")

# 可视化
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 转化率比较
rates = [rate_a, rate_b]
errors = [ci_a[1] - rate_a, ci_b[1] - rate_b]
axes[0, 0].bar(['A组', 'B组'], rates, yerr=errors, capsize=5, alpha=0.7)
axes[0, 0].set_title('转化率比较')
axes[0, 0].set_ylabel('转化率')
axes[0, 0].grid(True, alpha=0.3)

# 转化数分布
axes[0, 1].bar(['A组', 'B组'], [conv_a, conv_b], alpha=0.7)
axes[0, 1].set_title('转化数比较')
axes[0, 1].set_ylabel('转化数')
for i, v in enumerate([conv_a, conv_b]):
    axes[0, 1].text(i, v + 5, str(v), ha='center')
axes[0, 1].grid(True, alpha=0.3)

# 置信区间可视化
x_pos = [0, 1]
axes[1, 0].errorbar(x_pos, rates, 
                   yerr=[[rate_a - ci_a[0], rate_b - ci_b[0]], 
                         [ci_a[1] - rate_a, ci_b[1] - rate_b]], 
                   fmt='o', capsize=10, capthick=2, markersize=8)
axes[1, 0].set_xticks(x_pos)
axes[1, 0].set_xticklabels(['A组', 'B组'])
axes[1, 0].set_title('转化率及95%置信区间')
axes[1, 0].set_ylabel('转化率')
axes[1, 0].grid(True, alpha=0.3)

# 样本量与功效关系
sample_sizes = np.arange(100, 2000, 50)
powers = [ttest_power(effect_size, n, alpha=0.05) for n in sample_sizes]
axes[1, 1].plot(sample_sizes, powers, linewidth=2)
axes[1, 1].axhline(y=0.8, color='r', linestyle='--', label='功效=0.8')
axes[1, 1].axvline(x=n_a, color='g', linestyle=':', label=f'当前样本量={n_a}')
axes[1, 1].set_title('样本量与统计功效关系')
axes[1, 1].set_xlabel('样本量')
axes[1, 1].set_ylabel('统计功效')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

小结

本章详细介绍了 SciPy 统计分析模块的核心功能：

概率分布：掌握了连续分布和离散分布的使用方法
描述性统计：学会计算各种统计量和数据可视化
假设检验：理解单样本、双样本和多样本检验方法
相关性分析：掌握线性和非线性相关性的度量方法
回归分析：学会简单和多元线性回归的实现
非参数统计：了解不依赖分布假设的统计方法
核密度估计：掌握数据分布的非参数估计方法
实际应用：通过质量控制和A/B测试案例学习实际应用

练习题

使用正态分布生成1000个样本，进行Shapiro-Wilk正态性检验
比较两组数据的均值差异，选择合适的检验方法
对多组数据进行方差分析，并进行事后检验
计算两个变量的Pearson和Spearman相关系数，解释差异
使用核密度估计分析混合分布数据的特征
设计一个A/B测试分析流程，包括样本量计算和结果解释
实现一个质量控制图，识别异常数据点
使用游程检验分析时间序列数据的随机性

统计分析

scipy.stats 模块概述

概率分布

1. 连续分布

正态分布 (Normal Distribution)

其他重要的连续分布

2. 离散分布

描述性统计

1. 基本统计量

2. 数据可视化

假设检验

1. 单样本检验

2. 双样本检验

3. 多样本检验

4. 卡方检验

相关性分析

1. 线性相关

2. 多变量相关性

回归分析

1. 简单线性回归

2. 多元线性回归

非参数统计

1. 符号检验和秩检验

2. 游程检验

核密度估计

实际应用案例

案例1：质量控制分析

案例2：A/B测试分析

小结

练习题

统计分析 ​

scipy.stats 模块概述 ​

概率分布 ​

1. 连续分布 ​

正态分布 (Normal Distribution) ​

其他重要的连续分布 ​

2. 离散分布 ​

描述性统计 ​

1. 基本统计量 ​

2. 数据可视化 ​

假设检验 ​

1. 单样本检验 ​

2. 双样本检验 ​

3. 多样本检验 ​

4. 卡方检验 ​

相关性分析 ​

1. 线性相关 ​

2. 多变量相关性 ​

回归分析 ​

1. 简单线性回归 ​

2. 多元线性回归 ​

非参数统计 ​

1. 符号检验和秩检验 ​

2. 游程检验 ​

核密度估计 ​

实际应用案例 ​

案例1：质量控制分析 ​

案例2：A/B测试分析 ​

小结 ​

练习题 ​

统计分析

scipy.stats 模块概述

概率分布

1. 连续分布

正态分布 (Normal Distribution)

其他重要的连续分布

2. 离散分布

描述性统计

1. 基本统计量

2. 数据可视化

假设检验

1. 单样本检验

2. 双样本检验

3. 多样本检验

4. 卡方检验

相关性分析

1. 线性相关

2. 多变量相关性

回归分析

1. 简单线性回归

2. 多元线性回归

非参数统计

1. 符号检验和秩检验

2. 游程检验

核密度估计

实际应用案例

案例1：质量控制分析

案例2：A/B测试分析

小结

练习题