概率、统计学在机器学习中应用:20个Python示例

大数据文摘受权转载自机器学习算法与Python实战

在数据科学和机器学习领域，概率论和统计学扮演着至关重要的角色。Python作为一种强大而灵活的编程语言，提供了丰富的库和工具来实现这些概念。本文将通过20个Python实例，展示如何在实际应用中运用概率论和统计学知识。

1. 基本概率计算

让我们从一个简单的硬币投掷实验开始：

import random

def coin_flip(n):
return [random.choice(['H', 'T']) for _ in range(n)]

flips = coin_flip(1000)
probability_head = flips.count('H') / len(flips)

print(f"Probability of getting heads: {probability_head:.2f}")

这个例子模拟了1000次硬币投掷，并计算出现正面的概率。

2. 描述性统计

使用NumPy和Pandas来计算一些基本的描述性统计量：

import numpy as np
import pandas as pd

data = np.random.normal(0, 1, 1000)
df = pd.DataFrame(data, columns=['values'])

print(df.describe())

这个例子生成了1000个服从标准正态分布的随机数，并计算了均值、标准差等统计量。

3. 概率分布

使用SciPy绘制正态分布的概率密度函数：

import scipy.stats as stats
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 100)
plt.plot(x, stats.norm.pdf(x, 0, 1))
plt.title("Standard Normal Distribution")
plt.xlabel("x")
plt.ylabel("Probability Density")
plt.show()

4. 中心极限定理

演示中心极限定理：

sample_means = [np.mean(np.random.exponential(1, 100)) for _ in range(1000)]
plt.hist(sample_means, bins=30, edgecolor='black')
plt.title("Distribution of Sample Means")
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
plt.show()

这个例子展示了指数分布的样本均值趋向于正态分布。

5. 假设检验

进行t检验：

from scipy import stats

group1 = np.random.normal(0, 1, 100)
group2 = np.random.normal(0.5, 1, 100)

t_statistic, p_value = stats.ttest_ind(group1, group2)
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

这个例子比较两组数据，检验它们的均值是否有显著差异。

6. 置信区间

计算均值的置信区间：

data = np.random.normal(0, 1, 100)
mean = np.mean(data)
se = stats.sem(data)
ci = stats.t.interval(0.95, len(data)-1, loc=mean, scale=se)
print(f"95% Confidence Interval: {ci}")

7. 线性回归

使用sklearn进行简单线性回归：

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X = np.random.rand(100, 1)
y = 2 * X + 1 + np.random.randn(100, 1) * 0.1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

print(f"Coefficient: {model.coef_[0][0]:.2f}")
print(f"Intercept: {model.intercept_[0]:.2f}")

8. 多项式回归

使用numpy的polyfit函数进行多项式回归：

x = np.linspace(0, 1, 100)
y = x**2 + np.random.randn(100) * 0.1

coeffs = np.polyfit(x, y, 2)
p = np.poly1d(coeffs)

plt.scatter(x, y)
plt.plot(x, p(x), color='red')
plt.title("Polynomial Regression")
plt.show()

9. 贝叶斯推断

使用PyMC3进行简单的贝叶斯推断：

import pymc3 as pm

with pm.Model() as model:
 mu = pm.Normal('mu', mu=0, sd=1)
 obs = pm.Normal('obs', mu=mu, sd=1, observed=np.random.randn(100))
 trace = pm.sample(1000)

pm.plot_posterior(trace)
plt.show()

这个例子展示了如何对正态分布的均值进行贝叶斯推断。

10. 蒙特卡罗模拟

使用蒙特卡罗方法估算π：

def estimate_pi(n):
 inside_circle = 0
 total_points = n

for _ in range(total_points):
 x = random.uniform(-1, 1)
 y = random.uniform(-1, 1)
if x**2 + y**2 <= 1:
 inside_circle += 1

return 4 * inside_circle / total_points

print(f"Estimated value of π: {estimate_pi(1000000):.6f}")

这个例子通过随机点的方法估算π的值。

11. 马尔可夫链

实现简单的马尔可夫链：

states = ['A', 'B', 'C']
transition_matrix = {
'A': {'A': 0.3, 'B': 0.6, 'C': 0.1},
'B': {'A': 0.4, 'B': 0.2, 'C': 0.4},
'C': {'A': 0.1, 'B': 0.3, 'C': 0.6}
}

def next_state(current):
return random.choices(states, weights=list(transition_matrix[current].values()))[0]

current = 'A'
for _ in range(10):
 print(current, end=' -> ')
 current = next_state(current)
print(current)

12. 主成分分析 (PCA)

使用sklearn进行PCA：

from sklearn.decomposition import PCA

data = np.random.randn(100, 5)
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(data)

plt.scatter(reduced_data[:, 0], reduced_data[:, 1])
plt.title("PCA Reduced Data")
plt.xlabel("First Principal Component")
plt.ylabel("Second Principal Component")
plt.show()

13. 时间序列分析

使用statsmodels进行ARIMA模型拟合：

from statsmodels.tsa.arima.model import ARIMA

np.random.seed(1)
ts = pd.Series(np.cumsum(np.random.randn(100)))

model = ARIMA(ts, order=(1,1,1))
results = model.fit()
print(results.summary())

14. 核密度估计

使用seaborn进行核密度估计：

import seaborn as sns

data = np.concatenate([np.random.normal(-2, 1, 1000), np.random.normal(2, 1, 1000)])
sns.kdeplot(data)
plt.title("Kernel Density Estimation")
plt.show()

15. Bootstrap方法

使用Bootstrap方法估计均值的置信区间：

def bootstrap_mean(data, num_samples, size):
 means = [np.mean(np.random.choice(data, size=size)) for _ in range(num_samples)]
return np.percentile(means, [2.5, 97.5])

data = np.random.normal(0, 1, 1000)
ci = bootstrap_mean(data, 10000, len(data))
print(f"95% CI for the mean: {ci}")

16. 假设检验的功效分析

进行t检验的功效分析：

from statsmodels.stats.power import TTestIndPower

effect = 0.5
alpha = 0.05
power = 0.8

analysis = TTestIndPower()
sample_size = analysis.solve_power(effect, power=power, nobs1=None, ratio=1.0, alpha=alpha)

print(f"Required sample size: {sample_size:.0f}")

17. 贝叶斯信息准则 (BIC)

使用BIC进行模型选择：

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X = np.random.rand(100, 3)
y = X[:, 0] + 2*X[:, 1] + np.random.randn(100) * 0.1

def bic(y, y_pred, n_params):
 mse = mean_squared_error(y, y_pred)
return len(y) * np.log(mse) + n_params * np.log(len(y))

models = [
 LinearRegression().fit(X[:, :1], y),
 LinearRegression().fit(X[:, :2], y),
 LinearRegression().fit(X, y)
]

bic_scores = [bic(y, model.predict(X[:, :i+1]), i+1) for i, model in enumerate(models)]
best_model = np.argmin(bic_scores)
print(f"Best model (lowest BIC): {best_model + 1} features")

18. 非参数检验

使用Mann-Whitney U检验：

group1 = np.random.normal(0, 1, 100)
group2 = np.random.normal(0.5, 1, 100)

statistic, p_value = stats.mannwhitneyu(group1, group2)
print(f"Mann-Whitney U statistic: {statistic}")
print(f"P-value: {p_value:.4f}")

19. 生存分析

使用lifelines进行Kaplan-Meier生存分析：

from lifelines import KaplanMeierFitter

T = np.random.exponential(10, size=100)
E = np.random.binomial(1, 0.7, size=100)

kmf = KaplanMeierFitter()
kmf.fit(T, E, label="KM Estimate")
kmf.plot()
plt.title("Kaplan-Meier Survival Curve")
plt.show()

20. 聚类分析

使用K-means聚类：

from sklearn.cluster import KMeans

X = np.random.randn(300, 2)
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_)
plt.title("K-means Clustering")
plt.show()

租售GPU算力

租：4090/A800/H800/H100

售：现货H100/H800

特别适合企业级应用

扫码了解详情?

1. 基本概率计算

2. 描述性统计

3. 概率分布

4. 中心极限定理

5. 假设检验

6. 置信区间

7. 线性回归

8. 多项式回归

9. 贝叶斯推断

10. 蒙特卡罗模拟

11. 马尔可夫链

12. 主成分分析 (PCA)

13. 时间序列分析

14. 核密度估计

15. Bootstrap方法

16. 假设检验的功效分析

17. 贝叶斯信息准则 (BIC)

18. 非参数检验

19. 生存分析

20. 聚类分析

相关推荐

java:Cassandra入门与实战——下

广联达终于出免费造价软件了，这五款真好用，准确率高达100%

开源库libmodbus的用法

Tekla 2023钢结构设计软件安装教程附下载方法

配置GitLab流水线和门禁系统

推荐五个优秀的富文本编辑器富文本编辑器app

MySql中json类型数据的查询以及在MyBatis-Plus中的使用

立即下载Galaxy Z Flip 6和Fold 6的壁纸 - SamMobile

BIOS/UEFI模式下如何分区 uefi分区教程

亿图图示免费VIP会员兑换码激活码礼品券

概率、统计学在机器学习中应用:20个Python示例

1. 基本概率计算

2. 描述性统计

3. 概率分布

4. 中心极限定理

5. 假设检验

6. 置信区间

7. 线性回归

8. 多项式回归

9. 贝叶斯推断

10. 蒙特卡罗模拟

11. 马尔可夫链

12. 主成分分析 (PCA)

13. 时间序列分析

14. 核密度估计

15. Bootstrap方法

16. 假设检验的功效分析

17. 贝叶斯信息准则 (BIC)

18. 非参数检验

19. 生存分析

20. 聚类分析

相关推荐

java:Cassandra入门与实战——下

广联达终于出免费造价软件了，这五款真好用，准确率高达100%

开源库libmodbus的用法

Tekla 2023钢结构设计软件安装教程附下载方法

配置GitLab流水线和门禁系统

推荐五个优秀的富文本编辑器 富文本编辑器app

MySql中json类型数据的查询以及在MyBatis-Plus中的使用

立即下载Galaxy Z Flip 6和Fold 6的壁纸 - SamMobile

BIOS/UEFI模式下如何分区 uefi分区教程

亿图图示 免费VIP会员兑换码激活码礼品券

推荐五个优秀的富文本编辑器富文本编辑器app

亿图图示免费VIP会员兑换码激活码礼品券