沧海拾珠

机器学习之回归预测棒球击中率

1. sklearn回归模型简介

通过拟合线性模型的回归系数W =(w_1,…,w_n)
来减少数据中观察到的结果和实际结果之间的残差平方和,并通过线性逼近进行预测。
如果在回归分析中,只包括一个自变量和一个因变量,且二者的关系可用一条直线近似表示,这种回归分析称为一元线性回归分析。如果回归分析中包括两个或两个以上的自变量,且因变量和自变量之间是线性关系,则称为多元线性回归分析。

2. 导入相关库

1
2
3
4
5
6
7
import sqlite3
import pandas as pd
from sklearn.tree import DecisionTreeRegressor #决策树回归模型
from sklearn.linear_model import LinearRegression #线性回归模型
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

3. 引入数据集,该数据来自kaggle European Soccer Database

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cnx = sqlite3.connect("database.sqlite")
df = pd.read_sql_query("SELECT * FROM Player_Attributes", cnx)
df.columns
Index(['id', 'player_fifa_api_id', 'player_api_id', 'date', 'overall_rating',
'potential', 'preferred_foot', 'attacking_work_rate',
'defensive_work_rate', 'crossing', 'finishing', 'heading_accuracy',
'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
'gk_reflexes'],
dtype='object')

4. 确定影响因素及目标

1
2
3
4
5
6
7
8
9
10
11
12
13
features = [
'potential', 'crossing', 'finishing', 'heading_accuracy',
'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
'gk_reflexes'];
target = ['overall_rating']
X = df[features]
y = df[target]

5.去除nan值后,进行数据分组。

1
2
df = df.dropna()
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.33, random_state = 324)

6. 线性回归模型的建立及预测。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
regressor = LinearRegression()
regressor.fit(X_train,y_train)
y_prediction = regressor.predict(X_test)
y_prediction
array([[ 66.51284879],
[ 79.77234615],
[ 66.57371825],
...,
[ 69.23780133],
[ 64.58351696],
[ 73.6881185 ]])
# 均方误差
sqrt(mean_squared_error(y_true = y_test,y_pred = y_prediction))
2.805303046855228

7. 决策树回归模型的建立及预测

1
2
3
4
5
6
regressor = DecisionTreeRegressor(max_depth=20)
regressor.fit(X_train,y_train)
y_predict = regressor.predict(X_test)
sqrt(mean_squared_error(y_true = y_test,y_pred = y_predict))
1.4637145204101003

可以看出决策树回归模型比线性回归模型精度更高一些,具体原理暂时我也不清楚。