网球数据分析日记

(Updated: Nov 5, 2019)

预测参赛的人中，所有人中谁夺冠的概率。

【失败】选手在最近一段时间，tourney里连续赢球的百分比，会影响下一个tourney里F round里赢球的概率。— 么看出相关

【选取假设】设计数据，查看哪些是相关

【其他】从验证别人的论文开始。。。

看下TrueSkill如何处理，然后将TrueSkill 引入计算中
- 网球ranking 参考 ATP
- True Skill 参考 TrueSkill Rank

将需要的数据用dataframe的形式表达出来；
- y: final_winner — the probability to be tourney final winner.
- x: player_name, tourney_level, surface, player_age, player_ht, player_hand, player_rank_point(this need to deal with NULL data), recent_period_win_percent (maybe 1 year’s avg of the win_percent * tourney_level)
  - 用pandas df.rolling来处理，时间长度可以自己设置。
将category的数据用one-hot-encoding的方式处理;
- 用pandas get_dummies来处理
降维;
- 【失败】用PCA 处理，后来发现col是330（因为one hot encoding的关系），前四个因子累加起来的百分比刚超过5%。
线性数据处理
- EDA/ use pairplot to check the data relationship.
  - 没有出来有明显的相关x因素和y结果之间
  - player_age/ player_ht has 0 data.
将数据分成training dataset 和 test dataset; 并验证是否各个y label平均分配;
将数据 X_train 进行归一化处理,且应用到 X_test 上
之后用sklearn里的现成的包
接着计算accuracy

player_name	tourney_id	tourney_name	surface	draw_size	tourney_level	tourney date	final_winner	round_total	winner_max_round	win_percent	player_hand	player_age	player_ht	player_rank_points	wt_win_percent
roger xxx	idxxx	xxx master	hard	32	A	2019-01-01	1	4	4	100	R	20	189	2190	win% * level weight

player_name	tourney_id	tourney_name	surface	draw_size	tourney_level	level_weight	tourney_date	player_hand	player_age	player_ht	player_rank_points
roger xxx	xxx	xx master	hard	32	A	1	2019-01-01	R	20	189	2190

round	round_count	t_id
RR	4	which is tourney_id

tourney_id	round_total
xxx	5

winner_name	winner_max_round	tourney_id
roger xxx	4	xxx

loser_name	loser_count	tourney_id
roger xxx	1	xxx

final_winner_name	tourney_id
roger xx	xxx

赢得赛事的概率，是否和左右手持拍有关

category input to numerical data
- nominal input: try the one-hot-encoding first

刚刚理解了一把列的内容，所以先把自己学习到的一些术语放在这里，以备不时之需。（好复杂）