avatar


JPX-2.前十名的方案 [1/2]

  • 原作者提交给东京证券交易所的代码,有些写的很乱,这一点也不奇怪,在比赛期间,时间紧,任务重,通常不会太关注代码质量,我自己也是这样。
  • 在本文,我对原作者的代码进行了一些简单的重构。

说明

东京证券交易所公开了前10名的方案,链接如下:
https://github.com/J-Quants/JPXTokyoStockExchangePrediction

东京证券交易所官方也有对前10名的方案的评述,链接如下:
https://www.youtube.com/watch?v=Ax3ON-2FLBM

东京证券交易所官方将方案进行了分类。其中第一名、第二名、第三名、第六名、第七名、第八名的方案属于常规方案,第四名、第五名、第十名的方案属于创新方案。
第九名的方案是后来提交的,当时第九名的方案还没有提交给东京证券交易所,所以没有对第九名进行分类。我在看了第九的名的方案后,认为第九名的方案也属于创新方案。

分类

本章会讨论常规方案:

  • 第一名
  • 第二名
  • 第三名
  • 第六名
  • 第七名
  • 第八名

在下一章《JPX-3.前十名的方案 [2/2]》会讨论创新方案:

  • 第四名
  • 第五名
  • 第九名
  • 第十名

第一名

代码

1
2
3
4
5
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from scipy import stats
import jpx_tokyo_market_prediction
1
2
3
4
5
6
train_stock_prices = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv")
train_secondary_stock_prices = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/train_files/secondary_stock_prices.csv")
supplemental_stock_prices = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/supplemental_files/stock_prices.csv")
supplemental_secondary_stock_prices = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/supplemental_files/secondary_stock_prices.csv")

stock_prices = pd.concat([train_stock_prices,train_secondary_stock_prices,supplemental_stock_prices,supplemental_secondary_stock_prices])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def featuring_train(data):

data['Date'] = pd.to_datetime(data['Date'])
data['Target'] = data['Target'].fillna(0)
data["SupervisionFlag"] = data["SupervisionFlag"].astype(int)

# 缺失值处理
data['ExpectedDividend'] = data['ExpectedDividend'].fillna(0)
cols = ['Open', 'High', 'Low', 'Close']
data.loc[:,cols] = data.loc[:,cols].ffill()
data.loc[:,cols] = data.loc[:,cols].bfill()

# 特征衍生
data['Daily_Range'] = data['Close'] - data['Open']
data['Mean'] = (data['High']+data['Low']) / 2
data['Mean'] = data['Mean'].astype(int)

# 标准化
data['Open'] = stats.zscore(data['Open'])
data['High'] = stats.zscore(data['High'])
data['Low'] = stats.zscore(data['Low'])
data['Close'] = stats.zscore(data['Close'])
data['Volume'] = stats.zscore(data['Volume'])
data['Daily_Range'] = stats.zscore(data['Daily_Range'])
data['Mean'] = stats.zscore(data['Mean'])

# 去除一些列
data = data.drop(['RowId'], axis=1)

return data

data = featuring_train(stock_prices)
1
2
3
4
5
6
7
8
9
10
11
12
13
data_train = data[data['Date']<'2022-04-01']

data_test = data[data['Date']>'2022-04-01']
data_test = data_test.reset_index(drop=True)

data_train = data_train.drop(['Date'], axis=1)
data_test = data_test.drop(['Date'], axis=1)

X_train = data_train.drop(['Target'], axis=1)
y_train = data_train['Target']

X_test = data_test.drop(['Target'], axis=1)
y_test = data_test['Target']
1
2
model = LinearRegression()
model.fit(X_train, y_train)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def featuring_test(data):

data["SupervisionFlag"] = data["SupervisionFlag"].astype(int)

# 缺失值处理
data['ExpectedDividend'] = data['ExpectedDividend'].fillna(0)
cols = ['Open', 'High', 'Low', 'Close']
data.loc[:,cols] = data.loc[:,cols].ffill()
data.loc[:,cols] = data.loc[:,cols].bfill()

# 特征衍生
data['Daily_Range'] = data['Close'] - data['Open']
data['Mean'] = (data['High']+data['Low']) / 2
data['Mean'] = data['Mean'].astype(int)

# 标准化
data['Open'] = stats.zscore(data['Open'])
data['High'] = stats.zscore(data['High'])
data['Low'] = stats.zscore(data['Low'])
data['Close'] = stats.zscore(data['Close'])
data['Volume'] = stats.zscore(data['Volume'])
data['Daily_Range'] = stats.zscore(data['Daily_Range'])
data['Mean'] = stats.zscore(data['Mean'])

# 删除一些列
data = data.drop(['RowId', 'Date'], axis=1)

return data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:

x_test = featuring_test(prices)
y_pred = model.predict(x_test)

sample_prediction['Target'] = y_pred
sample_prediction = sample_prediction.sort_values(by = "Target", ascending = False)

sample_prediction['Rank'] = np.arange(len(sample_prediction.index))
sample_prediction = sample_prediction.sort_values(by = "SecuritiesCode", ascending = True)
sample_prediction.drop(["Target"], axis = 1)
submission = sample_prediction[["Date", "SecuritiesCode", "Rank"]]

env.predict(submission)

复现

第一名不但提交代码给了东京证券交易所,还在Kaggle开源了其代码,这两份代码是一样的。

但是,第一名提供的代码,无法复现0.381这个分数,我实验了多次,分数都是0.277,第一名可能有所保留。

(和我对第一名的方案进行重构没有关系,即使我用重构前的代码,也还是0.277,无法复现。)

Kaggle链接:https://www.kaggle.com/code/shokisakai/jpx-regression

解读

模型

线性回归模型。

缺失值处理

对于ExpectedDividend,填充为0。

对于OpenHighLowClose,填充为前后值。示例代码:

1
2
3
cols = ['Open', 'High', 'Low', 'Close']
data.loc[:,cols] = data.loc[:,cols].ffill()
data.loc[:,cols] = data.loc[:,cols].bfill()

对于Volume,该方案没有对其缺失值进行处理。

特征衍生

作者衍生了两个特征

  • Daily_Range,收盘价减去开盘价。
  • Mean,(最高价 + 最低价) / 2,然后向下取整。
1
2
3
data['Daily_Range'] = data['Close'] - data['Open']
data['Mean'] = (data['High']+data['Low']) / 2
data['Mean'] = data['Mean'].astype(int)

标准化

最后,作者对OpenHighLowCloseVolumeDaily_RangeMean这些特征进行标准化(Z-Score标准化),即把原始数据映射到均值为0,方差为1的范围内。

x=xmeanσx' = \frac{x - mean}{\sigma}

其中

  • meanmean代表平均值
  • σ\sigma代表标准差

东京证券交易所二部

对于东京证券交易所二部的数据,作者也将其加入了模型的训练。

困惑

特征工程部分

作者的特征工程代码,有一处不合逻辑:缺失值的处理。
作者应该按股票进行分组,再按日期进行排序,然后进行缺失值处理,对于OpenHighLowClose,填充为前后值。

如果我略微调整一下数据的顺序,例如我加上这么一行

1
data.sort_values(by='Close', inplace=True)

这时候的分数是-0.063。

测试数据处理部分

还有一处,其实存在争议。测试数据的特征处理。

在对训练数据进行标准化的时候,是基于全市场的所有股票的所有训练日期的数据进行标准化。

但是,在对测试数据进行标准化的时候,是基于全市场的所有股票的某一天的数据进行标准化,该处不合理。

姑且可以认为,在大样本情况下(样本数2000),该部分可以忽略。

第二名

代码

1
2
3
4
5
6
7
8
9
import math
import os

import jpx_tokyo_market_prediction
import lightgbm as lgb
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.metrics import mean_squared_error
1
2
3
4
5
6
7
8
# 设置随机数种子
def seed_everything(seed):
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)


SEED = 42
seed_everything(SEED)
1
2
3
4
# 只用了东京证券交易所一部的数据
train = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv", parse_dates=["Date"])
# 和分红、监管标志的列直接删掉
train = train.drop(columns=['RowId', 'ExpectedDividend', 'AdjustmentFactor', 'SupervisionFlag']).dropna().reset_index(drop=True)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# 特征衍生
def add_features(feats):
# 相比20个交易日前的变化
feats["return_1month"] = feats["Close"].pct_change(20)
# 相比40个交易日前的变化
feats["return_2month"] = feats["Close"].pct_change(40)
# 相比60个交易日前的变化
feats["return_3month"] = feats["Close"].pct_change(60)
# 过去20个交易日的波动率
feats["volatility_1month"] = (
np.log(feats["Close"]).diff().rolling(20).std()
)
# 过去40个交易日的波动率
feats["volatility_2month"] = (
np.log(feats["Close"]).diff().rolling(40).std()
)
# 过去60个交易日的波动率
feats["volatility_3month"] = (
np.log(feats["Close"]).diff().rolling(60).std()
)
# 过去20个交易日的平均
feats["MA_gap_1month"] = feats["Close"] / (
feats["Close"].rolling(20).mean()
)
# 过去40个交易日的平均
feats["MA_gap_2month"] = feats["Close"] / (
feats["Close"].rolling(40).mean()
)
# 过去60个交易日的平均
feats["MA_gap_3month"] = feats["Close"] / (
feats["Close"].rolling(60).mean()
)

return feats
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# 均方误差
def feval_rmse(y_pred, lgb_train):
y_true = lgb_train.get_label()
return 'rmse', mean_squared_error(y_true, y_pred), False


# 皮尔逊相关系数
def feval_pearsonr(y_pred, lgb_train):
y_true = lgb_train.get_label()
return 'pearsonr', stats.pearsonr(y_true, y_pred)[0], True


# 计算回报
def calc_spread_return_per_day(df, portfolio_size=200, toprank_weight_ratio=2):
assert df['Rank'].min() == 0
assert df['Rank'].max() == len(df['Rank']) - 1
weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size] * weights).sum() / weights.mean()
short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size] * weights).sum() / weights.mean()
return purchase - short


# 计算夏普比率
def calc_spread_return_sharpe(df: pd.DataFrame, portfolio_size=200, toprank_weight_ratio=2):
buf = df.groupby('Date').apply(calc_spread_return_per_day, portfolio_size, toprank_weight_ratio)
sharpe_ratio = buf.mean() / buf.std()
return sharpe_ratio


# 进行排名
def add_rank(df):
df["Rank"] = df.groupby("Date")["Target"].rank(ascending=False, method="first") - 1
df["Rank"] = df["Rank"].astype("int")
return df


# 填充空值和inf
def fill_nan_inf(df):
df = df.fillna(0)
df = df.replace([np.inf, -np.inf], 0)
return df


# 计算分数
def check_score(df, preds, Securities_filter=[]):
tmp_preds = df[['Date', 'SecuritiesCode']].copy()
tmp_preds['Target'] = preds

# Rank Filter. Calculate median for this date and assign this value to the list of Securities to filter.
tmp_preds['target_mean'] = tmp_preds.groupby("Date")["Target"].transform('median')
tmp_preds.loc[tmp_preds['SecuritiesCode'].isin(Securities_filter), 'Target'] = tmp_preds['target_mean']

tmp_preds = add_rank(tmp_preds)
df['Rank'] = tmp_preds['Rank']
score = round(calc_spread_return_sharpe(df, portfolio_size=200, toprank_weight_ratio=2), 5)
score_mean = round(df.groupby('Date').apply(calc_spread_return_per_day, 200, 2).mean(), 5)
score_std = round(df.groupby('Date').apply(calc_spread_return_per_day, 200, 2).std(), 5)
print(f'Competition_Score:{score}, rank_score_mean:{score_mean}, rank_score_std:{score_std}')
1
2
train = add_features(train)
train = fill_nan_inf(train)
1
2
3
4
5
6
7
8
9
10
# 每只股票的最大Target
SecuritiesCode_target_max = train.groupby('SecuritiesCode')['Target'].max()
# 每只股票的最低Target
SecuritiesCode_target_min = train.groupby('SecuritiesCode')['Target'].min()

# Target差最小的
list_spred_h = list((SecuritiesCode_target_max - SecuritiesCode_target_min).sort_values()[:1000].index)

# Target差最大的
list_spred_l = list((SecuritiesCode_target_max - SecuritiesCode_target_min).sort_values()[1000:].index)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
features = ['High', 'Low', 'Open', 'Close', 'Volume', 'return_1month', 'return_2month', 'return_3month',
'volatility_1month', 'volatility_2month', 'volatility_3month',
'MA_gap_1month', 'MA_gap_2month', 'MA_gap_3month']

# Target差最小的,作为训练集
tr_dataset = lgb.Dataset(train[train['SecuritiesCode'].isin(list_spred_h)][features],
train[train['SecuritiesCode'].isin(list_spred_h)]["Target"], feature_name=features)

# Target差最大的,作为测试集
vl_dataset = lgb.Dataset(train[train['SecuritiesCode'].isin(list_spred_l)][features],
train[train['SecuritiesCode'].isin(list_spred_l)]["Target"], feature_name=features)

# lgb的参数
params_lgb = {'learning_rate': 0.005, 'metric': 'None', 'objective': 'regression', 'boosting': 'gbdt', 'verbosity': 0,
'n_jobs': -1, 'force_col_wise': True}

# 模型训练
model = lgb.train(params=params_lgb,
train_set=tr_dataset,
valid_sets=[tr_dataset, vl_dataset],
num_boost_round=3000,
feval=feval_pearsonr,
callbacks=[lgb.early_stopping(stopping_rounds=300, verbose=True), lgb.log_evaluation(period=100)])
1
2
3
4
5
6
7
8
9
10
11
# 测试
test = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/supplemental_files/stock_prices.csv",parse_dates=["Date"])
test = test.drop(columns=['RowId', 'ExpectedDividend', 'AdjustmentFactor', 'SupervisionFlag'])
test = add_features(test)
test = fill_nan_inf(test)
preds = model.predict(test[features])
print(math.sqrt(mean_squared_error(preds, test.Target)))

check_score(test, preds)
check_score(test, preds, list_spred_h)
check_score(test, preds, list_spred_l)
1
2
3
4
5
6
7
8
9
10
11
12
sample_submission = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/example_test_files/sample_submission.csv")

env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
prices = add_features(prices)
prices['Target'] = model.predict(fill_nan_inf(prices)[features])
prices['target_mean'] = prices.groupby("Date")["Target"].transform('median')
prices.loc[prices['SecuritiesCode'].isin(list_spred_h), 'Target'] = prices['target_mean']
prices = add_rank(prices)
sample_prediction['Rank'] = prices['Rank']
env.predict(sample_prediction)

复现

第二名的方案可以复现,0.356。

而且其原代码中有一个设置随机数种子的部分,作者设置的是42,我改成0后,依旧可以能取的0.356的分数。

解读

模型

LightGBM。

作者表示,采用LightGBM,基于两点原因:

  1. 作者在尝试了LightGBMRanker和XGBoost,但是在验证集下,分数都不好,所以决定采用LightGBM。
  2. 作者曾经交易过加密货币,根据其加密货币的经验,作者认为LightGBM的效果会比其他模型都好。

缺失值和异常值处理

对于缺失值和np.inf-np.inf两个异常值,一律用0代替。

特征衍生

作者对收盘价Close进行了如下的衍生

  • 相比20个交易日前的变化
  • 相比40个交易日前的变化
  • 相比60个交易日前的变化
  • 过去20个交易日的波动率
  • 过去40个交易日的波动率
  • 过去60个交易日的波动率
  • 过去20个交易日的平均
  • 过去40个交易日的平均
  • 过去60个交易日的平均
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 相比20个交易日前的变化
feats["return_1month"] = feats["Close"].pct_change(20)
# 相比40个交易日前的变化
feats["return_2month"] = feats["Close"].pct_change(40)
# 相比60个交易日前的变化
feats["return_3month"] = feats["Close"].pct_change(60)
# 过去20个交易日的波动率
feats["volatility_1month"] = (
np.log(feats["Close"]).diff().rolling(20).std()
)
# 过去40个交易日的波动率
feats["volatility_2month"] = (
np.log(feats["Close"]).diff().rolling(40).std()
)
# 过去60个交易日的波动率
feats["volatility_3month"] = (
np.log(feats["Close"]).diff().rolling(60).std()
)
# 过去20个交易日的平均
feats["MA_gap_1month"] = feats["Close"] / (
feats["Close"].rolling(20).mean()
)
# 过去40个交易日的平均
feats["MA_gap_2month"] = feats["Close"] / (
feats["Close"].rolling(40).mean()
)
# 过去60个交易日的平均
feats["MA_gap_3month"] = feats["Close"] / (
feats["Close"].rolling(60).mean()
)

训练集和验证集的划分

在训练集和验证集的划分方面,作者并不是按照时间顺序划分,更不是乱序后进行划分,而是按照Target进行划分。

作者首先计算了2017-01-042021-12-03的每一只股票的最大Target和最小Target的差,然后

  • 训练集为:Target差最大的1000只股票的数据
  • 验证集为:Target差最大的1000只股票的数据,Target差最小的1000只股票的数据,共2000只股票的数据。

最后用2021-12-062022-06-24的数据进行了测试。

2021-12-062022-06-24的数据,并没有加入训练集,对模型重新进行训练。最终的模型,只用了从2017-01-042021-12-03的Target差最大的1000只股票数据作为训练数据。

结果提交

巧妙的设计

在结果提交方面,该方案有一个巧妙的设计,注意如下两行:

1
2
prices['target_mean'] = prices.groupby("Date")["Target"].transform('median')
prices.loc[prices['SecuritiesCode'].isin(list_spred_h), 'Target'] = prices['target_mean']

作者修改了list_spred_h(Target差最大的1000只股票)的'Target'值,修改为中位数。

可能的原因

在作者提交给东京证券交易所的代码和文档中,没有说明他这么做的原因。但是在第十名的方案中,第十名的观点,比赛以夏普比率为衡量,夏普比率最大,就需要最大化收益的均值,同时最小化收益的波动。

在金融业务中,每一只股票都具有一定的特性,有些股票可能本身就属于收益波动很大的股票,而这些股票比较有可能在list_spred_h(Target差最大的1000只股票)中,作者将这些股票的Target设置为中位数,那么这些股票就不太会被算作要做多或做空的股票。

我想,这是作者这么操作的原因。

困惑

该方案在特征工程部分,同样存在困惑。

  1. 作者应该按股票分组,计算移动平均等特征,但是作者没有这么做。
  2. 对于最后提交部分,其中prices只有某一天的数据,作者基于某一天的数据,去算移动平均,是没有意义的。
1
2
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
prices = add_features(prices)

验证猜想

猜想修改list_spred_h(Target差最大的1000只股票)的'Target'为中位数,这个发挥了作用。

为了验证这个猜想,不修改ist_spred_h(Target差最大的1000只股票)的'Target',再进行提交,结果为0.231,小于修改情况下的0.36。

第三名

代码

1
2
3
4
5
import numpy as np
import pandas as pd
import jpx_tokyo_market_prediction
from sklearn.tree import DecisionTreeRegressor
from tqdm.notebook import tqdm
1
2
3
4
5
6
path = "../input/jpx-tokyo-stock-exchange-prediction/"
train_stock_prices = pd.read_csv(f"{path}train_files/stock_prices.csv")
train_stock_prices = train_stock_prices[~train_stock_prices["Target"].isnull()]
supplemental_stock_prices = pd.read_csv(f"{path}supplemental_files/stock_prices.csv")
df_prices = pd.concat([supplemental_stock_prices, train_stock_prices])
df_prices = df_prices[df_prices.Date >= "2021-10-01"]
1
2
3
4
5
6
7
def fill_nans(prices):
prices.set_index(["SecuritiesCode", "Date"], inplace=True)
prices.ExpectedDividend.fillna(0, inplace=True)
prices.ffill(inplace=True)
prices.fillna(0, inplace=True)
prices.reset_index(inplace=True)
return prices
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def calc_spread_return_per_day(df, portfolio_size, toprank_weight_ratio):
weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
weights_mean = weights.mean()
df = df.sort_values(by='Rank')
purchase = (df['Target'][:portfolio_size] * weights).sum() / weights_mean
short = (df['Target'][-portfolio_size:] * weights[::-1]).sum() / weights_mean
return purchase - short

def calc_spread_return_sharpe(df, portfolio_size=200, toprank_weight_ratio=2):
grp = df.groupby('Date')
min_size = grp["Target"].count().min()
if min_size < 2 * portfolio_size:
portfolio_size = min_size // 2
if portfolio_size < 1:
return 0, None
buf = grp.apply(calc_spread_return_per_day, portfolio_size, toprank_weight_ratio)
sharpe_ratio = buf.mean() / buf.std()
return sharpe_ratio, buf
1
2
3
4
def add_rank(df, col_name="pred"):
df["Rank"] = df.groupby("Date")[col_name].rank(ascending=False, method="first") - 1
df["Rank"] = df["Rank"].astype("int")
return df
1
2
def predictor(feature_df):
return model.predict(feature_df[feats])
1
2
df_prices = fill_nans(df_prices)
supplemental_stock_prices = fill_nans(supplemental_stock_prices)
1
2
3
4
5
6
7
8
9
10
11
12
np.random.seed(0)
feats = ['Open', 'High', 'Low', 'Close']
max_score = 0
max_depth = 0
for md in tqdm(range(3, 40)):
model = DecisionTreeRegressor(max_depth=md)
model.fit(df_prices[feats], df_prices["Target"])
supplemental_stock_prices["pred"] = predictor(supplemental_stock_prices)
score, buf = calc_spread_return_sharpe(add_rank(supplemental_stock_prices))
if score > max_score:
max_score = score
max_depth = md
1
2
3
model = DecisionTreeRegressor(max_depth=max_depth)
model.fit(df_prices[feats], df_prices["Target"])
print(f'Max_deph={max_depth} : Sharpe Ratio Score base -> {max_score}')
1
2
3
4
5
6
7
8
9
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
for prices, options, financials, trades, secondary_prices, sample_prediction in iter_test:
prices = fill_nans(prices)
prices.loc[:, "pred"] = predictor(prices)
prices = add_rank(prices)
rank = prices.set_index('SecuritiesCode')['Rank'].to_dict()
sample_prediction['Rank'] = sample_prediction['SecuritiesCode'].map(rank)
env.predict(sample_prediction)

复现

第三名的方案可以复现,0.352。

而且其代码中有一个设置随机数种子的部分,作者设置的是0,我改成1后,意外取的了更高的分数,0.406。

解读

模型

决策树回归模型。

缺失值处理

缺失值处理的处理的步骤如下:

  1. ExpectedDividend,缺失值填充为0。
  2. 其他字段都向前填充。
  3. 如果还有为空的,填充0。
1
2
3
4
5
6
7
8
9
def fill_nans(prices):
prices.set_index(["SecuritiesCode", "Date"], inplace=True)
prices.ExpectedDividend.fillna(0, inplace=True)
prices.ffill(inplace=True)
prices.fillna(0, inplace=True)
prices.reset_index(inplace=True)
return prices

df_prices = fill_nans(df_prices)

特征衍生

该方案只用了四个最基本的特征,没有做其他任何的特征衍生。

训练数据

该方案只选取了2021-10-01及之后的数据进行训练。

只选择部分数据进行训练,这个操作很常见。我们从一些私募基金的致歉信中,可以窥见一斑。

明泓投资

模型训练

决策树中有一个超参数,是决策树的最佳深度,作者搜索了一个决策树的最佳深度。搜索代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
np.random.seed(0)
feats = ['Open', 'High', 'Low', 'Close']
max_score = 0
max_depth = 0
for md in tqdm(range(3, 40)):
model = DecisionTreeRegressor(max_depth=md)
model.fit(df_prices[feats], df_prices["Target"])
supplemental_stock_prices["pred"] = predictor(supplemental_stock_prices)
score, buf = calc_spread_return_sharpe(add_rank(supplemental_stock_prices))
if score > max_score:
max_score = score
max_depth = md

困惑

作者应该按股票进行分组,再按日期进行排序,然后进行缺失值处理,向前填充。

第六名

代码

1
2
3
4
5
6
7
import numpy as np
import pandas as pd
from lightgbm import Booster, LGBMRegressor
from tqdm import tqdm
from decimal import ROUND_HALF_UP, Decimal

import jpx_tokyo_market_prediction
1
2
3
4
5
6
7
8
9
base_dir = "../input/jpx-tokyo-stock-exchange-prediction"

train_files_dir = f"{base_dir}/train_files"
supplemental_files_dir = f"{base_dir}/supplemental_files"

df_price_train = pd.read_csv(f"{train_files_dir}/stock_prices.csv")
df_price_supplemental = pd.read_csv(f"{supplemental_files_dir}/stock_prices.csv")

df_price = pd.concat([df_price_train, df_price_supplemental])
1
2
TRAIN_END = "2019-12-31"
TEST_START = "2020-01-06"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 对收盘价进行复权
def generate_adjusted_close(df):
df = df.sort_values("Date", ascending=False)
df.loc[:, "CumulativeAdjustmentFactor"] = df["AdjustmentFactor"].cumprod()
df.loc[:, "AdjustedClose"] = (
df["CumulativeAdjustmentFactor"] * df["Close"]
).map(lambda x: float(
Decimal(str(x)).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP)
))
df = df.sort_values("Date")
df.loc[df["AdjustedClose"] == 0, "AdjustedClose"] = np.nan
df.loc[:, "AdjustedClose"] = df.loc[:, "AdjustedClose"].ffill()
return df

# 复权,该方法会调用 generate_adjusted_close
def adjust_price(price):
price = price.copy()
price.loc[:, "Date"] = pd.to_datetime(price.loc[:, "Date"], format="%Y-%m-%d")

price = price.sort_values(["SecuritiesCode", "Date"])
price = price.groupby("SecuritiesCode").apply(generate_adjusted_close).reset_index(drop=True)

price.set_index("Date", inplace=True)
return price
1
2
df_price = adjust_price(df_price)
codes = sorted(df_price["SecuritiesCode"].unique())
1
2
3
4
5
6
7
8
9
10
11
12
# 特征衍生
def get_features(price, code):
close_col = "AdjustedClose"
feats = price.loc[price["SecuritiesCode"] == code, ["SecuritiesCode", close_col]].copy()
# 前一日的差
feats["close_diff1"] = feats[close_col].diff(1)

feats = feats.fillna(0)
feats = feats.replace([np.inf, -np.inf], 0)
feats = feats.drop([close_col], axis=1)

return feats
1
2
3
4
5
buff = []
for code in tqdm(codes):
feat = get_features(df_price, code)
buff.append(feat)
feature = pd.concat(buff)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def get_label(price, code):
df = price.loc[price["SecuritiesCode"] == code].copy()
df.loc[:, "label"] = df["Target"]
return df.loc[:, ["SecuritiesCode", "label"]]


def get_features_and_label(price, codes, features):
trains_X, tests_X = [], []
trains_y, tests_y = [], []

for code in tqdm(codes):
feats = features[features["SecuritiesCode"] == code].dropna()
labels = get_label(price, code).dropna()

if feats.shape[0] > 0 and labels.shape[0] > 0:
labels = labels.loc[labels.index.isin(feats.index)]
feats = feats.loc[feats.index.isin(labels.index)]

assert (labels.loc[:, "SecuritiesCode"] == feats.loc[:, "SecuritiesCode"]).all()
labels = labels.loc[:, "label"]

_train_X = feats[: TRAIN_END]
_test_X = feats[TEST_START:]

_train_y = labels[: TRAIN_END]
_test_y = labels[TEST_START:]

assert len(_train_X) == len(_train_y)
assert len(_test_X) == len(_test_y)

trains_X.append(_train_X)
tests_X.append(_test_X)

trains_y.append(_train_y)
tests_y.append(_test_y)

train_X = pd.concat(trains_X)
test_X = pd.concat(tests_X)

train_y = pd.concat(trains_y)
test_y = pd.concat(tests_y)

return train_X, train_y, test_X, test_y
1
train_X, train_y, test_X, test_y = get_features_and_label(df_price, codes, feature)
1
2
3
4
5
6
7
8
9
10
11
lgbm_params = {
'seed': 42,
'n_jobs': -1,
}

feat_cols = [
"close_diff1",
]

pred_model = LGBMRegressor(**lgbm_params)
pred_model.fit(train_X[feat_cols].values, train_y)
1
2
3
4
5
6
7
df_price_train_raw = pd.read_csv(f"{train_files_dir}/stock_prices.csv")
price_cols = ["Date", "SecuritiesCode", "Close", "AdjustmentFactor",]
df_price_train_raw = df_price_train_raw[price_cols]
df_price_train_raw = df_price_train_raw.loc[df_price_train_raw["Date"] >= "2021-08-01"]
df_price_supplemental_raw = pd.read_csv(f"{supplemental_files_dir}/stock_prices.csv")
df_price_supplemental_raw = df_price_supplemental_raw[price_cols]
df_price_raw = pd.concat([df_price_train_raw, df_price_supplemental])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
counter = 0
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
current_date = prices["Date"].iloc[0]
sample_prediction_date = sample_prediction["Date"].iloc[0]
print(f"current_date: {current_date}, sample_prediction_date: {sample_prediction_date}")

if counter == 0:
df_price_raw = df_price_raw.loc[df_price_raw["Date"] < current_date]

threshold = (pd.Timestamp(current_date) - pd.offsets.BDay(80)).strftime("%Y-%m-%d")
print(f"threshold: {threshold}")
df_price_raw = df_price_raw.loc[df_price_raw["Date"] >= threshold]

df_price_raw = pd.concat([df_price_raw, prices[price_cols]])
df_price = adjust_price(df_price_raw)

codes = sorted(prices["SecuritiesCode"].unique())

feature = pd.concat([get_features(df_price, code) for code in codes])
feature = feature.loc[feature.index == current_date]

feature.loc[:, "predict"] = pred_model.predict(feature[feat_cols])

feature = feature.sort_values("predict", ascending=False).drop_duplicates(subset=['SecuritiesCode'])
feature.loc[:, "Rank"] = np.arange(len(feature))
feature_map = feature.set_index('SecuritiesCode')['Rank'].to_dict()
sample_prediction['Rank'] = sample_prediction['SecuritiesCode'].map(feature_map)

assert sample_prediction["Rank"].notna().all()
assert sample_prediction["Rank"].min() == 0
assert sample_prediction["Rank"].max() == len(sample_prediction["Rank"]) - 1
counter += 1

env.predict(sample_prediction)

复现

第六名的方案可以复现,0.308。

解读

模型

LightGBM。

缺失值处理

按股票进行分组,按时间排序,然后向前填充。

1
2
3
4
5
6
7
8
9
10
11
12
13
# 对收盘价进行复权
def generate_adjusted_close(df):
df = df.sort_values("Date", ascending=False)
df.loc[:, "CumulativeAdjustmentFactor"] = df["AdjustmentFactor"].cumprod()
df.loc[:, "AdjustedClose"] = (
df["CumulativeAdjustmentFactor"] * df["Close"]
).map(lambda x: float(
Decimal(str(x)).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP)
))
df = df.sort_values("Date")
df.loc[df["AdjustedClose"] == 0, "AdjustedClose"] = np.nan
df.loc[:, "AdjustedClose"] = df.loc[:, "AdjustedClose"].ffill()
return df

第一名、第二名和第三名,都没有进行复权。

特征衍生

只衍生了一个特征,当天Close与前一个交易日的Close的差。

示例代码:

1
2
3
4
5
6
7
8
9
10
11
def get_features(price, code):
close_col = "AdjustedClose"
feats = price.loc[price["SecuritiesCode"] == code, ["SecuritiesCode", close_col]].copy()
# 前一日的差
feats["close_diff1"] = feats[close_col].diff(1)

feats = feats.fillna(0)
feats = feats.replace([np.inf, -np.inf], 0)
feats = feats.drop([close_col], axis=1)

return feats

训练数据

训练数据时间范围

与第三名不一样,第四名是选取截止2019-12-31作为训练数据。

可能的原因

在作者提交给东京证券交易所的代码和文档中,没有说明他这么做的原因。

我想可能是想去除2020年年初,因为新冠疫情导致的金融市场异动,作者把这部分的数据作为了异常数据。

或者是基于宏观因素考虑,用2017-01-042019-12-31的数据作为训练数据,去预测2022-07-062022-10-07的市场。

没有困惑

对于该方案,我没有困惑。所有的缺失值处理、特征衍生包括模型训练等,都可以理解。

第七名

代码

1
2
3
4
5
6
7
8
import os
import numpy as np
import pandas as pd
from lightgbm import LGBMRegressor
import torch
from typing import Tuple

import jpx_tokyo_market_prediction
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def data_pipeline(dir_path: str) -> Tuple[pd.DataFrame, pd.DataFrame]:
stock_prices_train = pd.read_csv(os.path.join(dir_path, "train_files/stock_prices.csv"))
stock_prices_train = stock_prices_train.drop(["ExpectedDividend", "RowId"], axis=1)
stock_prices_train = stock_prices_train.fillna(0)

stock_prices_supplemental = pd.read_csv(os.path.join(dir_path, "supplemental_files/stock_prices.csv"))
stock_prices_supplemental = stock_prices_supplemental.drop(["ExpectedDividend", "RowId"], axis=1)
stock_prices_supplemental = stock_prices_supplemental.fillna(0)

stock_list = pd.read_csv(os.path.join(dir_path, "stock_list.csv"))
target_stock_list = stock_list[stock_list["Universe0"]]
sec_info = target_stock_list[["SecuritiesCode", "33SectorName", "17SectorName"]]

stock_prices_train = pd.merge(stock_prices_train, sec_info, on="SecuritiesCode")
stock_prices_train["33SectorName"] = stock_prices_train["33SectorName"].astype("category")
stock_prices_train["17SectorName"] = stock_prices_train["17SectorName"].astype("category")

stock_prices_supplemental = pd.merge(stock_prices_supplemental, sec_info, on="SecuritiesCode")
stock_prices_supplemental["33SectorName"] = stock_prices_supplemental["33SectorName"].astype("category")
stock_prices_supplemental["17SectorName"] = stock_prices_supplemental["17SectorName"].astype("category")

stock_prices_train.update(stock_prices_train.groupby("SecuritiesCode")["Target"].ffill().fillna(0))
stock_prices_supplemental.update(stock_prices_supplemental.groupby("SecuritiesCode")["Target"].ffill().fillna(0))

stock_prices_train["SupervisionFlag"] = stock_prices_train["SupervisionFlag"].map({True: 1, False: 0})
stock_prices_supplemental["SupervisionFlag"] = stock_prices_supplemental["SupervisionFlag"].map({True: 1, False: 0})

time_config = {"train_split_date": "2020-12-23"}
stock_prices_train = stock_prices_train[stock_prices_train.Date >= time_config["train_split_date"]]

return stock_prices_train, stock_prices_supplemental, sec_info
1
2
train, supplemental, sec_info = data_pipeline("../input/jpx-tokyo-stock-exchange-prediction")
train = pd.concat([train, supplemental])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class LGBMHierarchModel():
def __init__(self, device=None, seed=69):
self.seed = seed
self._best_found_params = {
"num_leaves": 17,
"learning_rate": 0.014,
"n_estimators": 700,
"max_depth": -1,
}
self.models = {}

def train(self, train: pd.DataFrame, use_params=False):
for name, group in train.groupby("33SectorName"):
y = group["Target"].to_numpy()
X = group.drop(["Target"], axis=1)
X = X.drop(["Date", "SecuritiesCode"], axis=1)
model = LGBMRegressor(**self._best_found_params)
model.fit(X, y, verbose=False)
self.models[name] = model

def predict(self, test: pd.DataFrame):
y_preds = []
for name, group in test.groupby("33SectorName"):
sec_codes = group["SecuritiesCode"]
X_test = group.drop(["Date", "SecuritiesCode"], axis=1)
y_pred = self.models[name].predict(X_test)
y_preds.extend(list(zip(sec_codes, y_pred)))
df = pd.DataFrame(y_preds, columns=["codes", "pred"])
return df.sort_values("codes", ascending=True)["pred"].values
1
2
3
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LGBMHierarchModel(device=device, seed=69)
model.train(train.copy(), use_params=True)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
for (df_test, options, financials, trades, secondary_prices, df_pred) in iter_test:
x_test = df_test.drop(["ExpectedDividend", "RowId"], axis=1)
x_test = x_test.fillna(0)

x_test = pd.merge(x_test, sec_info, on="SecuritiesCode")
x_test["33SectorName"] = x_test["33SectorName"].astype("category")
x_test["17SectorName"] = x_test["17SectorName"].astype("category")

x_test["SupervisionFlag"] = x_test["SupervisionFlag"].map({True: 1, False: 0})

y_pred = model.predict(x_test)
df_pred['Target'] = y_pred
df_pred = df_pred.sort_values(by="Target", ascending=False)
df_pred['Rank'] = np.arange(len(df_pred.index))
df_pred = df_pred.sort_values(by="SecuritiesCode", ascending=True)
df_pred.drop(["Target"], axis=1)
submission = df_pred[["Date", "SecuritiesCode", "Rank"]]
env.predict(submission)

复现

第七名的方案可以复现,0.301。

解读

模型

LightGBM。

更准确的说,是33个LightGBM模型。
作者发现了东京证券交易所市场的股票有一个行业效应,即同一个行业的股票,彼此容易有相类似的表现。
所以,作者针对每一个行业,都训练了一个LightGBM的回归模型。

缺失值处理

对于缺失值,一律填充0。

特征衍生

没有进行任何特征衍生。

训练数据

作者只选取了部分数据(2020-12-23及之后)参与训练。

没有困惑

对于该方案,我没有困惑。所有的缺失值处理、特征衍生包括模型训练等,都可以理解。

第八名

代码

Features.py

Features.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
from enum import Enum
import numpy as np


class FeatureType(Enum):
GLOBAL = 0
LOCAL = 1


class Feature:

def __init__(self, feature_type, name):
self.feature_type = feature_type
self.name = name

def add_feature_pandas(self, df):
raise NotImplementedError

def update_row(self, row):
raise NotImplementedError

def reset(self):
raise NotImplementedError

def copy(self):
raise NotImplementedError


class SMA(Feature):

def __init__(self, col, period):
super().__init__(FeatureType.LOCAL, col + "SMA" + str(period))
self.col = col
self.period = period
self.elements = [np.nan] * period
self.ptr = 0
self.mean = np.nan

def add_feature_pandas(self, df):
df[self.name] = df[self.col].rolling(self.period).mean()
for k in range(self.period):
df[self.name].iloc[k] = np.mean(df[self.col].iloc[:k+1])
return df

def update_row(self, row):

dequeue = self.elements[self.ptr % self.period]
enqueue = row[self.col]
self.elements[self.ptr % self.period] = enqueue
self.ptr += 1
mean = 0
if self.ptr < self.period: # We have not yet seen enough elements
mean = np.mean(self.elements[:self.ptr])
elif self.ptr == self.period: # We have the value for the first time
self.mean = np.mean(self.elements)
mean = self.mean
else:
mean = self.mean + (- dequeue + enqueue) / self.period # Simple and efficient updates
self.mean = mean

row[self.name] = mean
return row

def reset(self):
self.elements = [np.nan] * self.period
self.ptr = 0
self.mean = np.nan

def copy(self):
return SMA(self.col, self.period)


class Amplitude(Feature):
def __init__(self):
super().__init__(FeatureType.LOCAL, "Amplitude")

def add_feature_pandas(self, df):
df[self.name] = df["High"] - df["Low"]
return df

def update_row(self, row):
row[self.name] = row["High"] - row["Low"]
return row

def reset(self):
return

def copy(self):
return Amplitude()


class OpenCloseReturn(Feature):
def __init__(self):
super().__init__(FeatureType.LOCAL, "OpenCloseReturn")

def add_feature_pandas(self, df):
df[self.name] = (df["Close"] - df["Open"]) / df["Open"]
return df

def update_row(self, row):
row[self.name] = (row["Close"] - row["Open"]) / row["Open"]
return row

def reset(self):
return

def copy(self):
return OpenCloseReturn()


class Return(Feature):
def __init__(self):
super().__init__(FeatureType.LOCAL, "Return")
self.last_close = None

def add_feature_pandas(self, df):
df[self.name] = ((df["Close"] - df["Close"].shift(1)) / df["Close"].shift(1)).fillna(0)
return df

def update_row(self, row):
if self.last_close is None:
row[self.name] = 0
else:
row[self.name] = (row["Close"] - self.last_close) / self.last_close
self.last_close = row["Close"]
return row

def reset(self):
self.last_close = None

def copy(self):
return Return()


class Volatility(Feature):

def __init__(self, n=30):
super().__init__(FeatureType.LOCAL, "Volatility" + str(n))
self.n = n
self.returns = np.ones(n) * np.nan
self.ptr = 0
self.index = 0

def volatility_row_function(self, df, row):
l = max(0, self.index + 1 - self.n)
r = self.index + 1
self.index += 1
return np.std(df["Return"].to_numpy()[l:r], ddof=1)

def add_feature_pandas(self, df):
df[self.name] = df.apply(lambda row: self.volatility_row_function(df, row), axis=1)
return df.fillna(0)

def update_row(self, row):
self.returns[self.ptr % self.n] = row["Return"]

if self.ptr == 0:
row[self.name] = 0
elif self.ptr < self.n - 1:
vec = self.returns[:(self.ptr + 1) % self.n]
row[self.name] = np.std(vec, ddof=1)
else:
row[self.name] = np.std(self.returns, ddof=1)

self.ptr += 1
return row

def reset(self):
self.returns = np.ones(self.n) * np.nan
self.ptr = 0
self.index = 0

def copy(self):
return Volatility(self.n)


class FeatureChecker:
def verify(feature, df_prices):
df_pandas = feature.add_feature_pandas(df_prices)
df_online = df_prices.apply(lambda row: feature.update_row(row), axis=1)

return np.isclose(df_pandas[feature.name].to_numpy(),
df_online[feature.name].to_numpy(), equal_nan=True).all()

def verify_features(features, df_prices):
verifications = []
all_verified = True
for feature in features:
this = FeatureChecker.verify(feature, df_prices)
verifications.append((feature.name, this))
all_verified = all_verified and this

if all_verified:
print("All features passed the check.")
else:
print("Some features failed the check.")

print(verifications)

return verifications

Preprocessing.py

Preprocessing.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
import pandas as pd


class StockDataPreprocessor:

def fill_nans(df_stocks):
# Dividends NaNs means 0 zero.
df_stocks["ExpectedDividend"] = df_stocks["ExpectedDividend"].fillna(0)
subdfs = []
for stock_id, subdf in df_stocks.groupby("SecuritiesCode"):
for i in range(len(subdf)):
if not np.isnan(subdf.iloc[i]["Open"]):
break
subdf = subdf.iloc[i:]
subdf = subdf.fillna(method="ffill")
subdfs.append(subdf)

if i != 0:
print(f"Stock id {stock_id} dropping {i} rows")

new_df_stocks = pd.concat(subdfs).sort_index().reset_index(drop=True)

print(f"Number of rows dropped is {len(df_stocks) - len(new_df_stocks)}")
return new_df_stocks

def add_cum_adj_factor(df_stocks):
cum_adj_list = []
for stock_id, subdf in df_stocks.groupby("SecuritiesCode"):
cum_adj = subdf["AdjustmentFactor"].cumprod().shift(1, fill_value=1)
cum_adj_list.append(cum_adj)

df_stocks["CumAdjFactor"] = pd.concat(cum_adj_list).sort_index().to_numpy()
return df_stocks

def adjust_prices_and_volume(df_stocks):
df_stocks["Open"] = df_stocks["Open"] / df_stocks["CumAdjFactor"]
df_stocks["High"] = df_stocks["High"] / df_stocks["CumAdjFactor"]
df_stocks["Low"] = df_stocks["Low"] / df_stocks["CumAdjFactor"]
df_stocks["Close"] = df_stocks["Close"] / df_stocks["CumAdjFactor"]
df_stocks["Volume"] = df_stocks["Volume"] * df_stocks["CumAdjFactor"]
return df_stocks

def preprocess_for_training(df_stocks):
df_stocks = StockDataPreprocessor.fill_nans(df_stocks)
df_stocks = StockDataPreprocessor.add_cum_adj_factor(df_stocks)
df_stocks = StockDataPreprocessor.adjust_prices_and_volume(df_stocks)
return df_stocks

Trackers.py

Trackers.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
from enum import Enum

import numpy as np
import pandas as pd

from Preprocessing import StockDataPreprocessor
from Features import FeatureType


class StockStatusCheck(Enum):
OK = 0
ISOLATED_NAN = 1
INIT_NAN = 2
ERROR = 3


class StockTracker:

def __init__(self, stock_id):
self.stock_id = stock_id
self.cum_adj_factor = 1.0
self.last_open = None # These are unadjusted
self.last_high = None
self.last_low = None
self.last_close = None

def check_price_data(self, row):

assert row["SecuritiesCode"] == self.stock_id, "SecuritiesCode does not match the tracked stock"

status_code = StockStatusCheck.OK

if np.isnan(row["Open"]) and np.isnan(row["High"]) and np.isnan(row["Low"]) and np.isnan(row["Close"]):
if self.last_open is None:
status_code = StockStatusCheck.INIT_NAN
else:
row["Open"] = self.last_open
row["Close"] = self.last_close
row["Low"] = self.last_low
row["High"] = self.last_high
status_code = StockStatusCheck.ISOLATED_NAN
else:
date = row["Date"]
if np.isnan(row["Open"]):
print(f"Warning: OPEN on {date} for {self.stock_id} is NaN")
status_code = StockStatusCheck.ERROR
if np.isnan(row["High"]):
print(f"Warning: HIGH on {date} for {self.stock_id} is NaN")
status_code = StockStatusCheck.ERROR
if np.isnan(row["Low"]):
print(f"Warning: LOW on {date} for {self.stock_id} is NaN")
status_code = StockStatusCheck.ERROR
if np.isnan(row["Close"]):
print(f"Warning: CLOSE on {date} for {self.stock_id} is NaN")
status_code = StockStatusCheck.ERROR

if status_code == StockStatusCheck.OK:
self.last_open = row["Open"]
self.last_high = row["High"]
self.last_low = row["Low"]
self.last_close = row["Close"]

return row, status_code

def adjust_prices(self, row):
row[["Open", "High", "Low", "Close"]] /= self.cum_adj_factor
row["Volume"] *= self.cum_adj_factor
return row

def update(self, row):
row, status_code = self.check_price_data(row)
is_okay = True
if status_code == StockStatusCheck.INIT_NAN or status_code == StockStatusCheck.ERROR:
is_okay = False
if np.isnan(row["ExpectedDividend"]):
row["ExpectedDividend"] = 0.0
row = self.adjust_prices(row)
self.cum_adj_factor *= row["AdjustmentFactor"]

if is_okay:
pass

return row, status_code


class StateTracker:

def __init__(self, features):
self.stock_ids = [1301, 1332, 1333, 1376, 1377, 1379, 1381, 1407, 1414, 1417, 1419, 1429, 1435, 1515, 1518, 1605, 1662, 1663, 1712, 1716, 1719, 1720, 1721, 1723, 1726, 1762, 1766, 1775, 1780, 1787, 1793, 1799, 1801, 1802, 1803, 1805, 1808, 1810, 1811, 1812, 1813, 1814, 1815, 1820, 1821, 1822, 1833, 1835, 1852, 1860, 1861, 1870, 1871, 1873, 1878, 1879, 1882, 1884, 1885, 1888, 1890, 1893, 1898, 1899, 1911, 1914, 1921, 1925, 1926, 1928, 1929, 1930, 1934, 1938, 1939, 1941, 1942, 1944, 1945, 1946, 1949, 1950, 1951, 1952, 1954, 1959, 1961, 1963, 1965, 1967, 1968, 1969, 1973, 1975, 1976, 1979, 1980, 1981, 1982, 2001, 2002, 2003, 2004, 2009, 2053, 2060, 2108, 2109, 2114, 2117, 2120, 2121, 2124, 2127, 2130, 2146, 2148, 2150, 2153, 2154, 2157, 2158, 2160, 2168, 2170, 2175, 2181, 2183, 2185, 2193, 2198, 2201, 2204, 2206, 2207, 2208, 2209, 2211, 2212, 2217, 2220, 2221, 2222, 2226, 2229, 2264, 2266, 2267, 2268, 2269, 2270, 2281, 2282, 2288, 2292, 2294, 2296, 2301, 2305, 2307, 2309, 2315, 2317, 2325, 2326, 2327, 2329, 2331, 2335, 2337, 2349, 2353, 2359, 2371, 2372, 2374, 2378, 2379, 2384, 2389, 2393, 2395, 2412, 2413, 2418, 2427, 2429, 2432, 2433, 2440, 2445, 2453, 2461, 2462, 2469, 2471, 2475, 2477, 2484, 2489, 2491, 2492, 2497, 2498, 2501, 2502, 2503, 2531, 2533, 2540, 2573, 2579, 2587, 2588, 2590, 2593, 2594, 2602, 2607, 2612, 2613, 2651, 2653, 2659, 2664, 2669, 2670, 2676, 2678, 2681, 2685, 2686, 2692, 2694, 2695, 2698, 2702, 2705, 2715, 2726, 2729, 2730, 2733, 2734, 2737, 2742, 2749, 2751, 2752, 2753, 2760, 2761, 2763, 2767, 2768, 2780, 2782, 2784, 2790, 2791, 2792, 2801, 2802, 2804, 2805, 2806, 2809, 2810, 2811, 2814, 2815, 2819, 2830, 2831, 2871, 2874, 2875, 2882, 2884, 2897, 2899, 2904, 2908, 2910, 2914, 2915, 2918, 2922, 2923, 2925, 2929, 2930, 2931, 3001, 3002, 3003, 3028, 3031, 3034, 3036, 3038, 3040, 3046, 3048, 3050, 3064, 3075, 3076, 3085, 3086, 3087, 3088, 3091, 3092, 3097, 3099, 3101, 3103, 3104, 3105, 3106, 3107, 3110, 3116, 3132, 3134, 3139, 3141, 3148, 3150, 3151, 3153, 3154, 3156, 3157, 3159, 3166, 3167, 3176, 3178, 3179, 3180, 3182, 3183, 3186, 3191, 3193, 3196, 3197, 3198, 3199, 3201, 3221, 3222, 3228, 3231, 3232, 3244, 3245, 3252, 3254, 3264, 3276, 3284, 3288, 3289, 3291, 3302, 3315, 3319, 3328, 3333, 3341, 3349, 3355, 3360, 3361, 3371, 3377, 3382, 3387, 3388, 3391, 3395, 3397, 3401, 3402, 3405, 3407, 3415, 3421, 3431, 3433, 3436, 3443, 3445, 3457, 3458, 3465, 3475, 3539, 3543, 3546, 3547, 3548, 3549, 3553, 3569, 3580, 3591, 3593, 3597, 3608, 3626, 3628, 3632, 3635, 3636, 3648, 3649, 3656, 3657, 3659, 3660, 3662, 3663, 3665, 3668, 3673, 3675, 3676, 3677, 3678, 3679, 3681, 3687, 3688, 3694, 3696, 3697, 3708, 3733, 3738, 3762, 3763, 3765, 3769, 3771, 3772, 3774, 3778, 3784, 3788, 3798, 3800, 3817, 3825, 3834, 3835, 3836, 3837, 3843, 3844, 3853, 3854, 3856, 3857, 3861, 3863, 3865, 3880, 3891, 3900, 3901, 3902, 3903, 3906, 3914, 3915, 3916, 3919, 3921, 3922, 3923, 3925, 3926, 3932, 3934, 3937, 3939, 3941, 3946, 3950, 3951, 3962, 3966, 3969, 4004, 4005, 4008, 4021, 4023, 4025, 4026, 4027, 4028, 4041, 4042, 4043, 4044, 4045, 4046, 4047, 4061, 4062, 4063, 4078, 4080, 4082, 4088, 4091, 4092, 4094, 4095, 4097, 4099, 4100, 4107, 4109, 4112, 4113, 4114, 4116, 4118, 4151, 4182, 4183, 4185, 4186, 4187, 4188, 4189, 4202, 4203, 4204, 4205, 4206, 4208, 4212, 4215, 4216, 4218, 4220, 4221, 4228, 4229, 4235, 4238, 4246, 4272, 4275, 4286, 4290, 4293, 4298, 4301, 4307, 4308, 4310, 4312, 4318, 4323, 4324, 4326, 4327, 4337, 4343, 4344, 4345, 4348, 4350, 4362, 4365, 4368, 4369, 4401, 4403, 4410, 4452, 4461, 4462, 4463, 4464, 4471, 4502, 4503, 4506, 4507, 4516, 4519, 4521, 4523, 4526, 4527, 4528, 4530, 4534, 4536, 4538, 4540, 4541, 4543, 4544, 4547, 4548, 4549, 4550, 4551, 4552, 4553, 4554, 4559, 4563, 4565, 4568, 4569, 4571, 4572, 4574, 4577, 4578, 4579, 4581, 4582, 4584, 4587, 4592, 4593, 4595, 4611, 4612, 4613, 4617, 4619, 4620, 4621, 4626, 4628, 4631, 4633, 4634, 4636, 4641, 4658, 4659, 4661, 4662, 4665, 4666, 4668, 4671, 4674, 4676, 4680, 4681, 4684, 4686, 4687, 4689, 4694, 4699, 4704, 4708, 4709, 4714, 4716, 4718, 4719, 4722, 4725, 4726, 4732, 4733, 4739, 4743, 4745, 4746, 4751, 4755, 4763, 4765, 4767, 4768, 4771, 4772, 4776, 4781, 4792, 4800, 4801, 4809, 4812, 4813, 4816, 4819, 4820, 4825, 4826, 4828, 4832, 4837, 4839, 4847, 4848, 4849, 4901, 4902, 4911, 4912, 4914, 4917, 4919, 4921, 4922, 4923, 4927, 4928, 4951, 4955, 4956, 4958, 4963, 4966, 4967, 4968, 4970, 4971, 4973, 4974, 4975, 4978, 4980, 4985, 4992, 4994, 4996, 4997, 4998, 5008, 5011, 5013, 5015, 5017, 5019, 5020, 5021, 5101, 5105, 5108, 5110, 5121, 5122, 5142, 5161, 5184, 5185, 5186, 5191, 5192, 5195, 5201, 5202, 5208, 5214, 5217, 5218, 5232, 5233, 5261, 5262, 5269, 5273, 5288, 5301, 5302, 5304, 5310, 5331, 5332, 5333, 5334, 5344, 5351, 5352, 5357, 5384, 5388, 5393, 5401, 5406, 5408, 5410, 5411, 5423, 5440, 5444, 5449, 5451, 5463, 5464, 5471, 5480, 5481, 5482, 5486, 5541, 5563, 5602, 5631, 5632, 5659, 5698, 5702, 5703, 5706, 5707, 5711, 5713, 5714, 5715, 5726, 5727, 5741, 5801, 5802, 5803, 5805, 5807, 5809, 5821, 5851, 5857, 5901, 5902, 5909, 5911, 5918, 5929, 5930, 5932, 5933, 5938, 5943, 5945, 5946, 5947, 5949, 5951, 5957, 5959, 5970, 5975, 5976, 5982, 5985, 5988, 5989, 5991, 5992, 5999, 6005, 6013, 6023, 6027, 6028, 6030, 6035, 6036, 6047, 6050, 6055, 6058, 6062, 6067, 6070, 6071, 6073, 6078, 6080, 6082, 6088, 6089, 6094, 6095, 6098, 6099, 6101, 6103, 6104, 6113, 6118, 6125, 6134, 6135, 6136, 6140, 6141, 6143, 6144, 6145, 6146, 6149, 6151, 6157, 6178, 6182, 6183, 6184, 6191, 6194, 6196, 6197, 6199, 6200, 6201, 6222, 6237, 6238, 6240, 6245, 6246, 6247, 6250, 6254, 6257, 6258, 6264, 6266, 6268, 6269, 6272, 6273, 6277, 6278, 6279, 6282, 6284, 6287, 6289, 6293, 6301, 6302, 6305, 6306, 6309, 6310, 6315, 6323, 6324, 6326, 6328, 6330, 6331, 6332, 6333, 6339, 6340, 6345, 6349, 6351, 6357, 6361, 6363, 6364, 6365, 6366, 6367, 6368, 6369, 6370, 6371, 6376, 6378, 6379, 6381, 6382, 6383, 6387, 6395, 6406, 6407, 6409, 6411, 6412, 6413, 6417, 6418, 6419, 6420, 6425, 6430, 6432, 6436, 6440, 6444, 6448, 6454, 6455, 6457, 6458, 6459, 6460, 6462, 6463, 6464, 6465, 6470, 6471, 6472, 6473, 6474, 6479, 6480, 6481, 6482, 6484, 6485, 6486, 6490, 6498, 6501, 6502, 6503, 6504, 6506, 6507, 6508, 6516, 6517, 6532, 6533, 6535, 6538, 6539, 6584, 6586, 6588, 6590, 6592, 6594, 6616, 6617, 6619, 6620, 6622, 6625, 6626, 6627, 6629, 6630, 6632, 6637, 6638, 6640, 6641, 6644, 6645, 6651, 6652, 6668, 6670, 6674, 6676, 6701, 6702, 6703, 6706, 6707, 6718, 6723, 6724, 6727, 6728, 6736, 6737, 6740, 6741, 6742, 6744, 6745, 6750, 6752, 6753, 6754, 6755, 6758, 6762, 6768, 6769, 6770, 6777, 6779, 6787, 6788, 6789, 6794, 6798, 6800, 6804, 6806, 6807, 6809, 6810, 6814, 6815, 6817, 6820, 6823, 6824, 6832, 6834, 6841, 6844, 6845, 6848, 6849, 6850, 6855, 6856, 6857, 6859, 6861, 6866, 6869, 6871, 6875, 6877, 6879, 6881, 6882, 6890, 6902, 6904, 6905, 6908, 6912, 6914, 6915, 6918, 6920, 6923, 6924, 6925, 6929, 6932, 6937, 6938, 6941, 6947, 6951, 6952, 6954, 6955, 6957, 6958, 6960, 6961, 6962, 6963, 6965, 6966, 6967, 6971, 6976, 6981, 6986, 6988, 6994, 6995, 6996, 6997, 6999, 7003, 7004, 7011, 7012, 7013, 7102, 7105, 7148, 7157, 7164, 7167, 7169, 7172, 7173, 7177, 7180, 7181, 7182, 7184, 7186, 7187, 7189, 7191, 7192, 7201, 7202, 7203, 7205, 7211, 7220, 7222, 7224, 7226, 7229, 7231, 7236, 7238, 7239, 7240, 7241, 7242, 7244, 7245, 7246, 7250, 7254, 7259, 7261, 7267, 7269, 7270, 7272, 7276, 7278, 7279, 7280, 7282, 7283, 7287, 7292, 7294, 7296, 7298, 7309, 7313, 7315, 7408, 7412, 7414, 7419, 7420, 7421, 7433, 7438, 7445, 7447, 7451, 7453, 7456, 7458, 7459, 7463, 7466, 7467, 7475, 7476, 7480, 7482, 7483, 7487, 7500, 7504, 7508, 7510, 7512, 7513, 7516, 7518, 7520, 7522, 7532, 7537, 7545, 7550, 7552, 7554, 7564, 7570, 7575, 7581, 7593, 7595, 7596, 7599, 7600, 7605, 7606, 7607, 7609, 7611, 7613, 7616, 7618, 7621, 7628, 7630, 7636, 7637, 7638, 7649, 7701, 7702, 7705, 7715, 7716, 7717, 7718, 7721, 7723, 7725, 7726, 7729, 7730, 7731, 7732, 7733, 7734, 7735, 7739, 7740, 7741, 7744, 7745, 7747, 7749, 7751, 7752, 7762, 7774, 7775, 7777, 7779, 7780, 7814, 7816, 7817, 7818, 7820, 7821, 7823, 7826, 7832, 7839, 7840, 7844, 7846, 7856, 7860, 7864, 7867, 7868, 7874, 7879, 7893, 7905, 7906, 7911, 7912, 7914, 7915, 7917, 7921, 7925, 7936, 7937, 7942, 7943, 7947, 7949, 7951, 7952, 7955, 7956, 7958, 7962, 7965, 7966, 7970, 7972, 7974, 7976, 7979, 7981, 7984, 7987, 7988, 7989, 7990, 7994, 7995, 8001, 8002, 8005, 8008, 8012, 8014, 8015, 8016, 8018, 8020, 8022, 8031, 8032, 8035, 8037, 8038, 8041, 8043, 8050, 8051, 8052, 8053, 8056, 8057, 8058, 8059, 8060, 8061, 8065, 8066, 8068, 8070, 8074, 8075, 8078, 8079, 8081, 8084, 8086, 8088, 8089, 8093, 8095, 8096, 8097, 8098, 8101, 8103, 8111, 8113, 8114, 8117, 8125, 8129, 8130, 8131, 8132, 8133, 8136, 8137, 8140, 8141, 8150, 8151, 8153, 8154, 8155, 8157, 8158, 8159, 8160, 8163, 8165, 8167, 8168, 8173, 8174, 8179, 8182, 8185, 8194, 8198, 8200, 8202, 8203, 8214, 8217, 8218, 8219, 8227, 8233, 8237, 8242, 8244, 8249, 8252, 8253, 8255, 8267, 8273, 8275, 8276, 8278, 8279, 8281, 8282, 8283, 8285, 8289, 8291, 8303, 8304, 8306, 8308, 8309, 8316, 8331, 8334, 8336, 8337, 8338, 8341, 8343, 8344, 8345, 8346, 8354, 8355, 8358, 8359, 8360, 8361, 8362, 8364, 8366, 8367, 8368, 8369, 8370, 8377, 8381, 8382, 8385, 8386, 8387, 8388, 8392, 8393, 8395, 8399, 8410, 8411, 8418, 8424, 8425, 8439, 8473, 8508, 8511, 8515, 8522, 8524, 8527, 8530, 8541, 8544, 8550, 8558, 8566, 8570, 8572, 8584, 8585, 8591, 8593, 8595, 8596, 8600, 8601, 8604, 8609, 8613, 8616, 8622, 8624, 8628, 8630, 8697, 8698, 8699, 8706, 8707, 8708, 8713, 8714, 8715, 8725, 8739, 8750, 8766, 8771, 8772, 8793, 8795, 8798, 8801, 8802, 8803, 8804, 8806, 8818, 8830, 8841, 8842, 8844, 8848, 8850, 8860, 8864, 8869, 8871, 8876, 8877, 8881, 8890, 8892, 8897, 8905, 8909, 8914, 8917, 8920, 8923, 8925, 8928, 8929, 8934, 8935, 8999, 9001, 9003, 9005, 9006, 9007, 9008, 9009, 9010, 9014, 9020, 9021, 9022, 9024, 9025, 9028, 9031, 9033, 9037, 9039, 9041, 9042, 9044, 9045, 9046, 9048, 9052, 9055, 9057, 9058, 9064, 9065, 9066, 9068, 9069, 9070, 9072, 9075, 9076, 9081, 9083, 9086, 9090, 9099, 9101, 9104, 9107, 9110, 9115, 9119, 9142, 9201, 9202, 9232, 9233, 9301, 9302, 9303, 9304, 9305, 9308, 9310, 9319, 9324, 9364, 9368, 9369, 9375, 9381, 9384, 9386, 9401, 9404, 9405, 9409, 9412, 9413, 9414, 9416, 9418, 9422, 9424, 9432, 9433, 9435, 9436, 9438, 9441, 9449, 9467, 9468, 9470, 9474, 9501, 9502, 9503, 9504, 9505, 9506, 9507, 9508, 9509, 9511, 9513, 9517, 9531, 9532, 9533, 9534, 9535, 9536, 9537, 9543, 9551, 9600, 9601, 9602, 9603, 9605, 9612, 9613, 9616, 9619, 9621, 9622, 9627, 9628, 9629, 9631, 9632, 9639, 9640, 9641, 9658, 9661, 9663, 9672, 9678, 9682, 9684, 9687, 9692, 9697, 9699, 9706, 9708, 9715, 9716, 9717, 9719, 9722, 9726, 9728, 9729, 9733, 9735, 9739, 9740, 9742, 9743, 9744, 9746, 9749, 9755, 9757, 9759, 9766, 9769, 9783, 9787, 9788, 9790, 9793, 9795, 9810, 9823, 9824, 9828, 9830, 9831, 9832, 9837, 9842, 9843, 9850, 9856, 9861, 9869, 9873, 9880, 9882, 9887, 9889, 9896, 9900, 9902, 9903, 9906, 9919, 9928, 9932, 9934, 9936, 9945, 9946, 9948, 9955, 9956, 9960, 9962, 9974, 9977, 9979, 9982, 9983, 9984, 9987, 9989, 9990, 9991, 9993, 9994, 9997, 9539, 9519, 3558, 6544, 5757, 1413, 3561, 3978, 3983, 3479, 3964, 3563, 3984, 3480, 3990, 3993, 7809, 3482, 3994, 9260, 6556, 3484, 6653, 8919, 9143, 7198, 3540, 4249, 6235, 7199, 9267, 6565, 6566, 6569, 9270, 6571, 9450, 6572, 7322, 4382, 4384, 4385, 9273, 6580, 9274, 4390, 7030, 7806, 7033, 3491, 3496, 7036, 7326, 3612, 5290, 7327, 9278, 9279, 3498, 4423, 7931, 9434, 4425, 6232, 6564, 7047, 7048, 4431, 4433, 1887, 4434, 4435, 4436, 7060, 7061, 2975, 7065, 1431, 4443, 4931, 4446, 7803, 4599, 7679, 4449, 4448, 4475, 7071, 4477, 4880, 4251, 7683, 3449, 4479, 4480, 4481, 4483, 4478, 4482, 4485, 7685, 2980, 4488, 7082, 7085, 7088, 4490, 7089, 4493, 7094, 7095, 7351, 4499, 4051, 4495, 4053, 4054, 4883, 4056, 1375, 2932, 4058, 4933, 7337, 7339, 7354, 2987, 4934, 6612, 7092, 7944, 4165, 4167, 7358, 4168, 7342, 4169]
self.stock_trackers = {}
for s_id in self.stock_ids:
self.stock_trackers[s_id] = StockTracker(s_id)
self.global_features = []

self.local_features = {}

for feature in features:

if feature.feature_type == FeatureType.GLOBAL:
self.global_features.append(feature)
elif feature.feature_type == FeatureType.LOCAL:
for stock_id in self.stock_ids:
if stock_id not in self.local_features:
self.local_features[stock_id] = []
self.local_features[stock_id].append(feature.copy())

def prepare_data_for_training(self, df):

# Preprocessing
df = StockDataPreprocessor.preprocess_for_training(df)

train_dfs = []

for stock_id, subdf in df.groupby("SecuritiesCode"):
for feature in self.local_features[stock_id]:
subdf = feature.add_feature_pandas(subdf)
train_dfs.append(subdf)

return pd.concat(train_dfs).sort_index()

def update_single_row(self, row):
stock_id = row["SecuritiesCode"]
row, status_code = self.stock_trackers[stock_id].update(row)
row["StatusCode"] = status_code
for feature in self.local_features[stock_id]:
row = feature.update_row(row)
return row

def online_update_apply(self, prices):
return prices.apply(lambda row: self.update_single_row(row), axis=1)

def online_update(self, prices):
updated_prices = []
for row_id, row in prices.iterrows():
stock_id = row["SecuritiesCode"]
row, status_code = self.stock_trackers[stock_id].update(row)
row["StatusCode"] = status_code
for feature in self.local_features[stock_id]:
row = feature.update_row(row)
updated_prices.append(row.to_frame().T)
return pd.concat(updated_prices)

def set_priors(self):
pass

Validation.py

Validation.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd


def KFoldDataPartition(df, K=5):
df["Date"] = pd.to_datetime(df["Date"])
dates = sorted(df["Date"].unique()) # sort just in case

datasets = []
indices = [int(i / K * len(dates)) for i in range(K + 1)]
indices[-1] -= 1

for train_start in range(len(indices[:-1])):
for train_end in range(train_start + 1, len(indices)-1):
start_date = dates[indices[train_start]]
end_date = dates[indices[train_end]]
val_end_date = dates[indices[train_end + 1]]
df_train = df[(df["Date"] >= start_date) & (df["Date"] <= end_date)]
df_val = df[(df["Date"] > end_date) & (df["Date"] <= val_end_date)]
datasets.append((df_train, df_aval))

return datasets

模型训练

1
2
3
4
5
6
import os
import pickle
import shutil
import pandas as pd

import lightgbm as lgbm
1
2
3
4
5
6
7
8
9
10
11
if not os.path.exists(r"./Features.py"):
shutil.copyfile(r"../input/codejpx/Features.py", r"./Features.py")
if not os.path.exists(r"./Preprocessing.py"):
shutil.copyfile(r"../input/codejpx/Preprocessing.py", r"./Preprocessing.py")
if not os.path.exists(r"./Trackers.py"):
shutil.copyfile(r"../input/codejpx/Trackers.py", r"./Trackers.py")
if not os.path.exists(r"./Validation.py"):
shutil.copyfile(r"../input/codejpx/Validation.py", r"./Validation.py")

import Features
from Trackers import StateTracker
1
2
3
4
5
6
features = [Features.Amplitude(), Features.OpenCloseReturn(), Features.Return(),
Features.Volatility(10), Features.Volatility(30), Features.Volatility(50),
Features.SMA("Close", 3), Features.SMA("Close", 5), Features.SMA("Close", 10),
Features.SMA("Close", 30),
Features.SMA("Return", 3), Features.SMA("Return", 5),
Features.SMA("Return", 10), Features.SMA("Return", 30)]
1
2
3
st = StateTracker(features)
df_train = pd.read_csv(r'../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv')
df_train = st.prepare_data_for_training(df_train)
1
2
3
4
5
6
7
training_cols = ['SecuritiesCode', 'Open', 'High', 'Low', 'Close', 'Volume', 'AdjustmentFactor', 'ExpectedDividend', 'SupervisionFlag']

for feature in features:
training_cols.append(feature.name)

categorical_cols = ["SecuritiesCode", "SupervisionFlag"]
target_col = ["Target"]
1
2
model = lgbm.LGBMRegressor()
model.fit(df_train[training_cols], df_train[target_col], categorical_feature=categorical_cols, eval_metric='rmse')
1
2
with open("./lgbm.pickle", "wb") as file:
pickle.dump(model, file)

结果提交

1
2
3
4
5
6
import os
import pickle
import shutil
import numpy as np

import jpx_tokyo_market_prediction
1
2
3
4
5
6
7
8
9
10
11
if not os.path.exists(r"./Features.py"):
shutil.copyfile(r"../input/codejpx/Features.py", r"./Features.py")
if not os.path.exists(r"./Preprocessing.py"):
shutil.copyfile(r"../input/codejpx/Preprocessing.py", r"./Preprocessing.py")
if not os.path.exists(r"./Trackers.py"):
shutil.copyfile(r"../input/codejpx/Trackers.py", r"./Trackers.py")
if not os.path.exists(r"./Validation.py"):
shutil.copyfile(r"../input/codejpx/Validation.py", r"./Validation.py")

import Features
from Trackers import StateTracker
1
2
3
4
5
6
features = [Features.Amplitude(), Features.OpenCloseReturn(), Features.Return(),
Features.Volatility(10), Features.Volatility(30), Features.Volatility(50),
Features.SMA("Close", 3), Features.SMA("Close", 5), Features.SMA("Close", 10),
Features.SMA("Close", 30),
Features.SMA("Return", 3), Features.SMA("Return", 5),
Features.SMA("Return", 10), Features.SMA("Return", 30)]
1
st = StateTracker(features)
1
2
3
4
5
6
7
training_cols = ['SecuritiesCode', 'Open', 'High', 'Low', 'Close', 'Volume', 'AdjustmentFactor', 'ExpectedDividend',  'SupervisionFlag']

for feature in features:
training_cols.append(feature.name)

categorical_cols = ["SecuritiesCode", "SupervisionFlag"]
target_col = ["Target"]
1
2
3
model = None
with open(r"../input/lgbm-model/lgbm.pickle", "rb") as file:
model = pickle.load(file)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

class Algo:

def __init__(self, model, state_tracker):
self.model = model
self.st = state_tracker
self.cols = ['SecuritiesCode', 'Open', 'High', 'Low', 'Close', 'Volume', 'AdjustmentFactor', 'ExpectedDividend', 'SupervisionFlag']

for feature in self.st.local_features[1301]:
self.cols.append(feature.name)

def add_rank(self, df):
predictions = df["Prediction"]
ranks = np.arange(2000)
zipped = list(zip(predictions, ranks))
zipped.sort(key=lambda x: -x[0])
sorted_predictions, sorted_ranks = map(list, zip(*zipped))
df["Rank"] = sorted_ranks
return df

def predict(self, prices, options, financials, trades, secondary_prices):
prices = st.online_update_apply(prices)[self.cols]
if not prices["SecuritiesCode"].is_monotonic_increasing:
prices = prices.sort_values(by="SecuritiesCode")
prices["Prediction"] = self.model.predict(prices)
return self.add_rank(prices)

algo = Algo(model, st)
1
2
3
4
5
6
7
8
9
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:

if not sample_prediction["SecuritiesCode"].is_monotonic_increasing:
sample_prediction = sample_prediction.sort_values("SecuritiesCode")

sample_prediction['Rank'] = algo.predict(prices, options, financials, trades, secondary_prices)['Rank']
env.predict(sample_prediction)

复现

第八名的方案无法复现,分数是0.001,达不到0.289。

第八名可能有所保留。

(和我对第八名的方案进行重构没有关系,即使我用重构前的代码,也还是0.001,无法复现。)

解读

模型

LightGBM。

缺失值处理

确实值一律填充为0。

特征衍生

作者衍生了如下的特征:

  • Features.Amplitude():振幅,当天的最高价减去最低价。
  • Features.OpenCloseReturn():收盘价减去开盘价的差,再除以开盘价。
  • Features.Return():当日的涨跌(收盘价减去前一个交易日的收盘价的差,再除以收盘价)。
  • Features.Volatility(10):过去10个交易日,Return的波动率。
  • Features.Volatility(30):过去30个交易日,Return的波动率。
  • Features.Volatility(50):过去50个交易日,Return的波动率。
  • Features.SMA("Close", 3):过去3个交易日的Close的移动平均线。
  • Features.SMA("Close", 5):过去5个交易日的Close的移动平均线。
  • Features.SMA("Close", 10):过去10个交易日的Close的移动平均线。
  • Features.SMA("Close", 30):过去30个交易日的Close的移动平均线。
  • Features.SMA("Return", 3):过去3个交易日的Return的移动平均线。
  • Features.SMA("Return", 5):过去5个交易日的Return的移动平均线。
  • Features.SMA("Return", 10):过去10个交易日的Return的移动平均线。
  • Features.SMA("Return", 30):过去30个交易日的Return的移动平均线。

以Close的SMA(移动平均线)为例:

SMA=C1+C2+C3++Cnn\text{SMA} = \frac{C_1 + C_2 + C_3 + \cdots + C_n}{n}

困惑

train_files/stock_prices.csv的时间范围是2017-01-042021-12-03
supplemental_files/stock_prices.csv的时间范围是2021-12-062022-06-24
作者在对2022-07-052022-10-07的数据进行预测的时候,应该把supplemental_files/stock_prices.csv的数据也加上,否则计算移动的移动平均不准确,但是作者没有这么做。

文章作者: Kaka Wan Yifan
文章链接: https://kakawanyifan.com/20102
版权声明: 本博客所有文章版权为文章作者所有,未经书面许可,任何机构和个人不得以任何形式转载、摘编或复制。

留言板