avatar


JPX-3.前十名的方案 [2/2]

  • 原作者提交给东京证券交易所的代码,有些写的很乱,这一点也不奇怪,在比赛期间,时间紧,任务重,通常不会太关注代码质量,我自己也是这样。
  • 在本文,我对原作者的代码进行了一些简单的重构。

本章会讨论创新方案:

  • 第四名
  • 第五名
  • 第九名
  • 第十名

在上一章《JPX-3.前十名的方案 [1/2]》讨论了常规方案:

  • 第一名
  • 第二名
  • 第三名
  • 第六名
  • 第七名
  • 第八名

第四名

代码

1
2
3
4
5
import numpy as np
import pandas as pd
from decimal import ROUND_HALF_UP, Decimal

import jpx_tokyo_market_prediction
1
2
3
base_dir = "../input/jpx-tokyo-stock-exchange-prediction"
train_files_dir = f"{base_dir}/train_files"
supplemental_files_dir = f"{base_dir}/supplemental_files"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def adjust_price(price):
price.loc[:, "Date"] = pd.to_datetime(price.loc[:, "Date"], format="%Y-%m-%d")

def generate_adjusted_close(df):
df = df.sort_values("Date", ascending=False)
df.loc[:, "CumulativeAdjustmentFactor"] = df["AdjustmentFactor"].cumprod()
df.loc[:, "AdjustedClose"] = (
df["CumulativeAdjustmentFactor"] * df["Close"]
).map(lambda x: float(
Decimal(str(x)).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP)
))
df = df.sort_values("Date")
df.loc[df["AdjustedClose"] == 0, "AdjustedClose"] = np.nan
df.loc[:, "AdjustedClose"] = df.loc[:, "AdjustedClose"].ffill()
return df

price = price.sort_values(["SecuritiesCode", "Date"])
price = price.groupby("SecuritiesCode").apply(generate_adjusted_close).reset_index(drop=True)
price.set_index("Date", inplace=True)
return price
1
2
3
4
5
6
7
8
9
10
def get_features(price, code):
close_col = "AdjustedClose"
feats = price.loc[price["SecuritiesCode"] == code, ["SecuritiesCode", close_col, "ExpectedDividend"]].copy()
feats["return_1day"] = feats[close_col].pct_change(1)
feats["ExpectedDividend"] = feats["ExpectedDividend"].mask(feats["ExpectedDividend"] > 0, 1)
feats = feats.fillna(0)
feats = feats.replace([np.inf, -np.inf], 0)
feats = feats.drop([close_col], axis=1)

return feats
1
2
3
4
5
6
7
8
9
10
price_cols = ["Date", "SecuritiesCode", "Close", "AdjustmentFactor", "ExpectedDividend"]

df_price_train = pd.read_csv(f"{train_files_dir}/stock_prices.csv")
df_price_train = df_price_train[price_cols]

df_price_supplemental = pd.read_csv(f"{supplemental_files_dir}/stock_prices.csv")
df_price_supplemental = df_price_supplemental[price_cols]

df_price_raw = pd.concat([df_price_train, df_price_supplemental])
df_price_raw = df_price_raw.loc[df_price_raw["Date"] >= "2022-07-01"]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
counter = 0
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
current_date = prices["Date"].iloc[0]
sample_prediction_date = sample_prediction["Date"].iloc[0]
print(f"current_date: {current_date}, sample_prediction_date: {sample_prediction_date}")

if counter == 0:
df_price_raw = df_price_raw.loc[df_price_raw["Date"] < current_date]

df_price_raw = pd.concat([df_price_raw, prices[price_cols]])
df_price = adjust_price(df_price_raw)
codes = sorted(prices["SecuritiesCode"].unique())

feature = pd.concat([get_features(df_price, code) for code in codes])
feature = feature.loc[feature.index == current_date]

feature.loc[:, "predict"] = feature["return_1day"] + feature["ExpectedDividend"] * 100

feature = feature.sort_values("predict", ascending=True).drop_duplicates(subset=['SecuritiesCode'])
feature.loc[:, "Rank"] = np.arange(len(feature))
feature_map = feature.set_index('SecuritiesCode')['Rank'].to_dict()
sample_prediction['Rank'] = sample_prediction['SecuritiesCode'].map(feature_map)

assert sample_prediction["Rank"].notna().all()
assert sample_prediction["Rank"].min() == 0
assert sample_prediction["Rank"].max() == len(sample_prediction["Rank"]) - 1

counter += 1
env.predict(sample_prediction)

复现

第四名的方案可以复现,0.347。

解读

整体步骤

第四名的方案,属于创新方案,没有基于任何机器学习的模型。

整体步骤如下:

  1. 对收盘价进行复权。
  2. 衍生两个特征:
    • 衍生特征一:用当天的收盘价减去前一天收盘价的差再除以前一天的收盘价,即当天的涨跌幅。
    • 衍生特征二:对于原ExpectedDividend(除权日的预期股息价值)大于00,记为11,否则记为00
  3. 按如下的计算进行排序,从小到大。
    衍生特征一 + 衍生特征二 * 100

这么做的原因

作者发现,随机模型得分可以近似为均值μ=0\mu=0,标准差σ=0.13785\sigma=0.13785的正态分布。
因此,即使我们创建了得分为[0.3,0.3][-0.3,0.3]的机器学习模型,也很难就说这个模型有多么的优秀,或者多么的糟糕。
所以,作者放弃了机器学习模型,从其他的思路着手。

ExpectedDividend的定义为,除权日的预期股息价值,该值在除息日前两个交易日记录。
也就是说,如果在ttExpectedDividend大于00,说明在t+1t+1日会登记哪些投资者享受分红,并在t+2t+2日派发股息。
所以t+2t+2的Close很可能会小于t+1t+1的Close。
tt日的Target是负数。
例如,股票的价值是1元,分红0.2元。就相差一两天,货币时间价值直接忽略。那么,在理性市场下,t+1日的价格最大1.2。t+2日会回调。
即,宣布要分红了,股票就要涨,分红结束了,就要回调。

当然,很多时候,股票是不派息的,所以ExpectedDividend几乎都是零,只使用ExpectedDividend无法得到较好的分数,所以作者还考虑了当天的涨跌幅这个特征。

但是,在作者提交给东京证券交易所的代码和文档中,没有说明,考虑当天涨跌幅这个特征的原因。

第五名

代码

1
2
3
4
5
import numpy as np
import pandas as pd
import lightgbm as lgb

import jpx_tokyo_market_prediction
1
2
3
data_df_train = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv')
data_df_supplemental = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/supplemental_files/stock_prices.csv')
data_df = pd.concat([data_df_train,data_df_supplemental])
1
2
3
4
5
data_df.sort_values(by=['SecuritiesCode', 'Date'], inplace=True)
data_df.reset_index(drop=True, inplace=True)
data_df['Close'] = data_df.groupby(['SecuritiesCode'])['Close'].ffill()
data_df['Close'] = data_df.groupby(['SecuritiesCode'])['Close'].bfill()
data_df['SecuritiesCode'] = data_df['SecuritiesCode'].astype('int')
1
2
3
4
5
6
7
8
9
data_df.loc[:, 'r1dprev_clean'] = data_df.groupby(['SecuritiesCode'])['Target'].shift(2)
data_df['r1dprev_clean'].fillna(0, inplace=True)
data_df.loc[:, 'ave1dprev_clean'] = data_df.groupby(['Date'])['r1dprev_clean'].transform(np.mean)
data_df.loc[:, 'clean_prices'] = data_df.groupby(['SecuritiesCode'])['r1dprev_clean'].apply(lambda x: np.cumproduct(x + 1) * 100)
data_df.loc[:, 'r1dprev_abs'] = np.abs(data_df['r1dprev_clean'])
data_df.loc[:, 'alt_Target'] = data_df.groupby(['SecuritiesCode'])['clean_prices'].shift(-2) / data_df.groupby(['SecuritiesCode'])['clean_prices'].shift(0) - 1
data_df.loc[:, 'ave_alt_Target'] = data_df.groupby(['Date'])['alt_Target'].transform(np.mean)
data_df.loc[:, 'alpha_alt_Target'] = data_df['alt_Target'] - data_df['ave_alt_Target']
data_df.loc[:, 'target_rank'] = data_df.groupby(['Date'])['Target'].rank()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
master_df_col =  ['Close', 'Date', 'SecuritiesCode', 'clean_prices', 'Target', 'r1dprev_abs', 'Volume', 'alt_Target']
master_df = data_df[master_df_col].copy()

master_df.loc[:, 'volsignal'] = master_df.groupby(['SecuritiesCode'])['r1dprev_abs'].apply(lambda x: x.rolling(231).sum())
master_df.loc[:, 'adv'] = master_df.groupby(['SecuritiesCode'])['Volume'].apply(lambda x: x.rolling(11).mean())
master_df.loc[:, 'ave_volsignal'] = master_df.groupby(['Date'])['volsignal'].transform(np.mean)
master_df.loc[:, 'momsignal'] = master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(25) / master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(131)
master_df.loc[:, 'ave_momsignal'] = master_df.groupby(['Date'])['momsignal'].transform(np.mean)
master_df.loc[:, 'mrsignal'] = master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(0) / master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(2)
master_df.loc[:, 'ave_mrsignal'] = master_df.groupby(['Date'])['mrsignal'].transform(np.mean)
master_df.loc[:, 'Close_rank'] = master_df.groupby(['Date'])['Close'].rank()
master_df.loc[:, 'volume_rank'] = master_df.groupby(['Date'])['Volume'].rank()
master_df.loc[:, 'adv_rank'] = master_df.groupby(['Date'])['adv'].rank()
master_df.dropna(inplace=True)
master_df.loc[:, 'Target_rank'] = master_df.groupby(['Date'])['Target'].rank()
master_df.loc[:, 'ave_ret'] = master_df.groupby(['Date'])['alt_Target'].transform(np.mean)
master_df.loc[:, 'alpha_alt_Target'] = master_df['alt_Target'] - master_df['ave_ret']
1
master_df = master_df.loc[(master_df['Target_rank'] >= 1750) | (master_df['Target_rank'] <= 250)]
1
2
3
4
features = ['volsignal', 'ave_momsignal', 'momsignal', 'mrsignal', 'ave_mrsignal', 'SecuritiesCode', 'adv_rank', 'Close_rank']
x_train = master_df[features]
y_train = master_df['alpha_alt_Target']
cat_feat = ['SecuritiesCode']
1
2
model = lgb.LGBMRegressor(boosting_type='gbdt', max_depth=2, learning_rate=0.2, n_estimators=2000, seed=42)
model.fit(x_train, y_train, categorical_feature=cat_feat)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
sub_df = data_df[['Date', 'SecuritiesCode', 'Close', 'Volume']].copy()
count_ = 0
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
prices = prices[['Date', 'SecuritiesCode', 'Close', 'Volume']]
date_val = prices['Date'].unique()[0]
if count_ == 0:
sub_df = sub_df.loc[(sub_df['Date'] < date_val)]
count_ = 1
date_val_prev = sub_df.iloc[-1, 0]
sub_df = sub_df.append(prices)
sub_df.sort_values(by=['SecuritiesCode', 'Date'], inplace=True)
sub_df.reset_index(drop=True, inplace=True)
sub_df.fillna(method='ffill', inplace=True)
pred_df = sub_df.copy()
pred_df.loc[:, 'r1dprev_clean'] = pred_df.groupby(['SecuritiesCode'])['Close'].shift(0) / pred_df.groupby(['SecuritiesCode'])['Close'].shift(1) - 1
pred_df.loc[:, 'r1dprev_abs'] = np.abs(pred_df['r1dprev_clean'])
pred_df.loc[:, 'adv'] = pred_df.groupby(['SecuritiesCode'])['Volume'].apply(lambda x: x.rolling(11).mean())
pred_df.loc[:, 'volsignal'] = pred_df.groupby(['SecuritiesCode'])['r1dprev_abs'].apply(lambda x: x.rolling(231).sum())
pred_df.loc[:, 'momsignal'] = pred_df.groupby(['SecuritiesCode'])['Close'].shift(25) / pred_df.groupby(['SecuritiesCode'])['Close'].shift(131)
pred_df.loc[:, 'ave_momsignal'] = pred_df.groupby(['Date'])['momsignal'].transform(np.mean)
pred_df.loc[:, 'mrsignal'] = pred_df.groupby(['SecuritiesCode'])['Close'].shift(0) / pred_df.groupby(['SecuritiesCode'])['Close'].shift(2)
pred_df.loc[:, 'ave_mrsignal'] = pred_df.groupby(['Date'])['mrsignal'].transform(np.mean)
pred_df.loc[:, 'Close_rank'] = pred_df.groupby(['Date'])['Close'].rank()
pred_df.loc[:, 'volume_rank'] = pred_df.groupby(['Date'])['Volume'].rank()
pred_df.loc[:, 'adv_rank'] = pred_df.groupby(['Date'])['adv'].rank()
pred_df = pred_df.loc[(pred_df['Date'] == date_val), features]
pred_df.loc[:, 'y_pred'] = model.predict(pred_df[features])
pred_df.sort_values('y_pred', ascending=False, inplace=True)
pred_df.reset_index(inplace=True, drop=True)
pred_df.loc[:, 'Rank'] = np.arange(len(pred_df))
sample_prediction.drop(['Rank'], axis=1, inplace=True)

sample_prediction = pd.merge(sample_prediction, pred_df[['SecuritiesCode', 'Rank']], how='left',on=(['SecuritiesCode']))
sample_prediction['Rank'].fillna(1000, inplace=True)

env.predict(sample_prediction)

复现

第五名的方案可以复现,0.339。

解读

模型

LightGBM

作者表示,在和TabNet和DNN比较后,发现LightGBM的效果最好。

缺失值处理

只对收盘价进行了处理,按股票分组后,按时间排序,然后向前再向后填充。

特征衍生

衍生的特征

作者衍生了8个特征:

  • volsignal
  • ave_momsignal
  • momsignal
  • mrsignal
  • ave_mrsignal
  • adv_rank
  • Close_rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
data_df.loc[:, 'r1dprev_clean'] = data_df.groupby(['SecuritiesCode'])['Target'].shift(2)
data_df['r1dprev_clean'].fillna(0, inplace=True)
data_df.loc[:, 'ave1dprev_clean'] = data_df.groupby(['Date'])['r1dprev_clean'].transform(np.mean)
data_df.loc[:, 'clean_prices'] = data_df.groupby(['SecuritiesCode'])['r1dprev_clean'].apply(lambda x: np.cumproduct(x + 1) * 100)
data_df.loc[:, 'r1dprev_abs'] = np.abs(data_df['r1dprev_clean'])
data_df.loc[:, 'alt_Target'] = data_df.groupby(['SecuritiesCode'])['clean_prices'].shift(-2) / data_df.groupby(['SecuritiesCode'])['clean_prices'].shift(0) - 1
data_df.loc[:, 'ave_alt_Target'] = data_df.groupby(['Date'])['alt_Target'].transform(np.mean)
data_df.loc[:, 'alpha_alt_Target'] = data_df['alt_Target'] - data_df['ave_alt_Target']
data_df.loc[:, 'target_rank'] = data_df.groupby(['Date'])['Target'].rank()

master_df_col = ['Close', 'Date', 'SecuritiesCode', 'clean_prices', 'Target', 'r1dprev_abs', 'Volume', 'alt_Target']
master_df = data_df[master_df_col].copy()

master_df.loc[:, 'volsignal'] = master_df.groupby(['SecuritiesCode'])['r1dprev_abs'].apply(lambda x: x.rolling(231).sum())
master_df.loc[:, 'adv'] = master_df.groupby(['SecuritiesCode'])['Volume'].apply(lambda x: x.rolling(11).mean())
master_df.loc[:, 'ave_volsignal'] = master_df.groupby(['Date'])['volsignal'].transform(np.mean)
master_df.loc[:, 'momsignal'] = master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(25) / master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(131)
master_df.loc[:, 'ave_momsignal'] = master_df.groupby(['Date'])['momsignal'].transform(np.mean)
master_df.loc[:, 'mrsignal'] = master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(0) / master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(2)
master_df.loc[:, 'ave_mrsignal'] = master_df.groupby(['Date'])['mrsignal'].transform(np.mean)
master_df.loc[:, 'Close_rank'] = master_df.groupby(['Date'])['Close'].rank()
master_df.loc[:, 'volume_rank'] = master_df.groupby(['Date'])['Volume'].rank()
master_df.loc[:, 'adv_rank'] = master_df.groupby(['Date'])['adv'].rank()
master_df.dropna(inplace=True)
master_df.loc[:, 'Target_rank'] = master_df.groupby(['Date'])['Target'].rank()
master_df.loc[:, 'ave_ret'] = master_df.groupby(['Date'])['alt_Target'].transform(np.mean)
master_df.loc[:, 'alpha_alt_Target'] = master_df['alt_Target'] - master_df['ave_ret']

这么做的原因

作者表示,他研究了许多的特征,发现,极短期的预测周期意味着行为方面是远远大于基本因素的驱动因素,如果是长期的超额表现预期可能需要更好地利用基本估值数据。

作者的这个考虑,让我想到了本杰明·格雷厄姆的一句话。

投票器

超额回报作为标签值

赛题原本的标签值,其实是包括了市场的无风险收益在内的,即总回报(pure return)。
作者认为:

  • 从信号的角度来看,金融市场预测通常被认为是一个极其脆弱的过程。由于市场状况不断变化,金融市场时间序列中存在各种特有风险,因此,对总回报(pure return)进行预测难度很大,并且存在着不一致和高度可变的依赖性。
  • 相比之下,超额回报(excess return)更具有可预测性。

所以,作者计算了超额回报,将超额回报作为标签值。

在市场上,有些私募基金,有类似的操作,先预测出一些股票将会明显跑赢指数,然后持有,再做空指数。

训练数据

作者只选取了异常值(回报率最高的前250个回报率最低的250个)进行训练。

相比,其他选手以及业内的一些做法,选择某一段时间的数据进行训练。这个确实有参考价值。

评述

第五名的方案,在东京证券交易所的划分中,被算作了创新方案。
其实第五名基于的模型,依旧是机器学习中的LightGBM。

但是在其特征衍生部分,确实有创新之处。不是基于一些通用的方法做特征衍生,也不是基于量价关系的各种技术指标做特征衍生。

第九名

代码

1
2
3
4
import os
import pandas as pd

import jpx_tokyo_market_prediction
1
2
3
TRAIN_DIR = "../input/jpx-tokyo-stock-exchange-prediction/train_files/"
PRICE_COLS = ["Open", "High", "Low", "Close"]
PK = ["Date", "SecuritiesCode"]
1
2
3
4
5
6
7
8
9
10
def adjust_pnv(df_stock: pd.DataFrame) -> pd.DataFrame:
df_stock.sort_values(by=["Date"], inplace=True)
adj_factor_cum_prod = df_stock["AdjustmentFactor"][::-1].cumprod()

for price in PRICE_COLS:
df_stock[f"Adj{price}"] = df_stock[price] * adj_factor_cum_prod
df_stock[f"Adj{price}"] = df_stock[f"Adj{price}"].ffill()
df_stock[f"AdjVol"] = df_stock["Volume"] / adj_factor_cum_prod

return df_stock
1
2
3
4
5
df = pd.read_csv(os.path.join(TRAIN_DIR, 'stock_prices.csv'))
df = df.groupby("SecuritiesCode").apply(adjust_pnv)
df["IntradayReturn"] = (df["AdjClose"] - df["AdjOpen"]) / df["AdjOpen"]
df["IntradayReturn"].fillna(0, inplace=True)
df["Target"].fillna(0, inplace=True)
1
2
3
rank = df[PK + ["IntradayReturn"]].set_index("SecuritiesCode")
rank = (rank.groupby("Date").apply(lambda x: x["IntradayReturn"].rank(method='first').astype(int) - 1))
rank.name = "Rank"
1
2
3
pred = rank.reset_index()
target = df[PK + ['Target']]
pred = pred.merge(target, left_on=PK, right_on=PK)
1
2
3
4
5
6
7
8
9
10
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
for (prices, _, _, _, _, sample_prediction) in iter_test:
prices["IntradayReturn"] = (prices["Close"] - prices["Open"]) / prices["Open"]
prices["IntradayReturn"].fillna(0, inplace=True)
df = prices[["SecuritiesCode", "IntradayReturn"]].set_index("SecuritiesCode")
df["Rank"] = df["IntradayReturn"].rank(method='first').astype(int) - 1
rank = df["Rank"]
sample_prediction['Rank'] = sample_prediction["SecuritiesCode"].map(rank)
env.predict(sample_prediction)

复现

第九名的方案可以复现,分数为0.281。

解读

第九名的方案,属于创新方案,没有基于任何机器学习的模型。

整体步骤:

  1. 处理缺失值,按股票进行分组,然后向前填充。
  2. 衍生一个特征:
    1
    df["IntradayReturn"] = (df["AdjClose"] - df["AdjOpen"]) / df["AdjOpen"]
  3. 按照IntradayReturn,从小到大进行排序。

但是,在作者提交给东京证券交易所的代码和文档中,没有说明,直接按照IntradayReturn进行排序的原因。

第十名

根据第十名提交给东京证券交易所的文档和代码,第十名的方案中,一共有三个.ipynb文件:

  1. simulations.ipynb
  2. simulation_aggregation.ipynb
  3. submission-notebook.ipynb

第一个simulations.ipynb和第二个simulation_aggregation.ipynb,通过蒙特卡洛模拟的方式,决定了submission-notebook.ipynb中的一个重要参数。

但是,在我的实际测试中,我发现了两个问题:

  1. 第一个simulations.ipynb和第二个simulation_aggregation.ipynb,代码就跑不通,明显缺失了部分的代码。
  2. 第三个submission-notebook.ipynb无法复现,结果只有0.023,达不到0.280。

此外,第十名的思路,我确实没有理解。

关于第十名的方案,暂不讨论。

文章作者: Kaka Wan Yifan
文章链接: https://kakawanyifan.com/20103
版权声明: 本博客所有文章版权为文章作者所有,未经书面许可,任何机构和个人不得以任何形式转载、摘编或复制。

留言板