- 原作者提交给东京证券交易所的代码,有些写的很乱,这一点也不奇怪,在比赛期间,时间紧,任务重,通常不会太关注代码质量,我自己也是这样。
- 在本文,我对原作者的代码进行了一些简单的重构。
本章会讨论创新方案:
在上一章《JPX-3.前十名的方案 [1/2]》讨论了常规方案:
第四名
代码
1 2 3 4 5
| import numpy as np import pandas as pd from decimal import ROUND_HALF_UP, Decimal
import jpx_tokyo_market_prediction
|
1 2 3
| base_dir = "../input/jpx-tokyo-stock-exchange-prediction" train_files_dir = f"{base_dir}/train_files" supplemental_files_dir = f"{base_dir}/supplemental_files"
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| def adjust_price(price): price.loc[:, "Date"] = pd.to_datetime(price.loc[:, "Date"], format="%Y-%m-%d")
def generate_adjusted_close(df): df = df.sort_values("Date", ascending=False) df.loc[:, "CumulativeAdjustmentFactor"] = df["AdjustmentFactor"].cumprod() df.loc[:, "AdjustedClose"] = ( df["CumulativeAdjustmentFactor"] * df["Close"] ).map(lambda x: float( Decimal(str(x)).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP) )) df = df.sort_values("Date") df.loc[df["AdjustedClose"] == 0, "AdjustedClose"] = np.nan df.loc[:, "AdjustedClose"] = df.loc[:, "AdjustedClose"].ffill() return df
price = price.sort_values(["SecuritiesCode", "Date"]) price = price.groupby("SecuritiesCode").apply(generate_adjusted_close).reset_index(drop=True) price.set_index("Date", inplace=True) return price
|
1 2 3 4 5 6 7 8 9 10
| def get_features(price, code): close_col = "AdjustedClose" feats = price.loc[price["SecuritiesCode"] == code, ["SecuritiesCode", close_col, "ExpectedDividend"]].copy() feats["return_1day"] = feats[close_col].pct_change(1) feats["ExpectedDividend"] = feats["ExpectedDividend"].mask(feats["ExpectedDividend"] > 0, 1) feats = feats.fillna(0) feats = feats.replace([np.inf, -np.inf], 0) feats = feats.drop([close_col], axis=1)
return feats
|
1 2 3 4 5 6 7 8 9 10
| price_cols = ["Date", "SecuritiesCode", "Close", "AdjustmentFactor", "ExpectedDividend"]
df_price_train = pd.read_csv(f"{train_files_dir}/stock_prices.csv") df_price_train = df_price_train[price_cols]
df_price_supplemental = pd.read_csv(f"{supplemental_files_dir}/stock_prices.csv") df_price_supplemental = df_price_supplemental[price_cols]
df_price_raw = pd.concat([df_price_train, df_price_supplemental]) df_price_raw = df_price_raw.loc[df_price_raw["Date"] >= "2022-07-01"]
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| env = jpx_tokyo_market_prediction.make_env() iter_test = env.iter_test() counter = 0 for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test: current_date = prices["Date"].iloc[0] sample_prediction_date = sample_prediction["Date"].iloc[0] print(f"current_date: {current_date}, sample_prediction_date: {sample_prediction_date}")
if counter == 0: df_price_raw = df_price_raw.loc[df_price_raw["Date"] < current_date]
df_price_raw = pd.concat([df_price_raw, prices[price_cols]]) df_price = adjust_price(df_price_raw) codes = sorted(prices["SecuritiesCode"].unique())
feature = pd.concat([get_features(df_price, code) for code in codes]) feature = feature.loc[feature.index == current_date]
feature.loc[:, "predict"] = feature["return_1day"] + feature["ExpectedDividend"] * 100
feature = feature.sort_values("predict", ascending=True).drop_duplicates(subset=['SecuritiesCode']) feature.loc[:, "Rank"] = np.arange(len(feature)) feature_map = feature.set_index('SecuritiesCode')['Rank'].to_dict() sample_prediction['Rank'] = sample_prediction['SecuritiesCode'].map(feature_map)
assert sample_prediction["Rank"].notna().all() assert sample_prediction["Rank"].min() == 0 assert sample_prediction["Rank"].max() == len(sample_prediction["Rank"]) - 1
counter += 1 env.predict(sample_prediction)
|
复现
第四名的方案可以复现,0.347。
解读
整体步骤
第四名的方案,属于创新方案,没有基于任何机器学习的模型。
整体步骤如下:
- 对收盘价进行复权。
- 衍生两个特征:
衍生特征一
:用当天的收盘价减去前一天收盘价的差再除以前一天的收盘价,即当天的涨跌幅。
衍生特征二
:对于原ExpectedDividend
(除权日的预期股息价值)大于0,记为1,否则记为0。
- 按如下的计算进行排序,从小到大。
衍生特征一
+ 衍生特征二
* 100
这么做的原因
作者发现,随机模型得分可以近似为均值μ=0,标准差σ=0.13785的正态分布。
因此,即使我们创建了得分为[−0.3,0.3]的机器学习模型,也很难就说这个模型有多么的优秀,或者多么的糟糕。
所以,作者放弃了机器学习模型,从其他的思路着手。
ExpectedDividend
的定义为,除权日的预期股息价值,该值在除息日前两个交易日记录。
也就是说,如果在t日ExpectedDividend
大于0,说明在t+1日会登记哪些投资者享受分红,并在t+2日派发股息。
所以t+2的Close很可能会小于t+1的Close。
即t日的Target是负数。
例如,股票的价值是1元,分红0.2元。就相差一两天,货币时间价值直接忽略。那么,在理性市场下,t+1日的价格最大1.2。t+2日会回调。
即,宣布要分红了,股票就要涨,分红结束了,就要回调。
当然,很多时候,股票是不派息的,所以ExpectedDividend几乎都是零,只使用ExpectedDividend无法得到较好的分数,所以作者还考虑了当天的涨跌幅这个特征。
但是,在作者提交给东京证券交易所的代码和文档中,没有说明,考虑当天涨跌幅这个特征的原因。
第五名
代码
1 2 3 4 5
| import numpy as np import pandas as pd import lightgbm as lgb
import jpx_tokyo_market_prediction
|
1 2 3
| data_df_train = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv') data_df_supplemental = pd.read_csv('/kaggle/input/jpx-tokyo-stock-exchange-prediction/supplemental_files/stock_prices.csv') data_df = pd.concat([data_df_train,data_df_supplemental])
|
1 2 3 4 5
| data_df.sort_values(by=['SecuritiesCode', 'Date'], inplace=True) data_df.reset_index(drop=True, inplace=True) data_df['Close'] = data_df.groupby(['SecuritiesCode'])['Close'].ffill() data_df['Close'] = data_df.groupby(['SecuritiesCode'])['Close'].bfill() data_df['SecuritiesCode'] = data_df['SecuritiesCode'].astype('int')
|
1 2 3 4 5 6 7 8 9
| data_df.loc[:, 'r1dprev_clean'] = data_df.groupby(['SecuritiesCode'])['Target'].shift(2) data_df['r1dprev_clean'].fillna(0, inplace=True) data_df.loc[:, 'ave1dprev_clean'] = data_df.groupby(['Date'])['r1dprev_clean'].transform(np.mean) data_df.loc[:, 'clean_prices'] = data_df.groupby(['SecuritiesCode'])['r1dprev_clean'].apply(lambda x: np.cumproduct(x + 1) * 100) data_df.loc[:, 'r1dprev_abs'] = np.abs(data_df['r1dprev_clean']) data_df.loc[:, 'alt_Target'] = data_df.groupby(['SecuritiesCode'])['clean_prices'].shift(-2) / data_df.groupby(['SecuritiesCode'])['clean_prices'].shift(0) - 1 data_df.loc[:, 'ave_alt_Target'] = data_df.groupby(['Date'])['alt_Target'].transform(np.mean) data_df.loc[:, 'alpha_alt_Target'] = data_df['alt_Target'] - data_df['ave_alt_Target'] data_df.loc[:, 'target_rank'] = data_df.groupby(['Date'])['Target'].rank()
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| master_df_col = ['Close', 'Date', 'SecuritiesCode', 'clean_prices', 'Target', 'r1dprev_abs', 'Volume', 'alt_Target'] master_df = data_df[master_df_col].copy()
master_df.loc[:, 'volsignal'] = master_df.groupby(['SecuritiesCode'])['r1dprev_abs'].apply(lambda x: x.rolling(231).sum()) master_df.loc[:, 'adv'] = master_df.groupby(['SecuritiesCode'])['Volume'].apply(lambda x: x.rolling(11).mean()) master_df.loc[:, 'ave_volsignal'] = master_df.groupby(['Date'])['volsignal'].transform(np.mean) master_df.loc[:, 'momsignal'] = master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(25) / master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(131) master_df.loc[:, 'ave_momsignal'] = master_df.groupby(['Date'])['momsignal'].transform(np.mean) master_df.loc[:, 'mrsignal'] = master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(0) / master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(2) master_df.loc[:, 'ave_mrsignal'] = master_df.groupby(['Date'])['mrsignal'].transform(np.mean) master_df.loc[:, 'Close_rank'] = master_df.groupby(['Date'])['Close'].rank() master_df.loc[:, 'volume_rank'] = master_df.groupby(['Date'])['Volume'].rank() master_df.loc[:, 'adv_rank'] = master_df.groupby(['Date'])['adv'].rank() master_df.dropna(inplace=True) master_df.loc[:, 'Target_rank'] = master_df.groupby(['Date'])['Target'].rank() master_df.loc[:, 'ave_ret'] = master_df.groupby(['Date'])['alt_Target'].transform(np.mean) master_df.loc[:, 'alpha_alt_Target'] = master_df['alt_Target'] - master_df['ave_ret']
|
1
| master_df = master_df.loc[(master_df['Target_rank'] >= 1750) | (master_df['Target_rank'] <= 250)]
|
1 2 3 4
| features = ['volsignal', 'ave_momsignal', 'momsignal', 'mrsignal', 'ave_mrsignal', 'SecuritiesCode', 'adv_rank', 'Close_rank'] x_train = master_df[features] y_train = master_df['alpha_alt_Target'] cat_feat = ['SecuritiesCode']
|
1 2
| model = lgb.LGBMRegressor(boosting_type='gbdt', max_depth=2, learning_rate=0.2, n_estimators=2000, seed=42) model.fit(x_train, y_train, categorical_feature=cat_feat)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
| sub_df = data_df[['Date', 'SecuritiesCode', 'Close', 'Volume']].copy() count_ = 0 env = jpx_tokyo_market_prediction.make_env() iter_test = env.iter_test() for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test: prices = prices[['Date', 'SecuritiesCode', 'Close', 'Volume']] date_val = prices['Date'].unique()[0] if count_ == 0: sub_df = sub_df.loc[(sub_df['Date'] < date_val)] count_ = 1 date_val_prev = sub_df.iloc[-1, 0] sub_df = sub_df.append(prices) sub_df.sort_values(by=['SecuritiesCode', 'Date'], inplace=True) sub_df.reset_index(drop=True, inplace=True) sub_df.fillna(method='ffill', inplace=True) pred_df = sub_df.copy() pred_df.loc[:, 'r1dprev_clean'] = pred_df.groupby(['SecuritiesCode'])['Close'].shift(0) / pred_df.groupby(['SecuritiesCode'])['Close'].shift(1) - 1 pred_df.loc[:, 'r1dprev_abs'] = np.abs(pred_df['r1dprev_clean']) pred_df.loc[:, 'adv'] = pred_df.groupby(['SecuritiesCode'])['Volume'].apply(lambda x: x.rolling(11).mean()) pred_df.loc[:, 'volsignal'] = pred_df.groupby(['SecuritiesCode'])['r1dprev_abs'].apply(lambda x: x.rolling(231).sum()) pred_df.loc[:, 'momsignal'] = pred_df.groupby(['SecuritiesCode'])['Close'].shift(25) / pred_df.groupby(['SecuritiesCode'])['Close'].shift(131) pred_df.loc[:, 'ave_momsignal'] = pred_df.groupby(['Date'])['momsignal'].transform(np.mean) pred_df.loc[:, 'mrsignal'] = pred_df.groupby(['SecuritiesCode'])['Close'].shift(0) / pred_df.groupby(['SecuritiesCode'])['Close'].shift(2) pred_df.loc[:, 'ave_mrsignal'] = pred_df.groupby(['Date'])['mrsignal'].transform(np.mean) pred_df.loc[:, 'Close_rank'] = pred_df.groupby(['Date'])['Close'].rank() pred_df.loc[:, 'volume_rank'] = pred_df.groupby(['Date'])['Volume'].rank() pred_df.loc[:, 'adv_rank'] = pred_df.groupby(['Date'])['adv'].rank() pred_df = pred_df.loc[(pred_df['Date'] == date_val), features] pred_df.loc[:, 'y_pred'] = model.predict(pred_df[features]) pred_df.sort_values('y_pred', ascending=False, inplace=True) pred_df.reset_index(inplace=True, drop=True) pred_df.loc[:, 'Rank'] = np.arange(len(pred_df)) sample_prediction.drop(['Rank'], axis=1, inplace=True)
sample_prediction = pd.merge(sample_prediction, pred_df[['SecuritiesCode', 'Rank']], how='left',on=(['SecuritiesCode'])) sample_prediction['Rank'].fillna(1000, inplace=True)
env.predict(sample_prediction)
|
复现
第五名的方案可以复现,0.339。
解读
模型
LightGBM
作者表示,在和TabNet和DNN比较后,发现LightGBM的效果最好。
缺失值处理
只对收盘价进行了处理,按股票分组后,按时间排序,然后向前再向后填充。
特征衍生
衍生的特征
作者衍生了8个特征:
volsignal
ave_momsignal
momsignal
mrsignal
ave_mrsignal
adv_rank
Close_rank
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| data_df.loc[:, 'r1dprev_clean'] = data_df.groupby(['SecuritiesCode'])['Target'].shift(2) data_df['r1dprev_clean'].fillna(0, inplace=True) data_df.loc[:, 'ave1dprev_clean'] = data_df.groupby(['Date'])['r1dprev_clean'].transform(np.mean) data_df.loc[:, 'clean_prices'] = data_df.groupby(['SecuritiesCode'])['r1dprev_clean'].apply(lambda x: np.cumproduct(x + 1) * 100) data_df.loc[:, 'r1dprev_abs'] = np.abs(data_df['r1dprev_clean']) data_df.loc[:, 'alt_Target'] = data_df.groupby(['SecuritiesCode'])['clean_prices'].shift(-2) / data_df.groupby(['SecuritiesCode'])['clean_prices'].shift(0) - 1 data_df.loc[:, 'ave_alt_Target'] = data_df.groupby(['Date'])['alt_Target'].transform(np.mean) data_df.loc[:, 'alpha_alt_Target'] = data_df['alt_Target'] - data_df['ave_alt_Target'] data_df.loc[:, 'target_rank'] = data_df.groupby(['Date'])['Target'].rank()
master_df_col = ['Close', 'Date', 'SecuritiesCode', 'clean_prices', 'Target', 'r1dprev_abs', 'Volume', 'alt_Target'] master_df = data_df[master_df_col].copy()
master_df.loc[:, 'volsignal'] = master_df.groupby(['SecuritiesCode'])['r1dprev_abs'].apply(lambda x: x.rolling(231).sum()) master_df.loc[:, 'adv'] = master_df.groupby(['SecuritiesCode'])['Volume'].apply(lambda x: x.rolling(11).mean()) master_df.loc[:, 'ave_volsignal'] = master_df.groupby(['Date'])['volsignal'].transform(np.mean) master_df.loc[:, 'momsignal'] = master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(25) / master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(131) master_df.loc[:, 'ave_momsignal'] = master_df.groupby(['Date'])['momsignal'].transform(np.mean) master_df.loc[:, 'mrsignal'] = master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(0) / master_df.groupby(['SecuritiesCode'])['clean_prices'].shift(2) master_df.loc[:, 'ave_mrsignal'] = master_df.groupby(['Date'])['mrsignal'].transform(np.mean) master_df.loc[:, 'Close_rank'] = master_df.groupby(['Date'])['Close'].rank() master_df.loc[:, 'volume_rank'] = master_df.groupby(['Date'])['Volume'].rank() master_df.loc[:, 'adv_rank'] = master_df.groupby(['Date'])['adv'].rank() master_df.dropna(inplace=True) master_df.loc[:, 'Target_rank'] = master_df.groupby(['Date'])['Target'].rank() master_df.loc[:, 'ave_ret'] = master_df.groupby(['Date'])['alt_Target'].transform(np.mean) master_df.loc[:, 'alpha_alt_Target'] = master_df['alt_Target'] - master_df['ave_ret']
|
这么做的原因
作者表示,他研究了许多的特征,发现,极短期的预测周期意味着行为方面是远远大于基本因素的驱动因素,如果是长期的超额表现预期可能需要更好地利用基本估值数据。
作者的这个考虑,让我想到了本杰明·格雷厄姆的一句话。
超额回报作为标签值
赛题原本的标签值,其实是包括了市场的无风险收益在内的,即总回报(pure return)。
作者认为:
- 从信号的角度来看,金融市场预测通常被认为是一个极其脆弱的过程。由于市场状况不断变化,金融市场时间序列中存在各种特有风险,因此,对总回报(pure return)进行预测难度很大,并且存在着不一致和高度可变的依赖性。
- 相比之下,超额回报(excess return)更具有可预测性。
所以,作者计算了超额回报,将超额回报作为标签值。
在市场上,有些私募基金,有类似的操作,先预测出一些股票将会明显跑赢指数,然后持有,再做空指数。
训练数据
作者只选取了异常值(回报率最高的前250个回报率最低的250个)进行训练。
相比,其他选手以及业内的一些做法,选择某一段时间的数据进行训练。这个确实有参考价值。
评述
第五名的方案,在东京证券交易所的划分中,被算作了创新方案。
其实第五名基于的模型,依旧是机器学习中的LightGBM。
但是在其特征衍生部分,确实有创新之处。不是基于一些通用的方法做特征衍生,也不是基于量价关系的各种技术指标做特征衍生。
第九名
代码
1 2 3 4
| import os import pandas as pd
import jpx_tokyo_market_prediction
|
1 2 3
| TRAIN_DIR = "../input/jpx-tokyo-stock-exchange-prediction/train_files/" PRICE_COLS = ["Open", "High", "Low", "Close"] PK = ["Date", "SecuritiesCode"]
|
1 2 3 4 5 6 7 8 9 10
| def adjust_pnv(df_stock: pd.DataFrame) -> pd.DataFrame: df_stock.sort_values(by=["Date"], inplace=True) adj_factor_cum_prod = df_stock["AdjustmentFactor"][::-1].cumprod()
for price in PRICE_COLS: df_stock[f"Adj{price}"] = df_stock[price] * adj_factor_cum_prod df_stock[f"Adj{price}"] = df_stock[f"Adj{price}"].ffill() df_stock[f"AdjVol"] = df_stock["Volume"] / adj_factor_cum_prod
return df_stock
|
1 2 3 4 5
| df = pd.read_csv(os.path.join(TRAIN_DIR, 'stock_prices.csv')) df = df.groupby("SecuritiesCode").apply(adjust_pnv) df["IntradayReturn"] = (df["AdjClose"] - df["AdjOpen"]) / df["AdjOpen"] df["IntradayReturn"].fillna(0, inplace=True) df["Target"].fillna(0, inplace=True)
|
1 2 3
| rank = df[PK + ["IntradayReturn"]].set_index("SecuritiesCode") rank = (rank.groupby("Date").apply(lambda x: x["IntradayReturn"].rank(method='first').astype(int) - 1)) rank.name = "Rank"
|
1 2 3
| pred = rank.reset_index() target = df[PK + ['Target']] pred = pred.merge(target, left_on=PK, right_on=PK)
|
1 2 3 4 5 6 7 8 9 10
| env = jpx_tokyo_market_prediction.make_env() iter_test = env.iter_test() for (prices, _, _, _, _, sample_prediction) in iter_test: prices["IntradayReturn"] = (prices["Close"] - prices["Open"]) / prices["Open"] prices["IntradayReturn"].fillna(0, inplace=True) df = prices[["SecuritiesCode", "IntradayReturn"]].set_index("SecuritiesCode") df["Rank"] = df["IntradayReturn"].rank(method='first').astype(int) - 1 rank = df["Rank"] sample_prediction['Rank'] = sample_prediction["SecuritiesCode"].map(rank) env.predict(sample_prediction)
|
复现
第九名的方案可以复现,分数为0.281。
解读
第九名的方案,属于创新方案,没有基于任何机器学习的模型。
整体步骤:
- 处理缺失值,按股票进行分组,然后向前填充。
- 衍生一个特征:
1
| df["IntradayReturn"] = (df["AdjClose"] - df["AdjOpen"]) / df["AdjOpen"]
|
- 按照
IntradayReturn
,从小到大进行排序。
但是,在作者提交给东京证券交易所的代码和文档中,没有说明,直接按照IntradayReturn
进行排序的原因。
第十名
根据第十名提交给东京证券交易所的文档和代码,第十名的方案中,一共有三个.ipynb
文件:
simulations.ipynb
simulation_aggregation.ipynb
submission-notebook.ipynb
第一个simulations.ipynb
和第二个simulation_aggregation.ipynb
,通过蒙特卡洛模拟的方式,决定了submission-notebook.ipynb
中的一个重要参数。
但是,在我的实际测试中,我发现了两个问题:
- 第一个
simulations.ipynb
和第二个simulation_aggregation.ipynb
,代码就跑不通,明显缺失了部分的代码。
- 第三个
submission-notebook.ipynb
无法复现,结果只有0.023,达不到0.280。
此外,第十名的思路,我确实没有理解。
版权声明: 本博客所有文章版权为文章作者所有,未经书面许可,任何机构和个人不得以任何形式转载、摘编或复制。