原作者提交给东京证券交易所的代码，有些写的很乱，这一点也不奇怪，在比赛期间，时间紧，任务重，通常不会太关注代码质量，我自己也是这样。
在本文，我对原作者的代码进行了一些简单的重构。

说明

东京证券交易所公开了前10名的方案，链接如下：
https://github.com/J-Quants/JPXTokyoStockExchangePrediction

东京证券交易所官方也有对前10名的方案的评述，链接如下：
https://www.youtube.com/watch?v=Ax3ON-2FLBM

东京证券交易所官方将方案进行了分类。其中第一名、第二名、第三名、第六名、第七名、第八名的方案属于常规方案，第四名、第五名、第十名的方案属于创新方案。
第九名的方案是后来提交的，当时第九名的方案还没有提交给东京证券交易所，所以没有对第九名进行分类。我在看了第九的名的方案后，认为第九名的方案也属于创新方案。

本章会讨论常规方案：

第一名

第二名

第三名

第六名

第七名

第八名

在下一章《JPX-3.前十名的方案 [2/2]》会讨论创新方案：

第四名

第五名

第九名

第十名

第一名

代码

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression 
from scipy import stats
import jpx_tokyo_market_prediction

train_stock_prices = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv")
train_secondary_stock_prices = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/train_files/secondary_stock_prices.csv")
supplemental_stock_prices = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/supplemental_files/stock_prices.csv")
supplemental_secondary_stock_prices = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/supplemental_files/secondary_stock_prices.csv")

stock_prices = pd.concat([train_stock_prices,train_secondary_stock_prices,supplemental_stock_prices,supplemental_secondary_stock_prices])

def featuring_train(data):
    
    data['Date'] = pd.to_datetime(data['Date'])
    data['Target'] = data['Target'].fillna(0)
    data["SupervisionFlag"] = data["SupervisionFlag"].astype(int)
    
    # 缺失值处理
    data['ExpectedDividend'] = data['ExpectedDividend'].fillna(0)
    cols = ['Open', 'High', 'Low', 'Close']
    data.loc[:,cols] = data.loc[:,cols].ffill()
    data.loc[:,cols] = data.loc[:,cols].bfill()
    
    # 特征衍生
    data['Daily_Range'] = data['Close'] - data['Open']
    data['Mean'] = (data['High']+data['Low']) / 2
    data['Mean'] = data['Mean'].astype(int)
    
    # 标准化
    data['Open'] = stats.zscore(data['Open'])
    data['High'] = stats.zscore(data['High'])
    data['Low'] = stats.zscore(data['Low'])
    data['Close'] = stats.zscore(data['Close'])
    data['Volume'] = stats.zscore(data['Volume'])
    data['Daily_Range'] = stats.zscore(data['Daily_Range'])
    data['Mean'] = stats.zscore(data['Mean'])
    
    # 去除一些列
    data = data.drop(['RowId'], axis=1)
    
    return data

data = featuring_train(stock_prices)

data_train = data[data['Date']<'2022-04-01']

data_test = data[data['Date']>'2022-04-01']
data_test = data_test.reset_index(drop=True)

data_train = data_train.drop(['Date'], axis=1)
data_test = data_test.drop(['Date'], axis=1)

X_train = data_train.drop(['Target'], axis=1)
y_train = data_train['Target']

X_test = data_test.drop(['Target'], axis=1)
y_test = data_test['Target']

1 2	model = LinearRegression() model.fit(X_train, y_train)

def featuring_test(data):
    
    data["SupervisionFlag"] = data["SupervisionFlag"].astype(int)
    
    # 缺失值处理
    data['ExpectedDividend'] = data['ExpectedDividend'].fillna(0)
    cols = ['Open', 'High', 'Low', 'Close']
    data.loc[:,cols] = data.loc[:,cols].ffill()
    data.loc[:,cols] = data.loc[:,cols].bfill()

    # 特征衍生
    data['Daily_Range'] = data['Close'] - data['Open']
    data['Mean'] = (data['High']+data['Low']) / 2
    data['Mean'] = data['Mean'].astype(int)
    
    # 标准化
    data['Open'] = stats.zscore(data['Open'])
    data['High'] = stats.zscore(data['High'])
    data['Low'] = stats.zscore(data['Low'])
    data['Close'] = stats.zscore(data['Close'])
    data['Volume'] = stats.zscore(data['Volume'])
    data['Daily_Range'] = stats.zscore(data['Daily_Range'])
    data['Mean'] = stats.zscore(data['Mean'])
    
    # 删除一些列
    data = data.drop(['RowId', 'Date'], axis=1)
    
    return data

env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    
    x_test = featuring_test(prices) 
    y_pred = model.predict(x_test)
    
    sample_prediction['Target'] = y_pred
    sample_prediction = sample_prediction.sort_values(by = "Target", ascending = False)
    
    sample_prediction['Rank'] = np.arange(len(sample_prediction.index))
    sample_prediction = sample_prediction.sort_values(by = "SecuritiesCode", ascending = True)
    sample_prediction.drop(["Target"], axis = 1)
    submission = sample_prediction[["Date", "SecuritiesCode", "Rank"]]
    
    env.predict(submission)

复现

第一名不但提交代码给了东京证券交易所，还在Kaggle开源了其代码，这两份代码是一样的。

但是，第一名提供的代码，无法复现0.381这个分数，我实验了多次，分数都是0.277，第一名可能有所保留。

(和我对第一名的方案进行重构没有关系，即使我用重构前的代码，也还是0.277，无法复现。)

Kaggle链接：https://www.kaggle.com/code/shokisakai/jpx-regression

解读

模型

线性回归模型。

缺失值处理

对于ExpectedDividend，填充为0。

对于Open、High、Low、Close，填充为前后值。示例代码：

1
2
3

cols = ['Open', 'High', 'Low', 'Close']
data.loc[:,cols] = data.loc[:,cols].ffill()
data.loc[:,cols] = data.loc[:,cols].bfill()

对于Volume，该方案没有对其缺失值进行处理。

特征衍生

作者衍生了两个特征

Daily_Range，收盘价减去开盘价。
Mean，(最高价 + 最低价) / 2，然后向下取整。

1
2
3

data['Daily_Range'] = data['Close'] - data['Open']
data['Mean'] = (data['High']+data['Low']) / 2
data['Mean'] = data['Mean'].astype(int)

标准化

最后，作者对Open、High、Low、Close、Volume、Daily_Range、Mean这些特征进行标准化(Z-Score标准化)，即把原始数据映射到均值为0，方差为1的范围内。

$x' = \frac{x - mean}{\sigma}$

其中

$mean$ 代表平均值
$\sigma$ 代表标准差

东京证券交易所二部

对于东京证券交易所二部的数据，作者也将其加入了模型的训练。

困惑

特征工程部分

作者的特征工程代码，有一处不合逻辑：缺失值的处理。
作者应该按股票进行分组，再按日期进行排序，然后进行缺失值处理，对于Open、High、Low、Close，填充为前后值。

如果我略微调整一下数据的顺序，例如我加上这么一行

1	data.sort_values(by='Close', inplace=True)

这时候的分数是-0.063。

测试数据处理部分

还有一处，其实存在争议。测试数据的特征处理。

在对训练数据进行标准化的时候，是基于全市场的所有股票的所有训练日期的数据进行标准化。

但是，在对测试数据进行标准化的时候，是基于全市场的所有股票的某一天的数据进行标准化，该处不合理。

姑且可以认为，在大样本情况下(样本数2000)，该部分可以忽略。

第二名

代码

import math
import os

import jpx_tokyo_market_prediction
import lightgbm as lgb
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.metrics import mean_squared_error

# 设置随机数种子
def seed_everything(seed):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)


SEED = 42
seed_everything(SEED)

# 只用了东京证券交易所一部的数据
train = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv", parse_dates=["Date"])
# 和分红、监管标志的列直接删掉
train = train.drop(columns=['RowId', 'ExpectedDividend', 'AdjustmentFactor', 'SupervisionFlag']).dropna().reset_index(drop=True)

# 特征衍生
def add_features(feats):
    # 相比20个交易日前的变化
    feats["return_1month"] = feats["Close"].pct_change(20)
    # 相比40个交易日前的变化
    feats["return_2month"] = feats["Close"].pct_change(40)
    # 相比60个交易日前的变化
    feats["return_3month"] = feats["Close"].pct_change(60)
    # 过去20个交易日的波动率
    feats["volatility_1month"] = (
        np.log(feats["Close"]).diff().rolling(20).std()
    )
    # 过去40个交易日的波动率
    feats["volatility_2month"] = (
        np.log(feats["Close"]).diff().rolling(40).std()
    )
    # 过去60个交易日的波动率
    feats["volatility_3month"] = (
        np.log(feats["Close"]).diff().rolling(60).std()
    )
    # 过去20个交易日的平均
    feats["MA_gap_1month"] = feats["Close"] / (
        feats["Close"].rolling(20).mean()
    )
    # 过去40个交易日的平均
    feats["MA_gap_2month"] = feats["Close"] / (
        feats["Close"].rolling(40).mean()
    )
    # 过去60个交易日的平均
    feats["MA_gap_3month"] = feats["Close"] / (
        feats["Close"].rolling(60).mean()
    )

    return feats

# 均方误差
def feval_rmse(y_pred, lgb_train):
    y_true = lgb_train.get_label()
    return 'rmse', mean_squared_error(y_true, y_pred), False


# 皮尔逊相关系数
def feval_pearsonr(y_pred, lgb_train):
    y_true = lgb_train.get_label()
    return 'pearsonr', stats.pearsonr(y_true, y_pred)[0], True


# 计算回报
def calc_spread_return_per_day(df, portfolio_size=200, toprank_weight_ratio=2):
    assert df['Rank'].min() == 0
    assert df['Rank'].max() == len(df['Rank']) - 1
    weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
    purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size] * weights).sum() / weights.mean()
    short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size] * weights).sum() / weights.mean()
    return purchase - short


# 计算夏普比率
def calc_spread_return_sharpe(df: pd.DataFrame, portfolio_size=200, toprank_weight_ratio=2):
    buf = df.groupby('Date').apply(calc_spread_return_per_day, portfolio_size, toprank_weight_ratio)
    sharpe_ratio = buf.mean() / buf.std()
    return sharpe_ratio


# 进行排名
def add_rank(df):
    df["Rank"] = df.groupby("Date")["Target"].rank(ascending=False, method="first") - 1
    df["Rank"] = df["Rank"].astype("int")
    return df


# 填充空值和inf
def fill_nan_inf(df):
    df = df.fillna(0)
    df = df.replace([np.inf, -np.inf], 0)
    return df


# 计算分数
def check_score(df, preds, Securities_filter=[]):
    tmp_preds = df[['Date', 'SecuritiesCode']].copy()
    tmp_preds['Target'] = preds

    # Rank Filter. Calculate median for this date and assign this value to the list of Securities to filter.
    tmp_preds['target_mean'] = tmp_preds.groupby("Date")["Target"].transform('median')
    tmp_preds.loc[tmp_preds['SecuritiesCode'].isin(Securities_filter), 'Target'] = tmp_preds['target_mean']

    tmp_preds = add_rank(tmp_preds)
    df['Rank'] = tmp_preds['Rank']
    score = round(calc_spread_return_sharpe(df, portfolio_size=200, toprank_weight_ratio=2), 5)
    score_mean = round(df.groupby('Date').apply(calc_spread_return_per_day, 200, 2).mean(), 5)
    score_std = round(df.groupby('Date').apply(calc_spread_return_per_day, 200, 2).std(), 5)
    print(f'Competition_Score:{score}, rank_score_mean:{score_mean}, rank_score_std:{score_std}')

1 2	train = add_features(train) train = fill_nan_inf(train)

# 每只股票的最大Target
SecuritiesCode_target_max = train.groupby('SecuritiesCode')['Target'].max()
# 每只股票的最低Target
SecuritiesCode_target_min = train.groupby('SecuritiesCode')['Target'].min()

# Target差最小的
list_spred_h = list((SecuritiesCode_target_max - SecuritiesCode_target_min).sort_values()[:1000].index)

# Target差最大的
list_spred_l = list((SecuritiesCode_target_max - SecuritiesCode_target_min).sort_values()[1000:].index)

features = ['High', 'Low', 'Open', 'Close', 'Volume', 'return_1month', 'return_2month', 'return_3month',
            'volatility_1month', 'volatility_2month', 'volatility_3month',
            'MA_gap_1month', 'MA_gap_2month', 'MA_gap_3month']

# Target差最小的，作为训练集
tr_dataset = lgb.Dataset(train[train['SecuritiesCode'].isin(list_spred_h)][features],
                         train[train['SecuritiesCode'].isin(list_spred_h)]["Target"], feature_name=features)

# Target差最大的，作为测试集
vl_dataset = lgb.Dataset(train[train['SecuritiesCode'].isin(list_spred_l)][features],
                         train[train['SecuritiesCode'].isin(list_spred_l)]["Target"], feature_name=features)

# lgb的参数
params_lgb = {'learning_rate': 0.005, 'metric': 'None', 'objective': 'regression', 'boosting': 'gbdt', 'verbosity': 0,
              'n_jobs': -1, 'force_col_wise': True}

# 模型训练
model = lgb.train(params=params_lgb,
                  train_set=tr_dataset,
                  valid_sets=[tr_dataset, vl_dataset],
                  num_boost_round=3000,
                  feval=feval_pearsonr,
                  callbacks=[lgb.early_stopping(stopping_rounds=300, verbose=True), lgb.log_evaluation(period=100)])

# 测试
test = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/supplemental_files/stock_prices.csv",parse_dates=["Date"])
test = test.drop(columns=['RowId', 'ExpectedDividend', 'AdjustmentFactor', 'SupervisionFlag'])
test = add_features(test)
test = fill_nan_inf(test)
preds = model.predict(test[features])
print(math.sqrt(mean_squared_error(preds, test.Target)))

check_score(test, preds)
check_score(test, preds, list_spred_h)
check_score(test, preds, list_spred_l)

sample_submission = pd.read_csv("../input/jpx-tokyo-stock-exchange-prediction/example_test_files/sample_submission.csv")

env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    prices = add_features(prices)
    prices['Target'] = model.predict(fill_nan_inf(prices)[features])
    prices['target_mean'] = prices.groupby("Date")["Target"].transform('median')
    prices.loc[prices['SecuritiesCode'].isin(list_spred_h), 'Target'] = prices['target_mean']
    prices = add_rank(prices)
    sample_prediction['Rank'] = prices['Rank']
    env.predict(sample_prediction)

复现

第二名的方案可以复现，0.356。

而且其原代码中有一个设置随机数种子的部分，作者设置的是42，我改成0后，依旧可以能取的0.356的分数。

解读

模型

LightGBM。

作者表示，采用LightGBM，基于两点原因：

作者在尝试了LightGBMRanker和XGBoost，但是在验证集下，分数都不好，所以决定采用LightGBM。
作者曾经交易过加密货币，根据其加密货币的经验，作者认为LightGBM的效果会比其他模型都好。

缺失值和异常值处理

对于缺失值和np.inf、-np.inf两个异常值，一律用0代替。

特征衍生

作者对收盘价Close进行了如下的衍生

相比20个交易日前的变化
相比40个交易日前的变化
相比60个交易日前的变化
过去20个交易日的波动率
过去40个交易日的波动率
过去60个交易日的波动率
过去20个交易日的平均
过去40个交易日的平均
过去60个交易日的平均

# 相比20个交易日前的变化
feats["return_1month"] = feats["Close"].pct_change(20)
# 相比40个交易日前的变化
feats["return_2month"] = feats["Close"].pct_change(40)
# 相比60个交易日前的变化
feats["return_3month"] = feats["Close"].pct_change(60)
# 过去20个交易日的波动率
feats["volatility_1month"] = (
    np.log(feats["Close"]).diff().rolling(20).std()
)
# 过去40个交易日的波动率
feats["volatility_2month"] = (
    np.log(feats["Close"]).diff().rolling(40).std()
)
# 过去60个交易日的波动率
feats["volatility_3month"] = (
    np.log(feats["Close"]).diff().rolling(60).std()
)
# 过去20个交易日的平均
feats["MA_gap_1month"] = feats["Close"] / (
    feats["Close"].rolling(20).mean()
)
# 过去40个交易日的平均
feats["MA_gap_2month"] = feats["Close"] / (
    feats["Close"].rolling(40).mean()
)
# 过去60个交易日的平均
feats["MA_gap_3month"] = feats["Close"] / (
    feats["Close"].rolling(60).mean()
)

训练集和验证集的划分

在训练集和验证集的划分方面，作者并不是按照时间顺序划分，更不是乱序后进行划分，而是按照Target进行划分。

作者首先计算了2017-01-04到2021-12-03的每一只股票的最大Target和最小Target的差，然后

训练集为：Target差最大的1000只股票的数据
验证集为：Target差最大的1000只股票的数据，Target差最小的1000只股票的数据，共2000只股票的数据。

最后用2021-12-06到2022-06-24的数据进行了测试。

而2021-12-06到2022-06-24的数据，并没有加入训练集，对模型重新进行训练。最终的模型，只用了从2017-01-04到2021-12-03的Target差最大的1000只股票数据作为训练数据。

结果提交

巧妙的设计

在结果提交方面，该方案有一个巧妙的设计，注意如下两行：

1 2	prices['target_mean'] = prices.groupby("Date")["Target"].transform('median') prices.loc[prices['SecuritiesCode'].isin(list_spred_h), 'Target'] = prices['target_mean']

作者修改了list_spred_h(Target差最大的1000只股票)的'Target'值，修改为中位数。

可能的原因

在作者提交给东京证券交易所的代码和文档中，没有说明他这么做的原因。但是在第十名的方案中，第十名的观点，比赛以夏普比率为衡量，夏普比率最大，就需要最大化收益的均值，同时最小化收益的波动。

在金融业务中，每一只股票都具有一定的特性，有些股票可能本身就属于收益波动很大的股票，而这些股票比较有可能在list_spred_h(Target差最大的1000只股票)中，作者将这些股票的Target设置为中位数，那么这些股票就不太会被算作要做多或做空的股票。

我想，这是作者这么操作的原因。

困惑

该方案在特征工程部分，同样存在困惑。

作者应该按股票分组，计算移动平均等特征，但是作者没有这么做。
对于最后提交部分，其中prices只有某一天的数据，作者基于某一天的数据，去算移动平均，是没有意义的。

1 2	for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test: prices = add_features(prices)

验证猜想

猜想修改list_spred_h(Target差最大的1000只股票)的'Target'为中位数，这个发挥了作用。

为了验证这个猜想，不修改ist_spred_h(Target差最大的1000只股票)的'Target'，再进行提交，结果为0.231，小于修改情况下的0.36。

第三名

代码

import numpy as np
import pandas as pd
import jpx_tokyo_market_prediction
from sklearn.tree import DecisionTreeRegressor
from tqdm.notebook import tqdm

path = "../input/jpx-tokyo-stock-exchange-prediction/"
train_stock_prices = pd.read_csv(f"{path}train_files/stock_prices.csv")
train_stock_prices = train_stock_prices[~train_stock_prices["Target"].isnull()]
supplemental_stock_prices = pd.read_csv(f"{path}supplemental_files/stock_prices.csv")
df_prices = pd.concat([supplemental_stock_prices, train_stock_prices])
df_prices = df_prices[df_prices.Date >= "2021-10-01"]

def fill_nans(prices):
    prices.set_index(["SecuritiesCode", "Date"], inplace=True)
    prices.ExpectedDividend.fillna(0, inplace=True)
    prices.ffill(inplace=True)
    prices.fillna(0, inplace=True)
    prices.reset_index(inplace=True)
    return prices

def calc_spread_return_per_day(df, portfolio_size, toprank_weight_ratio):
    weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
    weights_mean = weights.mean()
    df = df.sort_values(by='Rank')
    purchase = (df['Target'][:portfolio_size] * weights).sum() / weights_mean
    short = (df['Target'][-portfolio_size:] * weights[::-1]).sum() / weights_mean
    return purchase - short

def calc_spread_return_sharpe(df, portfolio_size=200, toprank_weight_ratio=2):
    grp = df.groupby('Date')
    min_size = grp["Target"].count().min()
    if min_size < 2 * portfolio_size:
        portfolio_size = min_size // 2
        if portfolio_size < 1:
            return 0, None
    buf = grp.apply(calc_spread_return_per_day, portfolio_size, toprank_weight_ratio)
    sharpe_ratio = buf.mean() / buf.std()
    return sharpe_ratio, buf

def add_rank(df, col_name="pred"):
    df["Rank"] = df.groupby("Date")[col_name].rank(ascending=False, method="first") - 1
    df["Rank"] = df["Rank"].astype("int")
    return df

1 2	def predictor(feature_df): return model.predict(feature_df[feats])

1 2	df_prices = fill_nans(df_prices) supplemental_stock_prices = fill_nans(supplemental_stock_prices)

np.random.seed(0)
feats = ['Open', 'High', 'Low', 'Close']
max_score = 0
max_depth = 0
for md in tqdm(range(3, 40)):
    model = DecisionTreeRegressor(max_depth=md)
    model.fit(df_prices[feats], df_prices["Target"])
    supplemental_stock_prices["pred"] = predictor(supplemental_stock_prices)
    score, buf = calc_spread_return_sharpe(add_rank(supplemental_stock_prices))
    if score > max_score:
        max_score = score
        max_depth = md

1
2
3

model = DecisionTreeRegressor(max_depth=max_depth)
model.fit(df_prices[feats], df_prices["Target"])
print(f'Max_deph={max_depth} : Sharpe Ratio Score base -> {max_score}')

env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
for prices, options, financials, trades, secondary_prices, sample_prediction in iter_test:
    prices = fill_nans(prices)
    prices.loc[:, "pred"] = predictor(prices)
    prices = add_rank(prices)
    rank = prices.set_index('SecuritiesCode')['Rank'].to_dict()
    sample_prediction['Rank'] = sample_prediction['SecuritiesCode'].map(rank)
    env.predict(sample_prediction)

复现

第三名的方案可以复现，0.352。

而且其代码中有一个设置随机数种子的部分，作者设置的是0，我改成1后，意外取的了更高的分数，0.406。

解读

模型

决策树回归模型。

缺失值处理

缺失值处理的处理的步骤如下：

ExpectedDividend，缺失值填充为0。
其他字段都向前填充。
如果还有为空的，填充0。

def fill_nans(prices):
    prices.set_index(["SecuritiesCode", "Date"], inplace=True)
    prices.ExpectedDividend.fillna(0, inplace=True)
    prices.ffill(inplace=True)
    prices.fillna(0, inplace=True)
    prices.reset_index(inplace=True)
    return prices

df_prices = fill_nans(df_prices)

特征衍生

该方案只用了四个最基本的特征，没有做其他任何的特征衍生。

训练数据

该方案只选取了2021-10-01及之后的数据进行训练。

只选择部分数据进行训练，这个操作很常见。我们从一些私募基金的致歉信中，可以窥见一斑。

模型训练

决策树中有一个超参数，是决策树的最佳深度，作者搜索了一个决策树的最佳深度。搜索代码如下：

np.random.seed(0)
feats = ['Open', 'High', 'Low', 'Close']
max_score = 0
max_depth = 0
for md in tqdm(range(3, 40)):
    model = DecisionTreeRegressor(max_depth=md)
    model.fit(df_prices[feats], df_prices["Target"])
    supplemental_stock_prices["pred"] = predictor(supplemental_stock_prices)
    score, buf = calc_spread_return_sharpe(add_rank(supplemental_stock_prices))
    if score > max_score:
        max_score = score
        max_depth = md

困惑

作者应该按股票进行分组，再按日期进行排序，然后进行缺失值处理，向前填充。

第六名

代码

import numpy as np
import pandas as pd
from lightgbm import Booster, LGBMRegressor
from tqdm import tqdm
from decimal import ROUND_HALF_UP, Decimal

import jpx_tokyo_market_prediction

base_dir = "../input/jpx-tokyo-stock-exchange-prediction"

train_files_dir = f"{base_dir}/train_files"
supplemental_files_dir = f"{base_dir}/supplemental_files"

df_price_train = pd.read_csv(f"{train_files_dir}/stock_prices.csv")
df_price_supplemental = pd.read_csv(f"{supplemental_files_dir}/stock_prices.csv")

df_price = pd.concat([df_price_train, df_price_supplemental])

1 2	TRAIN_END = "2019-12-31" TEST_START = "2020-01-06"

# 对收盘价进行复权
def generate_adjusted_close(df):
    df = df.sort_values("Date", ascending=False)
    df.loc[:, "CumulativeAdjustmentFactor"] = df["AdjustmentFactor"].cumprod()
    df.loc[:, "AdjustedClose"] = (
            df["CumulativeAdjustmentFactor"] * df["Close"]
    ).map(lambda x: float(
        Decimal(str(x)).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP)
    ))
    df = df.sort_values("Date")
    df.loc[df["AdjustedClose"] == 0, "AdjustedClose"] = np.nan
    df.loc[:, "AdjustedClose"] = df.loc[:, "AdjustedClose"].ffill()
    return df

# 复权，该方法会调用 generate_adjusted_close
def adjust_price(price):
    price = price.copy()
    price.loc[:, "Date"] = pd.to_datetime(price.loc[:, "Date"], format="%Y-%m-%d")

    price = price.sort_values(["SecuritiesCode", "Date"])
    price = price.groupby("SecuritiesCode").apply(generate_adjusted_close).reset_index(drop=True)

    price.set_index("Date", inplace=True)
    return price

1 2	df_price = adjust_price(df_price) codes = sorted(df_price["SecuritiesCode"].unique())

# 特征衍生
def get_features(price, code):
    close_col = "AdjustedClose"
    feats = price.loc[price["SecuritiesCode"] == code, ["SecuritiesCode", close_col]].copy()
    # 前一日的差
    feats["close_diff1"] = feats[close_col].diff(1)
    
    feats = feats.fillna(0)
    feats = feats.replace([np.inf, -np.inf], 0)
    feats = feats.drop([close_col], axis=1)

    return feats

buff = []
for code in tqdm(codes):
    feat = get_features(df_price, code)
    buff.append(feat)
feature = pd.concat(buff)

def get_label(price, code):
    df = price.loc[price["SecuritiesCode"] == code].copy()
    df.loc[:, "label"] = df["Target"]
    return df.loc[:, ["SecuritiesCode", "label"]]


def get_features_and_label(price, codes, features):
    trains_X, tests_X = [], []
    trains_y, tests_y = [], []

    for code in tqdm(codes):
        feats = features[features["SecuritiesCode"] == code].dropna()
        labels = get_label(price, code).dropna()

        if feats.shape[0] > 0 and labels.shape[0] > 0:
            labels = labels.loc[labels.index.isin(feats.index)]
            feats = feats.loc[feats.index.isin(labels.index)]

            assert (labels.loc[:, "SecuritiesCode"] == feats.loc[:, "SecuritiesCode"]).all()
            labels = labels.loc[:, "label"]

            _train_X = feats[: TRAIN_END]
            _test_X = feats[TEST_START:]

            _train_y = labels[: TRAIN_END]
            _test_y = labels[TEST_START:]

            assert len(_train_X) == len(_train_y)
            assert len(_test_X) == len(_test_y)

            trains_X.append(_train_X)
            tests_X.append(_test_X)

            trains_y.append(_train_y)
            tests_y.append(_test_y)

    train_X = pd.concat(trains_X)
    test_X = pd.concat(tests_X)

    train_y = pd.concat(trains_y)
    test_y = pd.concat(tests_y)

    return train_X, train_y, test_X, test_y

1	train_X, train_y, test_X, test_y = get_features_and_label(df_price, codes, feature)

lgbm_params = {
    'seed': 42,
    'n_jobs': -1,
}

feat_cols = [
    "close_diff1",
]

pred_model = LGBMRegressor(**lgbm_params)
pred_model.fit(train_X[feat_cols].values, train_y)

df_price_train_raw = pd.read_csv(f"{train_files_dir}/stock_prices.csv")
price_cols = ["Date", "SecuritiesCode", "Close", "AdjustmentFactor",]
df_price_train_raw = df_price_train_raw[price_cols]
df_price_train_raw = df_price_train_raw.loc[df_price_train_raw["Date"] >= "2021-08-01"]
df_price_supplemental_raw = pd.read_csv(f"{supplemental_files_dir}/stock_prices.csv")
df_price_supplemental_raw = df_price_supplemental_raw[price_cols]
df_price_raw = pd.concat([df_price_train_raw, df_price_supplemental])

env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
counter = 0
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:
    current_date = prices["Date"].iloc[0]
    sample_prediction_date = sample_prediction["Date"].iloc[0]
    print(f"current_date: {current_date}, sample_prediction_date: {sample_prediction_date}")

    if counter == 0:
        df_price_raw = df_price_raw.loc[df_price_raw["Date"] < current_date]

    threshold = (pd.Timestamp(current_date) - pd.offsets.BDay(80)).strftime("%Y-%m-%d")
    print(f"threshold: {threshold}")
    df_price_raw = df_price_raw.loc[df_price_raw["Date"] >= threshold]

    df_price_raw = pd.concat([df_price_raw, prices[price_cols]])
    df_price = adjust_price(df_price_raw)

    codes = sorted(prices["SecuritiesCode"].unique())

    feature = pd.concat([get_features(df_price, code) for code in codes])
    feature = feature.loc[feature.index == current_date]

    feature.loc[:, "predict"] = pred_model.predict(feature[feat_cols])

    feature = feature.sort_values("predict", ascending=False).drop_duplicates(subset=['SecuritiesCode'])
    feature.loc[:, "Rank"] = np.arange(len(feature))
    feature_map = feature.set_index('SecuritiesCode')['Rank'].to_dict()
    sample_prediction['Rank'] = sample_prediction['SecuritiesCode'].map(feature_map)

    assert sample_prediction["Rank"].notna().all()
    assert sample_prediction["Rank"].min() == 0
    assert sample_prediction["Rank"].max() == len(sample_prediction["Rank"]) - 1
    counter += 1

    env.predict(sample_prediction)

复现

第六名的方案可以复现，0.308。

解读

模型

LightGBM。

缺失值处理

按股票进行分组，按时间排序，然后向前填充。

# 对收盘价进行复权
def generate_adjusted_close(df):
    df = df.sort_values("Date", ascending=False)
    df.loc[:, "CumulativeAdjustmentFactor"] = df["AdjustmentFactor"].cumprod()
    df.loc[:, "AdjustedClose"] = (
            df["CumulativeAdjustmentFactor"] * df["Close"]
    ).map(lambda x: float(
        Decimal(str(x)).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP)
    ))
    df = df.sort_values("Date")
    df.loc[df["AdjustedClose"] == 0, "AdjustedClose"] = np.nan
    df.loc[:, "AdjustedClose"] = df.loc[:, "AdjustedClose"].ffill()
    return df

第一名、第二名和第三名，都没有进行复权。

特征衍生

只衍生了一个特征，当天Close与前一个交易日的Close的差。

示例代码：

def get_features(price, code):
    close_col = "AdjustedClose"
    feats = price.loc[price["SecuritiesCode"] == code, ["SecuritiesCode", close_col]].copy()
    # 前一日的差
    feats["close_diff1"] = feats[close_col].diff(1)
    
    feats = feats.fillna(0)
    feats = feats.replace([np.inf, -np.inf], 0)
    feats = feats.drop([close_col], axis=1)

    return feats

训练数据

训练数据时间范围

与第三名不一样，第四名是选取截止2019-12-31作为训练数据。

可能的原因

在作者提交给东京证券交易所的代码和文档中，没有说明他这么做的原因。

我想可能是想去除2020年年初，因为新冠疫情导致的金融市场异动，作者把这部分的数据作为了异常数据。

或者是基于宏观因素考虑，用2017-01-04到2019-12-31的数据作为训练数据，去预测2022-07-06到2022-10-07的市场。

没有困惑

对于该方案，我没有困惑。所有的缺失值处理、特征衍生包括模型训练等，都可以理解。

第七名

代码

import os
import numpy as np
import pandas as pd
from lightgbm import LGBMRegressor
import torch
from typing import Tuple

import jpx_tokyo_market_prediction

def data_pipeline(dir_path: str) -> Tuple[pd.DataFrame, pd.DataFrame]:
    stock_prices_train = pd.read_csv(os.path.join(dir_path, "train_files/stock_prices.csv"))
    stock_prices_train = stock_prices_train.drop(["ExpectedDividend", "RowId"], axis=1)
    stock_prices_train = stock_prices_train.fillna(0)

    stock_prices_supplemental = pd.read_csv(os.path.join(dir_path, "supplemental_files/stock_prices.csv"))
    stock_prices_supplemental = stock_prices_supplemental.drop(["ExpectedDividend", "RowId"], axis=1)
    stock_prices_supplemental = stock_prices_supplemental.fillna(0)

    stock_list = pd.read_csv(os.path.join(dir_path, "stock_list.csv"))
    target_stock_list = stock_list[stock_list["Universe0"]]
    sec_info = target_stock_list[["SecuritiesCode", "33SectorName", "17SectorName"]]

    stock_prices_train = pd.merge(stock_prices_train, sec_info, on="SecuritiesCode")
    stock_prices_train["33SectorName"] = stock_prices_train["33SectorName"].astype("category")
    stock_prices_train["17SectorName"] = stock_prices_train["17SectorName"].astype("category")

    stock_prices_supplemental = pd.merge(stock_prices_supplemental, sec_info, on="SecuritiesCode")
    stock_prices_supplemental["33SectorName"] = stock_prices_supplemental["33SectorName"].astype("category")
    stock_prices_supplemental["17SectorName"] = stock_prices_supplemental["17SectorName"].astype("category")

    stock_prices_train.update(stock_prices_train.groupby("SecuritiesCode")["Target"].ffill().fillna(0))
    stock_prices_supplemental.update(stock_prices_supplemental.groupby("SecuritiesCode")["Target"].ffill().fillna(0))

    stock_prices_train["SupervisionFlag"] = stock_prices_train["SupervisionFlag"].map({True: 1, False: 0})
    stock_prices_supplemental["SupervisionFlag"] = stock_prices_supplemental["SupervisionFlag"].map({True: 1, False: 0})

    time_config = {"train_split_date": "2020-12-23"}
    stock_prices_train = stock_prices_train[stock_prices_train.Date >= time_config["train_split_date"]]

    return stock_prices_train, stock_prices_supplemental, sec_info

1 2	train, supplemental, sec_info = data_pipeline("../input/jpx-tokyo-stock-exchange-prediction") train = pd.concat([train, supplemental])

class LGBMHierarchModel():
    def __init__(self, device=None, seed=69):
        self.seed = seed
        self._best_found_params = {
            "num_leaves": 17,
            "learning_rate": 0.014,
            "n_estimators": 700,
            "max_depth": -1,
        }
        self.models = {}

    def train(self, train: pd.DataFrame, use_params=False):
        for name, group in train.groupby("33SectorName"):
            y = group["Target"].to_numpy()
            X = group.drop(["Target"], axis=1)
            X = X.drop(["Date", "SecuritiesCode"], axis=1)
            model = LGBMRegressor(**self._best_found_params)
            model.fit(X, y, verbose=False)
            self.models[name] = model

    def predict(self, test: pd.DataFrame):
        y_preds = []
        for name, group in test.groupby("33SectorName"):
            sec_codes = group["SecuritiesCode"]
            X_test = group.drop(["Date", "SecuritiesCode"], axis=1)
            y_pred = self.models[name].predict(X_test)
            y_preds.extend(list(zip(sec_codes, y_pred)))
        df = pd.DataFrame(y_preds, columns=["codes", "pred"])
        return df.sort_values("codes", ascending=True)["pred"].values

1
2
3

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LGBMHierarchModel(device=device, seed=69)
model.train(train.copy(), use_params=True)

env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
for (df_test, options, financials, trades, secondary_prices, df_pred) in iter_test:
    x_test = df_test.drop(["ExpectedDividend", "RowId"], axis=1)
    x_test = x_test.fillna(0)

    x_test = pd.merge(x_test, sec_info, on="SecuritiesCode")
    x_test["33SectorName"] = x_test["33SectorName"].astype("category")
    x_test["17SectorName"] = x_test["17SectorName"].astype("category")

    x_test["SupervisionFlag"] = x_test["SupervisionFlag"].map({True: 1, False: 0})

    y_pred = model.predict(x_test)
    df_pred['Target'] = y_pred
    df_pred = df_pred.sort_values(by="Target", ascending=False)
    df_pred['Rank'] = np.arange(len(df_pred.index))
    df_pred = df_pred.sort_values(by="SecuritiesCode", ascending=True)
    df_pred.drop(["Target"], axis=1)
    submission = df_pred[["Date", "SecuritiesCode", "Rank"]]
    env.predict(submission)

复现

第七名的方案可以复现，0.301。

解读

模型

LightGBM。

更准确的说，是33个LightGBM模型。
作者发现了东京证券交易所市场的股票有一个行业效应，即同一个行业的股票，彼此容易有相类似的表现。
所以，作者针对每一个行业，都训练了一个LightGBM的回归模型。

缺失值处理

对于缺失值，一律填充0。

特征衍生

没有进行任何特征衍生。

训练数据

作者只选取了部分数据(2020-12-23及之后)参与训练。

没有困惑

对于该方案，我没有困惑。所有的缺失值处理、特征衍生包括模型训练等，都可以理解。

第八名

代码

Features.py

Features.py：

from enum import Enum
import numpy as np


class FeatureType(Enum):
    GLOBAL = 0
    LOCAL = 1


class Feature:

    def __init__(self, feature_type, name):
        self.feature_type = feature_type
        self.name = name

    def add_feature_pandas(self, df):
        raise NotImplementedError

    def update_row(self, row):
        raise NotImplementedError

    def reset(self):
        raise NotImplementedError

    def copy(self):
        raise NotImplementedError


class SMA(Feature):

    def __init__(self, col, period):
        super().__init__(FeatureType.LOCAL, col + "SMA" + str(period))
        self.col = col
        self.period = period
        self.elements = [np.nan] * period
        self.ptr = 0
        self.mean = np.nan

    def add_feature_pandas(self, df):
        df[self.name] = df[self.col].rolling(self.period).mean()
        for k in range(self.period):
            df[self.name].iloc[k] = np.mean(df[self.col].iloc[:k+1])
        return df

    def update_row(self, row):

        dequeue = self.elements[self.ptr % self.period]
        enqueue = row[self.col]
        self.elements[self.ptr % self.period] = enqueue
        self.ptr += 1
        mean = 0
        if self.ptr < self.period:  # We have not yet seen enough elements
            mean = np.mean(self.elements[:self.ptr])
        elif self.ptr == self.period:  # We have the value for the first time
            self.mean = np.mean(self.elements)
            mean = self.mean
        else:
            mean = self.mean + (- dequeue + enqueue) / self.period  # Simple and efficient updates
            self.mean = mean

        row[self.name] = mean
        return row

    def reset(self):
        self.elements = [np.nan] * self.period
        self.ptr = 0
        self.mean = np.nan

    def copy(self):
        return SMA(self.col, self.period)


class Amplitude(Feature):
    def __init__(self):
        super().__init__(FeatureType.LOCAL, "Amplitude")

    def add_feature_pandas(self, df):
        df[self.name] = df["High"] - df["Low"]
        return df

    def update_row(self, row):
        row[self.name] = row["High"] - row["Low"]
        return row

    def reset(self):
        return

    def copy(self):
        return Amplitude()


class OpenCloseReturn(Feature):
    def __init__(self):
        super().__init__(FeatureType.LOCAL, "OpenCloseReturn")

    def add_feature_pandas(self, df):
        df[self.name] = (df["Close"] - df["Open"]) / df["Open"]
        return df

    def update_row(self, row):
        row[self.name] = (row["Close"] - row["Open"]) / row["Open"]
        return row

    def reset(self):
        return

    def copy(self):
        return OpenCloseReturn()


class Return(Feature):
    def __init__(self):
        super().__init__(FeatureType.LOCAL, "Return")
        self.last_close = None

    def add_feature_pandas(self, df):
        df[self.name] = ((df["Close"] - df["Close"].shift(1)) / df["Close"].shift(1)).fillna(0)
        return df

    def update_row(self, row):
        if self.last_close is None:
            row[self.name] = 0
        else:
            row[self.name] = (row["Close"] - self.last_close) / self.last_close
        self.last_close = row["Close"]
        return row

    def reset(self):
        self.last_close = None

    def copy(self):
        return Return()


class Volatility(Feature):

    def __init__(self, n=30):
        super().__init__(FeatureType.LOCAL, "Volatility" + str(n))
        self.n = n
        self.returns = np.ones(n) * np.nan
        self.ptr = 0
        self.index = 0

    def volatility_row_function(self, df, row):
        l = max(0, self.index + 1 - self.n)
        r = self.index + 1
        self.index += 1
        return np.std(df["Return"].to_numpy()[l:r], ddof=1)

    def add_feature_pandas(self, df):
        df[self.name] = df.apply(lambda row: self.volatility_row_function(df, row), axis=1)
        return df.fillna(0)

    def update_row(self, row):
        self.returns[self.ptr % self.n] = row["Return"]

        if self.ptr == 0:
            row[self.name] = 0
        elif self.ptr < self.n - 1:
            vec = self.returns[:(self.ptr + 1) % self.n]
            row[self.name] = np.std(vec, ddof=1)
        else:
            row[self.name] = np.std(self.returns, ddof=1)

        self.ptr += 1
        return row

    def reset(self):
        self.returns = np.ones(self.n) * np.nan
        self.ptr = 0
        self.index = 0

    def copy(self):
        return Volatility(self.n)


class FeatureChecker:
    def verify(feature, df_prices):
        df_pandas = feature.add_feature_pandas(df_prices)
        df_online = df_prices.apply(lambda row: feature.update_row(row), axis=1)

        return np.isclose(df_pandas[feature.name].to_numpy(),
                          df_online[feature.name].to_numpy(), equal_nan=True).all()

    def verify_features(features, df_prices):
        verifications = []
        all_verified = True
        for feature in features:
            this = FeatureChecker.verify(feature, df_prices)
            verifications.append((feature.name, this))
            all_verified = all_verified and this

        if all_verified:
            print("All features passed the check.")
        else:
            print("Some features failed the check.")

        print(verifications)

        return verifications

Preprocessing.py

Preprocessing.py：

import numpy as np
import pandas as pd


class StockDataPreprocessor:

    def fill_nans(df_stocks):
        # Dividends NaNs means 0 zero.
        df_stocks["ExpectedDividend"] = df_stocks["ExpectedDividend"].fillna(0)
        subdfs = []
        for stock_id, subdf in df_stocks.groupby("SecuritiesCode"):
            for i in range(len(subdf)):
                if not np.isnan(subdf.iloc[i]["Open"]):
                    break
            subdf = subdf.iloc[i:]
            subdf = subdf.fillna(method="ffill")
            subdfs.append(subdf)

            if i != 0:
                print(f"Stock id {stock_id} dropping {i} rows")

        new_df_stocks = pd.concat(subdfs).sort_index().reset_index(drop=True)

        print(f"Number of rows dropped is {len(df_stocks) - len(new_df_stocks)}")
        return new_df_stocks

    def add_cum_adj_factor(df_stocks):
        cum_adj_list = []
        for stock_id, subdf in df_stocks.groupby("SecuritiesCode"):
            cum_adj = subdf["AdjustmentFactor"].cumprod().shift(1, fill_value=1)
            cum_adj_list.append(cum_adj)

        df_stocks["CumAdjFactor"] = pd.concat(cum_adj_list).sort_index().to_numpy()
        return df_stocks

    def adjust_prices_and_volume(df_stocks):
        df_stocks["Open"] = df_stocks["Open"] / df_stocks["CumAdjFactor"]
        df_stocks["High"] = df_stocks["High"] / df_stocks["CumAdjFactor"]
        df_stocks["Low"] = df_stocks["Low"] / df_stocks["CumAdjFactor"]
        df_stocks["Close"] = df_stocks["Close"] / df_stocks["CumAdjFactor"]
        df_stocks["Volume"] = df_stocks["Volume"] * df_stocks["CumAdjFactor"]
        return df_stocks

    def preprocess_for_training(df_stocks):
        df_stocks = StockDataPreprocessor.fill_nans(df_stocks)
        df_stocks = StockDataPreprocessor.add_cum_adj_factor(df_stocks)
        df_stocks = StockDataPreprocessor.adjust_prices_and_volume(df_stocks)
        return df_stocks

Trackers.py

Trackers.py：

from enum import Enum

import numpy as np
import pandas as pd

from Preprocessing import StockDataPreprocessor
from Features import FeatureType


class StockStatusCheck(Enum):
    OK = 0
    ISOLATED_NAN = 1
    INIT_NAN = 2
    ERROR = 3


class StockTracker:

    def __init__(self, stock_id):
        self.stock_id = stock_id
        self.cum_adj_factor = 1.0
        self.last_open = None  # These are unadjusted
        self.last_high = None
        self.last_low = None
        self.last_close = None

    def check_price_data(self, row):

        assert row["SecuritiesCode"] == self.stock_id, "SecuritiesCode does not match the tracked stock"

        status_code = StockStatusCheck.OK

        if np.isnan(row["Open"]) and np.isnan(row["High"]) and np.isnan(row["Low"]) and np.isnan(row["Close"]):
            if self.last_open is None:
                status_code = StockStatusCheck.INIT_NAN
            else:
                row["Open"] = self.last_open
                row["Close"] = self.last_close
                row["Low"] = self.last_low
                row["High"] = self.last_high
                status_code = StockStatusCheck.ISOLATED_NAN
        else:
            date = row["Date"]
            if np.isnan(row["Open"]):
                print(f"Warning: OPEN on {date} for {self.stock_id} is NaN")
                status_code = StockStatusCheck.ERROR
            if np.isnan(row["High"]):
                print(f"Warning: HIGH on {date} for {self.stock_id} is NaN")
                status_code = StockStatusCheck.ERROR
            if np.isnan(row["Low"]):
                print(f"Warning: LOW on {date} for {self.stock_id} is NaN")
                status_code = StockStatusCheck.ERROR
            if np.isnan(row["Close"]):
                print(f"Warning: CLOSE on {date} for {self.stock_id} is NaN")
                status_code = StockStatusCheck.ERROR

        if status_code == StockStatusCheck.OK:
            self.last_open = row["Open"]
            self.last_high = row["High"]
            self.last_low = row["Low"]
            self.last_close = row["Close"]

        return row, status_code

    def adjust_prices(self, row):
        row[["Open", "High", "Low", "Close"]] /= self.cum_adj_factor
        row["Volume"] *= self.cum_adj_factor
        return row

    def update(self, row):
        row, status_code = self.check_price_data(row)
        is_okay = True
        if status_code == StockStatusCheck.INIT_NAN or status_code == StockStatusCheck.ERROR:
            is_okay = False
        if np.isnan(row["ExpectedDividend"]):
            row["ExpectedDividend"] = 0.0
        row = self.adjust_prices(row)
        self.cum_adj_factor *= row["AdjustmentFactor"]

        if is_okay:
            pass

        return row, status_code


class StateTracker:

    def __init__(self, features):
        self.stock_ids = [1301, 1332, 1333, 1376, 1377, 1379, 1381, 1407, 1414, 1417, 1419, 1429, 1435, 1515, 1518, 1605, 1662, 1663, 1712, 1716, 1719, 1720, 1721, 1723, 1726, 1762, 1766, 1775, 1780, 1787, 1793, 1799, 1801, 1802, 1803, 1805, 1808, 1810, 1811, 1812, 1813, 1814, 1815, 1820, 1821, 1822, 1833, 1835, 1852, 1860, 1861, 1870, 1871, 1873, 1878, 1879, 1882, 1884, 1885, 1888, 1890, 1893, 1898, 1899, 1911, 1914, 1921, 1925, 1926, 1928, 1929, 1930, 1934, 1938, 1939, 1941, 1942, 1944, 1945, 1946, 1949, 1950, 1951, 1952, 1954, 1959, 1961, 1963, 1965, 1967, 1968, 1969, 1973, 1975, 1976, 1979, 1980, 1981, 1982, 2001, 2002, 2003, 2004, 2009, 2053, 2060, 2108, 2109, 2114, 2117, 2120, 2121, 2124, 2127, 2130, 2146, 2148, 2150, 2153, 2154, 2157, 2158, 2160, 2168, 2170, 2175, 2181, 2183, 2185, 2193, 2198, 2201, 2204, 2206, 2207, 2208, 2209, 2211, 2212, 2217, 2220, 2221, 2222, 2226, 2229, 2264, 2266, 2267, 2268, 2269, 2270, 2281, 2282, 2288, 2292, 2294, 2296, 2301, 2305, 2307, 2309, 2315, 2317, 2325, 2326, 2327, 2329, 2331, 2335, 2337, 2349, 2353, 2359, 2371, 2372, 2374, 2378, 2379, 2384, 2389, 2393, 2395, 2412, 2413, 2418, 2427, 2429, 2432, 2433, 2440, 2445, 2453, 2461, 2462, 2469, 2471, 2475, 2477, 2484, 2489, 2491, 2492, 2497, 2498, 2501, 2502, 2503, 2531, 2533, 2540, 2573, 2579, 2587, 2588, 2590, 2593, 2594, 2602, 2607, 2612, 2613, 2651, 2653, 2659, 2664, 2669, 2670, 2676, 2678, 2681, 2685, 2686, 2692, 2694, 2695, 2698, 2702, 2705, 2715, 2726, 2729, 2730, 2733, 2734, 2737, 2742, 2749, 2751, 2752, 2753, 2760, 2761, 2763, 2767, 2768, 2780, 2782, 2784, 2790, 2791, 2792, 2801, 2802, 2804, 2805, 2806, 2809, 2810, 2811, 2814, 2815, 2819, 2830, 2831, 2871, 2874, 2875, 2882, 2884, 2897, 2899, 2904, 2908, 2910, 2914, 2915, 2918, 2922, 2923, 2925, 2929, 2930, 2931, 3001, 3002, 3003, 3028, 3031, 3034, 3036, 3038, 3040, 3046, 3048, 3050, 3064, 3075, 3076, 3085, 3086, 3087, 3088, 3091, 3092, 3097, 3099, 3101, 3103, 3104, 3105, 3106, 3107, 3110, 3116, 3132, 3134, 3139, 3141, 3148, 3150, 3151, 3153, 3154, 3156, 3157, 3159, 3166, 3167, 3176, 3178, 3179, 3180, 3182, 3183, 3186, 3191, 3193, 3196, 3197, 3198, 3199, 3201, 3221, 3222, 3228, 3231, 3232, 3244, 3245, 3252, 3254, 3264, 3276, 3284, 3288, 3289, 3291, 3302, 3315, 3319, 3328, 3333, 3341, 3349, 3355, 3360, 3361, 3371, 3377, 3382, 3387, 3388, 3391, 3395, 3397, 3401, 3402, 3405, 3407, 3415, 3421, 3431, 3433, 3436, 3443, 3445, 3457, 3458, 3465, 3475, 3539, 3543, 3546, 3547, 3548, 3549, 3553, 3569, 3580, 3591, 3593, 3597, 3608, 3626, 3628, 3632, 3635, 3636, 3648, 3649, 3656, 3657, 3659, 3660, 3662, 3663, 3665, 3668, 3673, 3675, 3676, 3677, 3678, 3679, 3681, 3687, 3688, 3694, 3696, 3697, 3708, 3733, 3738, 3762, 3763, 3765, 3769, 3771, 3772, 3774, 3778, 3784, 3788, 3798, 3800, 3817, 3825, 3834, 3835, 3836, 3837, 3843, 3844, 3853, 3854, 3856, 3857, 3861, 3863, 3865, 3880, 3891, 3900, 3901, 3902, 3903, 3906, 3914, 3915, 3916, 3919, 3921, 3922, 3923, 3925, 3926, 3932, 3934, 3937, 3939, 3941, 3946, 3950, 3951, 3962, 3966, 3969, 4004, 4005, 4008, 4021, 4023, 4025, 4026, 4027, 4028, 4041, 4042, 4043, 4044, 4045, 4046, 4047, 4061, 4062, 4063, 4078, 4080, 4082, 4088, 4091, 4092, 4094, 4095, 4097, 4099, 4100, 4107, 4109, 4112, 4113, 4114, 4116, 4118, 4151, 4182, 4183, 4185, 4186, 4187, 4188, 4189, 4202, 4203, 4204, 4205, 4206, 4208, 4212, 4215, 4216, 4218, 4220, 4221, 4228, 4229, 4235, 4238, 4246, 4272, 4275, 4286, 4290, 4293, 4298, 4301, 4307, 4308, 4310, 4312, 4318, 4323, 4324, 4326, 4327, 4337, 4343, 4344, 4345, 4348, 4350, 4362, 4365, 4368, 4369, 4401, 4403, 4410, 4452, 4461, 4462, 4463, 4464, 4471, 4502, 4503, 4506, 4507, 4516, 4519, 4521, 4523, 4526, 4527, 4528, 4530, 4534, 4536, 4538, 4540, 4541, 4543, 4544, 4547, 4548, 4549, 4550, 4551, 4552, 4553, 4554, 4559, 4563, 4565, 4568, 4569, 4571, 4572, 4574, 4577, 4578, 4579, 4581, 4582, 4584, 4587, 4592, 4593, 4595, 4611, 4612, 4613, 4617, 4619, 4620, 4621, 4626, 4628, 4631, 4633, 4634, 4636, 4641, 4658, 4659, 4661, 4662, 4665, 4666, 4668, 4671, 4674, 4676, 4680, 4681, 4684, 4686, 4687, 4689, 4694, 4699, 4704, 4708, 4709, 4714, 4716, 4718, 4719, 4722, 4725, 4726, 4732, 4733, 4739, 4743, 4745, 4746, 4751, 4755, 4763, 4765, 4767, 4768, 4771, 4772, 4776, 4781, 4792, 4800, 4801, 4809, 4812, 4813, 4816, 4819, 4820, 4825, 4826, 4828, 4832, 4837, 4839, 4847, 4848, 4849, 4901, 4902, 4911, 4912, 4914, 4917, 4919, 4921, 4922, 4923, 4927, 4928, 4951, 4955, 4956, 4958, 4963, 4966, 4967, 4968, 4970, 4971, 4973, 4974, 4975, 4978, 4980, 4985, 4992, 4994, 4996, 4997, 4998, 5008, 5011, 5013, 5015, 5017, 5019, 5020, 5021, 5101, 5105, 5108, 5110, 5121, 5122, 5142, 5161, 5184, 5185, 5186, 5191, 5192, 5195, 5201, 5202, 5208, 5214, 5217, 5218, 5232, 5233, 5261, 5262, 5269, 5273, 5288, 5301, 5302, 5304, 5310, 5331, 5332, 5333, 5334, 5344, 5351, 5352, 5357, 5384, 5388, 5393, 5401, 5406, 5408, 5410, 5411, 5423, 5440, 5444, 5449, 5451, 5463, 5464, 5471, 5480, 5481, 5482, 5486, 5541, 5563, 5602, 5631, 5632, 5659, 5698, 5702, 5703, 5706, 5707, 5711, 5713, 5714, 5715, 5726, 5727, 5741, 5801, 5802, 5803, 5805, 5807, 5809, 5821, 5851, 5857, 5901, 5902, 5909, 5911, 5918, 5929, 5930, 5932, 5933, 5938, 5943, 5945, 5946, 5947, 5949, 5951, 5957, 5959, 5970, 5975, 5976, 5982, 5985, 5988, 5989, 5991, 5992, 5999, 6005, 6013, 6023, 6027, 6028, 6030, 6035, 6036, 6047, 6050, 6055, 6058, 6062, 6067, 6070, 6071, 6073, 6078, 6080, 6082, 6088, 6089, 6094, 6095, 6098, 6099, 6101, 6103, 6104, 6113, 6118, 6125, 6134, 6135, 6136, 6140, 6141, 6143, 6144, 6145, 6146, 6149, 6151, 6157, 6178, 6182, 6183, 6184, 6191, 6194, 6196, 6197, 6199, 6200, 6201, 6222, 6237, 6238, 6240, 6245, 6246, 6247, 6250, 6254, 6257, 6258, 6264, 6266, 6268, 6269, 6272, 6273, 6277, 6278, 6279, 6282, 6284, 6287, 6289, 6293, 6301, 6302, 6305, 6306, 6309, 6310, 6315, 6323, 6324, 6326, 6328, 6330, 6331, 6332, 6333, 6339, 6340, 6345, 6349, 6351, 6357, 6361, 6363, 6364, 6365, 6366, 6367, 6368, 6369, 6370, 6371, 6376, 6378, 6379, 6381, 6382, 6383, 6387, 6395, 6406, 6407, 6409, 6411, 6412, 6413, 6417, 6418, 6419, 6420, 6425, 6430, 6432, 6436, 6440, 6444, 6448, 6454, 6455, 6457, 6458, 6459, 6460, 6462, 6463, 6464, 6465, 6470, 6471, 6472, 6473, 6474, 6479, 6480, 6481, 6482, 6484, 6485, 6486, 6490, 6498, 6501, 6502, 6503, 6504, 6506, 6507, 6508, 6516, 6517, 6532, 6533, 6535, 6538, 6539, 6584, 6586, 6588, 6590, 6592, 6594, 6616, 6617, 6619, 6620, 6622, 6625, 6626, 6627, 6629, 6630, 6632, 6637, 6638, 6640, 6641, 6644, 6645, 6651, 6652, 6668, 6670, 6674, 6676, 6701, 6702, 6703, 6706, 6707, 6718, 6723, 6724, 6727, 6728, 6736, 6737, 6740, 6741, 6742, 6744, 6745, 6750, 6752, 6753, 6754, 6755, 6758, 6762, 6768, 6769, 6770, 6777, 6779, 6787, 6788, 6789, 6794, 6798, 6800, 6804, 6806, 6807, 6809, 6810, 6814, 6815, 6817, 6820, 6823, 6824, 6832, 6834, 6841, 6844, 6845, 6848, 6849, 6850, 6855, 6856, 6857, 6859, 6861, 6866, 6869, 6871, 6875, 6877, 6879, 6881, 6882, 6890, 6902, 6904, 6905, 6908, 6912, 6914, 6915, 6918, 6920, 6923, 6924, 6925, 6929, 6932, 6937, 6938, 6941, 6947, 6951, 6952, 6954, 6955, 6957, 6958, 6960, 6961, 6962, 6963, 6965, 6966, 6967, 6971, 6976, 6981, 6986, 6988, 6994, 6995, 6996, 6997, 6999, 7003, 7004, 7011, 7012, 7013, 7102, 7105, 7148, 7157, 7164, 7167, 7169, 7172, 7173, 7177, 7180, 7181, 7182, 7184, 7186, 7187, 7189, 7191, 7192, 7201, 7202, 7203, 7205, 7211, 7220, 7222, 7224, 7226, 7229, 7231, 7236, 7238, 7239, 7240, 7241, 7242, 7244, 7245, 7246, 7250, 7254, 7259, 7261, 7267, 7269, 7270, 7272, 7276, 7278, 7279, 7280, 7282, 7283, 7287, 7292, 7294, 7296, 7298, 7309, 7313, 7315, 7408, 7412, 7414, 7419, 7420, 7421, 7433, 7438, 7445, 7447, 7451, 7453, 7456, 7458, 7459, 7463, 7466, 7467, 7475, 7476, 7480, 7482, 7483, 7487, 7500, 7504, 7508, 7510, 7512, 7513, 7516, 7518, 7520, 7522, 7532, 7537, 7545, 7550, 7552, 7554, 7564, 7570, 7575, 7581, 7593, 7595, 7596, 7599, 7600, 7605, 7606, 7607, 7609, 7611, 7613, 7616, 7618, 7621, 7628, 7630, 7636, 7637, 7638, 7649, 7701, 7702, 7705, 7715, 7716, 7717, 7718, 7721, 7723, 7725, 7726, 7729, 7730, 7731, 7732, 7733, 7734, 7735, 7739, 7740, 7741, 7744, 7745, 7747, 7749, 7751, 7752, 7762, 7774, 7775, 7777, 7779, 7780, 7814, 7816, 7817, 7818, 7820, 7821, 7823, 7826, 7832, 7839, 7840, 7844, 7846, 7856, 7860, 7864, 7867, 7868, 7874, 7879, 7893, 7905, 7906, 7911, 7912, 7914, 7915, 7917, 7921, 7925, 7936, 7937, 7942, 7943, 7947, 7949, 7951, 7952, 7955, 7956, 7958, 7962, 7965, 7966, 7970, 7972, 7974, 7976, 7979, 7981, 7984, 7987, 7988, 7989, 7990, 7994, 7995, 8001, 8002, 8005, 8008, 8012, 8014, 8015, 8016, 8018, 8020, 8022, 8031, 8032, 8035, 8037, 8038, 8041, 8043, 8050, 8051, 8052, 8053, 8056, 8057, 8058, 8059, 8060, 8061, 8065, 8066, 8068, 8070, 8074, 8075, 8078, 8079, 8081, 8084, 8086, 8088, 8089, 8093, 8095, 8096, 8097, 8098, 8101, 8103, 8111, 8113, 8114, 8117, 8125, 8129, 8130, 8131, 8132, 8133, 8136, 8137, 8140, 8141, 8150, 8151, 8153, 8154, 8155, 8157, 8158, 8159, 8160, 8163, 8165, 8167, 8168, 8173, 8174, 8179, 8182, 8185, 8194, 8198, 8200, 8202, 8203, 8214, 8217, 8218, 8219, 8227, 8233, 8237, 8242, 8244, 8249, 8252, 8253, 8255, 8267, 8273, 8275, 8276, 8278, 8279, 8281, 8282, 8283, 8285, 8289, 8291, 8303, 8304, 8306, 8308, 8309, 8316, 8331, 8334, 8336, 8337, 8338, 8341, 8343, 8344, 8345, 8346, 8354, 8355, 8358, 8359, 8360, 8361, 8362, 8364, 8366, 8367, 8368, 8369, 8370, 8377, 8381, 8382, 8385, 8386, 8387, 8388, 8392, 8393, 8395, 8399, 8410, 8411, 8418, 8424, 8425, 8439, 8473, 8508, 8511, 8515, 8522, 8524, 8527, 8530, 8541, 8544, 8550, 8558, 8566, 8570, 8572, 8584, 8585, 8591, 8593, 8595, 8596, 8600, 8601, 8604, 8609, 8613, 8616, 8622, 8624, 8628, 8630, 8697, 8698, 8699, 8706, 8707, 8708, 8713, 8714, 8715, 8725, 8739, 8750, 8766, 8771, 8772, 8793, 8795, 8798, 8801, 8802, 8803, 8804, 8806, 8818, 8830, 8841, 8842, 8844, 8848, 8850, 8860, 8864, 8869, 8871, 8876, 8877, 8881, 8890, 8892, 8897, 8905, 8909, 8914, 8917, 8920, 8923, 8925, 8928, 8929, 8934, 8935, 8999, 9001, 9003, 9005, 9006, 9007, 9008, 9009, 9010, 9014, 9020, 9021, 9022, 9024, 9025, 9028, 9031, 9033, 9037, 9039, 9041, 9042, 9044, 9045, 9046, 9048, 9052, 9055, 9057, 9058, 9064, 9065, 9066, 9068, 9069, 9070, 9072, 9075, 9076, 9081, 9083, 9086, 9090, 9099, 9101, 9104, 9107, 9110, 9115, 9119, 9142, 9201, 9202, 9232, 9233, 9301, 9302, 9303, 9304, 9305, 9308, 9310, 9319, 9324, 9364, 9368, 9369, 9375, 9381, 9384, 9386, 9401, 9404, 9405, 9409, 9412, 9413, 9414, 9416, 9418, 9422, 9424, 9432, 9433, 9435, 9436, 9438, 9441, 9449, 9467, 9468, 9470, 9474, 9501, 9502, 9503, 9504, 9505, 9506, 9507, 9508, 9509, 9511, 9513, 9517, 9531, 9532, 9533, 9534, 9535, 9536, 9537, 9543, 9551, 9600, 9601, 9602, 9603, 9605, 9612, 9613, 9616, 9619, 9621, 9622, 9627, 9628, 9629, 9631, 9632, 9639, 9640, 9641, 9658, 9661, 9663, 9672, 9678, 9682, 9684, 9687, 9692, 9697, 9699, 9706, 9708, 9715, 9716, 9717, 9719, 9722, 9726, 9728, 9729, 9733, 9735, 9739, 9740, 9742, 9743, 9744, 9746, 9749, 9755, 9757, 9759, 9766, 9769, 9783, 9787, 9788, 9790, 9793, 9795, 9810, 9823, 9824, 9828, 9830, 9831, 9832, 9837, 9842, 9843, 9850, 9856, 9861, 9869, 9873, 9880, 9882, 9887, 9889, 9896, 9900, 9902, 9903, 9906, 9919, 9928, 9932, 9934, 9936, 9945, 9946, 9948, 9955, 9956, 9960, 9962, 9974, 9977, 9979, 9982, 9983, 9984, 9987, 9989, 9990, 9991, 9993, 9994, 9997, 9539, 9519, 3558, 6544, 5757, 1413, 3561, 3978, 3983, 3479, 3964, 3563, 3984, 3480, 3990, 3993, 7809, 3482, 3994, 9260, 6556, 3484, 6653, 8919, 9143, 7198, 3540, 4249, 6235, 7199, 9267, 6565, 6566, 6569, 9270, 6571, 9450, 6572, 7322, 4382, 4384, 4385, 9273, 6580, 9274, 4390, 7030, 7806, 7033, 3491, 3496, 7036, 7326, 3612, 5290, 7327, 9278, 9279, 3498, 4423, 7931, 9434, 4425, 6232, 6564, 7047, 7048, 4431, 4433, 1887, 4434, 4435, 4436, 7060, 7061, 2975, 7065, 1431, 4443, 4931, 4446, 7803, 4599, 7679, 4449, 4448, 4475, 7071, 4477, 4880, 4251, 7683, 3449, 4479, 4480, 4481, 4483, 4478, 4482, 4485, 7685, 2980, 4488, 7082, 7085, 7088, 4490, 7089, 4493, 7094, 7095, 7351, 4499, 4051, 4495, 4053, 4054, 4883, 4056, 1375, 2932, 4058, 4933, 7337, 7339, 7354, 2987, 4934, 6612, 7092, 7944, 4165, 4167, 7358, 4168, 7342, 4169]
        self.stock_trackers = {}
        for s_id in self.stock_ids:
            self.stock_trackers[s_id] = StockTracker(s_id)
        self.global_features = []

        self.local_features = {}

        for feature in features:

            if feature.feature_type == FeatureType.GLOBAL:
                self.global_features.append(feature)
            elif feature.feature_type == FeatureType.LOCAL:
                for stock_id in self.stock_ids:
                    if stock_id not in self.local_features:
                        self.local_features[stock_id] = []
                    self.local_features[stock_id].append(feature.copy())

    def prepare_data_for_training(self, df):

        #  Preprocessing
        df = StockDataPreprocessor.preprocess_for_training(df)

        train_dfs = []

        for stock_id, subdf in df.groupby("SecuritiesCode"):
            for feature in self.local_features[stock_id]:
                subdf = feature.add_feature_pandas(subdf)
            train_dfs.append(subdf)

        return pd.concat(train_dfs).sort_index()

    def update_single_row(self, row):
        stock_id = row["SecuritiesCode"]
        row, status_code = self.stock_trackers[stock_id].update(row)
        row["StatusCode"] = status_code
        for feature in self.local_features[stock_id]:
            row = feature.update_row(row)
        return row

    def online_update_apply(self, prices):
        return prices.apply(lambda row: self.update_single_row(row), axis=1)

    def online_update(self, prices):
        updated_prices = []
        for row_id, row in prices.iterrows():
            stock_id = row["SecuritiesCode"]
            row, status_code = self.stock_trackers[stock_id].update(row)
            row["StatusCode"] = status_code
            for feature in self.local_features[stock_id]:
                row = feature.update_row(row)
            updated_prices.append(row.to_frame().T)
        return pd.concat(updated_prices)

    def set_priors(self):
        pass

Validation.py

Validation.py：

import pandas as pd


def KFoldDataPartition(df, K=5):
    df["Date"] = pd.to_datetime(df["Date"])
    dates = sorted(df["Date"].unique())  # sort just in case

    datasets = []
    indices = [int(i / K * len(dates)) for i in range(K + 1)]
    indices[-1] -= 1

    for train_start in range(len(indices[:-1])):
        for train_end in range(train_start + 1, len(indices)-1):
            start_date = dates[indices[train_start]]
            end_date = dates[indices[train_end]]
            val_end_date = dates[indices[train_end + 1]]
            df_train = df[(df["Date"] >= start_date) & (df["Date"] <= end_date)]
            df_val = df[(df["Date"] > end_date) & (df["Date"] <= val_end_date)]
            datasets.append((df_train, df_aval))

    return datasets

模型训练

import os
import pickle
import shutil
import pandas as pd

import lightgbm as lgbm

if not os.path.exists(r"./Features.py"):
    shutil.copyfile(r"../input/codejpx/Features.py", r"./Features.py")
if not os.path.exists(r"./Preprocessing.py"):
    shutil.copyfile(r"../input/codejpx/Preprocessing.py", r"./Preprocessing.py")
if not os.path.exists(r"./Trackers.py"):
    shutil.copyfile(r"../input/codejpx/Trackers.py", r"./Trackers.py")
if not os.path.exists(r"./Validation.py"):
    shutil.copyfile(r"../input/codejpx/Validation.py", r"./Validation.py")

import Features
from Trackers import StateTracker

features = [Features.Amplitude(), Features.OpenCloseReturn(), Features.Return(),
            Features.Volatility(10), Features.Volatility(30), Features.Volatility(50),
            Features.SMA("Close", 3), Features.SMA("Close", 5), Features.SMA("Close", 10),
            Features.SMA("Close", 30),
            Features.SMA("Return", 3), Features.SMA("Return", 5),
            Features.SMA("Return", 10), Features.SMA("Return", 30)]

1
2
3

st = StateTracker(features)
df_train = pd.read_csv(r'../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv')
df_train = st.prepare_data_for_training(df_train)

training_cols = ['SecuritiesCode', 'Open', 'High', 'Low', 'Close', 'Volume', 'AdjustmentFactor', 'ExpectedDividend', 'SupervisionFlag']

for feature in features:
    training_cols.append(feature.name)

categorical_cols = ["SecuritiesCode", "SupervisionFlag"]
target_col = ["Target"]

1 2	model = lgbm.LGBMRegressor() model.fit(df_train[training_cols], df_train[target_col], categorical_feature=categorical_cols, eval_metric='rmse')

1 2	with open("./lgbm.pickle", "wb") as file: pickle.dump(model, file)

结果提交

import os
import pickle
import shutil
import numpy as np

import jpx_tokyo_market_prediction

if not os.path.exists(r"./Features.py"):
    shutil.copyfile(r"../input/codejpx/Features.py", r"./Features.py")
if not os.path.exists(r"./Preprocessing.py"):
    shutil.copyfile(r"../input/codejpx/Preprocessing.py", r"./Preprocessing.py")
if not os.path.exists(r"./Trackers.py"):
    shutil.copyfile(r"../input/codejpx/Trackers.py", r"./Trackers.py")
if not os.path.exists(r"./Validation.py"):
    shutil.copyfile(r"../input/codejpx/Validation.py", r"./Validation.py")

import Features
from Trackers import StateTracker

features = [Features.Amplitude(), Features.OpenCloseReturn(), Features.Return(),
            Features.Volatility(10), Features.Volatility(30), Features.Volatility(50),
            Features.SMA("Close", 3), Features.SMA("Close", 5), Features.SMA("Close", 10),
            Features.SMA("Close", 30),
            Features.SMA("Return", 3), Features.SMA("Return", 5),
            Features.SMA("Return", 10), Features.SMA("Return", 30)]

1	st = StateTracker(features)

training_cols = ['SecuritiesCode', 'Open', 'High', 'Low', 'Close', 'Volume', 'AdjustmentFactor', 'ExpectedDividend',  'SupervisionFlag']

for feature in features:
    training_cols.append(feature.name)

categorical_cols = ["SecuritiesCode", "SupervisionFlag"]
target_col = ["Target"]

1
2
3

model = None
with open(r"../input/lgbm-model/lgbm.pickle", "rb") as file:
    model = pickle.load(file)


class Algo:

    def __init__(self, model, state_tracker):
        self.model = model
        self.st = state_tracker
        self.cols = ['SecuritiesCode', 'Open', 'High', 'Low', 'Close', 'Volume', 'AdjustmentFactor', 'ExpectedDividend', 'SupervisionFlag']

        for feature in self.st.local_features[1301]:
            self.cols.append(feature.name)

    def add_rank(self, df):
        predictions = df["Prediction"]
        ranks = np.arange(2000)
        zipped = list(zip(predictions, ranks))
        zipped.sort(key=lambda x: -x[0])
        sorted_predictions, sorted_ranks = map(list, zip(*zipped))
        df["Rank"] = sorted_ranks
        return df

    def predict(self, prices, options, financials, trades, secondary_prices):
        prices = st.online_update_apply(prices)[self.cols]
        if not prices["SecuritiesCode"].is_monotonic_increasing:
            prices = prices.sort_values(by="SecuritiesCode")
        prices["Prediction"] = self.model.predict(prices)
        return self.add_rank(prices)

algo = Algo(model, st)

env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()
for (prices, options, financials, trades, secondary_prices, sample_prediction) in iter_test:

    if not sample_prediction["SecuritiesCode"].is_monotonic_increasing:
        sample_prediction = sample_prediction.sort_values("SecuritiesCode")

    sample_prediction['Rank'] = algo.predict(prices, options, financials, trades, secondary_prices)['Rank']
    env.predict(sample_prediction)

复现

第八名的方案无法复现，分数是0.001，达不到0.289。

第八名可能有所保留。

(和我对第八名的方案进行重构没有关系，即使我用重构前的代码，也还是0.001，无法复现。)

解读

模型

LightGBM。

缺失值处理

确实值一律填充为0。

特征衍生

作者衍生了如下的特征：

Features.Amplitude()：振幅，当天的最高价减去最低价。
Features.OpenCloseReturn()：收盘价减去开盘价的差，再除以开盘价。
Features.Return()：当日的涨跌(收盘价减去前一个交易日的收盘价的差，再除以收盘价)。
Features.Volatility(10)：过去10个交易日，Return的波动率。
Features.Volatility(30)：过去30个交易日，Return的波动率。
Features.Volatility(50)：过去50个交易日，Return的波动率。
Features.SMA("Close", 3)：过去3个交易日的Close的移动平均线。
Features.SMA("Close", 5)：过去5个交易日的Close的移动平均线。
Features.SMA("Close", 10)：过去10个交易日的Close的移动平均线。
Features.SMA("Close", 30)：过去30个交易日的Close的移动平均线。
Features.SMA("Return", 3)：过去3个交易日的Return的移动平均线。
Features.SMA("Return", 5)：过去5个交易日的Return的移动平均线。
Features.SMA("Return", 10)：过去10个交易日的Return的移动平均线。
Features.SMA("Return", 30)：过去30个交易日的Return的移动平均线。

以Close的SMA(移动平均线)为例：

$\text{SMA} = \frac{C_1 + C_2 + C_3 + \cdots + C_n}{n}$

困惑

train_files/stock_prices.csv的时间范围是2017-01-04到2021-12-03。
supplemental_files/stock_prices.csv的时间范围是2021-12-06到2022-06-24
作者在对2022-07-05到2022-10-07的数据进行预测的时候，应该把supplemental_files/stock_prices.csv的数据也加上，否则计算移动的移动平均不准确，但是作者没有这么做。

参考资料：

https://github.com/J-Quants/JPXTokyoStockExchangePrediction

https://speakerdeck.com/gamella/jpx-tokyo-stock-exchange-prediction-award-ceremony-jie-fa-zong-ping

https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction/discussion/363838

https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction/discussion/364262

https://blog.csdn.net/weixin_46072771/article/details/106180960

https://www.futunn.com/learn/detail-what-is-simple-moving-average-sma-71726-0

文章作者: Kaka Wan Yifan

文章链接: https://kakawanyifan.com/20102