比赛链接

JPX Tokyo Stock Exchange Prediction
https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction

数据

数据概览

+--data_specifications
|      +--stock_fin_spec.csv
|      +--options_spec.csv
|      +--trades_spec.csv
|      +--stock_list_spec.csv
|      +--stock_price_spec.csv
+--stock_list.csv
+--jpx_tokyo_market_prediction
|      +--competition.cpython-37m-x86_64-linux-gnu.so
|      +--__init__.py
+--example_test_files
|      +--options.csv
|      +--secondary_stock_prices.csv
|      +--stock_prices.csv
|      +--trades.csv
|      +--financials.csv
|      +--sample_submission.csv
+--supplemental_files
|      +--options.csv
|      +--secondary_stock_prices.csv
|      +--stock_prices.csv
|      +--trades.csv
|      +--financials.csv
+--train_files
|      +--options.csv
|      +--secondary_stock_prices.csv
|      +--stock_prices.csv
|      +--trades.csv
|      +--financials.csv

data_specifications
对数据的描述(列名含义、数据解释等)
stock_list.csv
股票列表
jpx_tokyo_market_prediction
JPX的赛题属于"Code Competition"，jpx_tokyo_market_prediction是在Linux系统下的运行环境。
example_test_files
样例数据
supplemental_files
可以理解为验证集数据
train_files
训练集数据

train_files

我们重点关注train_files中的文件。

+--train_files
|      +--options.csv
|      +--secondary_stock_prices.csv
|      +--stock_prices.csv
|      +--trades.csv
|      +--financials.csv

financials.csv

financials.csv，上市公司的财报数据。

查看financials.csv中的内容，示例代码：

1 2	financials = pd.read_csv('../jpxd/train_files/financials.csv') financials

运行结果：

       DisclosureNumber       DateCode        Date  SecuritiesCode DisclosedDate DisclosedTime  DisclosedUnixTime                         TypeOfDocument CurrentPeriodEndDate TypeOfCurrentPeriod CurrentFiscalYearStartDate CurrentFiscalYearEndDate      NetSales OperatingProfit OrdinaryProfit      Profit EarningsPerShare     TotalAssets          Equity EquityToAssetRatio BookValuePerShare ResultDividendPerShare1stQuarter ResultDividendPerShare2ndQuarter ResultDividendPerShare3rdQuarter ResultDividendPerShareFiscalYearEnd ResultDividendPerShareAnnual ForecastDividendPerShare1stQuarter ForecastDividendPerShare2ndQuarter ForecastDividendPerShare3rdQuarter ForecastDividendPerShareFiscalYearEnd ForecastDividendPerShareAnnual ForecastNetSales ForecastOperatingProfit ForecastOrdinaryProfit ForecastProfit ForecastEarningsPerShare ApplyingOfSpecificAccountingOfTheQuarterlyFinancialStatements MaterialChangesInSubsidiaries ChangesBasedOnRevisionsOfAccountingStandard ChangesOtherThanOnesBasedOnRevisionsOfAccountingStandard ChangesInAccountingEstimates RetrospectiveRestatement NumberOfIssuedAndOutstandingSharesAtTheEndOfFiscalYearIncludingTreasuryStock NumberOfTreasuryStockAtTheEndOfFiscalYear AverageNumberOfShares
0          2.016121e+13  20170104_2753  2017-01-04          2753.0    2017-01-04      07:30:00       1.483483e+09  3QFinancialStatements_Consolidated_JP           2016-12-31                  3Q                 2016-04-01               2017-03-31   22761000000      2147000000     2234000000  1494000000           218.23   22386000000.0   18295000000.0              0.817           2671.42                                －                             50.0                                －                                 NaN                          NaN                                NaN                                NaN                                NaN                                  50.0                          100.0      31800000000              3255000000             3300000000     2190000000                   319.76                                                NaN                                    False                                        True                                              False                              False                    False                                          6848800.0                                                                   －             6848800.0
1          2.017010e+13  20170104_3353  2017-01-04          3353.0    2017-01-04      15:00:00       1.483510e+09  3QFinancialStatements_Consolidated_JP           2016-11-30                  3Q                 2016-03-01               2017-02-28   22128000000       820000000      778000000   629000000           328.57   25100000000.0    7566000000.0              0.301               NaN                                －                             36.0                                －                                 NaN                          NaN                                NaN                                NaN                                NaN                                  36.0                           72.0      30200000000              1350000000             1300000000      930000000                   485.36                                                NaN                                    False                                        True                                              False                              False                    False                                          2035000.0                                                              118917             1916083.0
2          2.016123e+13  20170104_4575  2017-01-04          4575.0    2017-01-04      12:00:00       1.483499e+09                       ForecastRevision           2016-12-31                  2Q                 2016-07-01               2017-06-30           NaN             NaN            NaN         NaN              NaN             NaN             NaN                NaN               NaN                              NaN                              NaN                              NaN                                 NaN                          NaN                                NaN                                NaN                                NaN                                   NaN                            NaN        110000000              -465000000             -466000000     -467000000                   -93.11                                                NaN                                      NaN                                         NaN                                                NaN                                NaN                      NaN                                                NaN                                                                 NaN                   NaN
3          2.017010e+13  20170105_2659  2017-01-05          2659.0    2017-01-05      15:00:00       1.483596e+09  3QFinancialStatements_Consolidated_JP           2016-11-30                  3Q                 2016-03-01               2017-02-28  134781000000     11248000000    11558000000  7171000000           224.35  128464000000.0  100905000000.0              0.765           3073.12                                －                              0.0                                －                                 NaN                          NaN                                NaN                                NaN                                NaN                                  42.0                           42.0     177683000000             14168000000            14473000000     9111000000                   285.05                                                NaN                                    False                                        True                                              False                              False                    False                                         31981654.0                                                               18257            31963405.0
4          2.017011e+13  20170105_3050  2017-01-05          3050.0    2017-01-05      15:30:00       1.483598e+09                       ForecastRevision           2017-02-28                  FY                 2016-02-29               2017-02-28           NaN             NaN            NaN         NaN              NaN             NaN             NaN                NaN               NaN                              NaN                              NaN                              NaN                                 NaN                          NaN                                NaN                                  －                                  －                                  13.0                           24.0              NaN                     NaN                    NaN            NaN                      NaN                                                NaN                                      NaN                                         NaN                                                NaN                                NaN                      NaN                                                NaN                                                                 NaN                   NaN
...                 ...            ...         ...             ...           ...           ...                ...                                    ...                  ...                 ...                        ...                      ...           ...             ...            ...         ...              ...             ...             ...                ...               ...                              ...                              ...                              ...                                 ...                          ...                                ...                                ...                                ...                                   ...                            ...              ...                     ...                    ...            ...                      ...                                                ...                                      ...                                         ...                                                ...                                ...                      ...                                                ...                                                                 ...                   ...
92951      2.021112e+13  20211203_6040  2021-12-03          6040.0    2021-12-03      15:00:00       1.638511e+09  1QFinancialStatements_Consolidated_JP           2021-10-31                  1Q                 2021-08-01               2022-07-31     732000000      -274000000     -272000000  -206000000           -13.59    6952000000.0    4771000000.0              0.653             299.3                                －                              NaN                              NaN                                 NaN                          NaN                                NaN                                0.0                                  －                                   7.0                            7.0                －                       －                      －              －                        －                                                NaN                                    False                                        True                                              False                              False                    False                                         16000400.0                                                              836400            15164000.0
92952      2.021120e+13  20211203_6898  2021-12-03          6898.0    2021-12-03      16:00:00       1.638515e+09  3QFinancialStatements_Consolidated_JP           2021-10-31                  3Q                 2021-02-01               2022-01-31    1293000000       144000000      147000000   121000000           184.73    4246000000.0    3284000000.0              0.774               NaN                                －                              0.0                                －                                 NaN                          NaN                                NaN                                NaN                                NaN                                   0.0                            0.0       1479000000               106000000              107000000       93000000                   142.01                                                NaN                                    False                                       False                                              False                              False                    False                                           816979.0                                                              157541              659486.0
92953      2.021120e+13  20211203_6969  2021-12-03          6969.0    2021-12-03      15:00:00       1.638511e+09                       ForecastRevision           2022-03-31                  FY                 2021-04-01               2022-03-31           NaN             NaN            NaN         NaN              NaN             NaN             NaN                NaN               NaN                              NaN                              NaN                              NaN                                 NaN                          NaN                                NaN                                NaN                                NaN                                   NaN                            NaN       4500000000               450000000              420000000     -380000000                  -147.87                                                NaN                                      NaN                                         NaN                                                NaN                                NaN                      NaN                                                NaN                                                                 NaN                   NaN
92954      2.021112e+13  20211203_8057  2021-12-03          8057.0    2021-12-03      17:00:00       1.638518e+09  1QFinancialStatements_Consolidated_JP           2021-10-20                  1Q                 2021-07-21               2022-07-20   43071000000      2565000000     2860000000  1507000000           153.74  116016000000.0   51116000000.0              0.396               NaN                                －                              NaN                              NaN                                 NaN                          NaN                                NaN                                  －                                  －                                 110.0                          110.0     210000000000              5300000000             5900000000     3250000000                   330.92                                                NaN                                    False                                        True                                              False                              False                    False                                         10419371.0                                                              614032             9805339.0
92955      2.021120e+13  20211203_9627  2021-12-03          9627.0    2021-12-03      15:30:00       1.638513e+09  2QFinancialStatements_Consolidated_JP           2021-10-31                  2Q                 2021-05-01               2022-04-30  152972000000      5776000000     6127000000  3338000000            94.68  210442000000.0  115810000000.0               0.55               NaN                                －                              0.0                              NaN                                 NaN                          NaN                                NaN                                NaN                                  －                                  55.0                           55.0     315000000000             15000000000            15500000000     8300000000                   234.28                                                NaN                                    False                                        True                                              False                              False                    False                                         35428212.0                                                              200911            35260638.0

[92956 rows x 45 columns]

在市场不存在内幕交易的情况下，如果财报不及预期，可能在财报发布后会有一段时间的反应。但我们很难判断财报是否达到预期，我们不会过多关注该数据。

题外话，在财报数据中，有一处，很有趣。
有很多和未来预测有关的字段，

ForecastDividendPerShare2ndQuarter
ForecastDividendPerShare3rdQuarter
ForecastDividendPerShareFiscalYearEnd
ForecastDividendPerShareAnnual
ForecastNetSales
ForecastOperatingProfit
ForecastOrdinaryProfit
ForecastProfit
ForecastEarningsPerShare

这些字段，感觉在A股市场不会有。
让上市公司发布预测的分红收益，那就是说，吹牛不犯法。
我也很好奇，东京证券所的上市公司，发财报的时候，是怎么处理的。

options.csv

options.csv，期权数据。

期权确实有一个功能：价格发现。
但是，根据东京证券交易所的资料，没有个股期权，也没有行业指数期权，东京证券交易所的期权的标的资产都是和宏观有关的指数，或者大宗商品。
也就是说，东京证券交易所的期权，不具有对于个股或行业的价格发现功能。
即，认为期权数据的帮助有限。

东京证券交易所链接：https://www.jpx.co.jp/english/sicc/regulations/b5b4pj0000023mqo-att/(HP)sakimono20220208-e.pdf

secondary_stock_prices.csv

东京证券交易所市场二部的股票数据。

stock_prices.csv

stock_prices.csv，股票价格数据。

字段

查看stock_prices.csv中的内容，会发现是日线数据。示例代码：

1 2	stock_prices = pd.read_csv('../jpxd/train_files/stock_prices.csv') stock_prices

运行结果：

                 RowId        Date  SecuritiesCode    Open    High     Low   Close   Volume  AdjustmentFactor  ExpectedDividend  SupervisionFlag    Target
0        20170104_1301  2017-01-04            1301  2734.0  2755.0  2730.0  2742.0    31400               1.0               NaN            False  0.000730
1        20170104_1332  2017-01-04            1332   568.0   576.0   563.0   571.0  2798500               1.0               NaN            False  0.012324
2        20170104_1333  2017-01-04            1333  3150.0  3210.0  3140.0  3210.0   270800               1.0               NaN            False  0.006154
3        20170104_1376  2017-01-04            1376  1510.0  1550.0  1510.0  1550.0    11300               1.0               NaN            False  0.011053
4        20170104_1377  2017-01-04            1377  3270.0  3350.0  3270.0  3330.0   150800               1.0               NaN            False  0.003026
...                ...         ...             ...     ...     ...     ...     ...      ...               ...               ...              ...       ...
2332526  20211203_9990  2021-12-03            9990   514.0   528.0   513.0   528.0    44200               1.0               NaN            False  0.034816
2332527  20211203_9991  2021-12-03            9991   782.0   794.0   782.0   794.0    35900               1.0               NaN            False  0.025478
2332528  20211203_9993  2021-12-03            9993  1690.0  1690.0  1645.0  1645.0     7200               1.0               NaN            False -0.004302
2332529  20211203_9994  2021-12-03            9994  2388.0  2396.0  2380.0  2389.0     6500               1.0               NaN            False  0.009098
2332530  20211203_9997  2021-12-03            9997   690.0   711.0   686.0   696.0   381100               1.0               NaN            False  0.018414

[2332531 rows x 12 columns]

RowId，每一行的唯一标识。
Date，日期。
SecuritiesCode，证券代码。
Open、High、Low、Close、Volume，分别是开盘价、最高价、最低价、收盘价、成交量。
AdjustmentFactor，调整因子(复权因子)。
拆股、合股、配股、送股和派息等，都可能会影响调整因子。
举个例子，例如原本股价是每股100，拆成10股后，每股可能是10，同时交易量会是原来的10倍。
ExpectedDividend：除权日的预期股息价值，该价值记录于除息日前2个工作日。
SupervisionFlag：被监管，或者即将退市。(类似于国内的风险警示，ST。)
Target；目标值，即标签。

Target的计算方法如下：

$r_{(k,t)} = \frac{C_{(k,t+2)} - C_{(k,t+1)}}{C_{(k,t+1)}}$

$C_{(k,t+2)}$ 表示第 $k$ 支股票在 $t+2$ 时刻的收盘价。
$C_{(k,t+1)}$ 表示第 $k$ 支股票在 $t+1$ 时刻的收盘价。
$r_{(k,t)}$ 表示第 $k$ 支股票在 $t$ 时刻的收益，即Target

即 $t$ 日的Target，是 $t+2$ 这一天的涨跌幅；是假设在 $t+1$ 日以收盘价买入，在 $t+2$ 日以收盘价卖出，这时候的收益。

时间范围

stock_prices.csv在两处有：

train_files/stock_prices.csv
supplemental_files/stock_prices.csv

查看两份数据的时间范围。示例代码：

train_files_stock_prices = pd.read_csv('../jpxd/train_files/stock_prices.csv')
print(f'train的起始时间：{train_files_stock_prices.Date.min()}，train的结束时间：{train_files_stock_prices.Date.max()}')

supplemental_files_stock_prices = pd.read_csv('../jpxd/supplemental_files/stock_prices.csv')
print(f'supplemental的起始时间：{supplemental_files_stock_prices.Date.min()}，supplemental的结束时间：{supplemental_files_stock_prices.Date.max()}')

运行结果：

1 2	train的起始时间：2017-01-04，train的结束时间：2021-12-03 supplemental的起始时间：2021-12-06，supplemental的结束时间：2022-06-24

trades.csv

trades.csv，市场运行周报。

每一个板块，在某一周的运行情况。
我认为该部分的数据，没有帮助。

查看trades.csv，示例代码：

1 2	trades = pd.read_csv('../jpxd/train_files/trades.csv') trades

运行结果：

            Date   StartDate     EndDate                           Section    TotalSales  TotalPurchases    TotalTotal  TotalBalance  ProprietarySales  ProprietaryPurchases  ProprietaryTotal  ProprietaryBalance  BrokerageSales  BrokeragePurchases  BrokerageTotal  BrokerageBalance  IndividualsSales  IndividualsPurchases  IndividualsTotal  IndividualsBalance  ForeignersSales  ForeignersPurchases  ForeignersTotal  ForeignersBalance  SecuritiesCosSales  SecuritiesCosPurchases  SecuritiesCosTotal  SecuritiesCosBalance  InvestmentTrustsSales  InvestmentTrustsPurchases  InvestmentTrustsTotal  InvestmentTrustsBalance  BusinessCosSales  BusinessCosPurchases  BusinessCosTotal  BusinessCosBalance  OtherInstitutionsSales  OtherInstitutionsPurchases  OtherInstitutionsTotal  OtherInstitutionsBalance  InsuranceCosSales  InsuranceCosPurchases  InsuranceCosTotal  InsuranceCosBalance  CityBKsRegionalBKsEtcSales  CityBKsRegionalBKsEtcPurchase  CityBKsRegionalBKsEtcTotal  CityBKsRegionalBKsEtcBalance  TrustBanksSales  TrustBanksPurchases  TrustBanksTotal  TrustBanksBalance  OtherFinancialInstitutionsSales  OtherFinancialInstitutionsPurchases  OtherFinancialInstitutionsTotal  OtherFinancialInstitutionsBalance
0     2017-01-04         NaN         NaN                               NaN           NaN             NaN           NaN           NaN               NaN                   NaN               NaN                 NaN             NaN                 NaN             NaN               NaN               NaN                   NaN               NaN                 NaN              NaN                  NaN              NaN                NaN                 NaN                     NaN                 NaN                   NaN                    NaN                        NaN                    NaN                      NaN               NaN                   NaN               NaN                 NaN                     NaN                         NaN                     NaN                       NaN                NaN                    NaN                NaN                  NaN                         NaN                            NaN                         NaN                           NaN              NaN                  NaN              NaN                NaN                              NaN                                  NaN                              NaN                                NaN
1     2017-01-05         NaN         NaN                               NaN           NaN             NaN           NaN           NaN               NaN                   NaN               NaN                 NaN             NaN                 NaN             NaN               NaN               NaN                   NaN               NaN                 NaN              NaN                  NaN              NaN                NaN                 NaN                     NaN                 NaN                   NaN                    NaN                        NaN                    NaN                      NaN               NaN                   NaN               NaN                 NaN                     NaN                         NaN                     NaN                       NaN                NaN                    NaN                NaN                  NaN                         NaN                            NaN                         NaN                           NaN              NaN                  NaN              NaN                NaN                              NaN                                  NaN                              NaN                                NaN
2     2017-01-06         NaN         NaN                               NaN           NaN             NaN           NaN           NaN               NaN                   NaN               NaN                 NaN             NaN                 NaN             NaN               NaN               NaN                   NaN               NaN                 NaN              NaN                  NaN              NaN                NaN                 NaN                     NaN                 NaN                   NaN                    NaN                        NaN                    NaN                      NaN               NaN                   NaN               NaN                 NaN                     NaN                         NaN                     NaN                       NaN                NaN                    NaN                NaN                  NaN                         NaN                            NaN                         NaN                           NaN              NaN                  NaN              NaN                NaN                              NaN                                  NaN                              NaN                                NaN
3     2017-01-10         NaN         NaN                               NaN           NaN             NaN           NaN           NaN               NaN                   NaN               NaN                 NaN             NaN                 NaN             NaN               NaN               NaN                   NaN               NaN                 NaN              NaN                  NaN              NaN                NaN                 NaN                     NaN                 NaN                   NaN                    NaN                        NaN                    NaN                      NaN               NaN                   NaN               NaN                 NaN                     NaN                         NaN                     NaN                       NaN                NaN                    NaN                NaN                  NaN                         NaN                            NaN                         NaN                           NaN              NaN                  NaN              NaN                NaN                              NaN                                  NaN                              NaN                                NaN
4     2017-01-11         NaN         NaN                               NaN           NaN             NaN           NaN           NaN               NaN                   NaN               NaN                 NaN             NaN                 NaN             NaN               NaN               NaN                   NaN               NaN                 NaN              NaN                  NaN              NaN                NaN                 NaN                     NaN                 NaN                   NaN                    NaN                        NaN                    NaN                      NaN               NaN                   NaN               NaN                 NaN                     NaN                         NaN                     NaN                       NaN                NaN                    NaN                NaN                  NaN                         NaN                            NaN                         NaN                           NaN              NaN                  NaN              NaN                NaN                              NaN                                  NaN                              NaN                                NaN
...          ...         ...         ...                               ...           ...             ...           ...           ...               ...                   ...               ...                 ...             ...                 ...             ...               ...               ...                   ...               ...                 ...              ...                  ...              ...                ...                 ...                     ...                 ...                   ...                    ...                        ...                    ...                      ...               ...                   ...               ...                 ...                     ...                         ...                     ...                       ...                ...                    ...                ...                  ...                         ...                            ...                         ...                           ...              ...                  ...              ...                ...                              ...                                  ...                              ...                                ...
1707  2021-12-01         NaN         NaN                               NaN           NaN             NaN           NaN           NaN               NaN                   NaN               NaN                 NaN             NaN                 NaN             NaN               NaN               NaN                   NaN               NaN                 NaN              NaN                  NaN              NaN                NaN                 NaN                     NaN                 NaN                   NaN                    NaN                        NaN                    NaN                      NaN               NaN                   NaN               NaN                 NaN                     NaN                         NaN                     NaN                       NaN                NaN                    NaN                NaN                  NaN                         NaN                            NaN                         NaN                           NaN              NaN                  NaN              NaN                NaN                              NaN                                  NaN                              NaN                                NaN
1708  2021-12-02  2021-11-22  2021-11-26    Growth Market (Mothers/JASDAQ)  1.143466e+09    1.143923e+09  2.287389e+09      456677.0      3.663919e+07          3.496068e+07      7.159987e+07          -1678508.0    1.106827e+09        1.108962e+09    2.215789e+09         2135185.0      6.317277e+08          6.508934e+08      1.282621e+09          19165664.0     4.222267e+08         3.995653e+08     8.217920e+08        -22661428.0          19532301.0              20335113.0          39867414.0              802812.0              5712311.0                  6802056.0             12514367.0                1089745.0        13241913.0            16476738.0        29718651.0           3234825.0               6947933.0                   7211377.0              14159310.0                  263444.0           170792.0               433284.0           604076.0             262492.0                    335919.0                        60311.0                    396230.0                     -275608.0        6696755.0            6886122.0       13582877.0           189367.0                         234653.0                             298525.0                         533178.0                            63872.0
1709  2021-12-02  2021-11-22  2021-11-26      Prime Market (First Section)  1.138343e+10    1.137621e+10  2.275964e+10    -7214179.0      1.499660e+09          1.230944e+09      2.730604e+09        -268716111.0    9.883766e+09        1.014527e+10    2.002903e+10       261501932.0      2.042100e+09          2.433004e+09      4.475104e+09         390904459.0     7.137596e+09         6.912257e+09     1.404985e+10       -225339255.0          74894037.0              88791160.0         163685197.0            13897123.0            183078463.0                159026769.0            342105232.0              -24051694.0       123642633.0           211502023.0       335144656.0          87859390.0              18341982.0                  43479826.0              61821808.0                25137844.0         10839136.0              9695681.0         20534817.0           -1143455.0                  26734116.0                      9223824.0                  35957940.0                   -17510292.0      254580089.0          261919512.0      516499601.0          7339423.0                       11959898.0                           16368287.0                       28328185.0                          4408389.0
1710  2021-12-02  2021-11-22  2021-11-26  Standard Market (Second Section)  1.069969e+08    1.075036e+08  2.145004e+08      506702.0      2.811025e+06          3.273163e+06      6.084188e+06            462138.0    1.041858e+08        1.042304e+08    2.084162e+08           44564.0      6.587397e+07          6.573161e+07      1.316056e+08           -142356.0     2.898821e+07         2.868161e+07     5.766982e+07          -306605.0           2983832.0               3003763.0           5987595.0               19931.0               543907.0                   367291.0               911198.0                -176616.0         4948282.0             5634326.0        10582608.0            686044.0                258986.0                    560994.0                819980.0                  302008.0            47298.0                    0.0            47298.0             -47298.0                     42127.0                            0.0                     42127.0                      -42127.0         438928.0             243817.0         682745.0          -195111.0                          60291.0                               6985.0                          67276.0                           -53306.0
1711  2021-12-03         NaN         NaN                               NaN           NaN             NaN           NaN           NaN               NaN                   NaN               NaN                 NaN             NaN                 NaN             NaN               NaN               NaN                   NaN               NaN                 NaN              NaN                  NaN              NaN                NaN                 NaN                     NaN                 NaN                   NaN                    NaN                        NaN                    NaN                      NaN               NaN                   NaN               NaN                 NaN                     NaN                         NaN                     NaN                       NaN                NaN                    NaN                NaN                  NaN                         NaN                            NaN                         NaN                           NaN              NaN                  NaN              NaN                NaN                              NaN                                  NaN                              NaN                                NaN

[1712 rows x 56 columns]

题外话

根据东京证券交易所官方资料，是2022年4月4日，重新对市场进行了划分。然后才有Growth Market、Prime Market和Standard Market三个市场。但是根据trades.csv，我们看到在2021年11月，已经有了Growth Market、Prime Market和Standard Market三个市场的运行周报数据。

东京证券交易所链接：https://www.jpx.co.jp/english/equities/market-restructure/market-segments/index.html

stock_list.csv

stock_list.csv，股票的基本信息。

查看stock_list.csv，示例代码：

1 2	stock_list = pd.read_csv('../jpxd/stock_list.csv') stock_list

运行结果：

      SecuritiesCode  EffectiveDate                                               Name             Section/Products NewMarketSegment 33SectorCode                       33SectorName 17SectorCode                   17SectorName NewIndexSeriesSizeCode NewIndexSeriesSize   TradeDate    Close  IssuedShares  MarketCapitalization  Universe0
0               1301       20211230                                   KYOKUYO CO.,LTD.     First Section (Domestic)     Prime Market           50  Fishery, Agriculture and Forestry            1                         FOODS                       7      TOPIX Small 2  20211230.0   3080.0  1.092828e+07          3.365911e+10       True
1               1305       20211230                                    Daiwa ETF-TOPIX                   ETFs/ ETNs              NaN            -                                  -            -                              -                      -                  -  20211230.0   2097.0  3.634636e+09          7.621831e+12      False
2               1306       20211230              NEXT FUNDS TOPIX Exchange Traded Fund                   ETFs/ ETNs              NaN            -                                  -            -                              -                      -                  -  20211230.0   2073.5  7.917718e+09          1.641739e+13      False
3               1308       20211230             Nikko Exchange Traded Index Fund TOPIX                   ETFs/ ETNs              NaN            -                                  -            -                              -                      -                  -  20211230.0   2053.0  3.736943e+09          7.671945e+12      False
4               1309       20211230  NEXT FUNDS ChinaAMC SSE50 Index Exchange Trade...                   ETFs/ ETNs              NaN            -                                  -            -                              -                      -                  -  20211230.0  44280.0  7.263200e+04          3.216145e+09      False
...              ...            ...                                                ...                          ...              ...          ...                                ...          ...                            ...                    ...                ...         ...      ...           ...                   ...        ...
4412            9994       20211230                                 YAMAYA CORPORATION     First Section (Domestic)  Standard Market         6100                       Retail Trade           14                  RETAIL TRADE                       7      TOPIX Small 2  20211230.0   2447.0  1.084787e+07          2.654474e+10       True
4413            9995       20211230                                    GLOSEL Co.,Ltd.     First Section (Domestic)     Prime Market         6050                    Wholesale Trade           13  COMMERCIAL & WHOLESALE TRADE                       7      TOPIX Small 2  20211230.0    410.0  2.642680e+07          1.083499e+10      False
4414            9996       20211230                                     Satoh&Co.,Ltd.  JASDAQ(Standard / Domestic)  Standard Market         6050                    Wholesale Trade           13  COMMERCIAL & WHOLESALE TRADE                       -                  -  20211230.0   1488.0  9.152640e+06          1.361913e+10      False
4415            9997       20211230                                   BELLUNA CO.,LTD.     First Section (Domestic)     Prime Market         6100                       Retail Trade           14                  RETAIL TRADE                       6      TOPIX Small 1  20211230.0    709.0  9.724447e+07          6.894633e+10       True
4416           25935       20211230             ITO EN,LTD.(shares of preferred stock)     First Section (Domestic)              NaN         3050                              Foods            1                         FOODS                       -                  -  20211230.0   1934.0  3.424696e+07          6.623362e+10      False

[4417 rows x 16 columns]

33SectorCode、33SectorName
17SectorCode、17SectorName

两种行业分类方式。根据东京证券交易所资料，"33"这个行业分类方式更精细化，"17"是基于"33"的进行的划分。

东京证券交易所链接地址：https://www.jpx.co.jp/english/markets/indices/line-up/files/e_fac_13_sector.pdf

不会出现类似于下图的情况，两种行业分类彼此存在相同的股票和不同的股票。

NewIndexSeriesSizeCode、NewIndexSeriesSize

该部分表示的是某只股票是否属于某一个指数的成分股。
不同的指数对于其成份股有不同要求，所以该部分其实能反应股票的某些特征。

东京证券交易所链接地址：https://www.jpx.co.jp/english/markets/indices/line-up/files/e_fac_12_size.pdf

TradeDate
交易日期，用于计算市值的交易日期。
Close
收盘价，用于计算市值的收盘价。
IssuedShares
发行股份，已发行股份。
MarketCapitalization
市值
Universe0
预测目标股票池的标志(市值排名前2000只股票)
即，该股票在赛题中是否需要预测。

评分

投资组合

虽然在数据集中有target，但我们要做的并不是预测target，即这不是一个回归问题。

我们需要根据在 $t$ 日对给出的两千只股票进行排序，然后在 $t+1$ 日做多前200只，做空后200只，并在 $t+2$ 日进行平仓；在做多和做空的过程中，根据排序对进行加权。
所以整个过程，构成了一个投资组合，最后衡量投资组合的夏普比率。

夏普比率

什么是夏普比率

概述

夏普比率，衡量额外承受的每一单位风险所获得的额外收益。

多个版本的夏普比率

1966年，威廉·F·夏普提出了第一个版本的夏普比率。如下：

$\text{S}=\frac {\text{E}[R-R_{f}]}{\sqrt {\text{Var} [R]}}$

其中 $R$ 是资产收益， $R_{f}$ 是无风险收益。

1994年，威廉·F·夏普对第一个版本的夏普比率进行了修改，提出了第二个版本的夏普比率，如下：

$\text{S}=\frac {\text{E}[R-R_{f}]}{\sqrt {\text{Var} [R-R_{f}]}}$

在第二个版本中，如果 $R_{f}$ 在整个期间是一个恒定的值，则 $\text{Var} [R] = \text{Var} [R-R_{f}]$ 。

还有很多版本的夏普比率。

采用的不是无风险收益率，而是某一种有风险的收益率(例如股票指数)，将其作为基准收益率。
完全不考虑无风险收益率
(在东京证券交易所的比赛中，即是如此。)
还有些，采用的不是标准差，而是样本标准差。

这些都是在行业具体应用过程中，根据实际情况的"魔改"。

在夏普比率中，一般用标准差，很少用样本标准差的。与之相反的是实现波动率，在实现波动率中，一般用样本标准差，很少用标准差的。
关于实现波动率，可以参考《Optiver-1.金融基础与赛题解析》

例子

例一

假设资产的预期收益率超过无风险利率 $15\%$ ，评估资产的风险(定义为资产超额收益的标准偏差)为 $10\%$ 。无风险收益是常数，那么夏普比率(使用第一种)

$\begin{aligned} S = \frac{0.15}{0.10} = 1.5\end{aligned}$

例二

日期	资产收益	标普500总收益率	超额收益
第一天	-0.0050000	-0.0048419	-0.0001581
第二天	0.0010000	0.0017234	-0.0007234
第三天	0.0050000	0.0046110	0.0003890

示例代码：

# 超额收益
crsy = [-0.0001581,-0.0007234, 0.0003890]

# 资产收益
zcsy = [-0.005,0.001,0.005]

# 超额收益的均值，除以，超额收益的标准差
print(np.mean(crsy) / np.std(crsy))

# 超额收益的均值，除以，资产收益的标准差
print(np.mean(crsy) / np.std(zcsy))

运行结果：

1 2	-0.3614766513736107 -0.03994702495345027

在比赛中的应用

根据东京证券交易所的资料，其在计算夏普比率的时候，没有考虑无风险收益率(基准收益率)。

具体计算过程如下：

计算股票的收益率。
计算某一天，做多前200只的收益。
计算某一天，做空后200只的收益。
计算某一天的总收益。
计算夏普比率(不考虑无风险收益率)。

计算股票的收益率

$r_{(k, t)} = \frac{C_{(k, t+2)} - C_{(k, t+1)}}{C_{(k, t+1)}}$

其中 $C_{(k, t+1)}$ 是第 $k$ 只股票在第 $t+1$ 天的收盘价， $C_{(k, t+2)}$ 是第 $k$ 只股票在第 $k+2$ 这一天的收盘价。
即，在 $t$ 日做决策，在 $t+1$ 日以收盘价做多或做空，在 $t+2$ 日以收盘价平仓， $r_{(k, t)}$ 表示的投资这只股票的收益。

做多前200只

对于排序前200支的股票进行做多，权重服从一个 $[2,1]$ 的线性函数，第一个权重为 $2$ ，第二个权重为 $1.995$ ，第三个权重为 $1.990$ ，以此类推，最后除以线性函数的均值 $1.5$ 。

$S_{\text{up},t} = \frac{\sum^{200}_{i=1}(r_{({\text{up}_i}, t)} * \text{linear function}(2, 1)_i))}{\text{Average}(\text{linear function}(2, 1))}$

做空后200只

对于排序200只的股票进行做空，权重服从一个 $[2,1]$ 的线性函数，倒数一个权重为 $2$ ，倒数二个权重为 $1.995$ ，倒数三个权重为 $1.990$ ，以此类推，最后除以均值 $1.5$ 。

$S_{\text{down},t} = \frac{\sum^{200}_{i=1}(r_{({\text{down}_i}, t)} * \text{linear function}(2, 1)_i)}{\text{Average}(\text{linear function}(2, 1))}$

计算每天总回报

$R_{t} = S_{\text{up},t} - S_{\text{down},t}$

因为是做空后200只股票，所以在计算总回报时，应该减去 $S_{\text{down},t}$ 。

计算夏普比率

总收益率的时间序列的平均值，除以，总收益率的时间序列的标准差，不考虑无风险收益率，也不把日经指数作为基准收益率。

$\text{Score} = \frac{\text{Average}(R_{\text{series}})}{\text{STD}(R_{\text{series}})}$

实现代码

import numpy as np
import pandas as pd


def calc_return_sharpe(df: pd.DataFrame, portfolio_size: int = 200, top_rank_weight_ratio: float = 2) -> float:
    """
    计算夏普比率
    :param df: 排序
    :param portfolio_size: 投资组合中单方向的最大资产数
    :param top_rank_weight_ratio: 第一(或倒数第一)的权重
    :return: 夏普比率
    """

    def _calc_return_per_day(df, portfolio_size_val, top_rank_weight_ratio_val):
        """
        计算回报
        :param df: 排序
        :param portfolio_size_val: 投资组合中单方向的最大资产数
        :param top_rank_weight_ratio_val: 第一(或倒数第一)的权重
        :return: 回报
        """
        assert df['Rank'].min() == 0
        assert df['Rank'].max() == len(df['Rank']) - 1
        weights = np.linspace(start=top_rank_weight_ratio_val, stop=1, num=portfolio_size_val)
        purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size_val] * weights).sum() / weights.mean()
        short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size_val] * weights).sum() / weights.mean()
        return purchase - short

    buf = df.groupby('Date').apply(_calc_return_per_day, portfolio_size, top_rank_weight_ratio)
    sharpe_ratio = buf.mean() / buf.std()
    return sharpe_ratio

缺失值(停牌)

为什么会有缺失值

在股票日线数据中，如果存在缺失值，一般是因为股票停牌了(不排除交易所自身故障)。

是否存在缺失值

判断哪些列包含缺失值

如果该列存在缺失值则返回True，反之False。示例代码：

1
2
3

stock_prices = pd.concat([train_files_stock_prices, supplemental_files_stock_prices], axis=0)
stock_prices.reset_index(drop=True, inplace=True)
stock_prices.isnull().any()

运行结果：

RowId               False
Date                False
SecuritiesCode      False
Open                 True
High                 True
Low                  True
Close                True
Volume              False
AdjustmentFactor    False
ExpectedDividend     True
SupervisionFlag     False
Target               True
dtype: bool

统计每列缺失值的数量

统计每列缺失值的数量，示例代码：

1	stock_prices.isnull().sum()

运行结果：

RowId                     0
Date                      0
SecuritiesCode            0
Open                   8426
High                   8426
Low                    8426
Close                  8426
Volume                    0
AdjustmentFactor          0
ExpectedDividend    2581536
SupervisionFlag           0
Target                  246
dtype: int64

查看缺失值

查看Open为空的行

查看Open为空的行，示例代码：

1	stock_prices[stock_prices.Open.isnull()]

运行结果：

                 RowId        Date  SecuritiesCode  Open  High  Low  Close  Volume  AdjustmentFactor  ExpectedDividend  SupervisionFlag    Target
401      20170104_3540  2017-01-04            3540   NaN   NaN  NaN    NaN       0               1.0               NaN            False       NaN
1753     20170104_9539  2017-01-04            9539   NaN   NaN  NaN    NaN       0               1.0               NaN            False -0.004149
2266     20170105_3540  2017-01-05            3540   NaN   NaN  NaN    NaN       0               1.0               NaN            False       NaN
2511     20170105_4621  2017-01-05            4621   NaN   NaN  NaN    NaN       0               1.0               NaN            False  0.000000
4131     20170106_3540  2017-01-06            3540   NaN   NaN  NaN    NaN       0               1.0               NaN            False       NaN
...                ...         ...             ...   ...   ...  ...    ...     ...               ...               ...              ...       ...
2600516  20220624_1981  2022-06-24            1981   NaN   NaN  NaN    NaN       0               1.0               NaN            False  0.007634
2600687  20220624_2814  2022-06-24            2814   NaN   NaN  NaN    NaN       0               1.0               NaN            False  0.000000
2601130  20220624_4628  2022-06-24            4628   NaN   NaN  NaN    NaN       0               1.0               NaN            False  0.000000
2601397  20220624_6144  2022-06-24            6144   NaN   NaN  NaN    NaN       0               1.0               NaN            False  0.040578
2601516  20220624_6484  2022-06-24            6484   NaN   NaN  NaN    NaN       0               1.0               NaN            False  0.013667

[8426 rows x 12 columns]

查看2020-10-01这一天

查看2020-10-01这一天，示例代码：

1	stock_prices[stock_prices.Date == '2020-10-01'].isnull().sum()

运行结果：

RowId                  0
Date                   0
SecuritiesCode         0
Open                1988
High                1988
Low                 1988
Close               1988
Volume                 0
AdjustmentFactor       0
ExpectedDividend    1988
SupervisionFlag        0
Target                 0
dtype: int64

我们看到，这一天有很多股票都有缺失值。因为在这一天，东京证券交易所发生了故障。

缺失值处理

多种处理方法

在《机器学习实战方法(Python)：特征工程-1.特征预处理》，我们讨论过如何填补缺失值。在技术上：对于连续型，可以使用均值或者中位数；对于离散型，可以使用众数。

在本文，因为缺失的原因一般是股票停牌，我认为使用均值或者中位数。都是不恰当的。
建议的处理方法有：

不处理
直接丢弃停牌期间的数据
成交量置为0，开收高低等都采用最近一个交易日的收盘价。

第一种方法

对于第一种方法，不处理。需要特别注意，空值的运算。示例代码：

1	1 + 2 + 3 + None

运行结果：

1	TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

第二种方法

实现

利用dropna，可以去除空值。

关于dropna，可以参考《经典机器学习及其Python实现：2.特征预处理》。

示例代码：

stock_prices.dropna(subset=['Open', 'Target'], inplace=True)
stock_prices.reset_index(drop=True, inplace=True)
print(stock_prices)
print(stock_prices.isnull().sum())

运行结果：

                 RowId        Date  SecuritiesCode    Open    High     Low   Close   Volume  AdjustmentFactor  ExpectedDividend  SupervisionFlag    Target
0        20170104_1301  2017-01-04            1301  2734.0  2755.0  2730.0  2742.0    31400               1.0               NaN            False  0.000730
1        20170104_1332  2017-01-04            1332   568.0   576.0   563.0   571.0  2798500               1.0               NaN            False  0.012324
2        20170104_1333  2017-01-04            1333  3150.0  3210.0  3140.0  3210.0   270800               1.0               NaN            False  0.006154
3        20170104_1376  2017-01-04            1376  1510.0  1550.0  1510.0  1550.0    11300               1.0               NaN            False  0.011053
4        20170104_1377  2017-01-04            1377  3270.0  3350.0  3270.0  3330.0   150800               1.0               NaN            False  0.003026
...                ...         ...             ...     ...     ...     ...     ...      ...               ...               ...              ...       ...
2593973  20220624_9990  2022-06-24            9990   576.0   576.0   563.0   564.0    24200               1.0               NaN            False  0.027073
2593974  20220624_9991  2022-06-24            9991   810.0   815.0   804.0   815.0     8700               1.0               NaN            False  0.001220
2593975  20220624_9993  2022-06-24            9993  1548.0  1548.0  1497.0  1497.0    12600               1.0               NaN            False  0.001329
2593976  20220624_9994  2022-06-24            9994  2507.0  2527.0  2498.0  2527.0     7300               1.0               NaN            False  0.003185
2593977  20220624_9997  2022-06-24            9997   710.0   725.0   710.0   719.0   139600               1.0               NaN            False  0.015089

[2593978 rows x 12 columns]
RowId                     0
Date                      0
SecuritiesCode            0
Open                      0
High                      0
Low                       0
Close                     0
Volume                    0
AdjustmentFactor          0
ExpectedDividend    2573131
SupervisionFlag           0
Target                    0
dtype: int64

不建议的原因

但是不建议这种方法。

我们将停牌的数据删除了，在进行滑动窗口处理的时候，会很麻烦。因为我们不好判断是因为停牌导致数据被删；还是因为不是交易日，本就没有数据。
例如，统计第1天到第5天共5个交易日的数据，如果在第3天停牌了，数据又被我们删掉了，而且我们还采取了"直接按顺序找5行"这种粗暴的方式。那么，可能最终统计的是第1天到第6天的数据。
总之，不建议。

第三种方法

思路

成交量置为0，开收高低等都采用最近一个交易日的收盘价。

对于"成交量置为0"，直接fillna即可。
对于"开收高低等都采用最近一个交易日的收盘价"
- 先进行groupby
- 然后用ffill，按行，填充收盘价。
- 再分别用bfill，按列，填充开盘价、最高价以及最低价。

关于fillna以及和groupby的配合，可以参考《经典机器学习及其Python实现：2.特征预处理》。

实现

示例代码：

train_files_stock_prices = pd.read_csv('../jpxd/train_files/stock_prices.csv')
supplemental_files_stock_prices = pd.read_csv('../jpxd/supplemental_files/stock_prices.csv')

stock_prices = pd.concat([train_files_stock_prices, supplemental_files_stock_prices], axis=0)
stock_prices.reset_index(drop=True, inplace=True)

stock_prices_2266 = stock_prices[stock_prices['SecuritiesCode'] == 2266]
stock_prices_2266 = stock_prices_2266[stock_prices_2266['Date'] <= '2020-10-10']
stock_prices_2266 = stock_prices_2266[stock_prices_2266['Date'] >= '2020-09-20']
print(stock_prices_2266)

# 成交量补充0
stock_prices['Volume'].fillna(0, inplace=True)

# 收盘价
stock_prices['Close'] = stock_prices.groupby('SecuritiesCode', group_keys=False)['Close'].apply(lambda x: x.fillna(method='ffill',axis=0))

# 最低价 Low
stock_prices['Low'] = stock_prices.groupby('SecuritiesCode', group_keys=False)['Low'].apply(lambda x: x.fillna(method='ffill'))
# 最高价 High
stock_prices['High'] = stock_prices.groupby('SecuritiesCode', group_keys=False)['High'].apply(lambda x: x.fillna(method='ffill'))
# 开盘价 Open
stock_prices['Open'] = stock_prices.groupby('SecuritiesCode', group_keys=False)['Open'].apply(lambda x: x.fillna(method='ffill'))

stock_prices_2266 = stock_prices[stock_prices['SecuritiesCode'] == 2266]
stock_prices_2266 = stock_prices_2266[stock_prices_2266['Date'] <= '2020-10-10']
stock_prices_2266 = stock_prices_2266[stock_prices_2266['Date'] >= '2020-09-20']
print(stock_prices_2266)

运行结果：

                 RowId        Date  SecuritiesCode    Open    High     Low   Close  Volume  AdjustmentFactor  ExpectedDividend  SupervisionFlag    Target
1743279  20200923_2266  2020-09-23            2266  1849.0  1850.0  1817.0  1826.0   19500               1.0               NaN            False -0.028649
1745262  20200924_2266  2020-09-24            2266  1847.0  1853.0  1831.0  1850.0   24200               1.0               NaN            False  0.041180
1747246  20200925_2266  2020-09-25            2266  1850.0  1865.0  1775.0  1797.0   61100               1.0               NaN            False  0.024051
1749232  20200928_2266  2020-09-28            2266  1831.0  1873.0  1797.0  1871.0   42800               1.0               NaN            False -0.016701
1751218  20200929_2266  2020-09-29            2266  1880.0  1925.0  1855.0  1916.0   35700               1.0               NaN            False  0.000000
1753204  20200930_2266  2020-09-30            2266  1912.0  1945.0  1884.0  1884.0   25700               1.0               NaN            False -0.035032
1755190  20201001_2266  2020-10-01            2266     NaN     NaN     NaN     NaN       0               1.0               NaN            False  0.022552
1757178  20201002_2266  2020-10-02            2266  1910.0  1910.0  1810.0  1818.0   22800               1.0               NaN            False -0.002152
1759167  20201005_2266  2020-10-05            2266  1834.0  1880.0  1834.0  1859.0   10700               1.0               NaN            False -0.015094
1761157  20201006_2266  2020-10-06            2266  1864.0  1864.0  1827.0  1855.0   14100               1.0               NaN            False -0.008758
1763147  20201007_2266  2020-10-07            2266  1842.0  1851.0  1827.0  1827.0   11900               1.0               NaN            False -0.014909
1765137  20201008_2266  2020-10-08            2266  1828.0  1852.0  1804.0  1811.0   14500               1.0               NaN            False -0.011771
1767127  20201009_2266  2020-10-09            2266  1812.0  1812.0  1760.0  1784.0   15000               1.0               NaN            False  0.000567

                 RowId        Date  SecuritiesCode    Open    High     Low   Close  Volume  AdjustmentFactor  ExpectedDividend  SupervisionFlag    Target
1743279  20200923_2266  2020-09-23            2266  1849.0  1850.0  1817.0  1826.0   19500               1.0               NaN            False -0.028649
1745262  20200924_2266  2020-09-24            2266  1847.0  1853.0  1831.0  1850.0   24200               1.0               NaN            False  0.041180
1747246  20200925_2266  2020-09-25            2266  1850.0  1865.0  1775.0  1797.0   61100               1.0               NaN            False  0.024051
1749232  20200928_2266  2020-09-28            2266  1831.0  1873.0  1797.0  1871.0   42800               1.0               NaN            False -0.016701
1751218  20200929_2266  2020-09-29            2266  1880.0  1925.0  1855.0  1916.0   35700               1.0               NaN            False  0.000000
1753204  20200930_2266  2020-09-30            2266  1912.0  1945.0  1884.0  1884.0   25700               1.0               NaN            False -0.035032
1755190  20201001_2266  2020-10-01            2266  1912.0  1945.0  1884.0  1884.0       0               1.0               NaN            False  0.022552
1757178  20201002_2266  2020-10-02            2266  1910.0  1910.0  1810.0  1818.0   22800               1.0               NaN            False -0.002152
1759167  20201005_2266  2020-10-05            2266  1834.0  1880.0  1834.0  1859.0   10700               1.0               NaN            False -0.015094
1761157  20201006_2266  2020-10-06            2266  1864.0  1864.0  1827.0  1855.0   14100               1.0               NaN            False -0.008758
1763147  20201007_2266  2020-10-07            2266  1842.0  1851.0  1827.0  1827.0   11900               1.0               NaN            False -0.014909
1765137  20201008_2266  2020-10-08            2266  1828.0  1852.0  1804.0  1811.0   14500               1.0               NaN            False -0.011771
1767127  20201009_2266  2020-10-09            2266  1812.0  1812.0  1760.0  1784.0   15000               1.0               NaN            False  0.000567

复权

两种复权

前复权(向前复权)
就是保持现有价位不变，将以前的价格缩减。
后复权(向后复权)
保持先前的价格不变，而将以后的价格增加。

复权的实现(前复权)

对于价格，乘以复权因子。
对于成交量，除以复权因子。

查看复权前，示例代码：

check_1805 = stock_prices[stock_prices['SecuritiesCode'] == 1805]

check_1805 = check_1805[check_1805['Date'] >= '2018-09-20']
check_1805 = check_1805[check_1805['Date'] <= '2018-09-30']

print(check_1805)

运行结果：

                RowId        Date  SecuritiesCode    Open    High     Low   Close   Volume  AdjustmentFactor  ExpectedDividend  SupervisionFlag    Target
800771  20180920_1805  2018-09-20            1805   183.0   187.0   181.0   186.0  2196400               1.0               NaN            False -0.005291
802685  20180921_1805  2018-09-21            1805   187.0   191.0   186.0   189.0  2283600               1.0               NaN            False  0.005319
804600  20180925_1805  2018-09-25            1805   186.0   190.0   185.0   188.0  1979800              10.0               NaN            False  0.000529
806515  20180926_1805  2018-09-26            1805  1911.0  1911.0  1853.0  1890.0   190800               1.0               NaN            False  0.015336
808430  20180927_1805  2018-09-27            1805  1880.0  1907.0  1879.0  1891.0   133300               1.0               NaN            False -0.005208
810346  20180928_1805  2018-09-28            1805  1900.0  1938.0  1895.0  1920.0   145500               1.0               NaN            False -0.026178

复权，示例代码：

def generate_adjusted_feature_one_stock(one_stock_df: pd.DataFrame):
    """

    :param one_stock_df: 一支股票
    :return:
    """
    # 按照日期进行排序
    adjusted_df = one_stock_df.sort_values("Date", ascending=False)
    # 计算复权因子
    adjusted_df.loc[:, "CumulativeAdjustmentFactor"] = adjusted_df["AdjustmentFactor"].cumprod()

    # 开盘价  Open
    adjusted_df.loc[:, "AdjustedOpen"] = adjusted_df["Open"] * adjusted_df["CumulativeAdjustmentFactor"]
    # 最高价  High
    adjusted_df.loc[:, "AdjustedHigh"] = adjusted_df["High"] * adjusted_df["CumulativeAdjustmentFactor"]
    # 最低价  Low
    adjusted_df.loc[:, "AdjustedLow"] = adjusted_df["Low"] * adjusted_df["CumulativeAdjustmentFactor"]
    # 收盘价  Close
    adjusted_df.loc[:, "AdjustedClose"] = adjusted_df["Close"] * adjusted_df["CumulativeAdjustmentFactor"]
    # 成交量  Volume
    adjusted_df.loc[:, "AdjustedVolume"] = adjusted_df["Volume"] / adjusted_df["CumulativeAdjustmentFactor"]

    return adjusted_df

adjusted_stock_prices = stock_prices.groupby("SecuritiesCode").apply(generate_adjusted_feature_one_stock).reset_index(drop=True)

查看复权后，关注AdjustedOpen、AdjustedHigh、AdjustedLow、AdjustedClose和AdjustedVolume，示例代码：

check_1805 = adjusted_stock_prices[adjusted_stock_prices['SecuritiesCode'] == 1805]

check_1805 = check_1805[check_1805['Date'] >= '2018-09-20']
check_1805 = check_1805[check_1805['Date'] <= '2018-09-30']

print(check_1805)

运行结果：

               RowId        Date  SecuritiesCode    Open    High     Low   Close   Volume  AdjustmentFactor  ExpectedDividend  SupervisionFlag    Target  CumulativeAdjustmentFactor  AdjustedOpen  AdjustedHigh  AdjustedLow  AdjustedClose  AdjustedVolume
50114  20180928_1805  2018-09-28            1805  1900.0  1938.0  1895.0  1920.0   145500               1.0               NaN            False -0.026178                         1.0        1900.0        1938.0       1895.0         1920.0        145500.0
50115  20180927_1805  2018-09-27            1805  1880.0  1907.0  1879.0  1891.0   133300               1.0               NaN            False -0.005208                         1.0        1880.0        1907.0       1879.0         1891.0        133300.0
50116  20180926_1805  2018-09-26            1805  1911.0  1911.0  1853.0  1890.0   190800               1.0               NaN            False  0.015336                         1.0        1911.0        1911.0       1853.0         1890.0        190800.0
50117  20180925_1805  2018-09-25            1805   186.0   190.0   185.0   188.0  1979800              10.0               NaN            False  0.000529                        10.0        1860.0        1900.0       1850.0         1880.0        197980.0
50118  20180921_1805  2018-09-21            1805   187.0   191.0   186.0   189.0  2283600               1.0               NaN            False  0.005319                        10.0        1870.0        1910.0       1860.0         1890.0        228360.0
50119  20180920_1805  2018-09-20            1805   183.0   187.0   181.0   186.0  2196400               1.0               NaN            False -0.005291                        10.0        1830.0        1870.0       1810.0         1860.0        219640.0

争议

有些资料只对价格进行了复权，甚至只对收盘价进行了复权。假如只考虑价格，不考虑量价关系；甚至只考虑收盘价，这种复权方法是OK的。
但是如果要用除收盘价外的其他的价格，要考虑量价关系，必须对成交量进行复权。

探索

市场整体情况

示例代码：

# 删掉一些列
stock = adjusted_stock_prices.drop(columns=['Open', 'High', 'Low', 'Close', 'Volume', 'CumulativeAdjustmentFactor'])
# 改名
stock.rename(
    columns={'AdjustedOpen': 'Open', 'AdjustedHigh': 'High', 'AdjustedLow': 'Low',
             'AdjustedClose': 'Close', 'AdjustedVolume': 'Volume'},inplace=True)

# ExpectedDividend
stock['ExpectedDividend'].fillna(0, inplace=True)

stock['Date'] = pd.to_datetime(stock['Date'])

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

colors=px.colors.qualitative.Plotly
temp = dict(layout=go.Layout(font=dict(family="Franklin Gothic", size=12), width=800))

# 交易日
stock_date=stock.Date.unique()
# 回报（Target均值）
returns=stock.groupby('Date')['Target'].mean().mul(100).rename('Average Return')
# 收盘价
close_avg=stock.groupby('Date')['Close'].mean().rename('Closing Price')
# 交易量
vol_avg=stock.groupby('Date')['Volume'].mean().rename('Volume')

fig = make_subplots(rows=3, cols=1, shared_xaxes=True)
for i, j in enumerate([returns, close_avg, vol_avg]):
    fig.add_trace(go.Scatter(x=stock_date, y=j, mode='lines',
                             name=j.name, marker_color=colors[i]), row=i+1, col=1)
fig.update_xaxes(rangeslider_visible=False,
                 rangeselector=dict(
                     buttons=list([
                         dict(count=6, label="6m", step="month", stepmode="backward"),
                         dict(count=1, label="1y", step="year", stepmode="backward"),
                         dict(count=2, label="2y", step="year", stepmode="backward"),
                         dict(step="all")])),
                 row=1,col=1)
fig.update_layout(template=temp,title='JPX Market Average Stock Return, Closing Price, and Shares Traded', 
                  hovermode='x unified', height=700, 
                  yaxis1=dict(title='Stock Return', ticksuffix='%'), 
                  yaxis2_title='Closing Price', yaxis3_title='Shares Traded',
                  showlegend=False)
fig.show()

运行结果：

JPX Market Average Stock Return, Closing Price, and Shares Traded

有些资料会说，交易量在逐步下降，但其实是因为他们没有对交易量进行复权。

题外话，plotly保存HTML文件，示例代码：

1 2	import plotly plotly.offline.plot(fig, filename='file1.html')

分行业历年的Return

整理数据：

stock_list=pd.read_csv("../jpxd/stock_list.csv")

# 股票行业
stock_list['SectorName']=[i.rstrip().lower().capitalize() for i in stock_list['17SectorName']]
# 股票名称
stock_list['Name']=[i.rstrip().lower().capitalize() for i in stock_list['Name']]
# 数据合并
stock_df = stock.merge(stock_list[['SecuritiesCode','Name','SectorName']], on='SecuritiesCode', how='left')
# 年份
stock_df['Year'] = stock_df['Date'].dt.year

years = {year: pd.DataFrame() for year in stock_df.Year.unique()[::-1]}
for key in years.keys():
    df=stock_df[stock_df.Year == key]
    years[key] = df.groupby('SectorName')['Target'].mean().mul(100).rename("Avg_return_{}".format(key))
df=pd.concat((years[i].to_frame() for i in years.keys()), axis=1)
df=df.sort_values(by="Avg_return_2022")

绘制图像：

fig = make_subplots(rows=1, cols=6, shared_yaxes=True)

for i, col in enumerate(df.columns):
    x = df[col]
    mask = x >= 0
    fig.add_trace(go.Bar(x=x[mask], y=df.index[mask], orientation='h',
                         text=x[mask], texttemplate='%{text:.2f}%', textposition='auto',
                         hovertemplate='Average Target in %{y} Stocks = %{x:.4f}%',
                         marker=dict(color='red', opacity=0.7), name=col[-4:]),
                  row=1, col=i + 1)
    fig.add_trace(go.Bar(x=x[~mask], y=df.index[~mask], orientation='h',
                         text=x[~mask], texttemplate='%{text:.2f}%', textposition='auto',
                         hovertemplate='Average Target in %{y} Stocks = %{x:.4f}%',
                         marker=dict(color='green', opacity=0.7), name=col[-4:]),
                  row=1, col=i + 1)
    fig.update_xaxes(range=(x.min() - .15, x.max() + .15), title='{} Target'.format(col[-4:]),
                     showticklabels=False, row=1, col=i + 1)
fig.update_layout(title='Yearly Average Stock Target by Sector',
                  hovermode='closest', margin=dict(l=250, r=50),
                  showlegend=False)
fig.show()

运行结果：

Yearly Average Stock Target by Sector

Return的分布

Return的整体分布

示例代码：

fig = go.Figure()
x_hist=stock_df['Target']
fig.add_trace(go.Histogram(x=x_hist*100,
                           marker=dict(color=colors[0], opacity=0.7, 
                                       line=dict(width=1, color=colors[0])),
                           xbins=dict(start=-40,end=40,size=1)))
fig.update_layout(template=temp,title='Target Distribution', 
                  xaxis=dict(title='Stock Return',ticksuffix='%'), height=450)
fig.show()

运行结果：

《Target Distribution》

Return在行业内的分布

示例代码：

pal = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, 18)]
fig = go.Figure()
for i, sector in enumerate(df.index[::-1]):
    y_data=stock_df[stock_df['SectorName']==sector]['Target']
    fig.add_trace(go.Box(y=y_data*100, name=sector,
                         marker_color=pal[i], showlegend=False))
fig.update_layout(template=temp, title='Target Distribution by Sector',
                  yaxis=dict(title='Stock Return',ticksuffix='%'),
                  margin=dict(b=150), height=750, width=900)
fig.show()

运行结果：

Target Distribution by Sector

我们直接看数字的话，能更清楚，示例代码：

1 2	stock_df_group = stock_df.groupby('SectorName')['Target'].describe() stock_df_group

运行结果：

                                                count      mean       std       min       25%  50%       75%       max
SectorName                                                                                                            
Automobiles & transportation equipment        85568.0  0.000025  0.021919 -0.226629 -0.011147  0.0  0.010606  0.349345
Banks                                         83919.0 -0.000258  0.018002 -0.194805 -0.010256  0.0  0.009404  0.298780
Commercial & wholesale trade                 199701.0  0.000429  0.020714 -0.273723 -0.009161  0.0  0.009588  1.119512
Construction & materials                     207533.0  0.000296  0.020469 -0.294974 -0.009662  0.0  0.009666  0.304878
Electric appliances & precision instruments  251506.0  0.000512  0.025248 -0.244200 -0.012066  0.0  0.012162  0.420168
Electric power & gas                          30716.0  0.000118  0.018744 -0.355000 -0.008626  0.0  0.008333  0.247770
Energy resources                              18718.0  0.000348  0.021965 -0.177386 -0.010886  0.0  0.010856  0.242009
Financials （ex banks）                         73424.0  0.000262  0.022010 -0.244618 -0.010340  0.0  0.010359  0.317460
Foods                                        123864.0  0.000167  0.017064 -0.252101 -0.007503  0.0  0.007483  0.264000
It & services, others                        592412.0  0.000638  0.028094 -0.578541 -0.012422  0.0  0.012436  0.585366
Machinery                                    170410.0  0.000317  0.023419 -0.233909 -0.011614  0.0  0.011558  0.263852
Pharmaceutical                                60641.0  0.000316  0.027722 -0.524904 -0.011689  0.0  0.011111  0.597907
Raw materials & chemicals                    225677.0  0.000301  0.021783 -0.226074 -0.010682  0.0  0.010627  0.322581
Real estate                                   87896.0  0.000464  0.025345 -0.254939 -0.010880  0.0  0.010978  0.404255
Retail trade                                 236712.0  0.000228  0.020412 -0.263955 -0.009054  0.0  0.008999  0.407407
Steel & nonferrous metals                     58776.0  0.000243  0.023447 -0.226006 -0.011905  0.0  0.011673  0.290070
Transportation & logistics                    94693.0  0.000317  0.019579 -0.168594 -0.009346  0.0  0.009116  0.575264

分行业的最高和最低的Return

示例代码：

stock_data=stock_df.groupby('Name')['Target'].mean().mul(100)
stock_low=stock_data.nsmallest(7)[::-1].rename("Return")
stock_high=stock_data.nlargest(7).rename("Return")
stock_data=pd.concat([stock_high, stock_low], axis=0).reset_index()
stock_data['Sector']='All'
for i in stock_df.SectorName.unique():
    sector=stock_df[stock_df.SectorName==i].groupby('Name')['Target'].mean().mul(100)
    stock_low=sector.nsmallest(7)[::-1].rename("Return")
    stock_high=sector.nlargest(7).rename("Return")
    sector_stock=pd.concat([stock_high, stock_low], axis=0).reset_index()
    sector_stock['Sector']=i
    stock_data=stock_data.append(sector_stock,ignore_index=True)
    
fig=go.Figure()
buttons = []
for i, sector in enumerate(stock_data.Sector.unique()):
    
    x=stock_data[stock_data.Sector==sector]['Name']
    y=stock_data[stock_data.Sector==sector]['Return']
    mask=y>0
    fig.add_trace(go.Bar(x=x[mask], y=y[mask], text=y[mask], 
                         texttemplate='%{text:.2f}%',
                         textposition='auto',
                         name=sector, visible=(False if i != 0 else True),
                         hovertemplate='%{x} average return: %{y:.3f}%',
                         marker=dict(color='green', opacity=0.7)))
    fig.add_trace(go.Bar(x=x[~mask], y=y[~mask], text=y[~mask], 
                         texttemplate='%{text:.2f}%',
                         textposition='auto',
                         name=sector, visible=(False if i != 0 else True),
                         hovertemplate='%{x} average return: %{y:.3f}%',
                         marker=dict(color='red', opacity=0.7)))
    
    visibility=[False]*2*len(stock_data.Sector.unique())
    visibility[i*2],visibility[i*2+1]=True,True
    button = dict(label = sector,
                  method = "update",
                  args=[{"visible": visibility}])
    buttons.append(button)

fig.update_layout(title='Stocks with Highest and Lowest Returns by Sector',
                  template=temp, yaxis=dict(title='Average Return', ticksuffix='%'),
                  updatemenus=[dict(active=0, type="dropdown",
                                    buttons=buttons, xanchor='left',
                                    yanchor='bottom', y=1.01, x=.01)], 
                  margin=dict(b=150),showlegend=False,height=700, width=900)
fig.show()

运行结果：

Stocks with Highest and Lowest Returns by Sector

分行业的K线图

示例代码：

stock_date=stock_df.Date.unique()
sectors=stock_df.SectorName.unique().tolist()
sectors.insert(0, 'All')
open_avg=stock_df.groupby('Date')['Open'].mean()
high_avg=stock_df.groupby('Date')['High'].mean()
low_avg=stock_df.groupby('Date')['Low'].mean()
close_avg=stock_df.groupby('Date')['Close'].mean() 
buttons=[]

fig = go.Figure()
for i in range(18):
    if i != 0:
        open_avg=stock_df[stock_df.SectorName==sectors[i]].groupby('Date')['Open'].mean()
        high_avg=stock_df[stock_df.SectorName==sectors[i]].groupby('Date')['High'].mean()
        low_avg=stock_df[stock_df.SectorName==sectors[i]].groupby('Date')['Low'].mean()
        close_avg=stock_df[stock_df.SectorName==sectors[i]].groupby('Date')['Close'].mean()        
    
    fig.add_trace(go.Candlestick(x=stock_date, open=open_avg, high=high_avg,
                                 low=low_avg, close=close_avg, name=sectors[i],
                                 visible=(True if i==0 else False)))
    
    visibility=[False]*len(sectors)
    visibility[i]=True
    button = dict(label = sectors[i],
                  method = "update",
                  args=[{"visible": visibility}])
    buttons.append(button)
    
fig.update_xaxes(rangeslider_visible=True,
                 rangeselector=dict(
                     buttons=list([
                         dict(count=3, label="3m", step="month", stepmode="backward"),
                         dict(count=6, label="6m", step="month", stepmode="backward"),
                         dict(step="all")]), xanchor='left',yanchor='bottom', y=1.16, x=.01))
fig.update_layout(template=temp,title='Stock Price Movements by Sector', 
                  hovermode='x unified', showlegend=False, width=1000,
                  updatemenus=[dict(active=0, type="dropdown",
                                    buttons=buttons, xanchor='left',
                                    yanchor='bottom', y=1.01, x=.01)],
                  yaxis=dict(title='Stock Price'))
fig.show()

运行结果：

Stock Price Movements by Sector

行业间的相关性

示例代码：

import plotly.figure_factory as ff

df_pivot=stock_df.pivot_table(index='Date', columns='SectorName', values='Close').reset_index()
corr=df_pivot.corr().round(2)
mask=np.triu(np.ones_like(corr, dtype=bool))
c_mask = np.where(~mask, corr, 100)
c=[]
for i in c_mask.tolist()[1:]:
    c.append([x for x in i if x != 100])
    
cor=c[::-1]
x=corr.index.tolist()[:-1]
y=corr.columns.tolist()[1:][::-1]
fig=ff.create_annotated_heatmap(z=cor, x=x, y=y, 
                                hovertemplate='Correlation between %{x} and %{y} stocks = %{z}',
                                colorscale='viridis', name='')
fig.update_layout(template=temp, title='Stock Correlation between Sectors',
                  margin=dict(l=250,t=270),height=800,width=900,
                  yaxis=dict(showgrid=False, autorange='reversed'),
                  xaxis=dict(showgrid=False))
fig.show()

运行结果：

Stock Correlation between Sectors

补充数据

这个比赛不允许我们ipynb连接互联网，但是允许我们离线补充数据。

jpx-jquants

jpx-jquants概述

jpx-jquants，JPX官方的接口。

官网：https://jpx-jquants.com
文档：https://jpx.gitbook.io/j-quants-en/

获取 Refresh Token

地址：https://api.jquants.com/v1/token/auth_user

请求方式：POST

请求参数：

Header：不需要
Body：
- mailaddress：邮箱地址
- password：密码

响应：

状态码：200，OK；400，Bad Request；403，Forbidden；500，Internal Server Error。
字段：refreshToken

Refresh Token 的有效期是一周。

示例代码：

import requests
import json

data={"mailaddress":"【邮箱】", "password":"【密码】"}

r_post = requests.post("https://api.jquants.com/v1/token/auth_user", data=json.dumps(data))

r_post.json()

运行结果：

1	{'refreshToken': 'XXX'}

获取 ID Token

地址：https://api.jquants.com/v1/token/auth_refresh

请求方式：POST

请求参数：

Header：不需要
Query：refreshtoken

响应：

状态码：200，OK；400，Bad Request；403，Forbidden；500，Internal Server Error。
字段：idToken

ID Token 的有效期是24小时。

示例代码：

import requests
import json

REFRESH_TOKEN = "XXX"

r_post = requests.post(f"https://api.jquants.com/v1/token/auth_refresh?refreshtoken={REFRESH_TOKEN}")

r_post.json()

运行结果：

1	{'idToken': 'XXX'}

获取日线数据

地址：https://api.jquants.com/v1/prices/daily_quotes

请求方式：GET

请求参数：

Header：Authorization，idToken。
Query：
- code：股票代码。
- from：起始时间，例如，20210901或2021-09-01。
- to：结束时间，例如，20210907或2021-09-07。
- date：具体某一天，在没有指定from和to的时候有效。例如，20210907或2021-09-07。

响应状态吗：200，OK；400，Bad Request；401，Unauthorized；403，Forbidden；413，Payload Too Large；500，Internal Server Error。

响应字段：

Variables	Description	Data type	Remark
Date	Date	String	YYYY-MM-DD
Code	Issue code	String
Open	Open Price (before adjustment)	Number
High	High price (before adjustment)	Number
Low	Low price (before adjustment)	Number
Close	Close price (before adjustment)	Number
Volume	Trading volume (before Adjustment)	Number
TurnoverValue	Trading value	Number
AdjustmentFactor	Adjustment factor	Number	In the case of a two-for-one stock split, "0.5" will be set in the record on the ex-rights date.
AdjustmentOpen	Adjusted open price	Number	※1
AdjustmentHigh	Adjusted high price	Number	※1
AdjustmentLow	Adjusted low price	Number	※1
AdjustmentClose	Adjusted close price	Number	※1
AdjustmentVolume	Adjusted volume	Number	※1
MorningOpen	Open price of the morning session (before Adjustment)	Number	※2
MorningHigh	High price of the morning session (before Adjustment)	Number	※2
MorningLow	Low price of the morning session (before Adjustment)	Number	※2
MorningClose	Close price of the morning session (before Adjustment)	Number	※2
MorningVolume	Trading volume of the morning session (before Adjustment)	Number	※2
MorningTurnoverValue	Trading value of the morning session	Number	※2
MorningAdjustmentOpen	Adjusted open price of the morning session	Number	※1, ※2
MorningAdjustmentHigh	Adjusted high price of the morning session	Number	※1, ※2
MorningAdjustmentLow	Adjusted low price of the morning session	Number	※1, ※2
MorningAdjustmentClose	Adjusted close price of the morning session	Number	※1, ※2
MorningAdjustmentVolume	Adjusted trading volume of the morning session	Number	※1, ※2
AfternoonOpen	Open price of the afternoon session (before Adjustment)	Number	※2
AfternoonHigh	High price of the afternoon session (before Adjustment)	Number	※2
AfternoonLow	Low price of the afternoon session (before Adjustment)	Number	※2
AfternoonClose	Close price of the afternoon session (before Adjustment)	Number	※2
AfternoonVolume	Trading volume of the afternoon session (before Adjustment)	Number	※2
AfternoonAdjustmentOpen	Adjusted open price of the afternoon session	Number	※1, ※2
AfternoonAdjustmentHigh	Adjusted high price of the afternoon session	Number	※1, ※2
AfternoonAdjustmentLow	Adjusted low price of the afternoon session	Number	※1, ※2
AfternoonAdjustmentClose	Adjusted close price of the afternoon session	Number	※1, ※2
AfternoonAdjustmentVolume	Adjusted trading volume of the afternoon session	Number	※1, ※2

※1：The item has been adjusted to take into account past divisions, etc.
※2：The item is available only for Premium plan users.

示例代码：

import requests
import json

idToken = "XXX"

headers = {'Authorization': 'Bearer {}'.format(idToken)}

r = requests.get("https://api.jquants.com/v1/prices/daily_quotes?code=1414&from=2022-10-01&to=2023-01-31", headers=headers)

r.json()

运行结果：

{
    'daily_quotes': [{
        'Date': '2022-10-03',
        'Code': '14140',
        'Open': 6220.0,
        'High': 6230.0,
        'Low': 6120.0,
        'Close': 6220.0,
        'Volume': 121700.0,
        'TurnoverValue': 752370000.0,
        'AdjustmentFactor': 1.0,
        'AdjustmentOpen': 6220.0,
        'AdjustmentHigh': 6230.0,
        'AdjustmentLow': 6120.0,
        'AdjustmentClose': 6220.0,
        'AdjustmentVolume': 121700.0
    }
    
    【部分运行结果略】

    {
        'Date': '2023-01-31',
        'Code': '14140',
        'Open': 5510.0,
        'High': 5560.0,
        'Low': 5500.0,
        'Close': 5530.0,
        'Volume': 101500.0,
        'TurnoverValue': 561054000.0,
        'AdjustmentFactor': 1.0,
        'AdjustmentOpen': 5510.0,
        'AdjustmentHigh': 5560.0,
        'AdjustmentLow': 5500.0,
        'AdjustmentClose': 5530.0,
        'AdjustmentVolume': 101500.0
    }]
}

在本文应用

示例代码：

import pandas as pd
import numpy as np
from tqdm import tqdm
import requests
import json

idToken = "XXX"

headers = {'Authorization': 'Bearer {}'.format(idToken)}

stock_list = pd.read_csv('./jpx-tokyo-stock-exchange-prediction/stock_list.csv')
stock_list = stock_list[stock_list['Universe0'] == True]
codes = stock_list['SecuritiesCode'].unique()

all_hist = []
for tick in tqdm(codes):
    r = requests.get(f'https://api.jquants.com/v1/prices/daily_quotes?code={tick}&from=2022-10-01&to=2023-01-31', headers=headers)
    hist = pd.DataFrame(r.json()['daily_quotes'])
    all_hist.append(hist)

df = pd.concat(all_hist)
df.to_csv('df.csv',index=False)

yfinance

yfinance概述

什么是yfinance

Yahoo!finance(雅虎财经)，最初有官方的免费的API，但是因为被滥用，官方的API已经于2017年5月15日停止了。
yfinance，是非官方的，数据依旧来自雅虎，但其实是通过类似爬虫的技术实现的。

官网：https://aroussi.com/post/python-yahoo-finance
Github：https://github.com/ranaroussi/yfinance

安装

安装：

1	pip install yfinance

在中国大陆使用

因为从2021年11月1日开始，Yahoo已经不再对中国大陆提供服务。同时，因为yfinance没有代理功能，在中国大陆使用yfinance，可能会收到如下报错

HTTPError: 403 Client Error: Forbidden for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/600000.SS?modules=summaryProfile%2CfinancialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&ssl=true

因此，在中国大陆使用，需要自行挂载代理。

如果报了类似如下的错误

error：No timezone found, symbol may be delisted
error：No data found for this date range, symbol may be delisted

其原因是被反爬了。

优点

yfinance的优点有：

免费
无需注册，无需Token
数据粒度高(1min/2min/5min数据)
直接以Pandas中的DataFrame或者Series的形式返回数据。

缺点

yfinance的缺点有：

yfinance其实是通过一种类似爬虫的技术从雅虎财经获取数据的，为了避免被反爬，在使用的过程中，还是需要注意。
在中国大陆使用不方便

模块

yfinance分为三个模块：

Tickers
download
pandas_datareader

其中，download功能，无法在中国大陆使用(挂载高匿代理或许可以)；pandas_datareader，是为了与遗留代码向后兼容。我们只讨论Tickers。

Tickers

info，获取股票基本信息

info，获取股票基本信息。一个股票的基本信息有很多内容。示例代码：

import yfinance as yf

ss600000 = yf.Ticker('600000.SS')
ss600000.info

运行结果：

{
  'address1': '12 First East Zhongshan Road',
  'city': 'Shanghai',
  'country': 'China',
  'phone': '86 21 6161 8888',
  'fax': '86 21 6323 2036',
  'website': 'https://www.spdb.com.cn',
  'industry': 'Banks—Regional',
  'industryDisp': 'Banks—Regional',
  'sector': 'Financial Services',
  'longBusinessSummary': "Shanghai Pudong Development Bank Co., Ltd. provides commercial banking products and services in the People's Republic of China. The company offers various personal banking services, including savings and account management products; wealth management services comprising open-ended funds, special investments, SPDB structured deposits, security investment custody accounts, individual FX trading, collective security investments, and insurance products distribution; cards; lending services, such as deposit treasury bond pledged loans and car mortgages; online payment and consumption services; and instant messaging services. It also provides corporate banking services comprising cash management, supply chain financing, investment banking, entrusted assets, and occupational pension solutions, as well as offshore banking services; assets custody services; trade finance services; and services to small and medium enterprises, and multi-national companies. In addition, the company offers treasury and market products, such as foreign exchange risk management, interest rate risk management, structured, fixed income, and commodity exchange products; and financial leasing and trust services. The company was incorporated in 1992 and is headquartered in Shanghai, the People's Republic of China.",
  'fullTimeEmployees': 64731,
  'companyOfficers': [{'maxAge': 1,'name': 'Mr. Yang  Zheng','age': 56,'title': 'Exec. Chairman','yearBorn': 1966,'exercisedValue': 0,'unexercisedValue': 0},
                      {'maxAge': 1,'name': 'Mr. Weidong  Pan','age': 56,'title': 'Pres & Vice Chairman','yearBorn': 1966,'exercisedValue': 0,'unexercisedValue': 0},
                      {'maxAge': 1,'name': 'Mr. Xinhao  Wang','age': 55,'title': 'CFO & VP','yearBorn': 1967,'exercisedValue': 0,'unexercisedValue': 0},
                      {'maxAge': 1,'name': 'Mr. Bingwen  Cui','age': 53,'title': 'Gen. Counsel, VP & Chief Legal Advisor','yearBorn': 1969,'exercisedValue': 0,'unexercisedValue': 0},
                      {'maxAge': 1,'name': 'Mr. Yiyan  Liu','age': 58,'title': 'Chief Risk Officer, VP & Exec. Director','yearBorn': 1964,'exercisedValue': 0,'unexercisedValue': 0},
                      {'maxAge': 1,'name': "Mr. Zheng'an  Chen",'age': 59,'title': 'Exec. Director','yearBorn': 1963,'exercisedValue': 0,'unexercisedValue': 0},
                      {'maxAge': 1,'name': 'Mr. Fangping  Jiang','age': 56,'title': 'Head of the Discipline Inspection & Supervision Team','yearBorn': 1966,'exercisedValue': 0,'unexercisedValue': 0},
                      {'maxAge': 1,'name': 'Mr. Wei  Xie','age': 51,'title': 'VP & Sec. of the Board','yearBorn': 1971,'exercisedValue': 0,'unexercisedValue': 0},
                      {'maxAge': 1,'name': 'Lianquan  Li','title': 'Head of the Fin. & Accounting Department','exercisedValue': 0,'unexercisedValue': 0},
                      {'maxAge': 1,'name': 'Mr. Alex  Dai','title': 'IR Officer','exercisedValue': 0,'unexercisedValue': 0}],
  'auditRisk': 9,
  'boardRisk': 8,
  'compensationRisk': 3,
  'shareHolderRightsRisk': 2,
  'overallRisk': 5,
  'governanceEpochDate': 1685577600,
  'maxAge': 86400,
  'priceHint': 2,
  'previousClose': 7.57,
  'open': 7.57,
  'dayLow': 7.53,
  'dayHigh': 7.6,
  'regularMarketPreviousClose': 7.57,
  'regularMarketOpen': 7.57,
  'regularMarketDayLow': 7.53,
  'regularMarketDayHigh': 7.6,
  'dividendRate': 0.41,
  'dividendYield': 0.0542,
  'exDividendDate': 1658361600,
  'payoutRatio': 0.3083,
  'fiveYearAvgDividendYield': 4.24,
  'beta': 0.492847,
  'trailingPE': 5.6842103,
  'forwardPE': 4.3953485,
  'volume': 22288077,
  'regularMarketVolume': 22288077,
  'averageVolume': 37246368,
  'averageVolume10days': 26097882,
  'averageDailyVolume10Day': 26097882,
  'bid': 7.55,
  'ask': 7.56,
  'bidSize': 0,
  'askSize': 0,
  'marketCap': 221902635008,
  'fiftyTwoWeekLow': 6.63,
  'fiftyTwoWeekHigh': 8.22,
  'priceToSalesTrailing12Months': 2.0457323,
  'fiftyDayAverage': 7.4534,
  'twoHundredDayAverage': 7.23055,
  'trailingAnnualDividendRate': 0.32,
  'trailingAnnualDividendYield': 0.042272124,
  'currency': 'CNY',
  'enterpriseValue': 1083258437632,
  'profitMargins': 0.43896,
  'floatShares': 9531825231,
  'sharesOutstanding': 29352200192,
  'heldPercentInsiders': 0.49045,
  'heldPercentInstitutions': 0.22323,
  'impliedSharesOutstanding': 0,
  'bookValue': 24.321,
  'priceToBook': 0.31084248,
  'lastFiscalYearEnd': 1672444800,
  'nextFiscalYearEnd': 1703980800,
  'mostRecentQuarter': 1680220800,
  'earningsQuarterlyGrowth': -0.183,
  'netIncomeToCommon': 41539497984,
  'trailingEps': 1.33,
  'forwardEps': 1.72,
  'pegRatio': -2.68,
  'lastSplitFactor': '13:10',
  'lastSplitDate': 1495670400,
  'enterpriseToRevenue': 9.987,
  '52WeekChange': -0.02702701,
  'SandP52WeekChange': 0.14647579,
  'lastDividendValue': 0.41,
  'lastDividendDate': 1658361600,
  'exchange': 'SHH',
  'quoteType': 'EQUITY',
  'symbol': '600000.SS',
  'underlyingSymbol': '600000.SS',
  'shortName': 'SHANGHAI PUDONG DEVELOPMENT BAN',
  'longName': 'Shanghai Pudong Development Bank Co., Ltd.',
  'firstTradeDateEpochUtc': 942197400,
  'timeZoneFullName': 'Asia/Shanghai',
  'timeZoneShortName': 'CST',
  'uuid': 'bacd6a72-d76b-3eeb-a7dc-aaaef82cf602',
  'messageBoardId': 'finmb_5436383',
  'gmtOffSetMilliseconds': 28800000,
  'currentPrice': 7.56,
  'targetHighPrice': 9.53,
  'targetLowPrice': 4.4,
  'targetMeanPrice': 7.8,
  'targetMedianPrice': 8.56,
  'recommendationMean': 2.6,
  'recommendationKey': 'hold',
  'numberOfAnalystOpinions': 7,
  'totalCash': 1347949035520,
  'totalCashPerShare': 45.923,
  'totalDebt': 2201421086720,
  'totalRevenue': 108471001088,
  'revenuePerShare': 3.717,
  'returnOnAssets': 0.00562,
  'returnOnEquity': 0.0685,
  'grossProfits': 112019000000,
  'operatingCashflow': 400627990528,
  'earningsGrowth': -0.19,
  'revenueGrowth': -0.101,
  'grossMargins': 0.0,
  'ebitdaMargins': 0.0,
  'operatingMargins': 0.47944,
  'financialCurrency': 'CNY',
  'trailingPegRatio': None
}

history，获取历史数据

history，获取历史数据，有以下几个参数可供配置：

period表示获取多久的数据，有10种选择：1d、5d、1mo、3mo、6mo、1y、2y、5y、10y、ytd、max。
interval，表示数据的精度，有13种选择：1m、2m、5m、15m、30m、60m、90m、1h、1d、5d、1wk、1mo、3mo。
天以内(不含天)的数据只能最多采集60天以内的，精度越高，能采集的数据天数更少，比如1m数据只能采集7天以内的。
start，如果没有配置period，可以使用start(数据开始时间)和end(数据结束时间)来定义应该获取多久的数据。
end，同上。
prepost，是否包含盘前和盘后价格，默认False不包含
auto_adjust，是否自动复权(前复权)，默认True。
actions，是否包含分红和扩股信息，默认True。

示例代码：

import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt

t1414 = yf.Ticker('1414.T')
t1414_y = t1414.history(start = "2013-01-01", end='2022-12-01',auto_adjust=False).reset_index()

df1 = pd.read_csv('./jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv')
df2 = pd.read_csv('./jpx-tokyo-stock-exchange-prediction/supplemental_files/stock_prices.csv')
df = pd.concat([df1,df2])
df['Date'] = pd.to_datetime(df.Date)
t1414_jpx = df[df['SecuritiesCode']==1414]

sub_y = t1414_y.set_index('Date')
sub_y.index = pd.to_datetime(sub_y.index)

sub_jpx = t1414_jpx.set_index('Date')

plt.figure(figsize = (12,4))
plt.plot(sub_y['Close'], label='Close yfinance')
plt.plot(sub_jpx['Close'], label='Close JPX' )
plt.show()

运行结果：

注意！auto_adjust=False，即不进行复权。但是我们看到，在yfinance中，auto_adjust=False，对于拆股依旧进行了复权，但是对于分红，没有复权。

actions，获取股票的分红和扩股数据

示例代码：

1	t1414.actions

运行结果：

                           Dividends  Stock Splits
Date                                              
2013-06-26 00:00:00+09:00        1.5           0.0
2013-12-26 00:00:00+09:00       11.0           0.0
2014-06-26 00:00:00+09:00        3.5           0.0
2014-12-26 00:00:00+09:00        1.0           0.0
2015-06-26 00:00:00+09:00       26.5           0.0

【部分运行结果略】

2020-06-29 00:00:00+09:00       44.5           0.0
2020-12-29 00:00:00+09:00       40.0           0.0
2021-06-29 00:00:00+09:00       65.5           0.0
2021-12-29 00:00:00+09:00       50.0           0.0
2022-06-29 00:00:00+09:00       68.0           0.0

dividends，只看股票的分红数据

示例代码：

1	t1414.dividends

运行结果：

Date
2013-06-26 00:00:00+09:00     1.5
2013-12-26 00:00:00+09:00    11.0
2014-06-26 00:00:00+09:00     3.5
2014-12-26 00:00:00+09:00     1.0
2015-06-26 00:00:00+09:00    26.5

【部分运行结果略】

2020-06-29 00:00:00+09:00    44.5
2020-12-29 00:00:00+09:00    40.0
2021-06-29 00:00:00+09:00    65.5
2021-12-29 00:00:00+09:00    50.0
2022-06-29 00:00:00+09:00    68.0
Name: Dividends, dtype: float64

splits，只看股票的扩股数据

示例代码：

1	t1414.splits

运行结果：

1
2
3

Date
2019-06-26 00:00:00+09:00    2.0
Name: Stock Splits, dtype: float64

在本文的应用

示例代码：

import pandas as pd
import numpy as np
import yfinance as yf
from tqdm import tqdm
import time

stock_list = pd.read_csv('./jpx-tokyo-stock-exchange-prediction/stock_list.csv')
stock_list = stock_list[stock_list['Universe0'] == True]
codes = stock_list['SecuritiesCode'].unique()

cols = ['RowId','Date','SecuritiesCode','Open','High','Low','Close','Volume','AdjustmentFactor','ExpectedDividend','SupervisionFlag','Target']

all_hist = []
for tick in tqdm(codes):
    msft = yf.Ticker(f"{tick}.T")
    hist = msft.history(start = "2022-10-01", end='2023-01-31',back_adjust=True,auto_adjust=False).reset_index().astype(str)
    hist['SecuritiesCode'] = tick
    hist['RowId'] = hist['Date'].apply(lambda x:''.join(x.split('-'))+'_'+str(tick))
    for col in ['Open','High','Low','Close','Volume']:
        hist[col] = pd.to_numeric(hist[col], errors='coerce')
    hist['Target'] = (hist['Close'].shift(-2) - hist['Close'].shift(-1))/hist['Close'].shift(-1)
    hist = hist.rename(columns = {'Dividends':'ExpectedDividend'})
    hist['ExpectedDividend'] = hist['ExpectedDividend'].apply(lambda x: x if x!= 0 else np.nan)
    hist['SupervisionFlag'] = np.nan
    hist['AdjustmentFactor'] = np.nan
    hist = hist[cols]
    all_hist.append(hist)
    time.sleep(2)

df = pd.concat(all_hist)
df.to_csv('df.csv',index=False)

注意！time.sleep(2)，这个一定要有，否则会被反爬。

参考资料：

https://www.kaggle.com/code/chumajin/easy-to-understand-the-competition/

https://www.jpx.co.jp/english/equities/market-restructure/market-segments/index.html

https://www.jpx.co.jp/english/markets/indices/line-up/files/e_fac_13_sector.pdf

https://zh.wikipedia.org/wiki/夏普比率

https://www.kaggle.com/code/smeitoma/jpx-competition-metric-definition

https://baike.baidu.com/item/股票复权/2433275

https://www.kaggle.com/code/kellibelcher/jpx-stock-market-analysis-prediction-with-lgbm

https://blog.csdn.net/frank_haha/article/details/115082015

https://jpx.gitbook.io/j-quants-en/api-reference/daily_quotes

https://www.runoob.com/pandas/pandas-json.html

https://zhuanlan.zhihu.com/p/415552328

https://aroussi.com/post/python-yahoo-finance

https://www.wenvenn.com/20211215/shi-yong-python-bao-yfinance-du-qu-ya-hu-cai-jing-shang-de-gu-piao-shu-ju/

https://stackoverflow.com/questions/74942770/yfinance-deployed-with-streamlit-cloud-no-timezone-found-symbol-may-be-deliste

https://www.kaggle.com/code/bowaka/jpx-extended-dataset-with-yfinance

文章作者: Kaka Wan Yifan

文章链接: https://kakawanyifan.com/20101

评论区

比赛链接

数据

数据概览

train_files

financials.csv

options.csv

secondary_stock_prices.csv

stock_prices.csv

字段

时间范围

trades.csv

stock_list.csv

评分

投资组合

夏普比率

什么是夏普比率

概述

多个版本的夏普比率

例子

在比赛中的应用

计算股票的收益率

做多前200只

做空后200只

计算每天总回报

计算夏普比率

实现代码

缺失值(停牌)

为什么会有缺失值

是否存在缺失值

判断哪些列包含缺失值

统计每列缺失值的数量

查看缺失值

查看Open为空的行

查看2020-10-01这一天

缺失值处理

多种处理方法

第一种方法

第二种方法

实现

不建议的原因

第三种方法

思路

实现

复权

两种复权

复权的实现(前复权)

争议

探索

市场整体情况

分行业历年的Return

Return的分布

Return的整体分布

Return在行业内的分布

分行业的最高和最低的Return

分行业的K线图

行业间的相关性

补充数据

jpx-jquants

jpx-jquants概述

获取 Refresh Token

获取 ID Token

获取日线数据

在本文应用

yfinance

yfinance概述

什么是yfinance

安装

在中国大陆使用

优点

缺点

模块

Tickers

info，获取股票基本信息

history，获取历史数据

actions，获取股票的分红和扩股数据

dividends，只看股票的分红数据

splits，只看股票的扩股数据

更多信息

在本文的应用