avatar


JPX-1.赛题解析

比赛链接

JPX Tokyo Stock Exchange Prediction
https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction

数据

数据概览

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
+--data_specifications
| +--stock_fin_spec.csv
| +--options_spec.csv
| +--trades_spec.csv
| +--stock_list_spec.csv
| +--stock_price_spec.csv
+--stock_list.csv
+--jpx_tokyo_market_prediction
| +--competition.cpython-37m-x86_64-linux-gnu.so
| +--__init__.py
+--example_test_files
| +--options.csv
| +--secondary_stock_prices.csv
| +--stock_prices.csv
| +--trades.csv
| +--financials.csv
| +--sample_submission.csv
+--supplemental_files
| +--options.csv
| +--secondary_stock_prices.csv
| +--stock_prices.csv
| +--trades.csv
| +--financials.csv
+--train_files
| +--options.csv
| +--secondary_stock_prices.csv
| +--stock_prices.csv
| +--trades.csv
| +--financials.csv
  • data_specifications
    对数据的描述(列名含义、数据解释等)
  • stock_list.csv
    股票列表
  • jpx_tokyo_market_prediction
    JPX的赛题属于"Code Competition",jpx_tokyo_market_prediction是在Linux系统下的运行环境。
  • example_test_files
    样例数据
  • supplemental_files
    可以理解为验证集数据
  • train_files
    训练集数据

train_files

我们重点关注train_files中的文件。

1
2
3
4
5
6
+--train_files
| +--options.csv
| +--secondary_stock_prices.csv
| +--stock_prices.csv
| +--trades.csv
| +--financials.csv

financials.csv

financials.csv,上市公司的财报数据。

查看financials.csv中的内容,示例代码:

1
2
financials = pd.read_csv('../jpxd/train_files/financials.csv')
financials

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
       DisclosureNumber       DateCode        Date  SecuritiesCode DisclosedDate DisclosedTime  DisclosedUnixTime                         TypeOfDocument CurrentPeriodEndDate TypeOfCurrentPeriod CurrentFiscalYearStartDate CurrentFiscalYearEndDate      NetSales OperatingProfit OrdinaryProfit      Profit EarningsPerShare     TotalAssets          Equity EquityToAssetRatio BookValuePerShare ResultDividendPerShare1stQuarter ResultDividendPerShare2ndQuarter ResultDividendPerShare3rdQuarter ResultDividendPerShareFiscalYearEnd ResultDividendPerShareAnnual ForecastDividendPerShare1stQuarter ForecastDividendPerShare2ndQuarter ForecastDividendPerShare3rdQuarter ForecastDividendPerShareFiscalYearEnd ForecastDividendPerShareAnnual ForecastNetSales ForecastOperatingProfit ForecastOrdinaryProfit ForecastProfit ForecastEarningsPerShare ApplyingOfSpecificAccountingOfTheQuarterlyFinancialStatements MaterialChangesInSubsidiaries ChangesBasedOnRevisionsOfAccountingStandard ChangesOtherThanOnesBasedOnRevisionsOfAccountingStandard ChangesInAccountingEstimates RetrospectiveRestatement NumberOfIssuedAndOutstandingSharesAtTheEndOfFiscalYearIncludingTreasuryStock NumberOfTreasuryStockAtTheEndOfFiscalYear AverageNumberOfShares
0 2.016121e+13 20170104_2753 2017-01-04 2753.0 2017-01-04 07:30:00 1.483483e+09 3QFinancialStatements_Consolidated_JP 2016-12-31 3Q 2016-04-01 2017-03-31 22761000000 2147000000 2234000000 1494000000 218.23 22386000000.0 18295000000.0 0.817 2671.42 - 50.0 - NaN NaN NaN NaN NaN 50.0 100.0 31800000000 3255000000 3300000000 2190000000 319.76 NaN False True False False False 6848800.0 - 6848800.0
1 2.017010e+13 20170104_3353 2017-01-04 3353.0 2017-01-04 15:00:00 1.483510e+09 3QFinancialStatements_Consolidated_JP 2016-11-30 3Q 2016-03-01 2017-02-28 22128000000 820000000 778000000 629000000 328.57 25100000000.0 7566000000.0 0.301 NaN - 36.0 - NaN NaN NaN NaN NaN 36.0 72.0 30200000000 1350000000 1300000000 930000000 485.36 NaN False True False False False 2035000.0 118917 1916083.0
2 2.016123e+13 20170104_4575 2017-01-04 4575.0 2017-01-04 12:00:00 1.483499e+09 ForecastRevision 2016-12-31 2Q 2016-07-01 2017-06-30 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 110000000 -465000000 -466000000 -467000000 -93.11 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2.017010e+13 20170105_2659 2017-01-05 2659.0 2017-01-05 15:00:00 1.483596e+09 3QFinancialStatements_Consolidated_JP 2016-11-30 3Q 2016-03-01 2017-02-28 134781000000 11248000000 11558000000 7171000000 224.35 128464000000.0 100905000000.0 0.765 3073.12 - 0.0 - NaN NaN NaN NaN NaN 42.0 42.0 177683000000 14168000000 14473000000 9111000000 285.05 NaN False True False False False 31981654.0 18257 31963405.0
4 2.017011e+13 20170105_3050 2017-01-05 3050.0 2017-01-05 15:30:00 1.483598e+09 ForecastRevision 2017-02-28 FY 2016-02-29 2017-02-28 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN - - 13.0 24.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
92951 2.021112e+13 20211203_6040 2021-12-03 6040.0 2021-12-03 15:00:00 1.638511e+09 1QFinancialStatements_Consolidated_JP 2021-10-31 1Q 2021-08-01 2022-07-31 732000000 -274000000 -272000000 -206000000 -13.59 6952000000.0 4771000000.0 0.653 299.3 - NaN NaN NaN NaN NaN 0.0 - 7.0 7.0 - - - - - NaN False True False False False 16000400.0 836400 15164000.0
92952 2.021120e+13 20211203_6898 2021-12-03 6898.0 2021-12-03 16:00:00 1.638515e+09 3QFinancialStatements_Consolidated_JP 2021-10-31 3Q 2021-02-01 2022-01-31 1293000000 144000000 147000000 121000000 184.73 4246000000.0 3284000000.0 0.774 NaN - 0.0 - NaN NaN NaN NaN NaN 0.0 0.0 1479000000 106000000 107000000 93000000 142.01 NaN False False False False False 816979.0 157541 659486.0
92953 2.021120e+13 20211203_6969 2021-12-03 6969.0 2021-12-03 15:00:00 1.638511e+09 ForecastRevision 2022-03-31 FY 2021-04-01 2022-03-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4500000000 450000000 420000000 -380000000 -147.87 NaN NaN NaN NaN NaN NaN NaN NaN NaN
92954 2.021112e+13 20211203_8057 2021-12-03 8057.0 2021-12-03 17:00:00 1.638518e+09 1QFinancialStatements_Consolidated_JP 2021-10-20 1Q 2021-07-21 2022-07-20 43071000000 2565000000 2860000000 1507000000 153.74 116016000000.0 51116000000.0 0.396 NaN - NaN NaN NaN NaN NaN - - 110.0 110.0 210000000000 5300000000 5900000000 3250000000 330.92 NaN False True False False False 10419371.0 614032 9805339.0
92955 2.021120e+13 20211203_9627 2021-12-03 9627.0 2021-12-03 15:30:00 1.638513e+09 2QFinancialStatements_Consolidated_JP 2021-10-31 2Q 2021-05-01 2022-04-30 152972000000 5776000000 6127000000 3338000000 94.68 210442000000.0 115810000000.0 0.55 NaN - 0.0 NaN NaN NaN NaN NaN - 55.0 55.0 315000000000 15000000000 15500000000 8300000000 234.28 NaN False True False False False 35428212.0 200911 35260638.0

[92956 rows x 45 columns]

在市场不存在内幕交易的情况下,如果财报不及预期,可能在财报发布后会有一段时间的反应。但我们很难判断财报是否达到预期,我们不会过多关注该数据。

题外话,在财报数据中,有一处,很有趣。
有很多和未来预测有关的字段,

  • ForecastDividendPerShare2ndQuarter
  • ForecastDividendPerShare3rdQuarter
  • ForecastDividendPerShareFiscalYearEnd
  • ForecastDividendPerShareAnnual
  • ForecastNetSales
  • ForecastOperatingProfit
  • ForecastOrdinaryProfit
  • ForecastProfit
  • ForecastEarningsPerShare

这些字段,感觉在A股市场不会有。
让上市公司发布预测的分红收益,那就是说,吹牛不犯法。
我也很好奇,东京证券所的上市公司,发财报的时候,是怎么处理的。

options.csv

options.csv,期权数据。

期权确实有一个功能:价格发现。
但是,根据东京证券交易所的资料,没有个股期权,也没有行业指数期权,东京证券交易所的期权的标的资产都是和宏观有关的指数,或者大宗商品。
也就是说,东京证券交易所的期权,不具有对于个股或行业的价格发现功能。
即,认为期权数据的帮助有限。

东京证券交易所

东京证券交易所链接:https://www.jpx.co.jp/english/sicc/regulations/b5b4pj0000023mqo-att/(HP)sakimono20220208-e.pdf

secondary_stock_prices.csv

东京证券交易所市场二部的股票数据。

stock_prices.csv

stock_prices.csv,股票价格数据。

字段

查看stock_prices.csv中的内容,会发现是日线数据。示例代码:

1
2
stock_prices = pd.read_csv('../jpxd/train_files/stock_prices.csv')
stock_prices

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
                 RowId        Date  SecuritiesCode    Open    High     Low   Close   Volume  AdjustmentFactor  ExpectedDividend  SupervisionFlag    Target
0 20170104_1301 2017-01-04 1301 2734.0 2755.0 2730.0 2742.0 31400 1.0 NaN False 0.000730
1 20170104_1332 2017-01-04 1332 568.0 576.0 563.0 571.0 2798500 1.0 NaN False 0.012324
2 20170104_1333 2017-01-04 1333 3150.0 3210.0 3140.0 3210.0 270800 1.0 NaN False 0.006154
3 20170104_1376 2017-01-04 1376 1510.0 1550.0 1510.0 1550.0 11300 1.0 NaN False 0.011053
4 20170104_1377 2017-01-04 1377 3270.0 3350.0 3270.0 3330.0 150800 1.0 NaN False 0.003026
... ... ... ... ... ... ... ... ... ... ... ... ...
2332526 20211203_9990 2021-12-03 9990 514.0 528.0 513.0 528.0 44200 1.0 NaN False 0.034816
2332527 20211203_9991 2021-12-03 9991 782.0 794.0 782.0 794.0 35900 1.0 NaN False 0.025478
2332528 20211203_9993 2021-12-03 9993 1690.0 1690.0 1645.0 1645.0 7200 1.0 NaN False -0.004302
2332529 20211203_9994 2021-12-03 9994 2388.0 2396.0 2380.0 2389.0 6500 1.0 NaN False 0.009098
2332530 20211203_9997 2021-12-03 9997 690.0 711.0 686.0 696.0 381100 1.0 NaN False 0.018414

[2332531 rows x 12 columns]
  • RowId,每一行的唯一标识。
  • Date,日期。
  • SecuritiesCode,证券代码。
  • OpenHighLowCloseVolume,分别是开盘价、最高价、最低价、收盘价、成交量。
  • AdjustmentFactor,调整因子(复权因子)。
    拆股、合股、配股、送股和派息等,都可能会影响调整因子。
    举个例子,例如原本股价是每股100,拆成10股后,每股可能是10,同时交易量会是原来的10倍。
  • ExpectedDividend:除权日的预期股息价值,该价值记录于除息日前2个工作日。
  • SupervisionFlag:被监管,或者即将退市。(类似于国内的风险警示,ST。)
  • Target;目标值,即标签。

Target的计算方法如下:

r(k,t)=C(k,t+2)C(k,t+1)C(k,t+1)r_{(k,t)} = \frac{C_{(k,t+2)} - C_{(k,t+1)}}{C_{(k,t+1)}}

  • C(k,t+2)C_{(k,t+2)}表示第kk支股票在t+2t+2时刻的收盘价。
  • C(k,t+1)C_{(k,t+1)}表示第kk支股票在t+1t+1时刻的收盘价。
  • r(k,t)r_{(k,t)}表示第kk支股票在tt时刻的收益,即Target

tt日的Target,是t+2t+2这一天的涨跌幅;是假设在t+1t+1日以收盘价买入,在t+2t+2日以收盘价卖出,这时候的收益。

时间范围

stock_prices.csv在两处有:

  1. train_files/stock_prices.csv
  2. supplemental_files/stock_prices.csv

查看两份数据的时间范围。示例代码:

1
2
3
4
5
train_files_stock_prices = pd.read_csv('../jpxd/train_files/stock_prices.csv')
print(f'train的起始时间:{train_files_stock_prices.Date.min()},train的结束时间:{train_files_stock_prices.Date.max()}')

supplemental_files_stock_prices = pd.read_csv('../jpxd/supplemental_files/stock_prices.csv')
print(f'supplemental的起始时间:{supplemental_files_stock_prices.Date.min()},supplemental的结束时间:{supplemental_files_stock_prices.Date.max()}')

运行结果:

1
2
train的起始时间:2017-01-04,train的结束时间:2021-12-03
supplemental的起始时间:2021-12-06,supplemental的结束时间:2022-06-24

trades.csv

trades.csv,市场运行周报。

每一个板块,在某一周的运行情况。
我认为该部分的数据,没有帮助。

查看trades.csv,示例代码:

1
2
trades = pd.read_csv('../jpxd/train_files/trades.csv')
trades

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
            Date   StartDate     EndDate                           Section    TotalSales  TotalPurchases    TotalTotal  TotalBalance  ProprietarySales  ProprietaryPurchases  ProprietaryTotal  ProprietaryBalance  BrokerageSales  BrokeragePurchases  BrokerageTotal  BrokerageBalance  IndividualsSales  IndividualsPurchases  IndividualsTotal  IndividualsBalance  ForeignersSales  ForeignersPurchases  ForeignersTotal  ForeignersBalance  SecuritiesCosSales  SecuritiesCosPurchases  SecuritiesCosTotal  SecuritiesCosBalance  InvestmentTrustsSales  InvestmentTrustsPurchases  InvestmentTrustsTotal  InvestmentTrustsBalance  BusinessCosSales  BusinessCosPurchases  BusinessCosTotal  BusinessCosBalance  OtherInstitutionsSales  OtherInstitutionsPurchases  OtherInstitutionsTotal  OtherInstitutionsBalance  InsuranceCosSales  InsuranceCosPurchases  InsuranceCosTotal  InsuranceCosBalance  CityBKsRegionalBKsEtcSales  CityBKsRegionalBKsEtcPurchase  CityBKsRegionalBKsEtcTotal  CityBKsRegionalBKsEtcBalance  TrustBanksSales  TrustBanksPurchases  TrustBanksTotal  TrustBanksBalance  OtherFinancialInstitutionsSales  OtherFinancialInstitutionsPurchases  OtherFinancialInstitutionsTotal  OtherFinancialInstitutionsBalance
0 2017-01-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2017-01-05 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2017-01-06 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2017-01-10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2017-01-11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1707 2021-12-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1708 2021-12-02 2021-11-22 2021-11-26 Growth Market (Mothers/JASDAQ) 1.143466e+09 1.143923e+09 2.287389e+09 456677.0 3.663919e+07 3.496068e+07 7.159987e+07 -1678508.0 1.106827e+09 1.108962e+09 2.215789e+09 2135185.0 6.317277e+08 6.508934e+08 1.282621e+09 19165664.0 4.222267e+08 3.995653e+08 8.217920e+08 -22661428.0 19532301.0 20335113.0 39867414.0 802812.0 5712311.0 6802056.0 12514367.0 1089745.0 13241913.0 16476738.0 29718651.0 3234825.0 6947933.0 7211377.0 14159310.0 263444.0 170792.0 433284.0 604076.0 262492.0 335919.0 60311.0 396230.0 -275608.0 6696755.0 6886122.0 13582877.0 189367.0 234653.0 298525.0 533178.0 63872.0
1709 2021-12-02 2021-11-22 2021-11-26 Prime Market (First Section) 1.138343e+10 1.137621e+10 2.275964e+10 -7214179.0 1.499660e+09 1.230944e+09 2.730604e+09 -268716111.0 9.883766e+09 1.014527e+10 2.002903e+10 261501932.0 2.042100e+09 2.433004e+09 4.475104e+09 390904459.0 7.137596e+09 6.912257e+09 1.404985e+10 -225339255.0 74894037.0 88791160.0 163685197.0 13897123.0 183078463.0 159026769.0 342105232.0 -24051694.0 123642633.0 211502023.0 335144656.0 87859390.0 18341982.0 43479826.0 61821808.0 25137844.0 10839136.0 9695681.0 20534817.0 -1143455.0 26734116.0 9223824.0 35957940.0 -17510292.0 254580089.0 261919512.0 516499601.0 7339423.0 11959898.0 16368287.0 28328185.0 4408389.0
1710 2021-12-02 2021-11-22 2021-11-26 Standard Market (Second Section) 1.069969e+08 1.075036e+08 2.145004e+08 506702.0 2.811025e+06 3.273163e+06 6.084188e+06 462138.0 1.041858e+08 1.042304e+08 2.084162e+08 44564.0 6.587397e+07 6.573161e+07 1.316056e+08 -142356.0 2.898821e+07 2.868161e+07 5.766982e+07 -306605.0 2983832.0 3003763.0 5987595.0 19931.0 543907.0 367291.0 911198.0 -176616.0 4948282.0 5634326.0 10582608.0 686044.0 258986.0 560994.0 819980.0 302008.0 47298.0 0.0 47298.0 -47298.0 42127.0 0.0 42127.0 -42127.0 438928.0 243817.0 682745.0 -195111.0 60291.0 6985.0 67276.0 -53306.0
1711 2021-12-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

[1712 rows x 56 columns]

题外话

根据东京证券交易所官方资料,是2022年4月4日,重新对市场进行了划分。然后才有Growth MarketPrime MarketStandard Market三个市场。但是根据trades.csv,我们看到在2021年11月,已经有了Growth MarketPrime MarketStandard Market三个市场的运行周报数据。

、和

东京证券交易所链接:https://www.jpx.co.jp/english/equities/market-restructure/market-segments/index.html

stock_list.csv

stock_list.csv,股票的基本信息。

查看stock_list.csv,示例代码:

1
2
stock_list = pd.read_csv('../jpxd/stock_list.csv')
stock_list

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
      SecuritiesCode  EffectiveDate                                               Name             Section/Products NewMarketSegment 33SectorCode                       33SectorName 17SectorCode                   17SectorName NewIndexSeriesSizeCode NewIndexSeriesSize   TradeDate    Close  IssuedShares  MarketCapitalization  Universe0
0 1301 20211230 KYOKUYO CO.,LTD. First Section (Domestic) Prime Market 50 Fishery, Agriculture and Forestry 1 FOODS 7 TOPIX Small 2 20211230.0 3080.0 1.092828e+07 3.365911e+10 True
1 1305 20211230 Daiwa ETF-TOPIX ETFs/ ETNs NaN - - - - - - 20211230.0 2097.0 3.634636e+09 7.621831e+12 False
2 1306 20211230 NEXT FUNDS TOPIX Exchange Traded Fund ETFs/ ETNs NaN - - - - - - 20211230.0 2073.5 7.917718e+09 1.641739e+13 False
3 1308 20211230 Nikko Exchange Traded Index Fund TOPIX ETFs/ ETNs NaN - - - - - - 20211230.0 2053.0 3.736943e+09 7.671945e+12 False
4 1309 20211230 NEXT FUNDS ChinaAMC SSE50 Index Exchange Trade... ETFs/ ETNs NaN - - - - - - 20211230.0 44280.0 7.263200e+04 3.216145e+09 False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4412 9994 20211230 YAMAYA CORPORATION First Section (Domestic) Standard Market 6100 Retail Trade 14 RETAIL TRADE 7 TOPIX Small 2 20211230.0 2447.0 1.084787e+07 2.654474e+10 True
4413 9995 20211230 GLOSEL Co.,Ltd. First Section (Domestic) Prime Market 6050 Wholesale Trade 13 COMMERCIAL & WHOLESALE TRADE 7 TOPIX Small 2 20211230.0 410.0 2.642680e+07 1.083499e+10 False
4414 9996 20211230 Satoh&Co.,Ltd. JASDAQ(Standard / Domestic) Standard Market 6050 Wholesale Trade 13 COMMERCIAL & WHOLESALE TRADE - - 20211230.0 1488.0 9.152640e+06 1.361913e+10 False
4415 9997 20211230 BELLUNA CO.,LTD. First Section (Domestic) Prime Market 6100 Retail Trade 14 RETAIL TRADE 6 TOPIX Small 1 20211230.0 709.0 9.724447e+07 6.894633e+10 True
4416 25935 20211230 ITO EN,LTD.(shares of preferred stock) First Section (Domestic) NaN 3050 Foods 1 FOODS - - 20211230.0 1934.0 3.424696e+07 6.623362e+10 False

[4417 rows x 16 columns]
  • 33SectorCode33SectorName
  • 17SectorCode17SectorName

两种行业分类方式。根据东京证券交易所资料,"33"这个行业分类方式更精细化,"17"是基于"33"的进行的划分。

33和17

东京证券交易所链接地址:https://www.jpx.co.jp/english/markets/indices/line-up/files/e_fac_13_sector.pdf

不会出现类似于下图的情况,两种行业分类彼此存在相同的股票和不同的股票。

行业分类

NewIndexSeriesSizeCodeNewIndexSeriesSize

该部分表示的是某只股票是否属于某一个指数的成分股。
不同的指数对于其成份股有不同要求,所以该部分其实能反应股票的某些特征。

指数

东京证券交易所链接地址:https://www.jpx.co.jp/english/markets/indices/line-up/files/e_fac_12_size.pdf

  • TradeDate
    交易日期,用于计算市值的交易日期。
  • Close
    收盘价,用于计算市值的收盘价。
  • IssuedShares
    发行股份,已发行股份。
  • MarketCapitalization
    市值
  • Universe0
    预测目标股票池的标志(市值排名前2000只股票)
    即,该股票在赛题中是否需要预测。

评分

投资组合

虽然在数据集中有target,但我们要做的并不是预测target,即这不是一个回归问题。

我们需要根据在tt日对给出的两千只股票进行排序,然后在t+1t+1日做多前200只,做空后200只,并在t+2t+2日进行平仓;在做多和做空的过程中,根据排序对进行加权。
所以整个过程,构成了一个投资组合,最后衡量投资组合的夏普比率。

夏普比率

什么是夏普比率

概述

夏普比率,衡量额外承受的每一单位风险所获得的额外收益。

多个版本的夏普比率

1966年,威廉·F·夏普提出了第一个版本的夏普比率。如下:

S=E[RRf]Var[R]\text{S}=\frac {\text{E}[R-R_{f}]}{\sqrt {\text{Var} [R]}}

其中RR是资产收益,RfR_{f}是无风险收益。

1994年,威廉·F·夏普对第一个版本的夏普比率进行了修改,提出了第二个版本的夏普比率,如下:

S=E[RRf]Var[RRf]\text{S}=\frac {\text{E}[R-R_{f}]}{\sqrt {\text{Var} [R-R_{f}]}}

在第二个版本中,如果RfR_{f}在整个期间是一个恒定的值,则Var[R]=Var[RRf]\text{Var} [R] = \text{Var} [R-R_{f}]

还有很多版本的夏普比率。

  1. 采用的不是无风险收益率,而是某一种有风险的收益率(例如股票指数),将其作为基准收益率。
  2. 完全不考虑无风险收益率
    (在东京证券交易所的比赛中,即是如此。)
  3. 还有些,采用的不是标准差,而是样本标准差。

这些都是在行业具体应用过程中,根据实际情况的"魔改"。

在夏普比率中,一般用标准差,很少用样本标准差的。与之相反的是实现波动率,在实现波动率中,一般用样本标准差,很少用标准差的。
关于实现波动率,可以参考《Optiver-1.金融基础与赛题解析》

例子

例一

假设资产的预期收益率超过无风险利率15%15\%,评估资产的风险(定义为资产超额收益的标准偏差)为10%10\%。无风险收益是常数,那么夏普比率(使用第一种)

S=0.150.10=1.5\begin{aligned} S = \frac{0.15}{0.10} = 1.5\end{aligned}

例二

日期资产收益标普500总收益率超额收益
第一天-0.0050000-0.0048419-0.0001581
第二天0.00100000.0017234-0.0007234
第三天0.00500000.00461100.0003890

示例代码:

1
2
3
4
5
6
7
8
9
10
11
# 超额收益
crsy = [-0.0001581,-0.0007234, 0.0003890]

# 资产收益
zcsy = [-0.005,0.001,0.005]

# 超额收益的均值,除以,超额收益的标准差
print(np.mean(crsy) / np.std(crsy))

# 超额收益的均值,除以,资产收益的标准差
print(np.mean(crsy) / np.std(zcsy))
运行结果:
1
2
-0.3614766513736107
-0.03994702495345027

在比赛中的应用

根据东京证券交易所的资料,其在计算夏普比率的时候,没有考虑无风险收益率(基准收益率)。

具体计算过程如下:

  1. 计算股票的收益率。
  2. 计算某一天,做多前200只的收益。
  3. 计算某一天,做空后200只的收益。
  4. 计算某一天的总收益。
  5. 计算夏普比率(不考虑无风险收益率)。

计算股票的收益率

r(k,t)=C(k,t+2)C(k,t+1)C(k,t+1)r_{(k, t)} = \frac{C_{(k, t+2)} - C_{(k, t+1)}}{C_{(k, t+1)}}

其中C(k,t+1)C_{(k, t+1)}是第kk只股票在第t+1t+1天的收盘价,C(k,t+2)C_{(k, t+2)}是第kk只股票在第k+2k+2这一天的收盘价。
即,在tt日做决策,在t+1t+1日以收盘价做多或做空,在t+2t+2日以收盘价平仓,r(k,t)r_{(k, t)}表示的投资这只股票的收益。

做多前200只

对于排序前200支的股票进行做多,权重服从一个[2,1][2,1]的线性函数,第一个权重为22,第二个权重为1.9951.995,第三个权重为1.9901.990,以此类推,最后除以线性函数的均值1.51.5

Sup,t=i=1200(r(upi,t)linear function(2,1)i))Average(linear function(2,1))S_{\text{up},t} = \frac{\sum^{200}_{i=1}(r_{({\text{up}_i}, t)} * \text{linear function}(2, 1)_i))}{\text{Average}(\text{linear function}(2, 1))}

做空后200只

对于排序200只的股票进行做空,权重服从一个[2,1][2,1]的线性函数,倒数一个权重为22,倒数二个权重为1.9951.995,倒数三个权重为1.9901.990,以此类推,最后除以均值1.51.5

Sdown,t=i=1200(r(downi,t)linear function(2,1)i)Average(linear function(2,1))S_{\text{down},t} = \frac{\sum^{200}_{i=1}(r_{({\text{down}_i}, t)} * \text{linear function}(2, 1)_i)}{\text{Average}(\text{linear function}(2, 1))}

计算每天总回报

Rt=Sup,tSdown,tR_{t} = S_{\text{up},t} - S_{\text{down},t}

因为是做空后200只股票,所以在计算总回报时,应该减去Sdown,tS_{\text{down},t}

计算夏普比率

总收益率的时间序列的平均值,除以,总收益率的时间序列的标准差,不考虑无风险收益率,也不把日经指数作为基准收益率。

Score=Average(Rseries)STD(Rseries)\text{Score} = \frac{\text{Average}(R_{\text{series}})}{\text{STD}(R_{\text{series}})}

实现代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
import pandas as pd


def calc_return_sharpe(df: pd.DataFrame, portfolio_size: int = 200, top_rank_weight_ratio: float = 2) -> float:
"""
计算夏普比率
:param df: 排序
:param portfolio_size: 投资组合中单方向的最大资产数
:param top_rank_weight_ratio: 第一(或倒数第一)的权重
:return: 夏普比率
"""

def _calc_return_per_day(df, portfolio_size_val, top_rank_weight_ratio_val):
"""
计算回报
:param df: 排序
:param portfolio_size_val: 投资组合中单方向的最大资产数
:param top_rank_weight_ratio_val: 第一(或倒数第一)的权重
:return: 回报
"""
assert df['Rank'].min() == 0
assert df['Rank'].max() == len(df['Rank']) - 1
weights = np.linspace(start=top_rank_weight_ratio_val, stop=1, num=portfolio_size_val)
purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size_val] * weights).sum() / weights.mean()
short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size_val] * weights).sum() / weights.mean()
return purchase - short

buf = df.groupby('Date').apply(_calc_return_per_day, portfolio_size, top_rank_weight_ratio)
sharpe_ratio = buf.mean() / buf.std()
return sharpe_ratio

缺失值(停牌)

为什么会有缺失值

在股票日线数据中,如果存在缺失值,一般是因为股票停牌了(不排除交易所自身故障)。

是否存在缺失值

判断哪些列包含缺失值

如果该列存在缺失值则返回True,反之False。示例代码:

1
2
3
stock_prices = pd.concat([train_files_stock_prices, supplemental_files_stock_prices], axis=0)
stock_prices.reset_index(drop=True, inplace=True)
stock_prices.isnull().any()

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
RowId               False
Date False
SecuritiesCode False
Open True
High True
Low True
Close True
Volume False
AdjustmentFactor False
ExpectedDividend True
SupervisionFlag False
Target True
dtype: bool

统计每列缺失值的数量

统计每列缺失值的数量,示例代码:

1
stock_prices.isnull().sum()

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
RowId                     0
Date 0
SecuritiesCode 0
Open 8426
High 8426
Low 8426
Close 8426
Volume 0
AdjustmentFactor 0
ExpectedDividend 2581536
SupervisionFlag 0
Target 246
dtype: int64

查看缺失值

查看Open为空的行

查看Open为空的行,示例代码:

1
stock_prices[stock_prices.Open.isnull()]

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
                 RowId        Date  SecuritiesCode  Open  High  Low  Close  Volume  AdjustmentFactor  ExpectedDividend  SupervisionFlag    Target
401 20170104_3540 2017-01-04 3540 NaN NaN NaN NaN 0 1.0 NaN False NaN
1753 20170104_9539 2017-01-04 9539 NaN NaN NaN NaN 0 1.0 NaN False -0.004149
2266 20170105_3540 2017-01-05 3540 NaN NaN NaN NaN 0 1.0 NaN False NaN
2511 20170105_4621 2017-01-05 4621 NaN NaN NaN NaN 0 1.0 NaN False 0.000000
4131 20170106_3540 2017-01-06 3540 NaN NaN NaN NaN 0 1.0 NaN False NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
2600516 20220624_1981 2022-06-24 1981 NaN NaN NaN NaN 0 1.0 NaN False 0.007634
2600687 20220624_2814 2022-06-24 2814 NaN NaN NaN NaN 0 1.0 NaN False 0.000000
2601130 20220624_4628 2022-06-24 4628 NaN NaN NaN NaN 0 1.0 NaN False 0.000000
2601397 20220624_6144 2022-06-24 6144 NaN NaN NaN NaN 0 1.0 NaN False 0.040578
2601516 20220624_6484 2022-06-24 6484 NaN NaN NaN NaN 0 1.0 NaN False 0.013667

[8426 rows x 12 columns]

查看2020-10-01这一天

查看2020-10-01这一天,示例代码:

1
stock_prices[stock_prices.Date == '2020-10-01'].isnull().sum()

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
RowId                  0
Date 0
SecuritiesCode 0
Open 1988
High 1988
Low 1988
Close 1988
Volume 0
AdjustmentFactor 0
ExpectedDividend 1988
SupervisionFlag 0
Target 0
dtype: int64

我们看到,这一天有很多股票都有缺失值。因为在这一天,东京证券交易所发生了故障。

东京证券交易所故障

缺失值处理

多种处理方法

《机器学习实战方法(Python):特征工程-1.特征预处理》,我们讨论过如何填补缺失值。在技术上:对于连续型,可以使用均值或者中位数;对于离散型,可以使用众数。

在本文,因为缺失的原因一般是股票停牌,我认为使用均值或者中位数。都是不恰当的。
建议的处理方法有:

  1. 不处理
  2. 直接丢弃停牌期间的数据
  3. 成交量置为0,开收高低等都采用最近一个交易日的收盘价。

第一种方法

对于第一种方法,不处理。需要特别注意,空值的运算。示例代码:

1
1 + 2 + 3 + None

运行结果:

1
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

第二种方法

实现

利用dropna,可以去除空值。

示例代码:

1
2
3
4
stock_prices.dropna(subset=['Open', 'Target'], inplace=True)
stock_prices.reset_index(drop=True, inplace=True)
print(stock_prices)
print(stock_prices.isnull().sum())

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
                 RowId        Date  SecuritiesCode    Open    High     Low   Close   Volume  AdjustmentFactor  ExpectedDividend  SupervisionFlag    Target
0 20170104_1301 2017-01-04 1301 2734.0 2755.0 2730.0 2742.0 31400 1.0 NaN False 0.000730
1 20170104_1332 2017-01-04 1332 568.0 576.0 563.0 571.0 2798500 1.0 NaN False 0.012324
2 20170104_1333 2017-01-04 1333 3150.0 3210.0 3140.0 3210.0 270800 1.0 NaN False 0.006154
3 20170104_1376 2017-01-04 1376 1510.0 1550.0 1510.0 1550.0 11300 1.0 NaN False 0.011053
4 20170104_1377 2017-01-04 1377 3270.0 3350.0 3270.0 3330.0 150800 1.0 NaN False 0.003026
... ... ... ... ... ... ... ... ... ... ... ... ...
2593973 20220624_9990 2022-06-24 9990 576.0 576.0 563.0 564.0 24200 1.0 NaN False 0.027073
2593974 20220624_9991 2022-06-24 9991 810.0 815.0 804.0 815.0 8700 1.0 NaN False 0.001220
2593975 20220624_9993 2022-06-24 9993 1548.0 1548.0 1497.0 1497.0 12600 1.0 NaN False 0.001329
2593976 20220624_9994 2022-06-24 9994 2507.0 2527.0 2498.0 2527.0 7300 1.0 NaN False 0.003185
2593977 20220624_9997 2022-06-24 9997 710.0 725.0 710.0 719.0 139600 1.0 NaN False 0.015089

[2593978 rows x 12 columns]
RowId 0
Date 0
SecuritiesCode 0
Open 0
High 0
Low 0
Close 0
Volume 0
AdjustmentFactor 0
ExpectedDividend 2573131
SupervisionFlag 0
Target 0
dtype: int64

不建议的原因

但是不建议这种方法。

我们将停牌的数据删除了,在进行滑动窗口处理的时候,会很麻烦。因为我们不好判断是因为停牌导致数据被删;还是因为不是交易日,本就没有数据。
例如,统计第1天到第5天共5个交易日的数据,如果在第3天停牌了,数据又被我们删掉了,而且我们还采取了"直接按顺序找5行"这种粗暴的方式。那么,可能最终统计的是第1天到第6天的数据。
总之,不建议。

第三种方法

思路

成交量置为0,开收高低等都采用最近一个交易日的收盘价。

  • 对于"成交量置为0",直接fillna即可。
  • 对于"开收高低等都采用最近一个交易日的收盘价"
    • 先进行groupby
    • 然后用ffill,按行,填充收盘价。
    • 再分别用bfill,按列,填充开盘价、最高价以及最低价。

关于fillna以及和groupby的配合,可以参考《经典机器学习及其Python实现:2.特征预处理》

实现

示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
train_files_stock_prices = pd.read_csv('../jpxd/train_files/stock_prices.csv')
supplemental_files_stock_prices = pd.read_csv('../jpxd/supplemental_files/stock_prices.csv')

stock_prices = pd.concat([train_files_stock_prices, supplemental_files_stock_prices], axis=0)
stock_prices.reset_index(drop=True, inplace=True)

stock_prices_2266 = stock_prices[stock_prices['SecuritiesCode'] == 2266]
stock_prices_2266 = stock_prices_2266[stock_prices_2266['Date'] <= '2020-10-10']
stock_prices_2266 = stock_prices_2266[stock_prices_2266['Date'] >= '2020-09-20']
print(stock_prices_2266)

# 成交量补充0
stock_prices['Volume'].fillna(0, inplace=True)

# 收盘价
stock_prices['Close'] = stock_prices.groupby('SecuritiesCode', group_keys=False)['Close'].apply(lambda x: x.fillna(method='ffill',axis=0))

# 最低价 Low
stock_prices['Low'] = stock_prices.groupby('SecuritiesCode', group_keys=False)['Low'].apply(lambda x: x.fillna(method='ffill'))
# 最高价 High
stock_prices['High'] = stock_prices.groupby('SecuritiesCode', group_keys=False)['High'].apply(lambda x: x.fillna(method='ffill'))
# 开盘价 Open
stock_prices['Open'] = stock_prices.groupby('SecuritiesCode', group_keys=False)['Open'].apply(lambda x: x.fillna(method='ffill'))

stock_prices_2266 = stock_prices[stock_prices['SecuritiesCode'] == 2266]
stock_prices_2266 = stock_prices_2266[stock_prices_2266['Date'] <= '2020-10-10']
stock_prices_2266 = stock_prices_2266[stock_prices_2266['Date'] >= '2020-09-20']
print(stock_prices_2266)

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
                 RowId        Date  SecuritiesCode    Open    High     Low   Close  Volume  AdjustmentFactor  ExpectedDividend  SupervisionFlag    Target
1743279 20200923_2266 2020-09-23 2266 1849.0 1850.0 1817.0 1826.0 19500 1.0 NaN False -0.028649
1745262 20200924_2266 2020-09-24 2266 1847.0 1853.0 1831.0 1850.0 24200 1.0 NaN False 0.041180
1747246 20200925_2266 2020-09-25 2266 1850.0 1865.0 1775.0 1797.0 61100 1.0 NaN False 0.024051
1749232 20200928_2266 2020-09-28 2266 1831.0 1873.0 1797.0 1871.0 42800 1.0 NaN False -0.016701
1751218 20200929_2266 2020-09-29 2266 1880.0 1925.0 1855.0 1916.0 35700 1.0 NaN False 0.000000
1753204 20200930_2266 2020-09-30 2266 1912.0 1945.0 1884.0 1884.0 25700 1.0 NaN False -0.035032
1755190 20201001_2266 2020-10-01 2266 NaN NaN NaN NaN 0 1.0 NaN False 0.022552
1757178 20201002_2266 2020-10-02 2266 1910.0 1910.0 1810.0 1818.0 22800 1.0 NaN False -0.002152
1759167 20201005_2266 2020-10-05 2266 1834.0 1880.0 1834.0 1859.0 10700 1.0 NaN False -0.015094
1761157 20201006_2266 2020-10-06 2266 1864.0 1864.0 1827.0 1855.0 14100 1.0 NaN False -0.008758
1763147 20201007_2266 2020-10-07 2266 1842.0 1851.0 1827.0 1827.0 11900 1.0 NaN False -0.014909
1765137 20201008_2266 2020-10-08 2266 1828.0 1852.0 1804.0 1811.0 14500 1.0 NaN False -0.011771
1767127 20201009_2266 2020-10-09 2266 1812.0 1812.0 1760.0 1784.0 15000 1.0 NaN False 0.000567

RowId Date SecuritiesCode Open High Low Close Volume AdjustmentFactor ExpectedDividend SupervisionFlag Target
1743279 20200923_2266 2020-09-23 2266 1849.0 1850.0 1817.0 1826.0 19500 1.0 NaN False -0.028649
1745262 20200924_2266 2020-09-24 2266 1847.0 1853.0 1831.0 1850.0 24200 1.0 NaN False 0.041180
1747246 20200925_2266 2020-09-25 2266 1850.0 1865.0 1775.0 1797.0 61100 1.0 NaN False 0.024051
1749232 20200928_2266 2020-09-28 2266 1831.0 1873.0 1797.0 1871.0 42800 1.0 NaN False -0.016701
1751218 20200929_2266 2020-09-29 2266 1880.0 1925.0 1855.0 1916.0 35700 1.0 NaN False 0.000000
1753204 20200930_2266 2020-09-30 2266 1912.0 1945.0 1884.0 1884.0 25700 1.0 NaN False -0.035032
1755190 20201001_2266 2020-10-01 2266 1912.0 1945.0 1884.0 1884.0 0 1.0 NaN False 0.022552
1757178 20201002_2266 2020-10-02 2266 1910.0 1910.0 1810.0 1818.0 22800 1.0 NaN False -0.002152
1759167 20201005_2266 2020-10-05 2266 1834.0 1880.0 1834.0 1859.0 10700 1.0 NaN False -0.015094
1761157 20201006_2266 2020-10-06 2266 1864.0 1864.0 1827.0 1855.0 14100 1.0 NaN False -0.008758
1763147 20201007_2266 2020-10-07 2266 1842.0 1851.0 1827.0 1827.0 11900 1.0 NaN False -0.014909
1765137 20201008_2266 2020-10-08 2266 1828.0 1852.0 1804.0 1811.0 14500 1.0 NaN False -0.011771
1767127 20201009_2266 2020-10-09 2266 1812.0 1812.0 1760.0 1784.0 15000 1.0 NaN False 0.000567

复权

两种复权

  • 前复权(向前复权)
    就是保持现有价位不变,将以前的价格缩减。
  • 后复权(向后复权)
    保持先前的价格不变,而将以后的价格增加。

复权的实现(前复权)

对于价格,乘以复权因子。
对于成交量,除以复权因子。

查看复权前,示例代码:

1
2
3
4
5
6
check_1805 = stock_prices[stock_prices['SecuritiesCode'] == 1805]

check_1805 = check_1805[check_1805['Date'] >= '2018-09-20']
check_1805 = check_1805[check_1805['Date'] <= '2018-09-30']

print(check_1805)

运行结果:

1
2
3
4
5
6
7
                RowId        Date  SecuritiesCode    Open    High     Low   Close   Volume  AdjustmentFactor  ExpectedDividend  SupervisionFlag    Target
800771 20180920_1805 2018-09-20 1805 183.0 187.0 181.0 186.0 2196400 1.0 NaN False -0.005291
802685 20180921_1805 2018-09-21 1805 187.0 191.0 186.0 189.0 2283600 1.0 NaN False 0.005319
804600 20180925_1805 2018-09-25 1805 186.0 190.0 185.0 188.0 1979800 10.0 NaN False 0.000529
806515 20180926_1805 2018-09-26 1805 1911.0 1911.0 1853.0 1890.0 190800 1.0 NaN False 0.015336
808430 20180927_1805 2018-09-27 1805 1880.0 1907.0 1879.0 1891.0 133300 1.0 NaN False -0.005208
810346 20180928_1805 2018-09-28 1805 1900.0 1938.0 1895.0 1920.0 145500 1.0 NaN False -0.026178

复权,示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def generate_adjusted_feature_one_stock(one_stock_df: pd.DataFrame):
"""

:param one_stock_df: 一支股票
:return:
"""
# 按照日期进行排序
adjusted_df = one_stock_df.sort_values("Date", ascending=False)
# 计算复权因子
adjusted_df.loc[:, "CumulativeAdjustmentFactor"] = adjusted_df["AdjustmentFactor"].cumprod()

# 开盘价 Open
adjusted_df.loc[:, "AdjustedOpen"] = adjusted_df["Open"] * adjusted_df["CumulativeAdjustmentFactor"]
# 最高价 High
adjusted_df.loc[:, "AdjustedHigh"] = adjusted_df["High"] * adjusted_df["CumulativeAdjustmentFactor"]
# 最低价 Low
adjusted_df.loc[:, "AdjustedLow"] = adjusted_df["Low"] * adjusted_df["CumulativeAdjustmentFactor"]
# 收盘价 Close
adjusted_df.loc[:, "AdjustedClose"] = adjusted_df["Close"] * adjusted_df["CumulativeAdjustmentFactor"]
# 成交量 Volume
adjusted_df.loc[:, "AdjustedVolume"] = adjusted_df["Volume"] / adjusted_df["CumulativeAdjustmentFactor"]

return adjusted_df

adjusted_stock_prices = stock_prices.groupby("SecuritiesCode").apply(generate_adjusted_feature_one_stock).reset_index(drop=True)

查看复权后,关注AdjustedOpenAdjustedHighAdjustedLowAdjustedCloseAdjustedVolume,示例代码:

1
2
3
4
5
6
check_1805 = adjusted_stock_prices[adjusted_stock_prices['SecuritiesCode'] == 1805]

check_1805 = check_1805[check_1805['Date'] >= '2018-09-20']
check_1805 = check_1805[check_1805['Date'] <= '2018-09-30']

print(check_1805)

运行结果:

1
2
3
4
5
6
7
               RowId        Date  SecuritiesCode    Open    High     Low   Close   Volume  AdjustmentFactor  ExpectedDividend  SupervisionFlag    Target  CumulativeAdjustmentFactor  AdjustedOpen  AdjustedHigh  AdjustedLow  AdjustedClose  AdjustedVolume
50114 20180928_1805 2018-09-28 1805 1900.0 1938.0 1895.0 1920.0 145500 1.0 NaN False -0.026178 1.0 1900.0 1938.0 1895.0 1920.0 145500.0
50115 20180927_1805 2018-09-27 1805 1880.0 1907.0 1879.0 1891.0 133300 1.0 NaN False -0.005208 1.0 1880.0 1907.0 1879.0 1891.0 133300.0
50116 20180926_1805 2018-09-26 1805 1911.0 1911.0 1853.0 1890.0 190800 1.0 NaN False 0.015336 1.0 1911.0 1911.0 1853.0 1890.0 190800.0
50117 20180925_1805 2018-09-25 1805 186.0 190.0 185.0 188.0 1979800 10.0 NaN False 0.000529 10.0 1860.0 1900.0 1850.0 1880.0 197980.0
50118 20180921_1805 2018-09-21 1805 187.0 191.0 186.0 189.0 2283600 1.0 NaN False 0.005319 10.0 1870.0 1910.0 1860.0 1890.0 228360.0
50119 20180920_1805 2018-09-20 1805 183.0 187.0 181.0 186.0 2196400 1.0 NaN False -0.005291 10.0 1830.0 1870.0 1810.0 1860.0 219640.0

争议

有些资料只对价格进行了复权,甚至只对收盘价进行了复权。假如只考虑价格,不考虑量价关系;甚至只考虑收盘价,这种复权方法是OK的。
但是如果要用除收盘价外的其他的价格,要考虑量价关系,必须对成交量进行复权。

探索

市场整体情况

示例代码:

1
2
3
4
5
6
7
8
9
10
11
# 删掉一些列
stock = adjusted_stock_prices.drop(columns=['Open', 'High', 'Low', 'Close', 'Volume', 'CumulativeAdjustmentFactor'])
# 改名
stock.rename(
columns={'AdjustedOpen': 'Open', 'AdjustedHigh': 'High', 'AdjustedLow': 'Low',
'AdjustedClose': 'Close', 'AdjustedVolume': 'Volume'},inplace=True)

# ExpectedDividend
stock['ExpectedDividend'].fillna(0, inplace=True)

stock['Date'] = pd.to_datetime(stock['Date'])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

colors=px.colors.qualitative.Plotly
temp = dict(layout=go.Layout(font=dict(family="Franklin Gothic", size=12), width=800))

# 交易日
stock_date=stock.Date.unique()
# 回报(Target均值)
returns=stock.groupby('Date')['Target'].mean().mul(100).rename('Average Return')
# 收盘价
close_avg=stock.groupby('Date')['Close'].mean().rename('Closing Price')
# 交易量
vol_avg=stock.groupby('Date')['Volume'].mean().rename('Volume')

fig = make_subplots(rows=3, cols=1, shared_xaxes=True)
for i, j in enumerate([returns, close_avg, vol_avg]):
fig.add_trace(go.Scatter(x=stock_date, y=j, mode='lines',
name=j.name, marker_color=colors[i]), row=i+1, col=1)
fig.update_xaxes(rangeslider_visible=False,
rangeselector=dict(
buttons=list([
dict(count=6, label="6m", step="month", stepmode="backward"),
dict(count=1, label="1y", step="year", stepmode="backward"),
dict(count=2, label="2y", step="year", stepmode="backward"),
dict(step="all")])),
row=1,col=1)
fig.update_layout(template=temp,title='JPX Market Average Stock Return, Closing Price, and Shares Traded',
hovermode='x unified', height=700,
yaxis1=dict(title='Stock Return', ticksuffix='%'),
yaxis2_title='Closing Price', yaxis3_title='Shares Traded',
showlegend=False)
fig.show()

运行结果:

市场整体情况

有些资料会说,交易量在逐步下降,但其实是因为他们没有对交易量进行复权。

题外话,plotly保存HTML文件,示例代码:

1
2
import plotly
plotly.offline.plot(fig, filename='file1.html')

分行业历年的Return

整理数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
stock_list=pd.read_csv("../jpxd/stock_list.csv")

# 股票行业
stock_list['SectorName']=[i.rstrip().lower().capitalize() for i in stock_list['17SectorName']]
# 股票名称
stock_list['Name']=[i.rstrip().lower().capitalize() for i in stock_list['Name']]
# 数据合并
stock_df = stock.merge(stock_list[['SecuritiesCode','Name','SectorName']], on='SecuritiesCode', how='left')
# 年份
stock_df['Year'] = stock_df['Date'].dt.year

years = {year: pd.DataFrame() for year in stock_df.Year.unique()[::-1]}
for key in years.keys():
df=stock_df[stock_df.Year == key]
years[key] = df.groupby('SectorName')['Target'].mean().mul(100).rename("Avg_return_{}".format(key))
df=pd.concat((years[i].to_frame() for i in years.keys()), axis=1)
df=df.sort_values(by="Avg_return_2022")

绘制图像:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
fig = make_subplots(rows=1, cols=6, shared_yaxes=True)

for i, col in enumerate(df.columns):
x = df[col]
mask = x >= 0
fig.add_trace(go.Bar(x=x[mask], y=df.index[mask], orientation='h',
text=x[mask], texttemplate='%{text:.2f}%', textposition='auto',
hovertemplate='Average Target in %{y} Stocks = %{x:.4f}%',
marker=dict(color='red', opacity=0.7), name=col[-4:]),
row=1, col=i + 1)
fig.add_trace(go.Bar(x=x[~mask], y=df.index[~mask], orientation='h',
text=x[~mask], texttemplate='%{text:.2f}%', textposition='auto',
hovertemplate='Average Target in %{y} Stocks = %{x:.4f}%',
marker=dict(color='green', opacity=0.7), name=col[-4:]),
row=1, col=i + 1)
fig.update_xaxes(range=(x.min() - .15, x.max() + .15), title='{} Target'.format(col[-4:]),
showticklabels=False, row=1, col=i + 1)
fig.update_layout(title='Yearly Average Stock Target by Sector',
hovermode='closest', margin=dict(l=250, r=50),
showlegend=False)
fig.show()

运行结果:

分行业历年的Target

Return的分布

Return的整体分布

示例代码:

1
2
3
4
5
6
7
8
9
fig = go.Figure()
x_hist=stock_df['Target']
fig.add_trace(go.Histogram(x=x_hist*100,
marker=dict(color=colors[0], opacity=0.7,
line=dict(width=1, color=colors[0])),
xbins=dict(start=-40,end=40,size=1)))
fig.update_layout(template=temp,title='Target Distribution',
xaxis=dict(title='Stock Return',ticksuffix='%'), height=450)
fig.show()

运行结果:

Target的整体分布

Return在行业内的分布

示例代码:

1
2
3
4
5
6
7
8
9
10
pal = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, 18)]
fig = go.Figure()
for i, sector in enumerate(df.index[::-1]):
y_data=stock_df[stock_df['SectorName']==sector]['Target']
fig.add_trace(go.Box(y=y_data*100, name=sector,
marker_color=pal[i], showlegend=False))
fig.update_layout(template=temp, title='Target Distribution by Sector',
yaxis=dict(title='Stock Return',ticksuffix='%'),
margin=dict(b=150), height=750, width=900)
fig.show()

运行结果:

Target在行业内的分布

我们直接看数字的话,能更清楚,示例代码:

1
2
stock_df_group = stock_df.groupby('SectorName')['Target'].describe()
stock_df_group

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
                                                count      mean       std       min       25%  50%       75%       max
SectorName
Automobiles & transportation equipment 85568.0 0.000025 0.021919 -0.226629 -0.011147 0.0 0.010606 0.349345
Banks 83919.0 -0.000258 0.018002 -0.194805 -0.010256 0.0 0.009404 0.298780
Commercial & wholesale trade 199701.0 0.000429 0.020714 -0.273723 -0.009161 0.0 0.009588 1.119512
Construction & materials 207533.0 0.000296 0.020469 -0.294974 -0.009662 0.0 0.009666 0.304878
Electric appliances & precision instruments 251506.0 0.000512 0.025248 -0.244200 -0.012066 0.0 0.012162 0.420168
Electric power & gas 30716.0 0.000118 0.018744 -0.355000 -0.008626 0.0 0.008333 0.247770
Energy resources 18718.0 0.000348 0.021965 -0.177386 -0.010886 0.0 0.010856 0.242009
Financials (ex banks) 73424.0 0.000262 0.022010 -0.244618 -0.010340 0.0 0.010359 0.317460
Foods 123864.0 0.000167 0.017064 -0.252101 -0.007503 0.0 0.007483 0.264000
It & services, others 592412.0 0.000638 0.028094 -0.578541 -0.012422 0.0 0.012436 0.585366
Machinery 170410.0 0.000317 0.023419 -0.233909 -0.011614 0.0 0.011558 0.263852
Pharmaceutical 60641.0 0.000316 0.027722 -0.524904 -0.011689 0.0 0.011111 0.597907
Raw materials & chemicals 225677.0 0.000301 0.021783 -0.226074 -0.010682 0.0 0.010627 0.322581
Real estate 87896.0 0.000464 0.025345 -0.254939 -0.010880 0.0 0.010978 0.404255
Retail trade 236712.0 0.000228 0.020412 -0.263955 -0.009054 0.0 0.008999 0.407407
Steel & nonferrous metals 58776.0 0.000243 0.023447 -0.226006 -0.011905 0.0 0.011673 0.290070
Transportation & logistics 94693.0 0.000317 0.019579 -0.168594 -0.009346 0.0 0.009116 0.575264

分行业的最高和最低的Return

示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
stock_data=stock_df.groupby('Name')['Target'].mean().mul(100)
stock_low=stock_data.nsmallest(7)[::-1].rename("Return")
stock_high=stock_data.nlargest(7).rename("Return")
stock_data=pd.concat([stock_high, stock_low], axis=0).reset_index()
stock_data['Sector']='All'
for i in stock_df.SectorName.unique():
sector=stock_df[stock_df.SectorName==i].groupby('Name')['Target'].mean().mul(100)
stock_low=sector.nsmallest(7)[::-1].rename("Return")
stock_high=sector.nlargest(7).rename("Return")
sector_stock=pd.concat([stock_high, stock_low], axis=0).reset_index()
sector_stock['Sector']=i
stock_data=stock_data.append(sector_stock,ignore_index=True)

fig=go.Figure()
buttons = []
for i, sector in enumerate(stock_data.Sector.unique()):

x=stock_data[stock_data.Sector==sector]['Name']
y=stock_data[stock_data.Sector==sector]['Return']
mask=y>0
fig.add_trace(go.Bar(x=x[mask], y=y[mask], text=y[mask],
texttemplate='%{text:.2f}%',
textposition='auto',
name=sector, visible=(False if i != 0 else True),
hovertemplate='%{x} average return: %{y:.3f}%',
marker=dict(color='green', opacity=0.7)))
fig.add_trace(go.Bar(x=x[~mask], y=y[~mask], text=y[~mask],
texttemplate='%{text:.2f}%',
textposition='auto',
name=sector, visible=(False if i != 0 else True),
hovertemplate='%{x} average return: %{y:.3f}%',
marker=dict(color='red', opacity=0.7)))

visibility=[False]*2*len(stock_data.Sector.unique())
visibility[i*2],visibility[i*2+1]=True,True
button = dict(label = sector,
method = "update",
args=[{"visible": visibility}])
buttons.append(button)

fig.update_layout(title='Stocks with Highest and Lowest Returns by Sector',
template=temp, yaxis=dict(title='Average Return', ticksuffix='%'),
updatemenus=[dict(active=0, type="dropdown",
buttons=buttons, xanchor='left',
yanchor='bottom', y=1.01, x=.01)],
margin=dict(b=150),showlegend=False,height=700, width=900)
fig.show()

运行结果:

分行业的最高和最低的Target

分行业的K线图

示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
stock_date=stock_df.Date.unique()
sectors=stock_df.SectorName.unique().tolist()
sectors.insert(0, 'All')
open_avg=stock_df.groupby('Date')['Open'].mean()
high_avg=stock_df.groupby('Date')['High'].mean()
low_avg=stock_df.groupby('Date')['Low'].mean()
close_avg=stock_df.groupby('Date')['Close'].mean()
buttons=[]

fig = go.Figure()
for i in range(18):
if i != 0:
open_avg=stock_df[stock_df.SectorName==sectors[i]].groupby('Date')['Open'].mean()
high_avg=stock_df[stock_df.SectorName==sectors[i]].groupby('Date')['High'].mean()
low_avg=stock_df[stock_df.SectorName==sectors[i]].groupby('Date')['Low'].mean()
close_avg=stock_df[stock_df.SectorName==sectors[i]].groupby('Date')['Close'].mean()

fig.add_trace(go.Candlestick(x=stock_date, open=open_avg, high=high_avg,
low=low_avg, close=close_avg, name=sectors[i],
visible=(True if i==0 else False)))

visibility=[False]*len(sectors)
visibility[i]=True
button = dict(label = sectors[i],
method = "update",
args=[{"visible": visibility}])
buttons.append(button)

fig.update_xaxes(rangeslider_visible=True,
rangeselector=dict(
buttons=list([
dict(count=3, label="3m", step="month", stepmode="backward"),
dict(count=6, label="6m", step="month", stepmode="backward"),
dict(step="all")]), xanchor='left',yanchor='bottom', y=1.16, x=.01))
fig.update_layout(template=temp,title='Stock Price Movements by Sector',
hovermode='x unified', showlegend=False, width=1000,
updatemenus=[dict(active=0, type="dropdown",
buttons=buttons, xanchor='left',
yanchor='bottom', y=1.01, x=.01)],
yaxis=dict(title='Stock Price'))
fig.show()

运行结果:

分行业的K线图

行业间的相关性

示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import plotly.figure_factory as ff

df_pivot=stock_df.pivot_table(index='Date', columns='SectorName', values='Close').reset_index()
corr=df_pivot.corr().round(2)
mask=np.triu(np.ones_like(corr, dtype=bool))
c_mask = np.where(~mask, corr, 100)
c=[]
for i in c_mask.tolist()[1:]:
c.append([x for x in i if x != 100])

cor=c[::-1]
x=corr.index.tolist()[:-1]
y=corr.columns.tolist()[1:][::-1]
fig=ff.create_annotated_heatmap(z=cor, x=x, y=y,
hovertemplate='Correlation between %{x} and %{y} stocks = %{z}',
colorscale='viridis', name='')
fig.update_layout(template=temp, title='Stock Correlation between Sectors',
margin=dict(l=250,t=270),height=800,width=900,
yaxis=dict(showgrid=False, autorange='reversed'),
xaxis=dict(showgrid=False))
fig.show()

运行结果:

行业间的相关性

补充数据

这个比赛不允许我们ipynb连接互联网,但是允许我们离线补充数据。

jpx-jquants

jpx-jquants概述

jpx-jquants,JPX官方的接口。

获取 Refresh Token

地址:https://api.jquants.com/v1/token/auth_user

请求方式:POST

请求参数:

  • Header:不需要
  • Body:
    • mailaddress:邮箱地址
    • password:密码

响应:

  • 状态码:200,OK;400,Bad Request;403,Forbidden;500,Internal Server Error。
  • 字段:refreshToken

Refresh Token 的有效期是一周。

示例代码:

1
2
3
4
5
6
7
8
import requests
import json

data={"mailaddress":"【邮箱】", "password":"【密码】"}

r_post = requests.post("https://api.jquants.com/v1/token/auth_user", data=json.dumps(data))

r_post.json()

运行结果:

1
{'refreshToken': 'XXX'}

获取 ID Token

地址:https://api.jquants.com/v1/token/auth_refresh

请求方式:POST

请求参数:

  • Header:不需要
  • Query:refreshtoken

响应:

  • 状态码:200,OK;400,Bad Request;403,Forbidden;500,Internal Server Error。
  • 字段:idToken

ID Token 的有效期是24小时。

示例代码:

1
2
3
4
5
6
7
8
import requests
import json

REFRESH_TOKEN = "XXX"

r_post = requests.post(f"https://api.jquants.com/v1/token/auth_refresh?refreshtoken={REFRESH_TOKEN}")

r_post.json()

运行结果:

1
{'idToken': 'XXX'}

获取日线数据

地址:https://api.jquants.com/v1/prices/daily_quotes

请求方式:GET

请求参数:

  • Header:Authorization,idToken。
  • Query:
    • code:股票代码。
    • from:起始时间,例如,202109012021-09-01
    • to:结束时间,例如,202109072021-09-07
    • date:具体某一天,在没有指定fromto的时候有效。例如,202109072021-09-07

响应状态吗:200,OK;400,Bad Request;401,Unauthorized;403,Forbidden;413,Payload Too Large;500,Internal Server Error。

响应字段:

Variables
Description
Data type
Remark
Date
Date
String
YYYY-MM-DD
Code
Issue code
String
Open
Open Price (before adjustment)
Number
High
High price (before adjustment)
Number
Low
Low price (before adjustment)
Number
Close
Close price (before adjustment)
Number
Volume
Trading volume (before Adjustment)
Number
TurnoverValue
Trading value
Number
AdjustmentFactor
Adjustment factor
Number
In the case of a two-for-one stock split, "0.5" will be set in the record on the ex-rights date.
AdjustmentOpen
Adjusted open price
Number
※1
AdjustmentHigh
Adjusted high price
Number
※1
AdjustmentLow
Adjusted low price
Number
※1
AdjustmentClose
Adjusted close price
Number
※1
AdjustmentVolume
Adjusted volume
Number
※1
MorningOpen
Open price of the morning session (before Adjustment)
Number
※2
MorningHigh
High price of the morning session (before Adjustment)
Number
※2
MorningLow
Low price of the morning session (before Adjustment)
Number
※2
MorningClose
Close price of the morning session (before Adjustment)
Number
※2
MorningVolume
Trading volume of the morning session (before Adjustment)
Number
※2
MorningTurnoverValue
Trading value of the morning session
Number
※2
MorningAdjustmentOpen
Adjusted open price of the morning session
Number
※1, ※2
MorningAdjustmentHigh
Adjusted high price of the morning session
Number
※1, ※2
MorningAdjustmentLow
Adjusted low price of the morning session
Number
※1, ※2
MorningAdjustmentClose
Adjusted close price of the morning session
Number
※1, ※2
MorningAdjustmentVolume
Adjusted trading volume of the morning session
Number
※1, ※2
AfternoonOpen
Open price of the afternoon session (before Adjustment)
Number
※2
AfternoonHigh
High price of the afternoon session (before Adjustment)
Number
※2
AfternoonLow
Low price of the afternoon session (before Adjustment)
Number
※2
AfternoonClose
Close price of the afternoon session (before Adjustment)
Number
※2
AfternoonVolume
Trading volume of the afternoon session (before Adjustment)
Number
※2
AfternoonAdjustmentOpen
Adjusted open price of the afternoon session
Number
※1, ※2
AfternoonAdjustmentHigh
Adjusted high price of the afternoon session
Number
※1, ※2
AfternoonAdjustmentLow
Adjusted low price of the afternoon session
Number
※1, ※2
AfternoonAdjustmentClose
Adjusted close price of the afternoon session
Number
※1, ※2
AfternoonAdjustmentVolume
Adjusted trading volume of the afternoon session
Number
※1, ※2
  • ※1:The item has been adjusted to take into account past divisions, etc.
  • ※2:The item is available only for Premium plan users.

示例代码:

1
2
3
4
5
6
7
8
9
10
import requests
import json

idToken = "XXX"

headers = {'Authorization': 'Bearer {}'.format(idToken)}

r = requests.get("https://api.jquants.com/v1/prices/daily_quotes?code=1414&from=2022-10-01&to=2023-01-31", headers=headers)

r.json()

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
{
'daily_quotes': [{
'Date': '2022-10-03',
'Code': '14140',
'Open': 6220.0,
'High': 6230.0,
'Low': 6120.0,
'Close': 6220.0,
'Volume': 121700.0,
'TurnoverValue': 752370000.0,
'AdjustmentFactor': 1.0,
'AdjustmentOpen': 6220.0,
'AdjustmentHigh': 6230.0,
'AdjustmentLow': 6120.0,
'AdjustmentClose': 6220.0,
'AdjustmentVolume': 121700.0
}

【部分运行结果略】

{
'Date': '2023-01-31',
'Code': '14140',
'Open': 5510.0,
'High': 5560.0,
'Low': 5500.0,
'Close': 5530.0,
'Volume': 101500.0,
'TurnoverValue': 561054000.0,
'AdjustmentFactor': 1.0,
'AdjustmentOpen': 5510.0,
'AdjustmentHigh': 5560.0,
'AdjustmentLow': 5500.0,
'AdjustmentClose': 5530.0,
'AdjustmentVolume': 101500.0
}]
}

在本文应用

示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import numpy as np
from tqdm import tqdm
import requests
import json

idToken = "XXX"

headers = {'Authorization': 'Bearer {}'.format(idToken)}

stock_list = pd.read_csv('./jpx-tokyo-stock-exchange-prediction/stock_list.csv')
stock_list = stock_list[stock_list['Universe0'] == True]
codes = stock_list['SecuritiesCode'].unique()

all_hist = []
for tick in tqdm(codes):
r = requests.get(f'https://api.jquants.com/v1/prices/daily_quotes?code={tick}&from=2022-10-01&to=2023-01-31', headers=headers)
hist = pd.DataFrame(r.json()['daily_quotes'])
all_hist.append(hist)

df = pd.concat(all_hist)
df.to_csv('df.csv',index=False)

yfinance

yfinance概述

什么是yfinance

Yahoo!finance(雅虎财经),最初有官方的免费的API,但是因为被滥用,官方的API已经于2017年5月15日停止了。
yfinance,是非官方的,数据依旧来自雅虎,但其实是通过类似爬虫的技术实现的。

安装

安装:

1
pip install yfinance

在中国大陆使用

在中国大陆使用

因为从2021年11月1日开始,Yahoo已经不再对中国大陆提供服务。同时,因为yfinance没有代理功能,在中国大陆使用yfinance,可能会收到如下报错

1
HTTPError: 403 Client Error: Forbidden for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/600000.SS?modules=summaryProfile%2CfinancialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&ssl=true

因此,在中国大陆使用,需要自行挂载代理。

如果报了类似如下的错误

  • error:No timezone found, symbol may be delisted
  • error:No data found for this date range, symbol may be delisted

其原因是被反爬了。

优点

yfinance的优点有:

  • 免费
  • 无需注册,无需Token
  • 数据粒度高(1min/2min/5min数据)
  • 直接以Pandas中的DataFrame或者Series的形式返回数据。

缺点

yfinance的缺点有:

  • yfinance其实是通过一种类似爬虫的技术从雅虎财经获取数据的,为了避免被反爬,在使用的过程中,还是需要注意。
  • 在中国大陆使用不方便

模块

yfinance分为三个模块:

  1. Tickers
  2. download
  3. pandas_datareader

其中,download功能,无法在中国大陆使用(挂载高匿代理或许可以);pandas_datareader,是为了与遗留代码向后兼容。我们只讨论Tickers

Tickers

info,获取股票基本信息

info,获取股票基本信息。一个股票的基本信息有很多内容。示例代码:

1
2
3
4
import yfinance as yf

ss600000 = yf.Ticker('600000.SS')
ss600000.info

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
{
'address1': '12 First East Zhongshan Road',
'city': 'Shanghai',
'country': 'China',
'phone': '86 21 6161 8888',
'fax': '86 21 6323 2036',
'website': 'https://www.spdb.com.cn',
'industry': 'Banks—Regional',
'industryDisp': 'Banks—Regional',
'sector': 'Financial Services',
'longBusinessSummary': "Shanghai Pudong Development Bank Co., Ltd. provides commercial banking products and services in the People's Republic of China. The company offers various personal banking services, including savings and account management products; wealth management services comprising open-ended funds, special investments, SPDB structured deposits, security investment custody accounts, individual FX trading, collective security investments, and insurance products distribution; cards; lending services, such as deposit treasury bond pledged loans and car mortgages; online payment and consumption services; and instant messaging services. It also provides corporate banking services comprising cash management, supply chain financing, investment banking, entrusted assets, and occupational pension solutions, as well as offshore banking services; assets custody services; trade finance services; and services to small and medium enterprises, and multi-national companies. In addition, the company offers treasury and market products, such as foreign exchange risk management, interest rate risk management, structured, fixed income, and commodity exchange products; and financial leasing and trust services. The company was incorporated in 1992 and is headquartered in Shanghai, the People's Republic of China.",
'fullTimeEmployees': 64731,
'companyOfficers': [{'maxAge': 1,'name': 'Mr. Yang Zheng','age': 56,'title': 'Exec. Chairman','yearBorn': 1966,'exercisedValue': 0,'unexercisedValue': 0},
{'maxAge': 1,'name': 'Mr. Weidong Pan','age': 56,'title': 'Pres & Vice Chairman','yearBorn': 1966,'exercisedValue': 0,'unexercisedValue': 0},
{'maxAge': 1,'name': 'Mr. Xinhao Wang','age': 55,'title': 'CFO & VP','yearBorn': 1967,'exercisedValue': 0,'unexercisedValue': 0},
{'maxAge': 1,'name': 'Mr. Bingwen Cui','age': 53,'title': 'Gen. Counsel, VP & Chief Legal Advisor','yearBorn': 1969,'exercisedValue': 0,'unexercisedValue': 0},
{'maxAge': 1,'name': 'Mr. Yiyan Liu','age': 58,'title': 'Chief Risk Officer, VP & Exec. Director','yearBorn': 1964,'exercisedValue': 0,'unexercisedValue': 0},
{'maxAge': 1,'name': "Mr. Zheng'an Chen",'age': 59,'title': 'Exec. Director','yearBorn': 1963,'exercisedValue': 0,'unexercisedValue': 0},
{'maxAge': 1,'name': 'Mr. Fangping Jiang','age': 56,'title': 'Head of the Discipline Inspection & Supervision Team','yearBorn': 1966,'exercisedValue': 0,'unexercisedValue': 0},
{'maxAge': 1,'name': 'Mr. Wei Xie','age': 51,'title': 'VP & Sec. of the Board','yearBorn': 1971,'exercisedValue': 0,'unexercisedValue': 0},
{'maxAge': 1,'name': 'Lianquan Li','title': 'Head of the Fin. & Accounting Department','exercisedValue': 0,'unexercisedValue': 0},
{'maxAge': 1,'name': 'Mr. Alex Dai','title': 'IR Officer','exercisedValue': 0,'unexercisedValue': 0}],
'auditRisk': 9,
'boardRisk': 8,
'compensationRisk': 3,
'shareHolderRightsRisk': 2,
'overallRisk': 5,
'governanceEpochDate': 1685577600,
'maxAge': 86400,
'priceHint': 2,
'previousClose': 7.57,
'open': 7.57,
'dayLow': 7.53,
'dayHigh': 7.6,
'regularMarketPreviousClose': 7.57,
'regularMarketOpen': 7.57,
'regularMarketDayLow': 7.53,
'regularMarketDayHigh': 7.6,
'dividendRate': 0.41,
'dividendYield': 0.0542,
'exDividendDate': 1658361600,
'payoutRatio': 0.3083,
'fiveYearAvgDividendYield': 4.24,
'beta': 0.492847,
'trailingPE': 5.6842103,
'forwardPE': 4.3953485,
'volume': 22288077,
'regularMarketVolume': 22288077,
'averageVolume': 37246368,
'averageVolume10days': 26097882,
'averageDailyVolume10Day': 26097882,
'bid': 7.55,
'ask': 7.56,
'bidSize': 0,
'askSize': 0,
'marketCap': 221902635008,
'fiftyTwoWeekLow': 6.63,
'fiftyTwoWeekHigh': 8.22,
'priceToSalesTrailing12Months': 2.0457323,
'fiftyDayAverage': 7.4534,
'twoHundredDayAverage': 7.23055,
'trailingAnnualDividendRate': 0.32,
'trailingAnnualDividendYield': 0.042272124,
'currency': 'CNY',
'enterpriseValue': 1083258437632,
'profitMargins': 0.43896,
'floatShares': 9531825231,
'sharesOutstanding': 29352200192,
'heldPercentInsiders': 0.49045,
'heldPercentInstitutions': 0.22323,
'impliedSharesOutstanding': 0,
'bookValue': 24.321,
'priceToBook': 0.31084248,
'lastFiscalYearEnd': 1672444800,
'nextFiscalYearEnd': 1703980800,
'mostRecentQuarter': 1680220800,
'earningsQuarterlyGrowth': -0.183,
'netIncomeToCommon': 41539497984,
'trailingEps': 1.33,
'forwardEps': 1.72,
'pegRatio': -2.68,
'lastSplitFactor': '13:10',
'lastSplitDate': 1495670400,
'enterpriseToRevenue': 9.987,
'52WeekChange': -0.02702701,
'SandP52WeekChange': 0.14647579,
'lastDividendValue': 0.41,
'lastDividendDate': 1658361600,
'exchange': 'SHH',
'quoteType': 'EQUITY',
'symbol': '600000.SS',
'underlyingSymbol': '600000.SS',
'shortName': 'SHANGHAI PUDONG DEVELOPMENT BAN',
'longName': 'Shanghai Pudong Development Bank Co., Ltd.',
'firstTradeDateEpochUtc': 942197400,
'timeZoneFullName': 'Asia/Shanghai',
'timeZoneShortName': 'CST',
'uuid': 'bacd6a72-d76b-3eeb-a7dc-aaaef82cf602',
'messageBoardId': 'finmb_5436383',
'gmtOffSetMilliseconds': 28800000,
'currentPrice': 7.56,
'targetHighPrice': 9.53,
'targetLowPrice': 4.4,
'targetMeanPrice': 7.8,
'targetMedianPrice': 8.56,
'recommendationMean': 2.6,
'recommendationKey': 'hold',
'numberOfAnalystOpinions': 7,
'totalCash': 1347949035520,
'totalCashPerShare': 45.923,
'totalDebt': 2201421086720,
'totalRevenue': 108471001088,
'revenuePerShare': 3.717,
'returnOnAssets': 0.00562,
'returnOnEquity': 0.0685,
'grossProfits': 112019000000,
'operatingCashflow': 400627990528,
'earningsGrowth': -0.19,
'revenueGrowth': -0.101,
'grossMargins': 0.0,
'ebitdaMargins': 0.0,
'operatingMargins': 0.47944,
'financialCurrency': 'CNY',
'trailingPegRatio': None
}

history,获取历史数据

history,获取历史数据,有以下几个参数可供配置:

  • period表示获取多久的数据,有10种选择:1d5d1mo3mo6mo1y2y5y10yytdmax
  • interval,表示数据的精度,有13种选择:1m2m5m15m30m60m90m1h1d5d1wk1mo3mo
    天以内(不含天)的数据只能最多采集60天以内的,精度越高,能采集的数据天数更少,比如1m数据只能采集7天以内的。
  • start,如果没有配置period,可以使用start(数据开始时间)和end(数据结束时间)来定义应该获取多久的数据。
  • end,同上。
  • prepost,是否包含盘前和盘后价格,默认False不包含
  • auto_adjust,是否自动复权(前复权),默认True
  • actions,是否包含分红和扩股信息,默认True

示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt

t1414 = yf.Ticker('1414.T')
t1414_y = t1414.history(start = "2013-01-01", end='2022-12-01',auto_adjust=False).reset_index()

df1 = pd.read_csv('./jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv')
df2 = pd.read_csv('./jpx-tokyo-stock-exchange-prediction/supplemental_files/stock_prices.csv')
df = pd.concat([df1,df2])
df['Date'] = pd.to_datetime(df.Date)
t1414_jpx = df[df['SecuritiesCode']==1414]

sub_y = t1414_y.set_index('Date')
sub_y.index = pd.to_datetime(sub_y.index)

sub_jpx = t1414_jpx.set_index('Date')

plt.figure(figsize = (12,4))
plt.plot(sub_y['Close'], label='Close yfinance')
plt.plot(sub_jpx['Close'], label='Close JPX' )
plt.show()

运行结果:

history 获取历史数据

注意!auto_adjust=False,即不进行复权。但是我们看到,在yfinance中,auto_adjust=False,对于拆股依旧进行了复权,但是对于分红,没有复权。

actions,获取股票的分红和扩股数据

示例代码:

1
t1414.actions

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
                           Dividends  Stock Splits
Date
2013-06-26 00:00:00+09:00 1.5 0.0
2013-12-26 00:00:00+09:00 11.0 0.0
2014-06-26 00:00:00+09:00 3.5 0.0
2014-12-26 00:00:00+09:00 1.0 0.0
2015-06-26 00:00:00+09:00 26.5 0.0

【部分运行结果略】

2020-06-29 00:00:00+09:00 44.5 0.0
2020-12-29 00:00:00+09:00 40.0 0.0
2021-06-29 00:00:00+09:00 65.5 0.0
2021-12-29 00:00:00+09:00 50.0 0.0
2022-06-29 00:00:00+09:00 68.0 0.0

dividends,只看股票的分红数据

示例代码:

1
t1414.dividends

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Date
2013-06-26 00:00:00+09:00 1.5
2013-12-26 00:00:00+09:00 11.0
2014-06-26 00:00:00+09:00 3.5
2014-12-26 00:00:00+09:00 1.0
2015-06-26 00:00:00+09:00 26.5

【部分运行结果略】

2020-06-29 00:00:00+09:00 44.5
2020-12-29 00:00:00+09:00 40.0
2021-06-29 00:00:00+09:00 65.5
2021-12-29 00:00:00+09:00 50.0
2022-06-29 00:00:00+09:00 68.0
Name: Dividends, dtype: float64

splits,只看股票的扩股数据

示例代码:

1
t1414.splits

运行结果:

1
2
3
Date
2019-06-26 00:00:00+09:00 2.0
Name: Stock Splits, dtype: float64

更多信息

其他更多信息有:

  • 获取重要持股人信息:major_holders
  • 获取重要机构持股人信息:institutional_holders
  • 获取共同基金持股人信息:mutualfund_holders
  • 当季财年信息:calendar
  • 年度营业收入和税后收益:earnings
  • 季度营业收入和税后收益:quarterly_earnings
  • 年度详细财报:financials
  • 季度详细财报:quarterly_financials
  • 年度资产负债表:balance_sheet
  • 季度资产负债表:quarterly_balance_sheet
  • 年度现金流:cashflow
  • 季度现金流:quarterly_cashflow

在本文的应用

示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pandas as pd
import numpy as np
import yfinance as yf
from tqdm import tqdm
import time

stock_list = pd.read_csv('./jpx-tokyo-stock-exchange-prediction/stock_list.csv')
stock_list = stock_list[stock_list['Universe0'] == True]
codes = stock_list['SecuritiesCode'].unique()

cols = ['RowId','Date','SecuritiesCode','Open','High','Low','Close','Volume','AdjustmentFactor','ExpectedDividend','SupervisionFlag','Target']

all_hist = []
for tick in tqdm(codes):
msft = yf.Ticker(f"{tick}.T")
hist = msft.history(start = "2022-10-01", end='2023-01-31',back_adjust=True,auto_adjust=False).reset_index().astype(str)
hist['SecuritiesCode'] = tick
hist['RowId'] = hist['Date'].apply(lambda x:''.join(x.split('-'))+'_'+str(tick))
for col in ['Open','High','Low','Close','Volume']:
hist[col] = pd.to_numeric(hist[col], errors='coerce')
hist['Target'] = (hist['Close'].shift(-2) - hist['Close'].shift(-1))/hist['Close'].shift(-1)
hist = hist.rename(columns = {'Dividends':'ExpectedDividend'})
hist['ExpectedDividend'] = hist['ExpectedDividend'].apply(lambda x: x if x!= 0 else np.nan)
hist['SupervisionFlag'] = np.nan
hist['AdjustmentFactor'] = np.nan
hist = hist[cols]
all_hist.append(hist)
time.sleep(2)

df = pd.concat(all_hist)
df.to_csv('df.csv',index=False)

注意!time.sleep(2),这个一定要有,否则会被反爬。

文章作者: Kaka Wan Yifan
文章链接: https://kakawanyifan.com/20101
版权声明: 本博客所有文章版权为文章作者所有,未经书面许可,任何机构和个人不得以任何形式转载、摘编或复制。

评论区