avatar


Kaggle中技术问题的解决方案

离线安装包的方法

在线安装

缺失的包

以在《强化学习浅谈及其Python实现:8.自定义环境(以股票交易为例)》中讨论的包gymnasium为例。

No module named 'gymnasium'

安装方法

如果允许连接网络的话,我们直接进行在线安装,!pip install gymnasium即可。

示例代码:

1
2
3
4
5
6
7
!pip install gymnasium

import gymnasium as gym

env = gym.make('FrozenLake-v1')
print('观察空间 = {}'.format(env.observation_space))
print('动作空间 = {}'.format(env.action_space))

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Collecting gymnasium
Downloading gymnasium-0.28.1-py3-none-any.whl (925 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 925.5/925.5 kB 16.0 MB/s eta 0:00:0000:01
Collecting jax-jumpy>=1.0.0
Downloading jax_jumpy-1.0.0-py3-none-any.whl (20 kB)
Collecting farama-notifications>=0.0.1
Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Requirement already satisfied: cloudpickle>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium) (2.2.1)
Requirement already satisfied: numpy>=1.21.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium) (1.21.6)
Requirement already satisfied: importlib-metadata>=4.8.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium) (4.11.4)
Requirement already satisfied: typing-extensions>=4.3.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium) (4.4.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=4.8.0->gymnasium) (3.11.0)
Installing collected packages: farama-notifications, jax-jumpy, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-0.28.1 jax-jumpy-1.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv class="ansi-yellow-fg">
观察空间 = Discrete(16)
动作空间 = Discrete(4)

其他依赖包

上述的运行结果,需要注意如下部分:

1
2
3
4
Collecting jax-jumpy>=1.0.0
Downloading jax_jumpy-1.0.0-py3-none-any.whl (20 kB)
Collecting farama-notifications>=0.0.1
Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)

这两个在是Kaggle环境中,安装gymnasium,所需要下载安装的其他依赖包。

如果我们离线安装gymnasium,也需要离线安装这两个。

.whl

如果不允许连接网络。

不允许连接网络

那么,只能进行离线安装。
离线安装的情况下,包有两种:

  1. .whl,已编译的
  2. .tar.gz,未编译的

下载包

通过如下的地址,搜索包,并下载。
https://pypi.org

下载过程中,注意选择Built Distribution.whl,已编译的。

Built Distribution

上传包

点击箭头所指之处,上传包。

上传包

在弹窗会要求输入名称,注意名称的长度限制。

名称的长度限制

我们还可以点击右侧的链接,直接提供包的URL地址,这样就免去了下载再上传,能更快。

获取路径

点击红色箭头所指的地方,可以获取路径。

获取路径

安装

示例代码:

1
2
3
4
5
6
7
8
9
!pip install /kaggle/input/package/jax_jumpy-1.0.0-py3-none-any.whl
!pip install /kaggle/input/package/Farama_Notifications-0.0.4-py3-none-any.whl
!pip install /kaggle/input/package/gymnasium-0.28.1-py3-none-any.whl

import gymnasium as gym

env = gym.make('FrozenLake-v1')
print('观察空间 = {}'.format(env.observation_space))
print('动作空间 = {}'.format(env.action_space))

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Processing /kaggle/input/package/jax_jumpy-1.0.0-py3-none-any.whl
Requirement already satisfied: numpy>=1.18.0 in /opt/conda/lib/python3.7/site-packages (from jax-jumpy==1.0.0) (1.21.6)
Installing collected packages: jax-jumpy
Successfully installed jax-jumpy-1.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv class="ansi-yellow-fg">
WARNING: There was an error checking the latest version of pip.
Processing /kaggle/input/package/Farama_Notifications-0.0.4-py3-none-any.whl
Installing collected packages: Farama-Notifications
Successfully installed Farama-Notifications-0.0.4
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv class="ansi-yellow-fg">
WARNING: There was an error checking the latest version of pip.
Processing /kaggle/input/package/gymnasium-0.28.1-py3-none-any.whl
Requirement already satisfied: importlib-metadata>=4.8.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium==0.28.1) (4.11.4)
Requirement already satisfied: cloudpickle>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium==0.28.1) (2.2.1)
Requirement already satisfied: numpy>=1.21.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium==0.28.1) (1.21.6)
Requirement already satisfied: typing-extensions>=4.3.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium==0.28.1) (4.4.0)
Requirement already satisfied: farama-notifications>=0.0.1 in /opt/conda/lib/python3.7/site-packages (from gymnasium==0.28.1) (0.0.4)
Requirement already satisfied: jax-jumpy>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium==0.28.1) (1.0.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=4.8.0->gymnasium==0.28.1) (3.11.0)
Installing collected packages: gymnasium
Successfully installed gymnasium-0.28.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv class="ansi-yellow-fg">
WARNING: There was an error checking the latest version of pip.
观察空间 = Discrete(16)
动作空间 = Discrete(4)

简便方法

还有一种简便方法,我们不用指定每一个包的路径,只需要指定文件夹的路径。示例代码:

1
!pip install jax-jumpy Farama-Notifications gymnasium --no-index --find-links=file:///kaggle/input/package

解释说明:

  • --no-index:忽略pip默认的依赖包索引,即不从网络找包。
  • --find-links:从指定的目录寻下找包。

.tar.gz

查看其他依赖包

我们以另一个和强化学习有关的包gym为例。

首先,我们进行在线安装,查看在Kaggle环境下,还需要依赖其他的哪些包。示例代码:

1
!pip install gym

运行结果:

1
2
3
4
5
6
7
8
Requirement already satisfied: gym in /opt/conda/lib/python3.7/site-packages (0.23.1)
Requirement already satisfied: gym-notices>=0.0.4 in /opt/conda/lib/python3.7/site-packages (from gym) (0.0.8)
Requirement already satisfied: cloudpickle>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from gym) (2.2.1)
Requirement already satisfied: importlib-metadata>=4.10.0 in /opt/conda/lib/python3.7/site-packages (from gym) (4.11.4)
Requirement already satisfied: numpy>=1.18.0 in /opt/conda/lib/python3.7/site-packages (from gym) (1.21.6)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=4.10.0->gym) (3.11.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=4.10.0->gym) (4.4.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv class="ansi-yellow-fg">

解释说明:根据运行结果,我们知道,在Kaggle环境下安装gym,不需要额外安装其他包。

下载源码

注意,要选择Source Distribution

下载源码

编译安装

上传之后,Kaggle会自动进行解压,我们只需要将其移动到/kaggle/working/目录下,然后进行编译安装即可。示例代码:

1
2
3
4
5
%cp -r /kaggle/input/gym-0262targz/gym-0.26.2 /kaggle/working/
%cd /kaggle/working/gym-0.26.2
!make
!make install
!python setup.py install

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/kaggle/working/gym-0.26.2
make: *** No targets specified and no makefile found. Stop.
make: *** No rule to make target 'install'. Stop.
running install
/opt/conda/lib/python3.7/site-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
setuptools.SetuptoolsDeprecationWarning,
/opt/conda/lib/python3.7/site-packages/setuptools/command/easy_install.py:159: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
EasyInstallDeprecationWarning,
running bdist_egg
running egg_info
writing gym.egg-info/PKG-INFO

【部分运行结果略】

Using /opt/conda/lib/python3.7/site-packages
Searching for zipp==3.11.0
Best match: zipp 3.11.0
Adding zipp 3.11.0 to easy-install.pth file

Using /opt/conda/lib/python3.7/site-packages
Finished processing dependencies for gym==0.26.2

注意路径

注意,一定要把gym-0.26.2移动到/kaggle/working/
否则,会报一个error: [Errno 30] Read-only file system的错误。

示例代码:

1
2
3
4
%cd /kaggle/input/gym-0262targz/gym-0.26.2
!make
!make install
!python setup.py install

运行结果:

1
2
3
4
5
6
7
8
9
10
11
/kaggle/input/gym-0262targz/gym-0.26.2
make: *** No targets specified and no makefile found. Stop.
make: *** No rule to make target 'install'. Stop.
running install
/opt/conda/lib/python3.7/site-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
setuptools.SetuptoolsDeprecationWarning,
/opt/conda/lib/python3.7/site-packages/setuptools/command/easy_install.py:159: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
EasyInstallDeprecationWarning,
running bdist_egg
running egg_info
error: [Errno 30] Read-only file system

一些常见的包

TA-Lib

ta-lib-bin

从如下的地址,选择已经编译好的。
https://pypi.org/project/ta-lib-bin/#files

选择红色箭头所指的

解释说明:

  • 第一个是源码包,暂时没有找到在Kaggle中的编译安装方法。
  • 第二个已经被弃用,最近的更新时间是2021年。

注意!在Kaggle需要安装的Linux版本的。
我们在如下地址找的包,基本上都是Windows版本的。
https://www.lfd.uci.edu/~gohlke/pythonlibs

查看Python版本

查看Kaggle中Pyton的版本。示例代码:

1
!python --version

运行结果:

1
Python 3.7.12

题外话,如果在MacOS中安装TA-Lib包,出现了如下的错误

1
2
3
4
5
talib/_ta_lib.c:526:10: fatal error: 'ta-lib/ta_defs.h' file not found
#include "ta-lib/ta_defs.h"
^~~~~~~~~~~~~~~~~~
1 error generated.
error: command 'gcc' failed with exit status 1

是因为需要先安装TA-Lib扩展库,命令如下:

1
brew install ta-lib

PyTorch-TabNet

冲突解决

整体安装方法没有差异,但是如果Python的版本是3.10,PyTorch的版本是2.0.0,在安装4.0版本的PyTorch-TabNet时候,可能会有冲突。
按照如下的版本安装,可以解决冲突。

1
2
!pip install ../input/optiver-tabnet-torch/torch-1.12.1-cp310-cp310-manylinux1_x86_64.whl
!pip install ../input/optiver-tabnet-pytorch-tabnet/pytorch_tabnet-4.0-py3-none-any.whl
  • Python:3.10.10
  • PyTorch:1.12.1
  • PyTorch-TabNet:4.0

模型导入

PyTorch-TabNet导出的模型为.zip格式,Kaggle会解压.zip格式的文件,这会导致模型无法导入。
解决方法有两种:

  1. 重命名
  2. 再次压缩

重命名

上传到Kaggle之前,修改或去除后缀.zip

再次压缩

首先,需要复制到/kaggle/working/目录。示例代码:

1
%cp -r /kaggle/input/172235/fold0 /kaggle/working/
1
2
3
4
import zipfile

fold0_zip = zipfile.ZipFile('fold0.zip','w')
fold0_zip.write('/kaggle/working/fold0',compress_type=zipfile.ZIP_DEFLATED)

解释说明:Linux中的一些gzipzip等命令,在Kaggle中没法用,可以用Python中的zipfile这个包,而且这个包不需要额外安装。

引入外部Py文件

原理

在Python中使用import XXX时,Python解析器会在当前目录、已安装和第三方模块中搜索,如果都搜索不到就会报错。
我们可以通过sys.path,查看Kaggle中Python解析器会搜索的目录。

示例代码:

1
2
import sys
print(sys.path)

运行结果:

1
['/kaggle/working', '/kaggle/lib/kagglegym', '/kaggle/lib', '/opt/conda/lib/python310.zip', '/opt/conda/lib/python3.10', '/opt/conda/lib/python3.10/lib-dynload', '', '/root/.local/lib/python3.10/site-packages', '/opt/conda/lib/python3.10/site-packages', '/src/bq-helper']

《QMT量化平台使用入门:2.QMT极简版(Mini-QMT)》的"导入xtquant"中,看到这个原理的一个应用。

方法一

通过sys.path.append()方法可以临时添加搜索路径,这样可以import一些不在上述目录中的包和模块。
也可以通过sys.path.insert(0,XXX)的方式添加路径,这个新添加的路径会优先于其他目录被import检查。
这种方法导入的路径会在Python程序退出后失效。

假如现在存在一个文件,文件名是my_package.py,内容如下:

1
2
def ping():
print('pong')

上传文件
my_package

通过sys.path.insert(0, '../input/my-package'),加入到路径中。示例代码:

1
2
3
4
import sys
sys.path.insert(0, '../input/my-package')
import my_package
my_package.ping()

运行结果:

1
pong

方法二

将位于/kaggle/input/中的.py文件复制到/kaggle/working/

可以通过Linux命令复制:

1
%cp -r /kaggle/input/XXX /kaggle/working/

也可以通过Python代码复制:

1
2
3
import shutil

shutil.copyfile(r"/kaggle/input/codejpx/Features.py", r"./Features.py")

上传前步模型和数据

有时候,为了节约时间,我们会直接利用之前的.ipynb所得到的模型和数据。
这时候不需要先下载,然后重新上传。

找到"Output"部分,直接选择"New dataset"。

New dataset

并行计算(joblib)

概述

joblib,一个并行计算用的包。

官方文档:https://joblib.readthedocs.io
Github地址:https://github.com/joblib/joblib

除了基本的多进程功能,joblib还实现了:

  • 任务自动分发
  • 磁盘读取的优化
  • 大型Numpy数组的优化。

《基于Python的后端开发入门:4.多线程、协程、多进程》,我们讨论过"多进程"。
当时我们说,Python的多进程只支持Linux系统,部分支持MacOS(进程共享数据不支持)、几乎不支持Windows(除非以if __name__ == '__main__':的方式启动)。

但是在joblib中,其三种系统都支持。

入门案例

示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import time, math
from joblib import Parallel, delayed


def my_fun(i):
time.sleep(1)
return math.sqrt(i ** 2)


num = 10
start = time.time()
for i in range(num):
my_fun(i)

end = time.time()

print('{:.4f} s'.format(end - start))

start = time.time()
# n_jobs is the number of parallel jobs
Parallel(n_jobs=2)(delayed(my_fun)(i) for i in range(num))
end = time.time()
print('{:.4f} s'.format(end-start))

运行结果:

1
2
10.0361 s
5.5562 s

我们看到,顺序执行的时候,耗时大概10秒钟,但是并发执行的话,只需要5秒钟。

注意!Parallel(n_jobs=2)(delayed(my_fun)(i) for i in range(num))只能写成一行,不能写成两行。

1
2
Parallel(n_jobs=2)
(delayed(my_fun)(i) for i in range(num))

参数解释

我们重点关注这一行

1
Parallel(n_jobs=2)(delayed(my_fun)(i) for i in range(num))
  • n_jobs:int,默认值:None
    最大并发运行作业的数量。
    backend="multiprocessing"时,表示Python工作进程的数量;或当backend="threading",表示线程池的大小。
    对于小于(不等于)0,其含义为将使用(n_cpus + 1 + n_jobs)个CPU;因此-1表示利用所有的CPU,-2表示有一个CPU不会被利用。
  • backend:str、ParallelBackendBase实例或None,默认值:loky
    指定并行化后端实现。
    • loky,默认使用。
    • multiprocessing,基于multiprocessing.Pool的后端,不如loky稳定。
    • threading,线程,会受到Python全局解释器锁的影响。
  • verbose:int,可选
    详细程度
    如果不为零,则会打印进度消息。
  • timeout:float,可选
    每个任务完成的超时时间限制。
    如果任何任务花费的时间超过了timeout的值,则会引发TimeOutError。
    仅在n_jobs!= 1时有效。
  • pre_dispatch'all',int或表达式。
    要预分配的任务批次的数量。
    3 * n_jobs,默认值为2 * n_jobs
  • batch_size:int或'auto',默认值:'auto'
    将原子任务分派到每个工作者的数量。

此外,还有很多参数。我们不一一列举,可以参考官网的介绍。

实际案例

再来看一个实际应用中的例子。

顺序执行,示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import time

t1 = time.time()

df_past_realized_train_f000 = past_realized_volatility_per_stock(list_file=list_order_book_file_train,prediction_column_name='f000',from_second=0)
df_past_realized_train_f100 = past_realized_volatility_per_stock(list_file=list_order_book_file_train,prediction_column_name='f100',from_second=100)
df_past_realized_train_f200 = past_realized_volatility_per_stock(list_file=list_order_book_file_train,prediction_column_name='f200',from_second=200)
df_past_realized_train_f300 = past_realized_volatility_per_stock(list_file=list_order_book_file_train,prediction_column_name='f300',from_second=300)
df_past_realized_train_f400 = past_realized_volatility_per_stock(list_file=list_order_book_file_train,prediction_column_name='f400',from_second=400)
df_past_realized_train_f500 = past_realized_volatility_per_stock(list_file=list_order_book_file_train,prediction_column_name='f500',from_second=500)

print(df_past_realized_train_f000.shape)
print(df_past_realized_train_f100.shape)
print(df_past_realized_train_f200.shape)
print(df_past_realized_train_f300.shape)
print(df_past_realized_train_f400.shape)
print(df_past_realized_train_f500.shape)

print(time.time() - t1)

运行结果:

1
2
3
4
5
6
7
(19150, 32)
(19150, 32)
(19150, 32)
(19150, 32)
(19150, 32)
(19150, 32)
443.0049788951874

改成并行计算,示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import time
from joblib import Parallel, delayed

t1 = time.time()

df_past_realized_train_f000 = None
df_past_realized_train_f100 = None
df_past_realized_train_f200 = None
df_past_realized_train_f300 = None
df_past_realized_train_f400 = None
df_past_realized_train_f500 = None


def my_func(prediction_column_name_val, from_second_val):
# 用于接收参数
tu = ()

temp = past_realized_volatility_per_stock(list_file=list_order_book_file_train,
prediction_column_name=prediction_column_name_val,
from_second=from_second_val)
print(temp.shape)
return prediction_column_name_val, temp


my_list = [
('f000', 0),
('f100', 100),
('f200', 200),
('f300', 300),
('f400', 400),
('f500', 500),
]

#
result_list = Parallel(n_jobs=-1)(delayed(my_func)(list_iter[0], list_iter[1]) for list_iter in my_list)

for result_tu in result_list:
if result_tu[0] == 'f000':
df_past_realized_train_f000 = result_tu[1]
elif result_tu[0] == 'f100':
df_past_realized_train_f100 = result_tu[1]
elif result_tu[0] == 'f200':
df_past_realized_train_f200 = result_tu[1]
elif result_tu[0] == 'f300':
df_past_realized_train_f300 = result_tu[1]
elif result_tu[0] == 'f400':
df_past_realized_train_f400 = result_tu[1]
elif result_tu[0] == 'f500':
df_past_realized_train_f500 = result_tu[1]

print(df_past_realized_train_f000.shape)
print(df_past_realized_train_f100.shape)
print(df_past_realized_train_f200.shape)
print(df_past_realized_train_f300.shape)
print(df_past_realized_train_f400.shape)
print(df_past_realized_train_f500.shape)

print(time.time() - t1)

运行结果:

1
2
3
4
5
6
7
8
(19150, 32)
(19150, 32)
(19150, 32)
(19150, 32)
(19150, 32)
(19150, 32)
(19150, 32)
136.16027998924255
  • 不论my_func中定义的返回是什么,Parallel(n_jobs=-1)...的返回都是一个list
  • 和上文的顺序计算比起来,并行计算的速度有了显著提升。

内存管理

主动释放

对于确认不再使用的变量,可以使用del方法删除,然后执行gc.collect(),确保内存会被回收。
示例代码:

1
2
3
4
5
6
7
8
import gc

【部分代码略】

# 确认df不再使用,删除。
del(df)
# 确保内存回被回收
gc.collect()

定义dtypes

Pandas中,默认的读取数据的类型,会导致内存浪费。
如果我们能事先知道要设置的内存大小,可以定义dtypes,预定义数据类型。
示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
dtypes = {
'Id' : 'uint32',
'groupId' : 'uint32',
'matchId' : 'uint16',
'assists' : 'uint8',
'boosts' : 'uint8',
'damageDealt' : 'float16',
'DBNOs' : 'uint8',
'headshotKills' : 'uint8',
'heals' : 'uint8',
'killPlace' : 'uint8',
'killPoints' : 'uint16',
'kills' : 'uint8',
'killStreaks' : 'uint8',
'longestKill' : 'float16',
'maxPlace' : 'uint8',
'numGroups' : 'uint8',
'revives' : 'uint8',
'rideDistance' : 'float16',
'roadKills' : 'uint8',
'swimDistance' : 'float16',
'teamKills' : 'uint8',
'vehicleDestroys' : 'uint8',
'walkDistance' : 'float16',
'weaponsAcquired' : 'uint8',
'winPoints' : 'uint8',
'winPlacePerc' : 'float16'
}
train = pd.read_csv('../input/train.csv', dtype=dtypes)

压缩DataFrame

示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
import numpy as np

# Memory saving function credit to https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
#start_mem = df.memory_usage().sum() / 1024**2
#print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

for col in df.columns:
col_type = df[col].dtype

if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)

#end_mem = df.memory_usage().sum() / 1024**2
#print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
#print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

# return df
1
df = pd.read_csv('adult.data',header = None)
1
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 32561 non-null int64
1 1 32561 non-null object
2 2 32561 non-null int64
3 3 32561 non-null object
4 4 32561 non-null int64
5 5 32561 non-null object
6 6 32561 non-null object
7 7 32561 non-null object
8 8 32561 non-null object
9 9 32561 non-null object
10 10 32561 non-null int64
11 11 32561 non-null int64
12 12 32561 non-null int64
13 13 32561 non-null object
14 14 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
1
reduce_mem_usage(df)
1
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 32561 non-null int8
1 1 32561 non-null object
2 2 32561 non-null int32
3 3 32561 non-null object
4 4 32561 non-null int8
5 5 32561 non-null object
6 6 32561 non-null object
7 7 32561 non-null object
8 8 32561 non-null object
9 9 32561 non-null object
10 10 32561 non-null int32
11 11 32561 non-null int16
12 12 32561 non-null int8
13 13 32561 non-null object
14 14 32561 non-null object
dtypes: int16(1), int32(2), int8(3), object(9)
memory usage: 2.6+ MB

reduce_mem_usage()方法,其实并不需要return,然后再用一个变量去解释。直接reduce_mem_usage(【需要被压缩的DataFrame】)即可。

只导入需要的

指定行

读取前10000行。示例代码:

1
train = pd.read_csv('../input/train.csv',nrows=10000)

跳过指定的行,例如跳过从index为1的行(index为0的行,很可能是标题),到index为3000000的行。示例代码:

1
train = pd.read_csv('../input/train.csv', skiprows=range(1, 3000000), nrows=10000)

指定列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
columns = ['Id', 'groupId', 'matchId','killPlace','killPoints','kills','killStreaks','longestKill','winPlacePerc']

dtypes = {
'Id' : 'uint32',
'groupId' : 'uint32',
'matchId' : 'uint16',
'killPlace' : 'uint8',
'killPoints' : 'uint8',
'kills' : 'uint8',
'killStreaks' : 'uint8',
'longestKill' : 'float16',
'winPlacePerc' : 'float16'
}
train = pd.read_csv('../input/train.csv', usecols=columns)
文章作者: Kaka Wan Yifan
文章链接: https://kakawanyifan.com/19909
版权声明: 本博客所有文章版权为文章作者所有,未经书面许可,任何机构和个人不得以任何形式转载、摘编或复制。

留言板