离线安装包的方法

在线安装

缺失的包

以在《强化学习浅谈及其Python实现：8.自定义环境(以股票交易为例)》中讨论的包gymnasium为例。

安装方法

如果允许连接网络的话，我们直接进行在线安装，!pip install gymnasium即可。

示例代码：

!pip install gymnasium

import gymnasium as gym

env = gym.make('FrozenLake-v1')
print('观察空间 = {}'.format(env.observation_space))
print('动作空间 = {}'.format(env.action_space))

运行结果：

Collecting gymnasium
  Downloading gymnasium-0.28.1-py3-none-any.whl (925 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 925.5/925.5 kB 16.0 MB/s eta 0:00:0000:01
Collecting jax-jumpy>=1.0.0
  Downloading jax_jumpy-1.0.0-py3-none-any.whl (20 kB)
Collecting farama-notifications>=0.0.1
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Requirement already satisfied: cloudpickle>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium) (2.2.1)
Requirement already satisfied: numpy>=1.21.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium) (1.21.6)
Requirement already satisfied: importlib-metadata>=4.8.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium) (4.11.4)
Requirement already satisfied: typing-extensions>=4.3.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium) (4.4.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=4.8.0->gymnasium) (3.11.0)
Installing collected packages: farama-notifications, jax-jumpy, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-0.28.1 jax-jumpy-1.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv class="ansi-yellow-fg">
观察空间 = Discrete(16)
动作空间 = Discrete(4)

其他依赖包

上述的运行结果，需要注意如下部分：

Collecting jax-jumpy>=1.0.0
  Downloading jax_jumpy-1.0.0-py3-none-any.whl (20 kB)
Collecting farama-notifications>=0.0.1
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)

这两个在是Kaggle环境中，安装gymnasium，所需要下载安装的其他依赖包。

如果我们离线安装gymnasium，也需要离线安装这两个。

.whl

如果不允许连接网络。

那么，只能进行离线安装。
离线安装的情况下，包有两种：

.whl，已编译的
.tar.gz，未编译的

下载包

通过如下的地址，搜索包，并下载。
https://pypi.org

下载过程中，注意选择Built Distribution，.whl，已编译的。

上传包

点击箭头所指之处，上传包。

在弹窗会要求输入名称，注意名称的长度限制。

我们还可以点击右侧的链接，直接提供包的URL地址，这样就免去了下载再上传，能更快。

获取路径

点击红色箭头所指的地方，可以获取路径。

安装

示例代码：

!pip install /kaggle/input/package/jax_jumpy-1.0.0-py3-none-any.whl
!pip install /kaggle/input/package/Farama_Notifications-0.0.4-py3-none-any.whl
!pip install /kaggle/input/package/gymnasium-0.28.1-py3-none-any.whl

import gymnasium as gym

env = gym.make('FrozenLake-v1')
print('观察空间 = {}'.format(env.observation_space))
print('动作空间 = {}'.format(env.action_space))

运行结果：

Processing /kaggle/input/package/jax_jumpy-1.0.0-py3-none-any.whl
Requirement already satisfied: numpy>=1.18.0 in /opt/conda/lib/python3.7/site-packages (from jax-jumpy==1.0.0) (1.21.6)
Installing collected packages: jax-jumpy
Successfully installed jax-jumpy-1.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv class="ansi-yellow-fg">
WARNING: There was an error checking the latest version of pip.
Processing /kaggle/input/package/Farama_Notifications-0.0.4-py3-none-any.whl
Installing collected packages: Farama-Notifications
Successfully installed Farama-Notifications-0.0.4
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv class="ansi-yellow-fg">
WARNING: There was an error checking the latest version of pip.
Processing /kaggle/input/package/gymnasium-0.28.1-py3-none-any.whl
Requirement already satisfied: importlib-metadata>=4.8.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium==0.28.1) (4.11.4)
Requirement already satisfied: cloudpickle>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium==0.28.1) (2.2.1)
Requirement already satisfied: numpy>=1.21.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium==0.28.1) (1.21.6)
Requirement already satisfied: typing-extensions>=4.3.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium==0.28.1) (4.4.0)
Requirement already satisfied: farama-notifications>=0.0.1 in /opt/conda/lib/python3.7/site-packages (from gymnasium==0.28.1) (0.0.4)
Requirement already satisfied: jax-jumpy>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from gymnasium==0.28.1) (1.0.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=4.8.0->gymnasium==0.28.1) (3.11.0)
Installing collected packages: gymnasium
Successfully installed gymnasium-0.28.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv class="ansi-yellow-fg">
WARNING: There was an error checking the latest version of pip.
观察空间 = Discrete(16)
动作空间 = Discrete(4)

简便方法

还有一种简便方法，我们不用指定每一个包的路径，只需要指定文件夹的路径。示例代码：

1	!pip install jax-jumpy Farama-Notifications gymnasium --no-index --find-links=file:///kaggle/input/package

解释说明：

--no-index：忽略pip默认的依赖包索引，即不从网络找包。
--find-links：从指定的目录寻下找包。

.tar.gz

查看其他依赖包

我们以另一个和强化学习有关的包gym为例。

首先，我们进行在线安装，查看在Kaggle环境下，还需要依赖其他的哪些包。示例代码：

1	!pip install gym

运行结果：

Requirement already satisfied: gym in /opt/conda/lib/python3.7/site-packages (0.23.1)
Requirement already satisfied: gym-notices>=0.0.4 in /opt/conda/lib/python3.7/site-packages (from gym) (0.0.8)
Requirement already satisfied: cloudpickle>=1.2.0 in /opt/conda/lib/python3.7/site-packages (from gym) (2.2.1)
Requirement already satisfied: importlib-metadata>=4.10.0 in /opt/conda/lib/python3.7/site-packages (from gym) (4.11.4)
Requirement already satisfied: numpy>=1.18.0 in /opt/conda/lib/python3.7/site-packages (from gym) (1.21.6)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=4.10.0->gym) (3.11.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=4.10.0->gym) (4.4.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv class="ansi-yellow-fg">

解释说明：根据运行结果，我们知道，在Kaggle环境下安装gym，不需要额外安装其他包。

下载源码

注意，要选择Source Distribution。

编译安装

上传之后，Kaggle会自动进行解压，我们只需要将其移动到/kaggle/working/目录下，然后进行编译安装即可。示例代码：

%cp -r /kaggle/input/gym-0262targz/gym-0.26.2 /kaggle/working/
%cd /kaggle/working/gym-0.26.2
!make
!make install
!python setup.py install

运行结果：

/kaggle/working/gym-0.26.2
make: *** No targets specified and no makefile found.  Stop.
make: *** No rule to make target 'install'.  Stop.
running install
/opt/conda/lib/python3.7/site-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  setuptools.SetuptoolsDeprecationWarning,
/opt/conda/lib/python3.7/site-packages/setuptools/command/easy_install.py:159: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  EasyInstallDeprecationWarning,
running bdist_egg
running egg_info
writing gym.egg-info/PKG-INFO

【部分运行结果略】

Using /opt/conda/lib/python3.7/site-packages
Searching for zipp==3.11.0
Best match: zipp 3.11.0
Adding zipp 3.11.0 to easy-install.pth file

Using /opt/conda/lib/python3.7/site-packages
Finished processing dependencies for gym==0.26.2

注意路径

注意，一定要把gym-0.26.2移动到/kaggle/working/。
否则，会报一个error: [Errno 30] Read-only file system的错误。

示例代码：

%cd /kaggle/input/gym-0262targz/gym-0.26.2
!make
!make install
!python setup.py install

运行结果：

/kaggle/input/gym-0262targz/gym-0.26.2
make: *** No targets specified and no makefile found.  Stop.
make: *** No rule to make target 'install'.  Stop.
running install
/opt/conda/lib/python3.7/site-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  setuptools.SetuptoolsDeprecationWarning,
/opt/conda/lib/python3.7/site-packages/setuptools/command/easy_install.py:159: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  EasyInstallDeprecationWarning,
running bdist_egg
running egg_info
error: [Errno 30] Read-only file system

一些常见的包

TA-Lib

ta-lib-bin

从如下的地址，选择已经编译好的。
https://pypi.org/project/ta-lib-bin/#files

解释说明：

第一个是源码包，暂时没有找到在Kaggle中的编译安装方法。
第二个已经被弃用，最近的更新时间是2021年。

注意！在Kaggle需要安装的Linux版本的。
我们在如下地址找的包，基本上都是Windows版本的。
https://www.lfd.uci.edu/~gohlke/pythonlibs

查看Python版本

查看Kaggle中Pyton的版本。示例代码：

1	!python --version

运行结果：

1	Python 3.7.12

题外话，如果在MacOS中安装TA-Lib包，出现了如下的错误

talib/_ta_lib.c:526:10: fatal error: 'ta-lib/ta_defs.h' file not found
#include "ta-lib/ta_defs.h"
         ^~~~~~~~~~~~~~~~~~
1 error generated.
error: command 'gcc' failed with exit status 1

是因为需要先安装TA-Lib扩展库，命令如下：

1	brew install ta-lib

PyTorch-TabNet

冲突解决

整体安装方法没有差异，但是如果Python的版本是3.10，PyTorch的版本是2.0.0，在安装4.0版本的PyTorch-TabNet时候，可能会有冲突。
按照如下的版本安装，可以解决冲突。

1 2	!pip install ../input/optiver-tabnet-torch/torch-1.12.1-cp310-cp310-manylinux1_x86_64.whl !pip install ../input/optiver-tabnet-pytorch-tabnet/pytorch_tabnet-4.0-py3-none-any.whl

Python：3.10.10
PyTorch：1.12.1
PyTorch-TabNet：4.0

模型导入

PyTorch-TabNet导出的模型为.zip格式，Kaggle会解压.zip格式的文件，这会导致模型无法导入。
解决方法有两种：

重命名
再次压缩

重命名

上传到Kaggle之前，修改或去除后缀.zip。

再次压缩

首先，需要复制到/kaggle/working/目录。示例代码：

1	%cp -r /kaggle/input/172235/fold0 /kaggle/working/

import zipfile

fold0_zip = zipfile.ZipFile('fold0.zip','w')
fold0_zip.write('/kaggle/working/fold0',compress_type=zipfile.ZIP_DEFLATED)

解释说明：Linux中的一些gzip或zip等命令，在Kaggle中没法用，可以用Python中的zipfile这个包，而且这个包不需要额外安装。

引入外部Py文件

原理

在Python中使用import XXX时，Python解析器会在当前目录、已安装和第三方模块中搜索，如果都搜索不到就会报错。
我们可以通过sys.path，查看Kaggle中Python解析器会搜索的目录。

示例代码：

1 2	import sys print(sys.path)

运行结果：

['/kaggle/working', '/kaggle/lib/kagglegym', '/kaggle/lib', '/opt/conda/lib/python310.zip', '/opt/conda/lib/python3.10', '/opt/conda/lib/python3.10/lib-dynload', '', '/root/.local/lib/python3.10/site-packages', '/opt/conda/lib/python3.10/site-packages', '/src/bq-helper']

在《QMT量化平台使用入门：2.QMT极简版(Mini-QMT)》的"导入xtquant"中，看到这个原理的一个应用。

方法一

通过sys.path.append()方法可以临时添加搜索路径，这样可以import一些不在上述目录中的包和模块。
也可以通过sys.path.insert(0,XXX)的方式添加路径，这个新添加的路径会优先于其他目录被import检查。
这种方法导入的路径会在Python程序退出后失效。

假如现在存在一个文件，文件名是my_package.py，内容如下：

1 2	def ping(): print('pong')

上传文件

通过sys.path.insert(0, '../input/my-package')，加入到路径中。示例代码：

import sys
sys.path.insert(0, '../input/my-package')
import my_package
my_package.ping()

运行结果：

pong

方法二

将位于/kaggle/input/中的.py文件复制到/kaggle/working/。

可以通过Linux命令复制：

1	%cp -r /kaggle/input/XXX /kaggle/working/

也可以通过Python代码复制：

1
2
3

import shutil

shutil.copyfile(r"/kaggle/input/codejpx/Features.py", r"./Features.py")

上传前步模型和数据

有时候，为了节约时间，我们会直接利用之前的.ipynb所得到的模型和数据。
这时候不需要先下载，然后重新上传。

找到"Output"部分，直接选择"New dataset"。

并行计算(joblib)

概述

joblib，一个并行计算用的包。

官方文档：https://joblib.readthedocs.io
Github地址：https://github.com/joblib/joblib

除了基本的多进程功能，joblib还实现了：

任务自动分发
磁盘读取的优化
大型Numpy数组的优化。

在《基于Python的后端开发入门：4.多线程、协程、多进程》，我们讨论过"多进程"。
当时我们说，Python的多进程只支持Linux系统，部分支持MacOS(进程共享数据不支持)、几乎不支持Windows(除非以if __name__ == '__main__':的方式启动)。

但是在joblib中，其三种系统都支持。

入门案例

示例代码：

import time, math
from joblib import Parallel, delayed


def my_fun(i):
    time.sleep(1)
    return math.sqrt(i ** 2)


num = 10
start = time.time()
for i in range(num):
    my_fun(i)

end = time.time()

print('{:.4f} s'.format(end - start))

start = time.time()
# n_jobs is the number of parallel jobs
Parallel(n_jobs=2)(delayed(my_fun)(i) for i in range(num))
end = time.time()
print('{:.4f} s'.format(end-start))

运行结果：

1 2	10.0361 s 5.5562 s

我们看到，顺序执行的时候，耗时大概10秒钟，但是并发执行的话，只需要5秒钟。

注意！Parallel(n_jobs=2)和(delayed(my_fun)(i) for i in range(num))只能写成一行，不能写成两行。

1 2	Parallel(n_jobs=2) (delayed(my_fun)(i) for i in range(num))

参数解释

我们重点关注这一行

1	Parallel(n_jobs=2)(delayed(my_fun)(i) for i in range(num))

n_jobs：int，默认值：None
最大并发运行作业的数量。
当backend="multiprocessing"时，表示Python工作进程的数量；或当backend="threading"，表示线程池的大小。
对于小于(不等于)0，其含义为将使用(n_cpus + 1 + n_jobs)个CPU；因此-1表示利用所有的CPU，-2表示有一个CPU不会被利用。
backend：str、ParallelBackendBase实例或None，默认值：loky
指定并行化后端实现。
- loky，默认使用。
- multiprocessing，基于multiprocessing.Pool的后端，不如loky稳定。
- threading，线程，会受到Python全局解释器锁的影响。
verbose：int，可选
详细程度
如果不为零，则会打印进度消息。
timeout：float，可选
每个任务完成的超时时间限制。
如果任何任务花费的时间超过了timeout的值，则会引发TimeOutError。
仅在n_jobs！= 1时有效。
pre_dispatch：'all'，int或表达式。
要预分配的任务批次的数量。
如3 * n_jobs，默认值为2 * n_jobs。
batch_size：int或'auto'，默认值：'auto'。
将原子任务分派到每个工作者的数量。

此外，还有很多参数。我们不一一列举，可以参考官网的介绍。

实际案例

再来看一个实际应用中的例子。

顺序执行，示例代码：

import time

t1 = time.time()

df_past_realized_train_f000 = past_realized_volatility_per_stock(list_file=list_order_book_file_train,prediction_column_name='f000',from_second=0)
df_past_realized_train_f100 = past_realized_volatility_per_stock(list_file=list_order_book_file_train,prediction_column_name='f100',from_second=100)
df_past_realized_train_f200 = past_realized_volatility_per_stock(list_file=list_order_book_file_train,prediction_column_name='f200',from_second=200)
df_past_realized_train_f300 = past_realized_volatility_per_stock(list_file=list_order_book_file_train,prediction_column_name='f300',from_second=300)
df_past_realized_train_f400 = past_realized_volatility_per_stock(list_file=list_order_book_file_train,prediction_column_name='f400',from_second=400)
df_past_realized_train_f500 = past_realized_volatility_per_stock(list_file=list_order_book_file_train,prediction_column_name='f500',from_second=500)

print(df_past_realized_train_f000.shape)
print(df_past_realized_train_f100.shape)
print(df_past_realized_train_f200.shape)
print(df_past_realized_train_f300.shape)
print(df_past_realized_train_f400.shape)
print(df_past_realized_train_f500.shape)

print(time.time() - t1)

运行结果：

(19150, 32)
(19150, 32)
(19150, 32)
(19150, 32)
(19150, 32)
(19150, 32)
443.0049788951874

改成并行计算，示例代码：

import time
from joblib import Parallel, delayed

t1 = time.time()

df_past_realized_train_f000 = None
df_past_realized_train_f100 = None
df_past_realized_train_f200 = None
df_past_realized_train_f300 = None
df_past_realized_train_f400 = None
df_past_realized_train_f500 = None


def my_func(prediction_column_name_val, from_second_val):
    # 用于接收参数
    tu = ()

    temp = past_realized_volatility_per_stock(list_file=list_order_book_file_train,
                                              prediction_column_name=prediction_column_name_val,
                                              from_second=from_second_val)
    print(temp.shape)
    return prediction_column_name_val, temp


my_list = [
    ('f000', 0),
    ('f100', 100),
    ('f200', 200),
    ('f300', 300),
    ('f400', 400),
    ('f500', 500),
]

#
result_list = Parallel(n_jobs=-1)(delayed(my_func)(list_iter[0], list_iter[1]) for list_iter in my_list)

for result_tu in result_list:
    if result_tu[0] == 'f000':
        df_past_realized_train_f000 = result_tu[1]
    elif result_tu[0] == 'f100':
        df_past_realized_train_f100 = result_tu[1]
    elif result_tu[0] == 'f200':
        df_past_realized_train_f200 = result_tu[1]
    elif result_tu[0] == 'f300':
        df_past_realized_train_f300 = result_tu[1]
    elif result_tu[0] == 'f400':
        df_past_realized_train_f400 = result_tu[1]
    elif result_tu[0] == 'f500':
        df_past_realized_train_f500 = result_tu[1]

print(df_past_realized_train_f000.shape)
print(df_past_realized_train_f100.shape)
print(df_past_realized_train_f200.shape)
print(df_past_realized_train_f300.shape)
print(df_past_realized_train_f400.shape)
print(df_past_realized_train_f500.shape)

print(time.time() - t1)

运行结果：

(19150, 32)
(19150, 32)
(19150, 32)
(19150, 32)
(19150, 32)
(19150, 32)
(19150, 32)
136.16027998924255

不论my_func中定义的返回是什么，Parallel(n_jobs=-1)...的返回都是一个list。
和上文的顺序计算比起来，并行计算的速度有了显著提升。

内存管理

主动释放

对于确认不再使用的变量，可以使用del方法删除，然后执行gc.collect()，确保内存会被回收。
示例代码：

import gc

【部分代码略】

# 确认df不再使用，删除。
del(df)
# 确保内存回被回收
gc.collect()

定义dtypes

Pandas中，默认的读取数据的类型，会导致内存浪费。
如果我们能事先知道要设置的内存大小，可以定义dtypes，预定义数据类型。
示例代码：

dtypes = {
        'Id'                : 'uint32',
        'groupId'           : 'uint32',
        'matchId'           : 'uint16',
        'assists'           : 'uint8',
        'boosts'            : 'uint8',
        'damageDealt'       : 'float16',
        'DBNOs'             : 'uint8',
        'headshotKills'     : 'uint8', 
        'heals'             : 'uint8',    
        'killPlace'         : 'uint8',    
        'killPoints'        : 'uint16',    
        'kills'             : 'uint8',    
        'killStreaks'       : 'uint8',    
        'longestKill'       : 'float16',    
        'maxPlace'          : 'uint8',    
        'numGroups'         : 'uint8',    
        'revives'           : 'uint8',    
        'rideDistance'      : 'float16',    
        'roadKills'         : 'uint8',    
        'swimDistance'      : 'float16',    
        'teamKills'         : 'uint8',    
        'vehicleDestroys'   : 'uint8',    
        'walkDistance'      : 'float16',    
        'weaponsAcquired'   : 'uint8',    
        'winPoints'         : 'uint8', 
        'winPlacePerc'      : 'float16' 
}
train = pd.read_csv('../input/train.csv', dtype=dtypes)

压缩DataFrame

示例代码：

import pandas as pd
import numpy as np

# Memory saving function credit to https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    #start_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    #end_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    #print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    # return df

1	df = pd.read_csv('adult.data',header = None)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       32561 non-null  int64 
 1   1       32561 non-null  object
 2   2       32561 non-null  int64 
 3   3       32561 non-null  object
 4   4       32561 non-null  int64 
 5   5       32561 non-null  object
 6   6       32561 non-null  object
 7   7       32561 non-null  object
 8   8       32561 non-null  object
 9   9       32561 non-null  object
 10  10      32561 non-null  int64 
 11  11      32561 non-null  int64 
 12  12      32561 non-null  int64 
 13  13      32561 non-null  object
 14  14      32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

1	reduce_mem_usage(df)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       32561 non-null  int8  
 1   1       32561 non-null  object
 2   2       32561 non-null  int32 
 3   3       32561 non-null  object
 4   4       32561 non-null  int8  
 5   5       32561 non-null  object
 6   6       32561 non-null  object
 7   7       32561 non-null  object
 8   8       32561 non-null  object
 9   9       32561 non-null  object
 10  10      32561 non-null  int32 
 11  11      32561 non-null  int16 
 12  12      32561 non-null  int8  
 13  13      32561 non-null  object
 14  14      32561 non-null  object
dtypes: int16(1), int32(2), int8(3), object(9)
memory usage: 2.6+ MB

reduce_mem_usage()方法，其实并不需要return，然后再用一个变量去解释。直接reduce_mem_usage(【需要被压缩的DataFrame】)即可。

只导入需要的

指定行

读取前10000行。示例代码：

1	train = pd.read_csv('../input/train.csv',nrows=10000)

跳过指定的行，例如跳过从index为1的行(index为0的行，很可能是标题)，到index为3000000的行。示例代码：

1	train = pd.read_csv('../input/train.csv', skiprows=range(1, 3000000), nrows=10000)

指定列

columns = ['Id', 'groupId', 'matchId','killPlace','killPoints','kills','killStreaks','longestKill','winPlacePerc']

dtypes = {
        'Id'                : 'uint32',
        'groupId'           : 'uint32',
        'matchId'           : 'uint16',   
        'killPlace'         : 'uint8',    
        'killPoints'        : 'uint8',    
        'kills'             : 'uint8',    
        'killStreaks'       : 'uint8',    
        'longestKill'       : 'float16',    
        'winPlacePerc'      : 'float16' 
}
train = pd.read_csv('../input/train.csv', usecols=columns)