机器学习是一门综合的技术，而数据更是重要的组成部分之一，所谓“没数据再好的算法也没效果”。但是呢，我们拿到的数据并不会总是那么好。故而，首先，我们讨论数据预处理的方法论特征工程。

特征工程又可分为

特征抽取
特征预处理
数据降维

我们从特征抽取开始。

scikit-learn的介绍

scikit-learn是一款基于Python的机器学习包。

建立在NumPy，SciPy和matplotlib上
开源，可商业使用 - BSD许可证

这个包，也几乎会贯穿我们这一系列笔记的每一篇。

关于scikit-learn的用户手册和API，可以参考这两个地址：
User Guide: https://scikit-learn.org/stable/user_guide.html
API: https://scikit-learn.org/stable/modules/classes.html

正如其官网所示：

scikit-learn共有六大功能，分别是

Classification：分类
Regression：回归
Clustering：聚类
Dimensionality reduction：降维
Model selection：模型选择
Preprocessing：特征工程

特别注意：导入scikit-learn的命令为import sklearn，而非import scikit-learn。

特征抽取

实际上，我们的数据不一定都是那么恰到好处的只有数字的数据，而是有各种不同的情况。可能会含有中文，也可能就是一段文字，又或者会是图片。把这些数据抽象成只有数字的方法，就是特征抽取。
我们这里介绍两种常见的特征抽取：

字典的特征抽取
文本的特征抽取

字典的特征抽取

字典数据

假定存在数据，如下：

city	temperature	weather
上海	22	雨
南昌	24	阴

上述数据在Python中称之为字典数据，通常记作

1	[{'city': '上海', 'temperature': 22,'weather':'雨'}, {'city': '南昌', 'temperature': 24,'weather':'阴'}]

实际上，这类数据在不同的编程语言中还有不同的名称，在Python和C#中是字典，在Java中又被称为Map。
我们有专门的文章来讨论这种数据结构，《算法入门经典(Java与Python描述)：7.哈希表》，但这不是我们这次讨论的重点，不做展开。

DictVectorizer

1	from sklearn.feature_extraction import DictVectorizer

DictVectorizer这个类有多种方法。我们这里只介绍两个。

fit_transform
get_feature_names

同时，在这一部分，我们还会穿插两个概念

Sparse矩阵(稀疏矩阵)
One-Hot编码

示例代码：

from sklearn.feature_extraction import DictVectorizer
dict = DictVectorizer()
data = [{'city': '上海', 'temperature': 22,'weather':'雨'}, {'city': '南昌', 'temperature': 24,'weather':'阴'}]
print(dict.fit_transform(data))
print(dict.get_feature_names())

运行结果：

(0, 0)	1.0
(0, 2)	22.0
(0, 4)	1.0
(1, 1)	1.0
(1, 2)	24.0
(1, 3)	1.0
['city=上海', 'city=南昌', 'temperature', 'weather=阴', 'weather=雨']

上述矩阵，和我们常见的矩阵不太一样。这种矩阵被称之为Sparse矩阵。
如果我们将上述代码的

1	dict = DictVectorizer()

修改为

1	dict = DictVectorizer(sparse=False)

运行结果：

1
2
3

[[ 1.  0. 22.  0.  1.]
 [ 0.  1. 24.  1.  0.]]
['city=上海', 'city=南昌', 'temperature', 'weather=阴', 'weather=雨']

上述矩阵，也被称为ndarray。相比之下，这种格式，似乎更像我们常见的矩阵。
通过比较，我们很容易理解Sparse矩阵

(0, 0) 1.0代表第0行第0列的的值为1.0
(0, 2) 22.0代表第0行第2列的值为22.0
其他含义类似。

采用Sparse矩阵的优点有：

节约内存
方便读取

对于上面的抽取结果，也非常容易理解。

第一列是city=上海，第一行的该值为1，代表True；第二行的该值为0，代表False。
第二列是city=南昌，第一行的该值为0，代表False；第二行的该值为1，代表True。
第三列是temperature，第一行的该值为22.0，第二行的该值为24.0
其他含义类似

上述0代表False，1代表True，并以此来标识city=上海或city=南昌。这种编码方法被称为One-Hot编码。

One-Hot编码在机器学习中，通常用来对类别型数据进行预处理，在分类问题中，我们会见到其广泛的应用。

文本的特征抽取

文本数据

顾名思义，如下的数据，便被称为文本数据。

前尘往事成云烟，消散在彼此眼前。
Hiding from the rain and snow,Trying to forget but I won't let go.
只是因为在人群中多看了你一眼，再也没能忘掉你容颜。
In that misty morning when I saw your smiling face,You only looked at me and I was yours.

关于文本数据的特征抽取，我们介绍两种特征抽取的方法。

Count
tf-idf

Count

英文文本数据的特征抽取

DictVectorizer

与字典数据特征抽取的第一个不相同
字典数据特征抽取用DictVectorizer，而文本数据特征抽取并不是想当然的用TextVectorizer，而是text。

而且text里面还有多个类，例如：CountVectorizer。

1	from sklearn.feature_extraction.text import CountVectorizer

我们同样，只用CountVectorizer的两个方法：

fit_transform
get_feature_names

示例代码：

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
data = ["Hiding from the rain and snow,Trying to forget but I won't let go.","In that misty morning when I saw your smiling face,You only looked at me and I was yours."]
print(cv.fit_transform(data))
print(cv.get_feature_names())

运行结果：

  (0, 7)	1
  (0, 5)	1
  (0, 20)	1
  (0, 15)	1
  (0, 0)	1
  (0, 18)	1
  (0, 22)	1
  (0, 21)	1
  (0, 4)	1
  (0, 2)	1
  (0, 25)	1
  (0, 9)	1
  (0, 6)	1
  (1, 0)	1
  (1, 8)	1
  (1, 19)	1
  (1, 12)	1
  (1, 13)	1
  (1, 24)	1
  (1, 16)	1
  (1, 27)	1
  (1, 17)	1
  (1, 3)	1
  (1, 26)	1
  (1, 14)	1
  (1, 10)	1
  (1, 1)	1
  (1, 11)	1
  (1, 23)	1
  (1, 28)	1
['and', 'at', 'but', 'face', 'forget', 'from', 'go', 'hiding', 'in', 'let', 'looked', 'me', 'misty', 'morning', 'only', 'rain', 'saw', 'smiling', 'snow', 'that', 'the', 'to', 'trying', 'was', 'when', 'won', 'you', 'your', 'yours']

在新版本的sklearn中，该方法改为了get_feature_names()改为了get_feature_names_out()。

toarray()

上述结果，同样是一个Sparse矩阵。我们同样尝试把

1	cv = CountVectorizer()

修改为

1	cv = CountVectorizer(sparse=False)

会得到如下的结果

1	TypeError: __init__() got an unexpected keyword argument 'sparse'

与字典数据特征抽取的第二个不相同
CountVectorizer缺少一个参数sparse。

但是，scikit-learn是建立在NumPy，SciPy和matplotlib基础上。
对于上述问题，我们的办法是把

1	print(cv.fit_transform(data))

修改为

1	print(cv.fit_transform(data).toarray())

运行结果：

1
2
3

[[1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1 0 0 0]
 [1 1 0 1 0 0 0 0 1 0 1 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1]]
['and', 'at', 'but', 'face', 'forget', 'from', 'go', 'hiding', 'in', 'let', 'looked', 'me', 'misty', 'morning', 'only', 'rain', 'saw', 'smiling', 'snow', 'that', 'the', 'to', 'trying', 'was', 'when', 'won', 'you', 'your', 'yours']

组装

组装成一个DataFrame，示例代码：

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
data = ["Hiding from the rain and snow,Trying to forget but I won't let go.",
        "In that misty morning when I saw your smiling face,You only looked at me and I was yours."]

data = cv.fit_transform(data)

df = pd.DataFrame(data.toarray(), columns=cv.get_feature_names_out())
print(df)

运行结果：

   and  at  but  face  forget  from  ...  was  when  won  you  your  yours
0    1   0    1     0       1     1  ...    0     0    1    0     0      0
1    1   1    0     1       0     0  ...    1     1    0    1     1      1

[2 rows x 29 columns]

注意

关于该方法，还有几点需要注意：

对于单个字母的单词，不统计。
1. 大概是因为文本数据的特征抽取主要用于文本分类和情感分析，但是单个字母的单词普遍不能反应文本的类型或文本的情感。
上述的0或者1代表的不是False和True，而是次数。

比如，我们把

1	I won't let go.

修改为

1	I won't let go go go,ale ale ale.

运行结果：

1
2
3

[[3 1 0 1 0 1 1 3 1 0 1 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1 0 0 0]
 [0 1 1 0 1 0 0 0 0 1 0 1 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1]]
['ale', 'and', 'at', 'but', 'face', 'forget', 'from', 'go', 'hiding', 'in', 'let', 'looked', 'me', 'misty', 'morning', 'only', 'rain', 'saw', 'smiling', 'snow', 'that', 'the', 'to', 'trying', 'was', 'when', 'won', 'you', 'your', 'yours']

中文文本数据的特征抽取

中文不分词

我们参考英文文本的抽取方式。
示例代码：

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
data = ["前尘往事成云烟，消散在彼此眼前。","只是因为在人群中多看了你一眼，再也没能忘掉你容颜。"]
print(cv.fit_transform(data).toarray())
print(cv.get_feature_names())

运行结果：

1
2
3

[[0 1 0 1]
 [1 0 1 0]]
['再也没能忘掉你容颜', '前尘往事成云烟', '只是因为在人群中多看了你一眼', '消散在彼此眼前']

上述的结果，明显不符合我们的要求。原因是sklearn不会对中文进行分词。

手工分词

我们把

1	data = ["前尘往事成云烟，消散在彼此眼前。","只是因为在人群中多看了你一眼，再也没能忘掉你容颜。"]

修改为

1	data = ["前尘往事成云烟，消散在彼此眼前。","只是因为在人群中多看了你一眼，再也没能忘掉你容颜。"]

添加空格，对中文进行分词，即可。
运行结果：

1
2
3

[[0 1 0 1 0 0 0 0 1 1 0 1 1]
 [1 0 1 0 1 1 1 1 0 0 1 0 0]]
['一眼', '云烟', '人群中', '前尘', '只是', '因为', '多看了', '容颜', '彼此', '往事', '忘掉', '消散', '眼前']

jieba

但是，实际上，对于大篇幅的文章，我们不可能手动进行分词。这里介绍一个工具，jieba。
示例代码：

from sklearn.feature_extraction.text import CountVectorizer
import jieba
cv = CountVectorizer()
con1 = jieba.cut("前尘往事成云烟，消散在彼此眼前。")
con2 = jieba.cut("只是因为在人群中多看了你一眼，再也没能忘掉你容颜。")
# 转换成列表
content1 = list(con1)
content2 = list(con2)
# 列表转换成字符串
c1 = ' '.join(content1)
c2 = ' '.join(content2)
data = [c1,c2]
print(cv.fit_transform(data).toarray())
print(cv.get_feature_names())

运行结果：

1
2
3

[[0 0 1 0 0 1 0 0 0 1 0 1 1]
 [1 1 0 1 1 0 1 1 1 0 1 0 0]]
['一眼', '中多', '云烟', '人群', '再也', '前尘往事', '只是', '因为', '容颜', '彼此', '忘掉', '消散', '眼前']

关于jieba分词的更多用法，不在这里赘述。大家可以参考《用Python分析数据的方法和技巧：4.jieba和wordcloud》。

tf-idf

count的缺陷

通过上述方法，我们对文本进行了特征抽取。当特征抽取出来之后，我们可以对文本进行分类了。
比如：
假设存在两篇文章，我们进行特征抽取后，结果如下：

根据这些词语的比重，我们可以很轻松的对上述文章进行分类。
但是，这是一种理想情况。
因为，在文章中，会存在大量的因为，所以，但是以及很多语气词，赋组词(例如的)。大量的这些词语，会干扰我们对文章进行分类，或者干扰我们比较两篇文章的相似度。

tf-idf的公式

为了解决上述现象，我们不用Count方法进行特征抽取，而用tf-idf。

$\text{tf-idf}=\text{tf(t,d)} \times \text{idf(t)}$

$\text{tf(t,d)}$ 表示在文本 $\text{d}$ 中词项 $\text{t}$ 出现的次数。
$\text{idf(t)}$ ：

$\text{idf(t)} = \log \frac{1+n_d}{1+df(d,t)} + 1$

$n_d$ ：总文档数量
$df(d,t)$ ：该词出现的文档数量

到这一步，其实是有问题的，因为这样的话，如果某篇文章特别长，那么容易

所以最后还有一个按行进行正则化的操作。

$v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}$

TfidfVectorizer

TfidfVectorizer的代码与CountVectorizer几乎相同，除了把CountVectorizer换成TfidfVectorizer。

我们可以验证一下上文的公式。示例代码：

import math

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(norm=None)
data = ["aa bb cc dd ee aa aa aa", "aa bb cc dd"]

df = pd.DataFrame(data=tf.fit_transform(data).toarray(), columns=tf.get_feature_names_out())

print(df)

print(4 * (math.log10(2 / 2)))

print(1 * (math.log10(2 / 2)))

运行结果：

    aa   bb   cc   dd        ee
0  4.0  1.0  1.0  1.0  1.405465
1  1.0  1.0  1.0  1.0  0.000000
0.0
0.0

解释说明：

norm=None，不进行norm，默认的会进行L2-norm

示例代码：

import jieba
tf = TfidfVectorizer()
con1 = jieba.cut("前尘往事成云烟，消散在彼此眼前。")
con2 = jieba.cut("只是因为在人群中多看了你一眼，再也没能忘掉你容颜。")
# 转换成列表
content1 = list(con1)
content2 = list(con2)
# 列表转换成字符串
c1 = ' '.join(content1)
c2 = ' '.join(content2)
data = [c1,c2]
print(tf.fit_transform(data).toarray())
print(tf.get_feature_names())

运行结果：

[[0.         0.         0.4472136  0.         0.         0.4472136
  0.         0.         0.         0.4472136  0.         0.4472136
  0.4472136 ]
 [0.35355339 0.35355339 0.         0.35355339 0.35355339 0.
  0.35355339 0.35355339 0.35355339 0.         0.35355339 0.
  0.        ]]
['一眼', '中多', '云烟', '人群', '再也', '前尘往事', '只是', '因为', '容颜', '彼此', '忘掉', '消散', '眼前']

实际上，在做自然语言处理的时候，Count和tf-idf，我们都不用。我们有更好的方法，这个会在《深度学习初步及其Python实现：8.循环神经网络》中讨论。