Pandas库的应用

Pandas库

引子

Numpy 在向量化的数值计算中表现优异

但是在处理更灵活、复杂的数据任务:

如为数据添加标签、处理缺失值、分组和透视表等方面

Numpy显得力不从心

而基于Numpy构建的Pandas库,提供了使得数据分析变得更快更简单的高级数据结构和操作工具

12.1 对象创建

12.1.1 Pandas Series对象

Series 是带标签数据的一维数组

Series对象的创建

通用结构: pd.Series(data, index=index, dtype=dtype)

data:数据,可以是列表,字典或Numpy数组

index:索引,为可选参数

dtype: 数据类型,为可选参数

1、用列表创建

  • index缺省,默认为整数序列
import pandas as pd

data = pd.Series([1.5, 3, 4.5, 6])
data
0    1.5
1    3.0
2    4.5
3    6.0
dtype: float64
  • 增加index
data = pd.Series([1.5, 3, 4.5, 6], index=["a", "b", "c", "d"])
data
a    1.5
b    3.0
c    4.5
d    6.0
dtype: float64
  • 增加数据类型 缺省则从传入的数据自动判断
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])    
data
a    1
b    2
c    3
d    4
dtype: int64
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"], dtype="float")
data
a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64

注意:数据支持多种类型

data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"])
data
a    1
b    2
c    3
d    4
dtype: object
data["a"]
1
data["c"]
'3'

数据类型可被强制改变

data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"], dtype=float)
data
a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64
data["c"]
3.0
data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
data

2、用一维numpy数组创建

import numpy as np

x = np.arange(5)
pd.Series(x)
0    0
1    1
2    2
3    3
4    4
dtype: int32

3、用字典创建

  • 默认以键为index 值为data
population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }
population = pd.Series(population_dict)    
population
BeiJing     2154
ShangHai    2424
ShenZhen    1303
HangZhou     981
dtype: int64
  • 字典创建,如果指定index,则会到字典的键中筛选,找不到的,值设为NaN
population = pd.Series(population_dict, index=["BeiJing", "HangZhou", "c", "d"])    
population
BeiJing     2154.0
HangZhou     981.0
c              NaN
d              NaN
dtype: float64

4、data为标量的情况

pd.Series(5, index=[100, 200, 300])
100    5
200    5
300    5
dtype: int64

12.1.2 Pandas DataFrame对象

DataFrame 是带标签数据的多维数组

DataFrame对象的创建

通用结构: pd.DataFrame(data, index=index, columns=columns)

data:数据,可以是列表,字典或Numpy数组

index:索引,为可选参数

columns: 列标签,为可选参数

1、通过Series对象创建

population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }

population = pd.Series(population_dict)    
pd.DataFrame(population)
0
BeiJing2154
ShangHai2424
ShenZhen1303
HangZhou981
pd.DataFrame(population, columns=["population"])
population
BeiJing2154
ShangHai2424
ShenZhen1303
HangZhou981

2、通过Series对象字典创建

GDP_dict = {"BeiJing": 30320,
            "ShangHai": 32680,
            "ShenZhen": 24222,
            "HangZhou": 13468 }

GDP = pd.Series(GDP_dict)
GDP
BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
dtype: int64
pd.DataFrame({"population": population,
              "GDP": GDP})
populationGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468

注意:数量不够的会自动补齐

pd.DataFrame({"population": population,
              "GDP": GDP,
              "country": "China"})
populationGDPcountry
BeiJing215430320China
ShangHai242432680China
ShenZhen130324222China
HangZhou98113468China

3、通过字典列表对象创建

  • 字典索引作为index,字典键作为columns
import numpy as np
import pandas as pd

data = [{"a": i, "b": 2*i} for i in range(3)]
data
[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]
data = pd.DataFrame(data)
data
ab
000
112
224
data1 = data["a"].copy()
data1
0    0
1    1
2    2
Name: a, dtype: int64
data1[0] = 10
data1
0    10
1     1
2     2
Name: a, dtype: int64
data
ab
000
112
224
  • 不存在的键,会默认值为NaN
data = [{"a": 1, "b":1},{"b": 3, "c":4}]
data
[{'a': 1, 'b': 1}, {'b': 3, 'c': 4}]
pd.DataFrame(data)
abc
01.01NaN
1NaN34.0

4、通过Numpy二维数组创建

data = np.random.randint(10, size=(3, 2))
data
array([[1, 6],
       [2, 9],
       [4, 0]])
pd.DataFrame(data, columns=["foo", "bar"], index=["a", "b", "c"])
foobar
a16
b29
c40

12.2 DataFrame性质

1、属性

data = pd.DataFrame({"pop": population, "GDP": GDP})
data
popGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468

(1)df.values 返回numpy数组表示的数据

data.values
array([[ 2154, 30320],
       [ 2424, 32680],
       [ 1303, 24222],
       [  981, 13468]], dtype=int64)

(2)df.index 返回行索引

data.index
Index(['BeiJing', 'ShangHai', 'ShenZhen', 'HangZhou'], dtype='object')

(3)df.columns 返回列索引

data.columns
Index(['pop', 'GDP'], dtype='object')

(4)df.shape 形状

data.shape
(4, 2)

(5) pd.size 大小

data.size
8

(6)pd.dtypes 返回每列数据类型

data.dtypes
pop    int64
GDP    int64
dtype: object

2、索引

data
popGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468

(1)获取列

  • 字典式
data["pop"]
BeiJing     2154
ShangHai    2424
ShenZhen    1303
HangZhou     981
Name: pop, dtype: int64
data[["GDP", "pop"]]
GDPpop
BeiJing303202154
ShangHai326802424
ShenZhen242221303
HangZhou13468981
  • 对象属性式
data.GDP
BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
Name: GDP, dtype: int64

(2)获取行

  • 绝对索引 df.loc
data.loc["BeiJing"]
pop     2154
GDP    30320
Name: BeiJing, dtype: int64
data.loc[["BeiJing", "HangZhou"]]
popGDP
BeiJing215430320
HangZhou98113468
  • 相对索引 df.iloc
data
popGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468
data.iloc[0]
pop     2154
GDP    30320
Name: BeiJing, dtype: int64
data.iloc[[1, 3]]
popGDP
ShangHai242432680
HangZhou98113468

(3)获取标量

data
popGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468
data.loc["BeiJing", "GDP"]
30320
data.iloc[0, 1]
30320
data.values[0][1]
30320

(4)Series对象的索引

type(data.GDP)
pandas.core.series.Series
GDP
BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
dtype: int64
GDP["BeiJing"]
30320

3、切片

dates = pd.date_range(start='2019-01-01', periods=6)
dates
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df
ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342

(1)行切片

df["2019-01-01": "2019-01-03"]
ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
df.loc["2019-01-01": "2019-01-03"]
ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
df.iloc[0: 3]
ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726

(2)列切片

df
ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342
df.loc[:, "A": "C"]
ABC
2019-01-01-0.935378-0.1907420.925984
2019-01-02-0.234414-1.1946741.080779
2019-01-03-0.1415720.0581181.102248
2019-01-040.3050880.535920-0.978434
2019-01-050.3133830.2340410.163155
2019-01-060.250613-0.904400-0.858240
df.iloc[:, 0: 3]
ABC
2019-01-01-0.935378-0.1907420.925984
2019-01-02-0.234414-1.1946741.080779
2019-01-03-0.1415720.0581181.102248
2019-01-040.3050880.535920-0.978434
2019-01-050.3133830.2340410.163155
2019-01-060.250613-0.904400-0.858240

(3)多种多样的取值

df
ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342
  • 行、列同时切片
df.loc["2019-01-02": "2019-01-03", "C":"D"]
CD
2019-01-021.080779-2.294395
2019-01-031.1022481.207726
df.iloc[1: 3, 2:]
CD
2019-01-021.080779-2.294395
2019-01-031.1022481.207726
  • 行切片,列分散取值
df.loc["2019-01-04": "2019-01-06", ["A", "C"]]
AC
2019-01-040.305088-0.978434
2019-01-050.3133830.163155
2019-01-060.250613-0.858240
df.iloc[3:, [0, 2]]
AC
2019-01-040.305088-0.978434
2019-01-050.3133830.163155
2019-01-060.250613-0.858240
  • 行分散取值,列切片
df.loc[["2019-01-02", "2019-01-06"], "C": "D"]
df.iloc[[1, 5], 0: 3]
ABC
2019-01-02-0.234414-1.1946741.080779
2019-01-060.250613-0.904400-0.858240
  • 行、列均分散取值
df.loc[["2019-01-04", "2019-01-06"], ["A", "D"]]
df.iloc[[1, 5], [0, 3]]
AD
2019-01-02-0.234414-2.294395
2019-01-060.250613-1.573342

4、布尔索引

df
ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342
df > 0
ABCD
2019-01-01FalseFalseTrueFalse
2019-01-02FalseFalseTrueFalse
2019-01-03FalseTrueTrueTrue
2019-01-04TrueTrueFalseTrue
2019-01-05TrueTrueTrueFalse
2019-01-06TrueFalseFalseFalse
df[df > 0]
ABCD
2019-01-01NaNNaN0.925984NaN
2019-01-02NaNNaN1.080779NaN
2019-01-03NaN0.0581181.1022481.207726
2019-01-040.3050880.535920NaN0.177251
2019-01-050.3133830.2340410.163155NaN
2019-01-060.250613NaNNaNNaN
df.A > 0
2019-01-01    False
2019-01-02    False
2019-01-03    False
2019-01-04     True
2019-01-05     True
2019-01-06     True
Freq: D, Name: A, dtype: bool
df[df.A > 0]
ABCD
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342
  • isin()方法
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2
ABCDE
2019-01-01-0.935378-0.1907420.925984-0.818969one
2019-01-02-0.234414-1.1946741.080779-2.294395one
2019-01-03-0.1415720.0581181.1022481.207726two
2019-01-040.3050880.535920-0.9784340.177251three
2019-01-050.3133830.2340410.163155-0.296649four
2019-01-060.250613-0.904400-0.858240-1.573342three
ind = df2["E"].isin(["two", "four"])
ind     
2019-01-01    False
2019-01-02    False
2019-01-03     True
2019-01-04    False
2019-01-05     True
2019-01-06    False
Freq: D, Name: E, dtype: bool
df2[ind]
ABCDE
2019-01-03-0.1415720.0581181.1022481.207726two
2019-01-050.3133830.2340410.163155-0.296649four

(5)赋值

df
  • DataFrame 增加新列
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20190101', periods=6))
s1
2019-01-01    1
2019-01-02    2
2019-01-03    3
2019-01-04    4
2019-01-05    5
2019-01-06    6
Freq: D, dtype: int64
df["E"] = s1
df
ABCDE
2019-01-01-0.935378-0.1907420.925984-0.8189691
2019-01-02-0.234414-1.1946741.080779-2.2943952
2019-01-03-0.1415720.0581181.1022481.2077263
2019-01-040.3050880.535920-0.9784340.1772514
2019-01-050.3133830.2340410.163155-0.2966495
2019-01-060.250613-0.904400-0.858240-1.5733426
  • 修改赋值
df.loc["2019-01-01", "A"] = 0
df
ABCDE
2019-01-010.000000-0.1907420.925984-0.8189691
2019-01-02-0.234414-1.1946741.080779-2.2943952
2019-01-03-0.1415720.0581181.1022481.2077263
2019-01-040.3050880.535920-0.9784340.1772514
2019-01-050.3133830.2340410.163155-0.2966495
2019-01-060.250613-0.904400-0.858240-1.5733426
df.iloc[0, 1] = 0
df
ABCDE
2019-01-010.0000000.0000000.925984-0.8189691
2019-01-02-0.234414-1.1946741.080779-2.2943952
2019-01-03-0.1415720.0581181.1022481.2077263
2019-01-040.3050880.535920-0.9784340.1772514
2019-01-050.3133830.2340410.163155-0.2966495
2019-01-060.250613-0.904400-0.858240-1.5733426
df["D"] = np.array([5]*len(df))   # 可简化成df["D"] = 5
df
ABCDE
2019-01-010.0000000.0000000.92598451
2019-01-02-0.234414-1.1946741.08077952
2019-01-03-0.1415720.0581181.10224853
2019-01-040.3050880.535920-0.97843454
2019-01-050.3133830.2340410.16315555
2019-01-060.250613-0.904400-0.85824056
  • 修改index和columns
df.index = [i for i in range(len(df))]
df
ABCDE
00.0000000.0000000.92598451
1-0.234414-1.1946741.08077952
2-0.1415720.0581181.10224853
30.3050880.535920-0.97843454
40.3133830.2340410.16315555
50.250613-0.904400-0.85824056
df.columns = [i for i in range(df.shape[1])]
df
01234
00.0000000.0000000.92598451
1-0.234414-1.1946741.08077952
2-0.1415720.0581181.10224853
30.3050880.535920-0.97843454
40.3133830.2340410.16315555
50.250613-0.904400-0.85824056

12.3 数值运算及统计分析

1、数据的查看

import pandas as pd
import numpy as np

dates = pd.date_range(start='2019-01-01', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df
ABCD
2019-01-01-0.8540430.412345-2.296051-0.048964
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402

(1)查看前面的行

df.head()    # 默认5行
ABCD
2019-01-01-0.8540430.412345-2.296051-0.048964
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
df.head(2)
ABCD
2019-01-01-0.8540430.412345-2.296051-0.048964
2019-01-021.371364-0.121454-0.2996531.095375

(2)查看后面的行

df.tail()    # 默认5行
ABCD
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402
df.tail(3) 
ABCD
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402

(3)查看总体信息

df.iloc[0, 3] = np.nan
df
ABCD
2019-01-01-0.8540430.412345-2.296051NaN
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
Freq: D
Data columns (total 4 columns):
A    6 non-null float64
B    6 non-null float64
C    6 non-null float64
D    5 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes

2、Numpy通用函数同样适用于Pandas

(1)向量化运算

x = pd.DataFrame(np.arange(4).reshape(1, 4))
x
0123
00123
x+5
0123
05678
np.exp(x)
0123
01.02.7182827.38905620.085537
y = pd.DataFrame(np.arange(4,8).reshape(1, 4))
y
0123
04567
x*y
0123
0051221

(2)矩阵化运算

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(30, 30)))
x
0123456789...20212223242526272829
06374692674...4095809263
18242648613...2031731559
23519193768...6870772072
32049698687...0242049668
49926033466...9686008838
52657840297...2040700115
66400214956...5085233292
72363807617...3010446882
82237570730...1152830304
93776200256...4232004528
104704203460...5619190708
115696921879...6528959950
123955407446...0729694946
138409901587...5840349946
143046995431...6103712002
154200791212...6394173848
163948720231...8003852038
172863294428...6942618990
185679819144...3525699262
191937860280...4322381800
204552689757...3508043251
212481971467...0182046504
224524644499...1769915521
230548064412...8507692043
249709037415...3782219224
254195450489...9307023759
266719726261...0659803839
272813517702...8045455637
286862274375...1792459532
293030095432...1304808756

30 rows × 30 columns

  • 转置
z = x.T
z
0123456789...20212223242526272829
06832926223...4240946263
13250964327...5455717880
27414250637...5824091163
34299670376...2148959320
46616082852...6960047520
59499341070...8746352179
62838304700...9144706745
76676429672...7444442734
87168695135...5691186073
94387676706...7792591252
103871353075...4026419910
117140676835...7505105835
127916287855...9041292431
132846530172...3185882557
145977105635...9001698359
154494907927...7421686904
161182894281...9931584176
177387433624...1802759759
185605561980...4501376521
191782315810...8073708487
204260925314...3018390811
210082600012...5175736073
229374848153...0860805490
235102605022...8297279524
248770072480...0096208448
250374003430...4419120550
269129803604...3652933598
272506812835...2550278657
286576319802...5024253335
293928852248...1413499726

30 rows × 30 columns

np.random.seed(1)
y = pd.DataFrame(np.random.randint(10, size=(30, 30)))
y
0123456789...20212223242526272829
05895001769...1706997691
10188398736...9204927798
26937745936...7711308645
36257844774...0198231272
46092662770...1540789570
59391446889...1870342035
61243060728...4336735324
74033835675...1731669696
80029606703...6795495256
96877726052...7062436763
100647629599...4939125408
112399448216...0598660473
120160616425...8807207119
135159649875...2432004250
140385314732...8557591393
153361305052...7177383063
160659646622...3686513263
176728018600...5625430621
189440987761...7997114656
194115126233...0009859340
209863990816...2901394888
212864905561...6756874240
220359036511...6253939519
237708612044...1960283725
246042310570...1127529473
255021494693...5535927416
269818162618...2512533618
271864695472...9315117126
280774327852...0283739238
298026836497...6785725345

30 rows × 30 columns

x.dot(y)
0123456789...20212223242526272829
0616560723739612457681799575590...523739613580668602733585657700
1520438691600612455666764707592...555681503679641506779494633590
2557570786807690469804828704573...563675712758793672754550756638
3605507664701660496698806651575...582685668586629534678484591626
4599681753873721563754770620654...633747661677726649716610735706
5422354602627613396617627489423...456572559537499384589436574507
6359446599599481357577572451464...449550495532633554663476565602
7531520698590607537665696571472...576588551665652527742528650599
8449322547533593399584638587424...402596523523447362561386529484
9373433525601522345551521434447...508498438478459418488407503496
10500427574607667477652656615477...622702531610558532598471582561
11664694772841779574730810711608...591760616638721676846678754708
12545547687701721576689724710532...674684648694710564757571671656
13574586723750691494696787667523...618681568682715644756557690604
14502382645557570403538677500501...369650507576546531554437616463
15510505736651649510719733694557...605717574642678576755455598654
16567376614612643514598724547464...456639520560569442596517659532
17626716828765740603809852692591...664716655721742612819593744712
18600559667664641556624815638564...581701559677710554748597614657
19445431661681641552690719602474...515637576620572512599455622538
20523569784725713501740772638640...589775664686726672747548723645
21487465553639517449592609454398...492567534404554417561466498492
22479449574686583377566614563455...453539491501596520722478565501
23483386476526550426492585536482...322541438456487408502426474481
24523551658767537444663731576577...522590525664691548635526641538
25652656738753853508752815669576...694833693606575616704559728672
26578577744856699497779800733587...630754704834760680765592731629
27554494665689630574695703636599...554685532658649554693577634668
28498552659784552492690775544551...567636518599742521733533605604
29513491563642477367589647516484...428574504548553483540407547455

30 rows × 30 columns

%timeit x.dot(y)
218 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.dot(x, y)
81.1 µs ± 2.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  • 执行相同运算,Numpy与Pandas的对比
x1 = np.array(x)
x1
y1 = np.array(y)
y1
%timeit x1.dot(y1)
22.1 µs ± 992 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x1, y1)
22.6 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x.values, y.values)
42.9 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
x2 = list(x1)
y2 = list(y1)
x3 = []
y3 = []
for i in x2:
    res = []
    for j in i:
        res.append(int(j))
    x3.append(res)
for i in y2:
    res = []
    for j in i:
        res.append(int(j))
    y3.append(res)
def f(x, y):
    res = []
    for i in range(len(x)):
        row = []
        for j in range(len(y[0])):
            sum_row = 0
            for k in range(len(x[0])):
                sum_row += x[i][k]*y[k][j]
            row.append(sum_row)
        res.append(row)
    return res          
%timeit f(x3, y3)
4.29 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

一般来说,纯粹的计算在Numpy里执行的更快

Numpy更侧重于计算,Pandas更侧重于数据处理

(3)广播运算

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))
x
ABC
0637
1469
2267
  • 按行广播
x.iloc[0]
A    6
B    3
C    7
Name: 0, dtype: int32
x/x.iloc[0]
ABC
01.0000001.01.000000
10.6666672.01.285714
20.3333332.01.000000
  • 按列广播
x.A
0    6
1    4
2    2
Name: A, dtype: int32
x.div(x.A, axis=0)             # add sub div mul
ABC
01.00.51.166667
11.01.52.250000
21.03.03.500000
x.div(x.iloc[0], axis=1)
ABC
01.0000001.01.000000
10.6666672.01.285714
20.3333332.01.000000

3、新的用法

(1)索引对齐

A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))
A
AB
037
121
B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))
B
ABC
0751
1409
2580
  • pandas会自动对齐两个对象的索引,没有的值用np.nan表示
A+B
ABC
010.012.0NaN
16.01.0NaN
2NaNNaNNaN
  • 缺省值也可用fill_value来填充
A.add(B, fill_value=0)
ABC
010.012.01.0
16.01.09.0
25.08.00.0
A*B
ABC
021.035.0NaN
18.00.0NaN
2NaNNaNNaN

(2)统计相关

  • 数据种类统计
y = np.random.randint(3, size=20)
y
array([2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 1])
np.unique(y)
array([0, 1, 2])
from collections import Counter
Counter(y)
Counter({2: 11, 1: 5, 0: 4})
y1 = pd.DataFrame(y, columns=["A"])
y1
A
02
12
22
31
42
51
61
72
81
92
102
110
122
130
142
152
160
170
182
191
np.unique(y1)
array([0, 1, 2])
y1["A"].value_counts()
2    11
1     5
0     4
Name: A, dtype: int64
  • 产生新的结果,并进行排序
population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }
population = pd.Series(population_dict) 

GDP_dict = {"BeiJing": 30320,
            "ShangHai": 32680,
            "ShenZhen": 24222,
            "HangZhou": 13468 }
GDP = pd.Series(GDP_dict)

city_info = pd.DataFrame({"population": population,"GDP": GDP})
city_info
populationGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468
city_info["per_GDP"] = city_info["GDP"]/city_info["population"]
city_info
populationGDPper_GDP
BeiJing21543032014.076137
ShangHai24243268013.481848
ShenZhen13032422218.589409
HangZhou9811346813.728848

递增排序

city_info.sort_values(by="per_GDP")
populationGDPper_GDP
ShangHai24243268013.481848
HangZhou9811346813.728848
BeiJing21543032014.076137
ShenZhen13032422218.589409

递减排序

city_info.sort_values(by="per_GDP", ascending=False)
populationGDPper_GDP
ShenZhen13032422218.589409
BeiJing21543032014.076137
HangZhou9811346813.728848
ShangHai24243268013.481848

按轴进行排序

data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))
data
CBAD
2313178
1119146
0117142

行排序

data.sort_index()
CBAD
0117142
1119146
2313178

列排序

data.sort_index(axis=1)
ABCD
2171338
1141916
0147112
data.sort_index(axis=1, ascending=False)
DCBA
2831317
1611914
0211714
  • 统计方法
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
df
ABCD
01.0821983.557396-3.0604766.367969
113.1132526.7745592.8745535.527044
2-2.036341-4.3331775.094802-0.152567
3-3.386712-1.522365-2.5222092.537716
44.3284915.5509945.5773295.019991
51.171336-0.493910-4.0326136.398588

非空个数

df.count()
A    6
B    6
C    6
D    6
dtype: int64

求和

df.sum()
A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64
df.sum(axis=1)
0     7.947086
1    28.289408
2    -1.427283
3    -4.893571
4    20.476806
5     3.043402
dtype: float64

最大值 最小值

df.min()
A   -3.386712
B   -4.333177
C   -4.032613
D   -0.152567
dtype: float64
df.max(axis=1)
0     6.367969
1    13.113252
2     5.094802
3     2.537716
4     5.577329
5     6.398588
dtype: float64
df
ABCD
01.0821983.557396-3.0604766.367969
113.1132526.7745592.8745535.527044
2-2.036341-4.3331775.094802-0.152567
3-3.386712-1.522365-2.5222092.537716
44.3284915.5509945.5773295.019991
51.171336-0.493910-4.0326136.398588
df.idxmax()
A    1
B    1
C    4
D    5
dtype: int64

均值

df.mean()
A    2.378704
B    1.588916
C    0.655231
D    4.283124
dtype: float64

方差

df.var()
A    34.980702
B    19.110656
C    18.948144
D     6.726776
dtype: float64

标准差

df.std()
A    5.914449
B    4.371574
C    4.352947
D    2.593603
dtype: float64

中位数

df.median()
A    1.126767
B    1.531743
C    0.176172
D    5.273518
dtype: float64

众数

data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))
data
AB
042
132
220
324
420
541
620
711
834
920
data.mode()
AB
020

75%分位数

df.quantile(0.75)
A    3.539202
B    5.052594
C    4.539740
D    6.157738
Name: 0.75, dtype: float64

一网打尽

df.describe()
ABCD
count6.0000006.0000006.0000006.000000
mean2.3787041.5889160.6552314.283124
std5.9144494.3715744.3529472.593603
min-3.386712-4.333177-4.032613-0.152567
25%-1.256706-1.265251-2.9259103.158284
50%1.1267671.5317430.1761725.273518
75%3.5392025.0525944.5397406.157738
max13.1132526.7745595.5773296.398588
data_2 = pd.DataFrame([["a", "a", "c", "d"],
                       ["c", "a", "c", "b"],
                       ["a", "a", "d", "c"]], columns=list("ABCD"))
data_2
ABCD
0aacd
1cacb
2aadc
data_2.describe()
ABCD
count3333
unique2123
topaacd
freq2321

相关性系数和协方差

df.corr()
ABCD
A1.0000000.8310630.3310600.510821
B0.8310631.0000000.1792440.719112
C0.3310600.1792441.000000-0.450365
D0.5108210.719112-0.4503651.000000
df.corrwith(df["A"])
A    1.000000
B    0.831063
C    0.331060
D    0.510821
dtype: float64

自定义输出

apply(method)的用法:使用method方法默认对每一列进行相应的操作

df
ABCD
01.0821983.557396-3.0604766.367969
113.1132526.7745592.8745535.527044
2-2.036341-4.3331775.094802-0.152567
3-3.386712-1.522365-2.5222092.537716
44.3284915.5509945.5773295.019991
51.171336-0.493910-4.0326136.398588
df.apply(np.cumsum)
ABCD
01.0821983.557396-3.0604766.367969
114.19545010.331955-0.18592311.895013
212.1591095.9987784.90887811.742447
38.7723974.4764132.38666914.280162
413.10088810.0274067.96399919.300153
514.2722249.5334973.93138525.698741
df.apply(np.cumsum, axis=1)
ABCD
01.0821984.6395941.5791177.947086
113.11325219.88781122.76236428.289408
2-2.036341-6.369518-1.274717-1.427283
3-3.386712-4.909077-7.431287-4.893571
44.3284919.87948515.45681420.476806
51.1713360.677427-3.3551863.043402
df.apply(sum)
A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64
df.sum()
A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64
df.apply(lambda x: x.max()-x.min())
A    16.499965
B    11.107736
C     9.609942
D     6.551155
dtype: float64
def my_describe(x):
    return pd.Series([x.count(), x.mean(), x.max(), x.idxmin(), x.std()], \
                     index=["Count", "mean", "max", "idxmin", "std"])
df.apply(my_describe)
ABCD
Count6.0000006.0000006.0000006.000000
mean2.3787041.5889160.6552314.283124
max13.1132526.7745595.5773296.398588
idxmin3.0000002.0000005.0000002.000000
std5.9144494.3715744.3529472.593603

12.4 缺失值处理

1、发现缺失值

import pandas as pd
import numpy as np

data = pd.DataFrame(np.array([[1, np.nan, 2],
                              [np.nan, 3, 4],
                              [5, 6, None]]), columns=["A", "B", "C"])
data
ABC
01NaN2
1NaN34
256None

注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源

data.dtypes
A    object
B    object
C    object
dtype: object
data.isnull()
ABC
0FalseTrueFalse
1TrueFalseFalse
2FalseFalseTrue
data.notnull()
ABC
0TrueFalseTrue
1FalseTrueTrue
2TrueTrueFalse

2、删除缺失值

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 4, 5, 6],
                              [7, 8, np.nan, 9],
                              [10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
ABCD
01.0NaN2.03.0
1NaN4.05.06.0
27.08.0NaN9.0
310.011.012.013.0

注意:np.nan是一种特殊的浮点数

data.dtypes
A    float64
B    float64
C    float64
D    float64
dtype: object

(1)删除整行

data.dropna()
ABCD
310.011.012.013.0

(2)删除整列

data.dropna(axis="columns")
D
03.0
16.0
29.0
313.0
data["D"] = np.nan
data
ABCD
01.0NaN2.0NaN
1NaN4.05.0NaN
27.08.0NaNNaN
310.011.012.0NaN
data.dropna(axis="columns", how="all")
ABC
01.0NaN2.0
1NaN4.05.0
27.08.0NaN
310.011.012.0
data.dropna(axis="columns", how="any")
0
1
2
3
data.loc[3] = np.nan
data
ABCD
01.0NaN2.0NaN
1NaN4.05.0NaN
27.08.0NaNNaN
3NaNNaNNaNNaN
data.dropna(how="all")
ABCD
01.0NaN2.0NaN
1NaN4.05.0NaN
27.08.0NaNNaN

3、填充缺失值

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 4, 5, 6],
                              [7, 8, np.nan, 9],
                              [10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
ABCD
01.0NaN2.03.0
1NaN4.05.06.0
27.08.0NaN9.0
310.011.012.013.0
data.fillna(value=5)
ABCD
01.05.02.03.0
15.04.05.06.0
27.08.05.09.0
310.011.012.013.0
  • 用均值进行替换
fill = data.mean()
fill
A    6.000000
B    7.666667
C    6.333333
D    7.750000
dtype: float64
data.fillna(value=fill)
ABCD
01.07.6666672.0000003.0
16.04.0000005.0000006.0
27.08.0000006.3333339.0
310.011.00000012.00000013.0
fill = data.stack().mean()
fill
7.0
data.fillna(value=fill)
ABCD
01.07.02.03.0
17.04.05.06.0
27.08.07.09.0
310.011.012.013.0

12.5 合并数据

  • 构造一个生产DataFrame的函数
import pandas as pd
import numpy as np

def make_df(cols, ind):
    "一个简单的DataFrame"
    data = {c: [str(c)+str(i) for i in ind]  for c in cols}
    return pd.DataFrame(data, ind)

make_df("ABC", range(3))
ABC
0A0B0C0
1A1B1C1
2A2B2C2
  • 垂直合并
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)
    A   B
1  A1  B1
2  A2  B2
    A   B
3  A3  B3
4  A4  B4
pd.concat([df_1, df_2])
AB
1A1B1
2A2B2
3A3B3
4A4B4
  • 水平合并
df_3 = make_df("AB", [0, 1])
df_4 = make_df("CD", [0, 1])
print(df_3)
print(df_4)
    A   B
0  A0  B0
1  A1  B1
    C   D
0  C0  D0
1  C1  D1
pd.concat([df_3, df_4], axis=1)
ABCD
0A0B0C0D0
1A1B1C1D1
  • 索引重叠

行重叠

df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
print(df_6)
    A   B
1  A1  B1
2  A2  B2
    A   B
1  A1  B1
2  A2  B2
pd.concat([df_5, df_6])
AB
1A1B1
2A2B2
1A1B1
2A2B2
pd.concat([df_5, df_6],ignore_index=True)
AB
0A1B1
1A2B2
2A1B1
3A2B2

列重叠

df_7 = make_df("ABC", [1, 2])
df_8 = make_df("BCD", [1, 2])
print(df_7)
print(df_8)
    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
1  B1  C1  D1
2  B2  C2  D2
pd.concat([df_7, df_8], axis=1)
ABCBCD
1A1B1C1B1C1D1
2A2B2C2B2C2D2
pd.concat([df_7, df_8],axis=1, ignore_index=True)
012345
1A1B1C1B1C1D1
2A2B2C2B2C2D2
  • 对齐合并merge()
df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
print(df_10)
    A   B
1  A1  B1
2  A2  B2
    B   C
1  B1  C1
2  B2  C2
pd.merge(df_9, df_10)
ABC
0A1B1C1
1A2B2C2
df_9 = make_df("AB", [1, 2])
df_10 = make_df("CB", [2, 1])
print(df_9)
print(df_10)
    A   B
1  A1  B1
2  A2  B2
    C   B
2  C2  B2
1  C1  B1
pd.merge(df_9, df_10)
ABC
0A1B1C1
1A2B2C2

【例】 合并城市信息

population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
                   "pop": (2154, 981, 1303)}
population = pd.DataFrame(population_dict)
population
citypop
0BeiJing2154
1HangZhou981
2ShenZhen1303
GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
            "GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
GDP
cityGDP
0BeiJing30320
1ShangHai32680
2HangZhou13468
city_info = pd.merge(population, GDP)
city_info
citypopGDP
0BeiJing215430320
1HangZhou98113468
city_info = pd.merge(population, GDP, how="outer")
city_info
citypopGDP
0BeiJing2154.030320.0
1HangZhou981.013468.0
2ShenZhen1303.0NaN
3ShangHaiNaN32680.0

12.6 分组和数据透视表

df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],
                  "data1": range(6),
                  "data2": np.random.randint(0, 10, size=6)})
df
keydata1data2
0A01
1B14
2C29
3C39
4B41
5A59

(1)分组

  • 延迟计算
df.groupby("key")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002276795A240>
df.groupby("key").sum()
data1data2
key
A510
B56
C511
df.groupby("key").mean()
data1data2
key
A2.55.0
B2.53.0
C2.55.5
for i in df.groupby("key"):
    print(str(i))
('A',   key  data1  data2
0   A      0      2
5   A      5      8)
('B',   key  data1  data2
1   B      1      2
4   B      4      4)
('C',   key  data1  data2
2   C      2      8
3   C      3      3)
  • 按列取值
df.groupby("key")["data2"].sum()
key
A    10
B     6
C    11
Name: data2, dtype: int32
  • 按组迭代
for data, group in df.groupby("key"):
    print("{0:5} shape={1}".format(data, group.shape))
A     shape=(2, 3)
B     shape=(2, 3)
C     shape=(2, 3)
  • 调用方法
df.groupby("key")["data1"].describe()
countmeanstdmin25%50%75%max
key
A2.02.53.5355340.01.252.53.755.0
B2.02.52.1213201.01.752.53.254.0
C2.02.50.7071072.02.252.52.753.0
  • 支持更复杂的操作
df.groupby("key").aggregate(["min", "median", "max"])
data1data2
minmedianmaxminmedianmax
key
A02.5525.08
B12.5423.04
C22.5335.58
  • 过滤
def filter_func(x):
    return x["data2"].std() > 3
df.groupby("key")["data2"].std()
key
A    4.242641
B    1.414214
C    3.535534
Name: data2, dtype: float64
df.groupby("key").filter(filter_func)
keydata1data2
0A02
2C28
3C33
5A58
  • 转换
df
keydata1data2
0A02
1B12
2C28
3C33
4B44
5A58
df.groupby("key").transform(lambda x: x-x.mean())
data1data2
0-2.5-3.0
1-1.5-1.0
2-0.52.5
30.5-2.5
41.51.0
52.53.0
df
keydata1data2
0A01
1B14
2C29
3C39
4B41
5A59
df.groupby("key").apply(lambda x: x-x.mean())
data1data2
0-2.5-4.0
1-1.51.5
2-0.50.0
30.50.0
41.5-1.5
52.54.0
  • apply()方法
df
keydata1data2
0A02
1B12
2C28
3C33
4B44
5A58
def norm_by_data2(x):
    x["data1"] /= x["data2"].sum()
    return x
df.groupby("key").apply(norm_by_data2)
keydata1data2
0A0.0000002
1B0.1666672
2C0.1818188
3C0.2727273
4B0.6666674
5A0.5000008
  • 将列表、数组设为分组键
L = [0, 1, 0, 1, 2, 0]
df
keydata1data2
0A02
1B12
2C28
3C33
4B44
5A58
df.groupby(L).sum()
data1data2
0718
145
244
  • 用字典将索引映射到分组
df2 = df.set_index("key")
df2
data1data2
key
A02
B12
C28
C33
B44
A58
mapping = {"A": "first", "B": "constant", "C": "constant"}
df2.groupby(mapping).sum()
data1data2
constant1017
first510
  • 任意Python函数
df2.groupby(str.lower).mean()
data1data2
a2.55.0
b2.53.0
c2.55.5
  • 多个有效值组成的列表
df2.groupby([str.lower, mapping]).mean()
data1data2
afirst2.55.0
bconstant2.53.0
cconstant2.55.5

【例1】 行星观测数据处理

import seaborn as sns

planets = sns.load_dataset("planets")
planets.shape
(1035, 6)
planets.head()
methodnumberorbital_periodmassdistanceyear
0Radial Velocity1269.3007.1077.402006
1Radial Velocity1874.7742.2156.952008
2Radial Velocity1763.0002.6019.842011
3Radial Velocity1326.03019.40110.622007
4Radial Velocity1516.22010.50119.472009
planets.describe()
numberorbital_periodmassdistanceyear
count1035.000000992.000000513.000000808.0000001035.000000
mean1.7855072002.9175962.638161264.0692822009.070531
std1.24097626014.7283043.818617733.1164933.972567
min1.0000000.0907060.0036001.3500001989.000000
25%1.0000005.4425400.22900032.5600002007.000000
50%1.00000039.9795001.26000055.2500002010.000000
75%2.000000526.0050003.040000178.5000002012.000000
max7.000000730000.00000025.0000008500.0000002014.000000
planets.head()
methodnumberorbital_periodmassdistanceyear
0Radial Velocity1269.3007.1077.402006
1Radial Velocity1874.7742.2156.952008
2Radial Velocity1763.0002.6019.842011
3Radial Velocity1326.03019.40110.622007
4Radial Velocity1516.22010.50119.472009
decade = 10 * (planets["year"] // 10)
decade.head()
0    2000
1    2000
2    2010
3    2000
4    2000
Name: year, dtype: int64
decade = decade.astype(str) + "s"
decade.name = "decade"
decade.head()
0    2000s
1    2000s
2    2010s
3    2000s
4    2000s
Name: decade, dtype: object
planets.head()
methodnumberorbital_periodmassdistanceyear
0Radial Velocity1269.3007.1077.402006
1Radial Velocity1874.7742.2156.952008
2Radial Velocity1763.0002.6019.842011
3Radial Velocity1326.03019.40110.622007
4Radial Velocity1516.22010.50119.472009
planets.groupby(["method", decade]).sum()
numberorbital_periodmassdistanceyear
methoddecade
Astrometry2010s21.262360e+030.0000035.754023
Eclipse Timing Variations2000s51.930800e+046.05000261.446025
2010s102.345680e+044.200001000.0012065
Imaging2000s291.350935e+060.00000956.8340139
2010s216.803750e+040.000001210.0836208
Microlensing2000s121.732500e+040.000000.0020070
2010s154.750000e+030.0000041440.0026155
Orbital Brightness Modulation2010s52.127920e+000.000002360.006035
Pulsar Timing1990s91.900153e+020.000000.005978
2000s13.652500e+040.000000.002003
2010s19.070629e-020.000001200.002011
Pulsation Timing Variations2000s11.170000e+030.000000.002007
Radial Velocity1980s18.388800e+0111.6800040.571989
1990s521.091561e+0468.17820723.7155943
2000s4752.633526e+05945.3192815201.16619775
2010s4241.809630e+05316.4789011382.67432451
Transit2000s642.897102e+020.0000031823.31124462
2010s7128.087813e+031.47000102419.46673999
Transit Timing Variations2010s92.393505e+020.000003313.008050
planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)
number
decade1980s1990s2000s2010s
method
Astrometry0.00.00.02.0
Eclipse Timing Variations0.00.05.010.0
Imaging0.00.029.021.0
Microlensing0.00.012.015.0
Orbital Brightness Modulation0.00.00.05.0
Pulsar Timing0.09.01.01.0
Pulsation Timing Variations0.00.01.00.0
Radial Velocity1.052.0475.0424.0
Transit0.00.064.0712.0
Transit Timing Variations0.00.00.09.0

(2)数据透视表

【例2】泰坦尼克号乘客数据分析

import seaborn as sns

titanic = sns.load_dataset("titanic")
titanic.head()
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue
T = titanic[titanic.age.notnull()].copy()
T.age.apply(lambda x: 60 if x>=60 else x)
T.age.value_counts()
24.00    30
22.00    27
60.00    26
18.00    26
28.00    25
30.00    25
19.00    25
21.00    24
25.00    23
36.00    22
29.00    20
35.00    18
32.00    18
27.00    18
26.00    18
31.00    17
16.00    17
34.00    15
20.00    15
33.00    15
23.00    15
39.00    14
40.00    13
17.00    13
42.00    13
45.00    12
38.00    11
4.00     10
50.00    10
2.00     10
         ..
8.00      4
5.00      4
11.00     4
6.00      3
7.00      3
46.00     3
30.50     2
57.00     2
0.83      2
55.00     2
10.00     2
59.00     2
13.00     2
28.50     2
40.50     2
45.50     2
0.75      2
32.50     2
34.50     1
55.50     1
0.92      1
36.50     1
12.00     1
53.00     1
14.50     1
0.67      1
20.50     1
23.50     1
24.50     1
0.42      1
Name: age, Length: 77, dtype: int64
Age = 10*(T["age"]//10)
Age = Age.astype(int)
Age.head()
Age.value_counts()
20    220
30    167
10    102
40     89
0      62
50     48
60     26
Name: age, dtype: int64
Age.astype(str)+"s"
0      20s
1      30s
2      20s
3      30s
4      30s
6      50s
7       0s
8      20s
9      10s
10      0s
11     50s
12     20s
13     30s
14     10s
15     50s
16      0s
18     30s
20     30s
21     30s
22     10s
23     20s
24      0s
25     30s
27     10s
30     40s
33     60s
34     20s
35     40s
37     20s
38     10s
      ... 
856    40s
857    50s
858    20s
860    40s
861    20s
862    40s
864    20s
865    40s
866    20s
867    30s
869     0s
870    20s
871    40s
872    30s
873    40s
874    20s
875    10s
876    20s
877    10s
879    50s
880    20s
881    30s
882    20s
883    20s
884    20s
885    30s
886    20s
887    10s
889    20s
890    30s
Name: age, Length: 714, dtype: object
T.groupby(["sex", Age])["survived"].mean().unstack()
age0102030405060
sex
female0.6333330.7555560.7222220.8333330.6875000.8888891.000000
male0.5937500.1228070.1689190.2149530.2105260.1333330.136364
T.age = Age
T.pivot_table("survived", index="sex", columns="age")
age0102030405060
sex
female0.6333330.7555560.7222220.8333330.6875000.8888891.000000
male0.5937500.1228070.1689190.2149530.2105260.1333330.136364
titanic.describe()
survivedpclassagesibspparchfare
count891.000000891.000000714.000000891.000000891.000000891.000000
mean0.3838382.30864229.6991180.5230080.38159432.204208
std0.4865920.83607114.5264971.1027430.80605749.693429
min0.0000001.0000000.4200000.0000000.0000000.000000
25%0.0000002.00000020.1250000.0000000.0000007.910400
50%0.0000003.00000028.0000000.0000000.00000014.454200
75%1.0000003.00000038.0000001.0000000.00000031.000000
max1.0000003.00000080.0000008.0000006.000000512.329200
titanic.groupby("sex")[["survived"]].mean()
survived
sex
female0.742038
male0.188908
titanic.groupby("sex")["survived"].mean()
sex
female    0.742038
male      0.188908
Name: survived, dtype: float64
titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()
classFirstSecondThird
sex
female0.9680850.9210530.500000
male0.3688520.1574070.135447
  • 数据透视表
titanic.pivot_table("survived", index="sex", columns="class")
classFirstSecondThird
sex
female0.9680850.9210530.500000
male0.3688520.1574070.135447
titanic.pivot_table("survived", index="sex", columns="class", aggfunc="mean", margins=True)
classFirstSecondThirdAll
sex
female0.9680850.9210530.5000000.742038
male0.3688520.1574070.1354470.188908
All0.6296300.4728260.2423630.383838
titanic.pivot_table(index="sex", columns="class", aggfunc={"survived": "sum", "fare": "mean"})
faresurvived
classFirstSecondThirdFirstSecondThird
sex
female106.12579821.97012116.118810917072
male67.22612719.74178212.661633451747

12.7 其他

(1)向量化字符串操作

(2) 处理时间序列

(3) 多级索引:用于多维数据

base_data = np.array([[1771, 11115 ],
                      [2154, 30320],
                      [2141, 14070],
                      [2424, 32680],
                      [1077, 7806],
                      [1303, 24222],
                      [798, 4789],
                      [981, 13468]]) 
data = pd.DataFrame(base_data, index=[["BeiJing","BeiJing","ShangHai","ShangHai","ShenZhen","ShenZhen","HangZhou","HangZhou"]\
                                     , [2008, 2018]*4], columns=["population", "GDP"])
data
populationGDP
BeiJing2008177111115
2018215430320
ShangHai2008214114070
2018242432680
ShenZhen200810777806
2018130324222
HangZhou20087984789
201898113468
data.index.names = ["city", "year"]
data
populationGDP
cityyear
BeiJing2008177111115
2018215430320
ShangHai2008214114070
2018242432680
ShenZhen200810777806
2018130324222
HangZhou20087984789
201898113468
data["GDP"]
city      year
BeiJing   2008    11115
          2018    30320
ShangHai  2008    14070
          2018    32680
ShenZhen  2008     7806
          2018    24222
HangZhou  2008     4789
          2018    13468
Name: GDP, dtype: int32
data.loc["ShangHai", "GDP"]
year
2008    14070
2018    32680
Name: GDP, dtype: int32
data.loc["ShangHai", 2018]["GDP"]
32680

(4) 高性能的Pandas:eval()

df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
%timeit (df1+df2)/(df3+df4)
17.6 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • 减少了复合代数式计算中间过程的内存分配
%timeit pd.eval("(df1+df2)/(df3+df4)")
10.5 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.allclose((df1+df2)/(df3+df4), pd.eval("(df1+df2)/(df3+df4)"))
True
  • 实现列间运算
df = pd.DataFrame(np.random.random((1000, 3)), columns=list("ABC"))
df.head()
ABC
00.4180710.3818360.500556
10.0594320.7490660.302429
20.4891470.7391530.777161
30.1754410.0165560.348979
40.7665340.5592520.310635
res_1 = pd.eval("(df.A+df.B)/(df.C-1)")
res_2 = df.eval("(A+B)/(C-1)")
np.allclose(res_1, res_2)
True
df["D"] = pd.eval("(df.A+df.B)/(df.C-1)")
df.head()
ABCD
00.4180710.3818360.500556-1.601593
10.0594320.7490660.302429-1.159019
20.4891470.7391530.777161-5.512052
30.1754410.0165560.348979-0.294917
40.7665340.5592520.310635-1.923199
df.eval("D=(A+B)/(C-1)", inplace=True)
df.head()
ABCD
00.4180710.3818360.500556-1.601593
10.0594320.7490660.302429-1.159019
20.4891470.7391530.777161-5.512052
30.1754410.0165560.348979-0.294917
40.7665340.5592520.310635-1.923199
  • 使用局部变量
column_mean = df.mean(axis=1)
res = df.eval("A+@column_mean")
res.head()
0    0.342788
1    0.047409
2   -0.387501
3    0.236956
4    0.694839
dtype: float64

(4) 高性能的Pandas:query()

df.head()
ABCD
00.4180710.3818360.500556-1.601593
10.0594320.7490660.302429-1.159019
20.4891470.7391530.777161-5.512052
30.1754410.0165560.348979-0.294917
40.7665340.5592520.310635-1.923199
%timeit df[(df.A < 0.5) & (df.B > 0.5)]
1.11 ms ± 9.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.query("(A < 0.5)&(B > 0.5)")
2.55 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.query("(A < 0.5)&(B > 0.5)").head()
ABCD
10.0594320.7490660.302429-1.159019
20.4891470.7391530.777161-5.512052
70.0739500.7301440.646190-2.272672
100.3932000.6104670.697096-3.313485
110.0657340.7646990.179380-1.011958
np.allclose(df[(df.A < 0.5) & (df.B > 0.5)], df.query("(A < 0.5)&(B > 0.5)"))
True

(5)eval()和query()的使用时机

小数组时,普通方法反而更快

df.values.nbytes
32000
df1.values.nbytes
8000000
温馨提示:本文最后更新于2024-05-26,若文件或内容有错误或已失效,请在下方留言
© 版权声明
THE END
喜欢就支持一下吧
点赞13赞赏 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容