Python Pandas 库使用指北
Pandas
是 Python 下的高性能的数据管理工具与数据分析工具。
本文中介绍了 Pandas
库中的一些常用类,然后记录了一些简单的筛选,切片等等用法。
基本概念
- (主要数据结构)Pandas 的主要数据结构是 Series(一维数据)与 DataFrame(二维数据)。
- (大小不变)Pandas 所有数据结构的值都是可变的,但数据结构的大小并非都是可变的,比如,Series 的长度不可改变,但 DataFrame 里就可以插入列。
- (倾向于复制)Pandas 里,绝大多数方法都不改变原始的输入数据,而是复制数据,生成新的对象。 一般来说,原始输入数据不变更稳妥。
快速入门
>>> import numpy as np
>>> import pandas as pd
生成 Series 与 DataFrame
- Series 的生成:默认使用整数索引
>>> s = pd.Series([1,3,5,np.nan,6,8])
>>> s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
- DataFrame 的生成:① 指定数据,索引,列名与是否为拷贝
class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)
>>> dates = pd.date_range('20220101', periods=6)
>>> dates
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
>>> df
A B C D
2022-01-01 -0.165424 -0.095599 0.940562 -0.629898
2022-01-02 0.237563 0.252783 1.362851 -1.525588
2022-01-03 2.222287 -0.591918 -1.450194 0.860104
2022-01-04 -0.560238 0.548346 -0.691656 -0.810484
2022-01-05 1.600348 0.242665 0.836329 0.372371
2022-01-06 -0.332250 -0.941505 0.110009 -1.478538
>>> a1 = {"A": 1, "B": 2, "c": 3}
>>> a2 = {"A": 1, "B": 3, "C": 5}
>>> pd1 = pd.DataFrame([a1, a2])
>>> pd1
A B c C
0 1 2 3.0 NaN
1 1 3 NaN 5.0
- DataFrame 的生成:② read_csv
task = pd.read_csv(f'./Data/{camp_name}/task.csv')
tid name score
0 0001-01-A 任务 1-A 5
1 0001-01-B 任务 1-B 7
2 0001-01-C 任务 1-C 2
3 0001-02-A 任务 2-A 5
4 0001-02-B 任务 2-B 5
5 0001-99-A Bonus 1
- DataFrame 的生成:③ data 中传入 Series,利用其所有的 index。如果不同列的 index 不 match,那么报错。
# Example
df2 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
>>> df2
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
# Test
df3 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': pd.Series(12, index=list(range(6)), dtype='float32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 614, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 464, in dict_to_mgr
return arrays_to_mgr(
File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 119, in arrays_to_mgr
index = _extract_index(arrays)
File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 649, in _extract_index
raise ValueError(msg)
ValueError: array length 4 does not match index length 6
# Another Test
df3 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': pd.Series(12, index=['a', 'b', 'c', 3], dtype='float32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 614, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 464, in dict_to_mgr
return arrays_to_mgr(
File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 119, in arrays_to_mgr
index = _extract_index(arrays)
File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 649, in _extract_index
raise ValueError(msg)
ValueError: array length 4 does not match index length 7
# Another Test
df3 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': pd.Series(12, index=[0,1,2,3], dtype='float32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
>>> df3
A B C D E F
0 1.0 2013-01-02 1.0 12.0 test foo
1 1.0 2013-01-02 1.0 12.0 train foo
2 1.0 2013-01-02 1.0 12.0 test foo
3 1.0 2013-01-02 1.0 12.0 train foo
查看数据
- 列数据类型
>>> df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
- 预览数据
>>> df2
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
>>> df2.head() # 默认最多显示 5 条数据
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
>>> df2.head(2)
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
>>> df2.tail(2)
A B C D E F
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
- 显示索引与列
>>> df2.index
Int64Index([0, 1, 2, 3], dtype='int64')
>>> df2.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
- 转换成 numpy 对象
>>> df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
>>> df
A B C D
2022-01-01 -0.451372 1.581206 -0.499837 -0.549320
2022-01-02 0.508692 -0.304325 1.995154 -0.895727
2022-01-03 0.681460 1.214341 -0.608446 -1.054553
2022-01-04 -1.032206 0.237051 1.502665 -0.048824
2022-01-05 1.707183 0.382228 1.335121 -1.099418
2022-01-06 0.895205 0.982582 0.188397 0.023960
>>> df.to_numpy() # 纯 float 类型,转换速度较快
array([[-0.45137221, 1.58120564, -0.49983697, -0.54931982],
[ 0.50869221, -0.30432474, 1.99515425, -0.89572709],
[ 0.68146026, 1.21434138, -0.60844611, -1.05455281],
[-1.03220563, 0.23705134, 1.50266499, -0.04882351],
[ 1.70718305, 0.38222751, 1.33512092, -1.09941788],
[ 0.89520526, 0.98258218, 0.18839673, 0.02396049]])
>>> df2.to_numpy() # 复合数据类型,转换速度慢
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
- 数据统计信息
>>> df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.384827 0.682181 0.652176 -0.603980
std 0.981800 0.698997 1.106773 0.497813
min -1.032206 -0.304325 -0.608446 -1.099418
25% -0.211356 0.273345 -0.327779 -1.014846
50% 0.595076 0.682405 0.761759 -0.722523
75% 0.841769 1.156402 1.460779 -0.173948
max 1.707183 1.581206 1.995154 0.023960
数据操作
- 转置
>>> df.T
2022-01-01 2022-01-02 2022-01-03 2022-01-04 2022-01-05 2022-01-06
A -0.451372 0.508692 0.681460 -1.032206 1.707183 0.895205
B 1.581206 -0.304325 1.214341 0.237051 0.382228 0.982582
C -0.499837 1.995154 -0.608446 1.502665 1.335121 0.188397
D -0.549320 -0.895727 -1.054553 -0.048824 -1.099418 0.023960
- 按索引排序
DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)
- kind 可以取’mergesort’,以达到稳定的目的
- axis 取 0 按行排序,取 1 按列排序
>>> df4 = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150],
... columns=['A'])
>>> df4
A
100 1
29 2
234 3
1 4
150 5
>>> df4.sort_index()
A
1 4
29 2
100 1
150 5
234 3
>>> df4 = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]], index=[100, 29, 234, 1, 150], columns=['B', 'A'])
>>> df4
B A
100 1 2
29 3 4
234 5 6
1 7 8
150 9 10
>>> df4.sort_index()
B A
1 7 8
29 3 4
100 1 2
150 9 10
234 5 6
>>> df4.sort_index(axis=1)
A B
100 2 1
29 4 3
234 6 5
1 8 7
150 10 9
- 按某列排序
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)
- na_position 可以是
first
和last
- 如果 ignore_index 为真,那么结果的索引会从 0 标号至 n-1
- key: 接受一个 Series 作为输出,返回一个 shape 相同的 Series,然后使用后者进行排序操作
>>> df = pd.DataFrame({
... 'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
... 'col2': [2, 1, 9, 8, 7, 4],
... 'col3': [0, 1, 9, 4, 2, 3],
... 'col4': ['a', 'B', 'c', 'D', 'e', 'F']
... })
>>> df
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
3 NaN 8 4 D
4 D 7 2 e
5 C 4 3 F
>>> df.sort_values(by=['col1'])
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
5 C 4 3 F
4 D 7 2 e
3 NaN 8 4 D
>>> df.sort_values(by=['col1', 'col2'])
col1 col2 col3 col4
1 A 1 1 B
0 A 2 0 a
2 B 9 9 c
5 C 4 3 F
4 D 7 2 e
3 NaN 8 4 D
df.sort_values(by='col4', key=lambda col: col.str.lower())
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
3 NaN 8 4 D
4 D 7 2 e
5 C 4 3 F
选择
- 获取单列:使用
df.A
或者df['A']
>>> df
A B C D
2022-01-01 -0.514649 2.198378 -0.986735 -0.368712
2022-01-02 -2.059045 1.715020 0.607552 1.145638
2022-01-03 -1.398741 1.538408 1.459097 0.491077
2022-01-04 0.642856 0.670310 0.378368 -0.167852
2022-01-05 0.288924 1.131819 0.532734 0.893374
2022-01-06 -0.604422 -0.483053 1.557318 -0.158592
>>> df.A
2022-01-01 -0.514649
2022-01-02 -2.059045
2022-01-03 -1.398741
2022-01-04 0.642856
2022-01-05 0.288924
2022-01-06 -0.604422
Freq: D, Name: A, dtype: float64
>>> df['A']
2022-01-01 -0.514649
2022-01-02 -2.059045
2022-01-03 -1.398741
2022-01-04 0.642856
2022-01-05 0.288924
2022-01-06 -0.604422
Freq: D, Name: A, dtype: float64
- 获取多行:使用
[]
做 Slice
>>> df[1:4]
A B C D
2022-01-02 -2.059045 1.715020 0.607552 1.145638
2022-01-03 -1.398741 1.538408 1.459097 0.491077
2022-01-04 0.642856 0.670310 0.378368 -0.167852
>>> df['20220102':]
A B C D
2022-01-02 -2.059045 1.715020 0.607552 1.145638
2022-01-03 -1.398741 1.538408 1.459097 0.491077
2022-01-04 0.642856 0.670310 0.378368 -0.167852
2022-01-05 0.288924 1.131819 0.532734 0.893374
2022-01-06 -0.604422 -0.483053 1.557318 -0.158592
- 按标签索引:使用
.loc[]
>>> df.loc[ : , : ]
A B C D
2022-01-01 -0.514649 2.198378 -0.986735 -0.368712
2022-01-02 -2.059045 1.715020 0.607552 1.145638
2022-01-03 -1.398741 1.538408 1.459097 0.491077
2022-01-04 0.642856 0.670310 0.378368 -0.167852
2022-01-05 0.288924 1.131819 0.532734 0.893374
2022-01-06 -0.604422 -0.483053 1.557318 -0.158592
>>> df.loc[ '20220101' , : ]
A -0.514649
B 2.198378
C -0.986735
D -0.368712
Name: 2022-01-01 00:00:00, dtype: float64
>>> df.loc[ '20220101' , 'C' ]
-0.9867347785391464
>>> df.loc[ '20220101' : '20220104' , 'C' ]
2022-01-01 -0.986735
2022-01-02 0.607552
2022-01-03 1.459097
2022-01-04 0.378368
Freq: D, Name: C, dtype: float64
>>> df.loc[ '20220101' : '20220104' , ['C', 'B', 'B', 'D'] ]
C B B D
2022-01-01 -0.986735 2.198378 2.198378 -0.368712
2022-01-02 0.607552 1.715020 1.715020 1.145638
2022-01-03 1.459097 1.538408 1.538408 0.491077
2022-01-04 0.378368 0.670310 0.670310 -0.167852
- 按整数索引
>>> df.iloc[ 1:3 , [2, 1, 1, 3] ]
C B B D
2022-01-02 0.607552 1.715020 1.715020 1.145638
2022-01-03 1.459097 1.538408 1.538408 0.491077
- Bool 索引
>>> df.B
2022-01-01 2.198378
2022-01-02 1.715020
2022-01-03 1.538408
2022-01-04 0.670310
2022-01-05 1.131819
2022-01-06 -0.483053
Freq: D, Name: B, dtype: float64
>>> df.B > 1
2022-01-01 True
2022-01-02 True
2022-01-03 True
2022-01-04 False
2022-01-05 True
2022-01-06 False
Freq: D, Name: B, dtype: bool
>>> df.loc[df.B > 1]
A B C D
2022-01-01 -0.514649 2.198378 -0.986735 -0.368712
2022-01-02 -2.059045 1.715020 0.607552 1.145638
2022-01-03 -1.398741 1.538408 1.459097 0.491077
2022-01-05 0.288924 1.131819 0.532734 0.893374
- 用
isin()
筛选
In [41]: df2 = df.copy()
In [42]: df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
In [43]: df2
Out[43]:
A B C D E
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one
2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three
In [44]: df2[df2['E'].isin(['two', 'four'])]
Out[44]:
A B C D E
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
- 赋值
>>> df['E'] = [1,2,3,4,4,4]
>>> df
A B C D E
2022-01-01 -0.514649 2.198378 -0.986735 -0.368712 1
2022-01-02 -2.059045 1.715020 0.607552 1.145638 2
2022-01-03 -1.398741 1.538408 1.459097 0.491077 3
2022-01-04 0.642856 0.670310 0.378368 -0.167852 4
2022-01-05 0.288924 1.131819 0.532734 0.893374 4
2022-01-06 -0.604422 -0.483053 1.557318 -0.158592 4
>>> s = pd.Series([6,5,4,3,2,1,], index=pd.date_range('20220103', periods=6))
>>> s
2022-01-03 6
2022-01-04 5
2022-01-05 4
2022-01-06 3
2022-01-07 2
2022-01-08 1
Freq: D, dtype: int64
>>> df.F = s
>>> df
A B C D E
2022-01-01 -0.514649 2.198378 -0.986735 -0.368712 1
2022-01-02 -2.059045 1.715020 0.607552 1.145638 2
2022-01-03 -1.398741 1.538408 1.459097 0.491077 3
2022-01-04 0.642856 0.670310 0.378368 -0.167852 4
2022-01-05 0.288924 1.131819 0.532734 0.893374 4
2022-01-06 -0.604422 -0.483053 1.557318 -0.158592 4
>>> df['F'] = s
>>> df
A B C D E F
2022-01-01 -0.514649 2.198378 -0.986735 -0.368712 1 NaN
2022-01-02 -2.059045 1.715020 0.607552 1.145638 2 NaN
2022-01-03 -1.398741 1.538408 1.459097 0.491077 3 6.0
2022-01-04 0.642856 0.670310 0.378368 -0.167852 4 5.0
2022-01-05 0.288924 1.131819 0.532734 0.893374 4 4.0
2022-01-06 -0.604422 -0.483053 1.557318 -0.158592 4 3.0
>>> df.at['20220105', 'A'] = 114514
>>> df
A B C D E F
2022-01-01 -0.514649 2.198378 -0.986735 -0.368712 1 NaN
2022-01-02 -2.059045 1.715020 0.607552 1.145638 2 NaN
2022-01-03 -1.398741 1.538408 1.459097 0.491077 3 6.0
2022-01-04 0.642856 0.670310 0.378368 -0.167852 4 5.0
2022-01-05 114514.000000 1.131819 0.532734 0.893374 4 4.0
2022-01-06 -0.604422 -0.483053 1.557318 -0.158592 4 3.0
>>> df.at['20220105', 'A'] = 114514
>>> df
A B C D E F
2022-01-01 -0.514649 2.198378 -0.986735 -0.368712 1 NaN
2022-01-02 -2.059045 1.715020 0.607552 1.145638 2 NaN
2022-01-03 -1.398741 1.538408 1.459097 0.491077 3 6.0
2022-01-04 0.642856 0.670310 0.378368 -0.167852 4 5.0
2022-01-05 114514.000000 1.131819 0.532734 0.893374 4 4.0
2022-01-06 -0.604422 -0.483053 1.557318 -0.158592 4 3.0
>>> df
A B C D E F
2022-01-01 -0.514649 2.198378 -0.986735 -0.368712 1 NaN
2022-01-02 -2.059045 1.715020 0.607552 1.145638 2 NaN
2022-01-03 -1.398741 1.538408 1.459097 0.491077 3 6.0
2022-01-04 0.642856 0.670310 0.378368 -0.167852 4 5.0
2022-01-05 -0.012000 1.131819 0.532734 0.893374 4 4.0
2022-01-06 -0.604422 -0.483053 1.557318 -0.158592 4 3.0
>>> df[df < 0]
A B C D E F
2022-01-01 -0.514649 NaN -0.986735 -0.368712 NaN NaN
2022-01-02 -2.059045 NaN NaN NaN NaN NaN
2022-01-03 -1.398741 NaN NaN NaN NaN NaN
2022-01-04 NaN NaN NaN -0.167852 NaN NaN
2022-01-05 -0.012000 NaN NaN NaN NaN NaN
2022-01-06 -0.604422 -0.483053 NaN -0.158592 NaN NaN
>>> df[df < 0] = -df
>>> df
A B C D E F
2022-01-01 0.514649 2.198378 0.986735 0.368712 1 NaN
2022-01-02 2.059045 1.715020 0.607552 1.145638 2 NaN
2022-01-03 1.398741 1.538408 1.459097 0.491077 3 6.0
2022-01-04 0.642856 0.670310 0.378368 0.167852 4 5.0
2022-01-05 0.012000 1.131819 0.532734 0.893374 4 4.0
2022-01-06 0.604422 0.483053 1.557318 0.158592 4 3.0
缺失值
- 删除带有缺省值的行
>>> df1 = df.copy()
>>> df1
A B C D E F
2022-01-01 0.514649 2.198378 0.986735 0.368712 1 NaN
2022-01-02 2.059045 1.715020 0.607552 1.145638 2 NaN
2022-01-03 1.398741 1.538408 1.459097 0.491077 3 6.0
2022-01-04 0.642856 0.670310 0.378368 0.167852 4 5.0
2022-01-05 0.012000 1.131819 0.532734 0.893374 4 4.0
2022-01-06 0.604422 0.483053 1.557318 0.158592 4 3.0
>>> df1.dropna(how='any')
A B C D E F
2022-01-03 1.398741 1.538408 1.459097 0.491077 3 6.0
2022-01-04 0.642856 0.670310 0.378368 0.167852 4 5.0
2022-01-05 0.012000 1.131819 0.532734 0.893374 4 4.0
2022-01-06 0.604422 0.483053 1.557318 0.158592 4 3.0
>>> df1
A B C D E F
2022-01-01 0.514649 2.198378 0.986735 0.368712 1 NaN
2022-01-02 2.059045 1.715020 0.607552 1.145638 2 NaN
2022-01-03 1.398741 1.538408 1.459097 0.491077 3 6.0
2022-01-04 0.642856 0.670310 0.378368 0.167852 4 5.0
2022-01-05 0.012000 1.131819 0.532734 0.893374 4 4.0
2022-01-06 0.604422 0.483053 1.557318 0.158592 4 3.0
- 填充缺省值
>>> df1.fillna(value=5)
A B C D E F
2022-01-01 0.514649 2.198378 0.986735 0.368712 1 5.0
2022-01-02 2.059045 1.715020 0.607552 1.145638 2 5.0
2022-01-03 1.398741 1.538408 1.459097 0.491077 3 6.0
2022-01-04 0.642856 0.670310 0.378368 0.167852 4 5.0
2022-01-05 0.012000 1.131819 0.532734 0.893374 4 4.0
2022-01-06 0.604422 0.483053 1.557318 0.158592 4 3.0
- 获取是否为 Nan
>>> pd.isna(df1)
A B C D E F
2022-01-01 False False False False False True
2022-01-02 False False False False False True
2022-01-03 False False False False False False
2022-01-04 False False False False False False
2022-01-05 False False False False False False
2022-01-06 False False False False False False
运算
Pandas 默认用 np.nan
表示缺失数据。下列计算时,默认不包含缺失值。
- 求平均值
>>> df = df1.dropna(how='any')
>>> df
A B C D E F
2022-01-03 1.398741 1.538408 1.459097 0.491077 3 6.0
2022-01-04 0.642856 0.670310 0.378368 0.167852 4 5.0
2022-01-05 0.012000 1.131819 0.532734 0.893374 4 4.0
2022-01-06 0.604422 0.483053 1.557318 0.158592 4 3.0
>>> df.mean()
A 0.664505
B 0.955897
C 0.981879
D 0.427724
E 3.750000
F 4.500000
dtype: float64
>>> df.mean(1)
2022-01-03 2.314554
2022-01-04 1.809898
2022-01-05 1.761655
2022-01-06 1.633897
Freq: D, dtype: float64
- 作差
>>> df
A B C D E F
2022-01-03 1.398741 1.538408 1.459097 0.491077 3 6.0
2022-01-04 0.642856 0.670310 0.378368 0.167852 4 5.0
2022-01-05 0.012000 1.131819 0.532734 0.893374 4 4.0
2022-01-06 0.604422 0.483053 1.557318 0.158592 4 3.0
>>> s
2022-01-03 6
2022-01-04 5
2022-01-05 4
2022-01-06 3
2022-01-07 2
2022-01-08 1
Freq: D, dtype: int64
>>> df.sub(s)
2022-01-03 00:00:00 2022-01-04 00:00:00 ... E F
2022-01-03 NaN NaN ... NaN NaN
2022-01-04 NaN NaN ... NaN NaN
2022-01-05 NaN NaN ... NaN NaN
2022-01-06 NaN NaN ... NaN NaN
[4 rows x 12 columns]
>>> df.sub(s, axis='index')
A B C D E F
2022-01-03 -4.601259 -4.461592 -4.540903 -5.508923 -3.0 0.0
2022-01-04 -4.357144 -4.329690 -4.621632 -4.832148 -1.0 0.0
2022-01-05 -3.988000 -2.868181 -3.467266 -3.106626 0.0 0.0
2022-01-06 -2.395578 -2.516947 -1.442682 -2.841408 1.0 0.0
2022-01-07 NaN NaN NaN NaN NaN NaN
2022-01-08 NaN NaN NaN NaN NaN NaN
- Apply 函数
>>> df
A B C D E F
2022-01-03 1.398741 1.538408 1.459097 0.491077 3 6.0
2022-01-04 0.642856 0.670310 0.378368 0.167852 4 5.0
2022-01-05 0.012000 1.131819 0.532734 0.893374 4 4.0
2022-01-06 0.604422 0.483053 1.557318 0.158592 4 3.0
>>> df.apply(np.cumsum)
A B C D E F
2022-01-03 1.398741 1.538408 1.459097 0.491077 3 6.0
2022-01-04 2.041597 2.208717 1.837465 0.658930 7 11.0
2022-01-05 2.053597 3.340537 2.370199 1.552304 11 15.0
2022-01-06 2.658018 3.823590 3.927517 1.710896 15 18.0
>>> df.apply(lambda x: x.max() - x.min())
A 1.386741
B 1.055355
C 1.178950
D 0.734782
E 1.000000
F 3.000000
dtype: float64
- Series 的频数统计
>>> df
A B C D E F
2022-01-03 1.398741 1.538408 1.459097 0.491077 3 6.0
2022-01-04 0.642856 0.670310 0.378368 0.167852 4 5.0
2022-01-05 0.012000 1.131819 0.532734 0.893374 4 4.0
2022-01-06 0.604422 0.483053 1.557318 0.158592 4 3.0
>>> df.E
2022-01-03 3
2022-01-04 4
2022-01-05 4
2022-01-06 4
Freq: D, Name: E, dtype: int64
>>> df.E.value_counts()
4 3
3 1
Name: E, dtype: int64
合并 Merge
Concat
>>> df = pd.DataFrame(np.random.randn(10, 4))
>>> df1 = df[:3]
>>> df2 = df[7:]
>>> df
0 1 2 3
0 -0.453289 0.892104 1.891981 0.485892
1 -2.703414 -0.685189 -0.903685 -0.137809
2 0.500884 0.831745 -0.757333 1.997601
3 0.939272 1.907135 -0.528942 -0.724512
4 0.796412 0.958696 1.352733 1.805198
5 -1.141725 1.591892 -0.171487 -0.289578
6 0.155887 -0.646798 1.151169 1.051582
7 -0.917709 -0.156452 0.578088 0.639791
8 0.570383 -0.513202 0.891358 -0.567285
9 1.858860 1.628878 -1.269917 -0.396636
>>> pd.concat([df1, df2])
0 1 2 3
0 -0.453289 0.892104 1.891981 0.485892
1 -2.703414 -0.685189 -0.903685 -0.137809
2 0.500884 0.831745 -0.757333 1.997601
7 -0.917709 -0.156452 0.578088 0.639791
8 0.570383 -0.513202 0.891358 -0.567285
9 1.858860 1.628878 -1.269917 -0.396636
Join
SQL 风格的合并。
>>> left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
>>> right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
>>> left
key lval
0 foo 1
1 foo 2
>>> right
key rval
0 foo 4
1 foo 5
>>> pd.merge(left, right, on='key')
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
追加 Append
>>> df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
>>> s = df.iloc[5]
>>> df
A B C D
0 -0.019536 1.644399 1.250679 0.140385
1 0.770806 1.496631 0.216141 0.265806
2 1.166252 0.415610 -1.166003 1.359695
3 -0.360562 0.463504 0.507450 -0.651677
4 -0.459843 -2.204788 1.381087 -0.988501
5 0.505702 -0.213073 -1.264545 1.884786
6 -0.955691 0.130868 -0.722569 -0.514278
7 0.768887 0.195819 -0.997779 -0.707758
>>> s
A 0.505702
B -0.213073
C -1.264545
D 1.884786
Name: 5, dtype: float64
>>> df.append(s, ignore_index=True)
A B C D
0 -0.019536 1.644399 1.250679 0.140385
1 0.770806 1.496631 0.216141 0.265806
2 1.166252 0.415610 -1.166003 1.359695
3 -0.360562 0.463504 0.507450 -0.651677
4 -0.459843 -2.204788 1.381087 -0.988501
5 0.505702 -0.213073 -1.264545 1.884786
6 -0.955691 0.130868 -0.722569 -0.514278
7 0.768887 0.195819 -0.997779 -0.707758
8 0.505702 -0.213073 -1.264545 1.884786
分组 Grouping
“group by” 指的是涵盖下列一项或多项步骤的处理流程:
- 分割:按条件把数据分割成多组;
- 应用:为每组单独应用函数;
- 组合:将处理结果组合成一个数据结构。
>>> df
A B C D E
0 -0.019536 1.644399 1.250679 0.140385 1
1 0.770806 1.496631 0.216141 0.265806 1
2 1.166252 0.415610 -1.166003 1.359695 1
3 -0.360562 0.463504 0.507450 -0.651677 3
4 -0.459843 -2.204788 1.381087 -0.988501 4
5 0.505702 -0.213073 -1.264545 1.884786 3
6 -0.955691 0.130868 -0.722569 -0.514278 2
7 0.768887 0.195819 -0.997779 -0.707758 2
>>> df.groupby('E')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001EE0AF69FD0>
>>> df.groupby('E').sum()
A B C D
E
1 1.917522 3.556640 0.300816 1.765886
2 -0.186804 0.326687 -1.720349 -1.222036
3 0.145139 0.250431 -0.757095 1.233109
4 -0.459843 -2.204788 1.381087 -0.988501