Pandas 是 Python 下的高性能的数据管理工具与数据分析工具。

本文中介绍了 Pandas 库中的一些常用类,然后记录了一些简单的筛选,切片等等用法。

基本概念

  • (主要数据结构)Pandas 的主要数据结构是 Series(一维数据)与 DataFrame(二维数据)。
  • (大小不变)Pandas 所有数据结构的值都是可变的,但数据结构的大小并非都是可变的,比如,Series 的长度不可改变,但 DataFrame 里就可以插入列。
  • (倾向于复制)Pandas 里,绝大多数方法都不改变原始的输入数据,而是复制数据,生成新的对象。 一般来说,原始输入数据不变更稳妥。

快速入门

>>> import numpy as np
>>> import pandas as pd

生成 Series 与 DataFrame

  • Series 的生成:默认使用整数索引
>>> s = pd.Series([1,3,5,np.nan,6,8])
>>> s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
  • DataFrame 的生成:① 指定数据,索引,列名与是否为拷贝
    • class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)
>>> dates = pd.date_range('20220101', periods=6)
>>> dates
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
               '2022-01-05', '2022-01-06'],
              dtype='datetime64[ns]', freq='D')

>>> df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
>>> df
                   A         B         C         D
2022-01-01 -0.165424 -0.095599  0.940562 -0.629898
2022-01-02  0.237563  0.252783  1.362851 -1.525588
2022-01-03  2.222287 -0.591918 -1.450194  0.860104
2022-01-04 -0.560238  0.548346 -0.691656 -0.810484
2022-01-05  1.600348  0.242665  0.836329  0.372371
2022-01-06 -0.332250 -0.941505  0.110009 -1.478538

>>> a1 = {"A": 1, "B": 2, "c": 3}
>>> a2 = {"A": 1, "B": 3, "C": 5}
>>> pd1 = pd.DataFrame([a1, a2])
>>> pd1
   A  B    c    C
0  1  2  3.0  NaN
1  1  3  NaN  5.0
  • DataFrame 的生成:② read_csv
task = pd.read_csv(f'./Data/{camp_name}/task.csv')

         tid     name   score
0  0001-01-A   任务 1-A       5
1  0001-01-B   任务 1-B       7
2  0001-01-C   任务 1-C       2
3  0001-02-A   任务 2-A       5
4  0001-02-B   任务 2-B       5
5  0001-99-A    Bonus       1
  • DataFrame 的生成:③ data 中传入 Series,利用其所有的 index。如果不同列的 index 不 match,那么报错。
# Example
df2 = pd.DataFrame({'A': 1.,
                     'B': pd.Timestamp('20130102'),
                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                     'D': np.array([3] * 4, dtype='int32'),
                     'E': pd.Categorical(["test", "train", "test", "train"]),
                     'F': 'foo'})
 

>>> df2

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo


# Test
df3 = pd.DataFrame({'A': 1.,
                     'B': pd.Timestamp('20130102'),
                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                     'D': pd.Series(12, index=list(range(6)), dtype='float32'),
                     'E': pd.Categorical(["test", "train", "test", "train"]),
                     'F': 'foo'})

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 614, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 464, in dict_to_mgr
    return arrays_to_mgr(
  File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 119, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 649, in _extract_index
    raise ValueError(msg)
ValueError: array length 4 does not match index length 6
        

# Another Test
df3 = pd.DataFrame({'A': 1.,
                     'B': pd.Timestamp('20130102'),
                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                     'D': pd.Series(12, index=['a', 'b', 'c', 3], dtype='float32'),
                     'E': pd.Categorical(["test", "train", "test", "train"]),
                     'F': 'foo'})

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 614, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 464, in dict_to_mgr
    return arrays_to_mgr(
  File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 119, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/home/c7w/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 649, in _extract_index
    raise ValueError(msg)
ValueError: array length 4 does not match index length 7
        
        
# Another Test
df3 = pd.DataFrame({'A': 1.,
                     'B': pd.Timestamp('20130102'),
                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                     'D': pd.Series(12, index=[0,1,2,3], dtype='float32'),
                     'E': pd.Categorical(["test", "train", "test", "train"]),
                     'F': 'foo'})
>>> df3
     A          B    C     D      E    F
0  1.0 2013-01-02  1.0  12.0   test  foo
1  1.0 2013-01-02  1.0  12.0  train  foo
2  1.0 2013-01-02  1.0  12.0   test  foo
3  1.0 2013-01-02  1.0  12.0  train  foo

查看数据

  • 列数据类型
>>> df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
  • 预览数据
>>> df2
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

>>> df2.head() # 默认最多显示 5 条数据
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

>>> df2.head(2)
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo

>>> df2.tail(2)
     A          B    C  D      E    F
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
  • 显示索引与列
>>> df2.index
Int64Index([0, 1, 2, 3], dtype='int64')

>>> df2.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
  • 转换成 numpy 对象

>>> df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

>>> df
                   A         B         C         D
2022-01-01 -0.451372  1.581206 -0.499837 -0.549320
2022-01-02  0.508692 -0.304325  1.995154 -0.895727
2022-01-03  0.681460  1.214341 -0.608446 -1.054553
2022-01-04 -1.032206  0.237051  1.502665 -0.048824
2022-01-05  1.707183  0.382228  1.335121 -1.099418
2022-01-06  0.895205  0.982582  0.188397  0.023960

>>> df.to_numpy() # 纯 float 类型,转换速度较快
array([[-0.45137221,  1.58120564, -0.49983697, -0.54931982],
       [ 0.50869221, -0.30432474,  1.99515425, -0.89572709],
       [ 0.68146026,  1.21434138, -0.60844611, -1.05455281],
       [-1.03220563,  0.23705134,  1.50266499, -0.04882351],
       [ 1.70718305,  0.38222751,  1.33512092, -1.09941788],
       [ 0.89520526,  0.98258218,  0.18839673,  0.02396049]])

>>> df2.to_numpy() # 复合数据类型,转换速度慢
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)
  • 数据统计信息
>>> df.describe()
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.384827  0.682181  0.652176 -0.603980
std    0.981800  0.698997  1.106773  0.497813
min   -1.032206 -0.304325 -0.608446 -1.099418
25%   -0.211356  0.273345 -0.327779 -1.014846
50%    0.595076  0.682405  0.761759 -0.722523
75%    0.841769  1.156402  1.460779 -0.173948
max    1.707183  1.581206  1.995154  0.023960

数据操作

  • 转置
>>> df.T
   2022-01-01  2022-01-02  2022-01-03  2022-01-04  2022-01-05  2022-01-06
A   -0.451372    0.508692    0.681460   -1.032206    1.707183    0.895205
B    1.581206   -0.304325    1.214341    0.237051    0.382228    0.982582
C   -0.499837    1.995154   -0.608446    1.502665    1.335121    0.188397
D   -0.549320   -0.895727   -1.054553   -0.048824   -1.099418    0.023960
  • 按索引排序
    • DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)
    • kind 可以取’mergesort’,以达到稳定的目的
    • axis 取 0 按行排序,取 1 按列排序
>>> df4 = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150],
...                   columns=['A'])
>>> df4
     A
100  1
29   2
234  3
1    4
150  5
>>> df4.sort_index()
     A
1    4
29   2
100  1
150  5
234  3


>>> df4 = pd.DataFrame([[1, 2],  [3, 4], [5, 6], [7, 8], [9, 10]], index=[100, 29, 234, 1, 150], columns=['B', 'A'])
>>> df4
     B   A
100  1   2
29   3   4
234  5   6
1    7   8
150  9  10
>>> df4.sort_index()
     B   A
1    7   8
29   3   4
100  1   2
150  9  10
234  5   6
>>> df4.sort_index(axis=1)
      A  B
100   2  1
29    4  3
234   6  5
1     8  7
150  10  9
  • 按某列排序
    • DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)
    • na_position 可以是 firstlast
    • 如果 ignore_index 为真,那么结果的索引会从 0 标号至 n-1
    • key: 接受一个 Series 作为输出,返回一个 shape 相同的 Series,然后使用后者进行排序操作
>>> df = pd.DataFrame({
...     'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
...     'col2': [2, 1, 9, 8, 7, 4],
...     'col3': [0, 1, 9, 4, 2, 3],
...     'col4': ['a', 'B', 'c', 'D', 'e', 'F']
... })

>>> df
  col1  col2  col3 col4
0    A     2     0    a
1    A     1     1    B
2    B     9     9    c
3  NaN     8     4    D
4    D     7     2    e
5    C     4     3    F

>>> df.sort_values(by=['col1'])
  col1  col2  col3 col4
0    A     2     0    a
1    A     1     1    B
2    B     9     9    c
5    C     4     3    F
4    D     7     2    e
3  NaN     8     4    D

>>> df.sort_values(by=['col1', 'col2'])
  col1  col2  col3 col4
1    A     1     1    B
0    A     2     0    a
2    B     9     9    c
5    C     4     3    F
4    D     7     2    e
3  NaN     8     4    D

df.sort_values(by='col4', key=lambda col: col.str.lower())
  col1  col2  col3 col4
0    A     2     0    a
1    A     1     1    B
2    B     9     9    c
3  NaN     8     4    D
4    D     7     2    e
5    C     4     3    F

选择

  • 获取单列:使用 df.A 或者 df['A']
>>> df
                   A         B         C         D
2022-01-01 -0.514649  2.198378 -0.986735 -0.368712
2022-01-02 -2.059045  1.715020  0.607552  1.145638
2022-01-03 -1.398741  1.538408  1.459097  0.491077
2022-01-04  0.642856  0.670310  0.378368 -0.167852
2022-01-05  0.288924  1.131819  0.532734  0.893374
2022-01-06 -0.604422 -0.483053  1.557318 -0.158592

>>> df.A
2022-01-01   -0.514649
2022-01-02   -2.059045
2022-01-03   -1.398741
2022-01-04    0.642856
2022-01-05    0.288924
2022-01-06   -0.604422
Freq: D, Name: A, dtype: float64
            
>>> df['A']
2022-01-01   -0.514649
2022-01-02   -2.059045
2022-01-03   -1.398741
2022-01-04    0.642856
2022-01-05    0.288924
2022-01-06   -0.604422
Freq: D, Name: A, dtype: float64
  • 获取多行:使用 [] 做 Slice
>>> df[1:4]
                   A         B         C         D
2022-01-02 -2.059045  1.715020  0.607552  1.145638
2022-01-03 -1.398741  1.538408  1.459097  0.491077
2022-01-04  0.642856  0.670310  0.378368 -0.167852

>>> df['20220102':]
                   A         B         C         D
2022-01-02 -2.059045  1.715020  0.607552  1.145638
2022-01-03 -1.398741  1.538408  1.459097  0.491077
2022-01-04  0.642856  0.670310  0.378368 -0.167852
2022-01-05  0.288924  1.131819  0.532734  0.893374
2022-01-06 -0.604422 -0.483053  1.557318 -0.158592
  • 按标签索引:使用 .loc[]
>>> df.loc[ : , : ]
                   A         B         C         D
2022-01-01 -0.514649  2.198378 -0.986735 -0.368712
2022-01-02 -2.059045  1.715020  0.607552  1.145638
2022-01-03 -1.398741  1.538408  1.459097  0.491077
2022-01-04  0.642856  0.670310  0.378368 -0.167852
2022-01-05  0.288924  1.131819  0.532734  0.893374
2022-01-06 -0.604422 -0.483053  1.557318 -0.158592

>>> df.loc[ '20220101' , : ]
A   -0.514649
B    2.198378
C   -0.986735
D   -0.368712
Name: 2022-01-01 00:00:00, dtype: float64

>>> df.loc[ '20220101' , 'C' ]
-0.9867347785391464

>>> df.loc[ '20220101' : '20220104' , 'C' ]
2022-01-01   -0.986735
2022-01-02    0.607552
2022-01-03    1.459097
2022-01-04    0.378368
Freq: D, Name: C, dtype: float64

>>> df.loc[ '20220101' : '20220104' , ['C', 'B', 'B', 'D'] ]
                   C         B         B         D
2022-01-01 -0.986735  2.198378  2.198378 -0.368712
2022-01-02  0.607552  1.715020  1.715020  1.145638
2022-01-03  1.459097  1.538408  1.538408  0.491077
2022-01-04  0.378368  0.670310  0.670310 -0.167852
  • 按整数索引
>>> df.iloc[ 1:3 , [2, 1, 1, 3] ]
                   C         B         B         D
2022-01-02  0.607552  1.715020  1.715020  1.145638
2022-01-03  1.459097  1.538408  1.538408  0.491077
  • Bool 索引
>>> df.B
2022-01-01    2.198378
2022-01-02    1.715020
2022-01-03    1.538408
2022-01-04    0.670310
2022-01-05    1.131819
2022-01-06   -0.483053
Freq: D, Name: B, dtype: float64
            
>>> df.B > 1
2022-01-01     True
2022-01-02     True
2022-01-03     True
2022-01-04    False
2022-01-05     True
2022-01-06    False

Freq: D, Name: B, dtype: bool
>>> df.loc[df.B > 1]
                   A         B         C         D
2022-01-01 -0.514649  2.198378 -0.986735 -0.368712
2022-01-02 -2.059045  1.715020  0.607552  1.145638
2022-01-03 -1.398741  1.538408  1.459097  0.491077
2022-01-05  0.288924  1.131819  0.532734  0.893374
  • isin() 筛选
In [41]: df2 = df.copy()

In [42]: df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']

In [43]: df2
Out[43]: 
                   A         B         C         D      E
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632    one
2013-01-02  1.212112 -0.173215  0.119209 -1.044236    one
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804    two
2013-01-04  0.721555 -0.706771 -1.039575  0.271860  three
2013-01-05 -0.424972  0.567020  0.276232 -1.087401   four
2013-01-06 -0.673690  0.113648 -1.478427  0.524988  three

In [44]: df2[df2['E'].isin(['two', 'four'])]
Out[44]: 
                   A         B         C         D     E
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804   two
2013-01-05 -0.424972  0.567020  0.276232 -1.087401  four
  • 赋值
>>> df['E'] = [1,2,3,4,4,4]
>>> df
                   A         B         C         D  E
2022-01-01 -0.514649  2.198378 -0.986735 -0.368712  1
2022-01-02 -2.059045  1.715020  0.607552  1.145638  2
2022-01-03 -1.398741  1.538408  1.459097  0.491077  3
2022-01-04  0.642856  0.670310  0.378368 -0.167852  4
2022-01-05  0.288924  1.131819  0.532734  0.893374  4
2022-01-06 -0.604422 -0.483053  1.557318 -0.158592  4

>>> s = pd.Series([6,5,4,3,2,1,], index=pd.date_range('20220103', periods=6))
>>> s
2022-01-03    6
2022-01-04    5
2022-01-05    4
2022-01-06    3
2022-01-07    2
2022-01-08    1
Freq: D, dtype: int64
        
>>> df.F = s
>>> df
                   A         B         C         D  E
2022-01-01 -0.514649  2.198378 -0.986735 -0.368712  1
2022-01-02 -2.059045  1.715020  0.607552  1.145638  2
2022-01-03 -1.398741  1.538408  1.459097  0.491077  3
2022-01-04  0.642856  0.670310  0.378368 -0.167852  4
2022-01-05  0.288924  1.131819  0.532734  0.893374  4
2022-01-06 -0.604422 -0.483053  1.557318 -0.158592  4

>>> df['F'] = s
>>> df
                   A         B         C         D  E    F
2022-01-01 -0.514649  2.198378 -0.986735 -0.368712  1  NaN
2022-01-02 -2.059045  1.715020  0.607552  1.145638  2  NaN
2022-01-03 -1.398741  1.538408  1.459097  0.491077  3  6.0
2022-01-04  0.642856  0.670310  0.378368 -0.167852  4  5.0
2022-01-05  0.288924  1.131819  0.532734  0.893374  4  4.0
2022-01-06 -0.604422 -0.483053  1.557318 -0.158592  4  3.0
>>> df.at['20220105', 'A'] = 114514
>>> df
                        A         B         C         D  E    F
2022-01-01      -0.514649  2.198378 -0.986735 -0.368712  1  NaN
2022-01-02      -2.059045  1.715020  0.607552  1.145638  2  NaN
2022-01-03      -1.398741  1.538408  1.459097  0.491077  3  6.0
2022-01-04       0.642856  0.670310  0.378368 -0.167852  4  5.0
2022-01-05  114514.000000  1.131819  0.532734  0.893374  4  4.0
2022-01-06      -0.604422 -0.483053  1.557318 -0.158592  4  3.0

>>> df.at['20220105', 'A'] = 114514
>>> df
                        A         B         C         D  E    F
2022-01-01      -0.514649  2.198378 -0.986735 -0.368712  1  NaN
2022-01-02      -2.059045  1.715020  0.607552  1.145638  2  NaN
2022-01-03      -1.398741  1.538408  1.459097  0.491077  3  6.0
2022-01-04       0.642856  0.670310  0.378368 -0.167852  4  5.0
2022-01-05  114514.000000  1.131819  0.532734  0.893374  4  4.0
2022-01-06      -0.604422 -0.483053  1.557318 -0.158592  4  3.0
>>> df
                   A         B         C         D  E    F
2022-01-01 -0.514649  2.198378 -0.986735 -0.368712  1  NaN
2022-01-02 -2.059045  1.715020  0.607552  1.145638  2  NaN
2022-01-03 -1.398741  1.538408  1.459097  0.491077  3  6.0
2022-01-04  0.642856  0.670310  0.378368 -0.167852  4  5.0
2022-01-05 -0.012000  1.131819  0.532734  0.893374  4  4.0
2022-01-06 -0.604422 -0.483053  1.557318 -0.158592  4  3.0

>>> df[df < 0]
                   A         B         C         D   E   F
2022-01-01 -0.514649       NaN -0.986735 -0.368712 NaN NaN
2022-01-02 -2.059045       NaN       NaN       NaN NaN NaN
2022-01-03 -1.398741       NaN       NaN       NaN NaN NaN
2022-01-04       NaN       NaN       NaN -0.167852 NaN NaN
2022-01-05 -0.012000       NaN       NaN       NaN NaN NaN
2022-01-06 -0.604422 -0.483053       NaN -0.158592 NaN NaN

>>> df[df < 0] = -df
>>> df
                   A         B         C         D  E    F
2022-01-01  0.514649  2.198378  0.986735  0.368712  1  NaN
2022-01-02  2.059045  1.715020  0.607552  1.145638  2  NaN
2022-01-03  1.398741  1.538408  1.459097  0.491077  3  6.0
2022-01-04  0.642856  0.670310  0.378368  0.167852  4  5.0
2022-01-05  0.012000  1.131819  0.532734  0.893374  4  4.0
2022-01-06  0.604422  0.483053  1.557318  0.158592  4  3.0

缺失值

  • 删除带有缺省值的行
>>> df1 = df.copy()
>>> df1
                   A         B         C         D  E    F
2022-01-01  0.514649  2.198378  0.986735  0.368712  1  NaN
2022-01-02  2.059045  1.715020  0.607552  1.145638  2  NaN
2022-01-03  1.398741  1.538408  1.459097  0.491077  3  6.0
2022-01-04  0.642856  0.670310  0.378368  0.167852  4  5.0
2022-01-05  0.012000  1.131819  0.532734  0.893374  4  4.0
2022-01-06  0.604422  0.483053  1.557318  0.158592  4  3.0

>>> df1.dropna(how='any')
                   A         B         C         D  E    F
2022-01-03  1.398741  1.538408  1.459097  0.491077  3  6.0
2022-01-04  0.642856  0.670310  0.378368  0.167852  4  5.0
2022-01-05  0.012000  1.131819  0.532734  0.893374  4  4.0
2022-01-06  0.604422  0.483053  1.557318  0.158592  4  3.0

>>> df1
                   A         B         C         D  E    F
2022-01-01  0.514649  2.198378  0.986735  0.368712  1  NaN
2022-01-02  2.059045  1.715020  0.607552  1.145638  2  NaN
2022-01-03  1.398741  1.538408  1.459097  0.491077  3  6.0
2022-01-04  0.642856  0.670310  0.378368  0.167852  4  5.0
2022-01-05  0.012000  1.131819  0.532734  0.893374  4  4.0
2022-01-06  0.604422  0.483053  1.557318  0.158592  4  3.0
  • 填充缺省值
>>> df1.fillna(value=5)
                   A         B         C         D  E    F
2022-01-01  0.514649  2.198378  0.986735  0.368712  1  5.0
2022-01-02  2.059045  1.715020  0.607552  1.145638  2  5.0
2022-01-03  1.398741  1.538408  1.459097  0.491077  3  6.0
2022-01-04  0.642856  0.670310  0.378368  0.167852  4  5.0
2022-01-05  0.012000  1.131819  0.532734  0.893374  4  4.0
2022-01-06  0.604422  0.483053  1.557318  0.158592  4  3.0
  • 获取是否为 Nan
>>> pd.isna(df1)
                A      B      C      D      E      F
2022-01-01  False  False  False  False  False   True
2022-01-02  False  False  False  False  False   True
2022-01-03  False  False  False  False  False  False
2022-01-04  False  False  False  False  False  False
2022-01-05  False  False  False  False  False  False
2022-01-06  False  False  False  False  False  False

运算

Pandas 默认用 np.nan 表示缺失数据。下列计算时,默认不包含缺失值。

  • 求平均值
>>> df = df1.dropna(how='any')
>>> df
                   A         B         C         D  E    F
2022-01-03  1.398741  1.538408  1.459097  0.491077  3  6.0
2022-01-04  0.642856  0.670310  0.378368  0.167852  4  5.0
2022-01-05  0.012000  1.131819  0.532734  0.893374  4  4.0
2022-01-06  0.604422  0.483053  1.557318  0.158592  4  3.0

>>> df.mean()
A    0.664505
B    0.955897
C    0.981879
D    0.427724
E    3.750000
F    4.500000
dtype: float64
    
>>> df.mean(1)
2022-01-03    2.314554
2022-01-04    1.809898
2022-01-05    1.761655
2022-01-06    1.633897
Freq: D, dtype: float64
  • 作差
>>> df
                   A         B         C         D  E    F
2022-01-03  1.398741  1.538408  1.459097  0.491077  3  6.0
2022-01-04  0.642856  0.670310  0.378368  0.167852  4  5.0
2022-01-05  0.012000  1.131819  0.532734  0.893374  4  4.0
2022-01-06  0.604422  0.483053  1.557318  0.158592  4  3.0

>>> s
2022-01-03    6
2022-01-04    5
2022-01-05    4
2022-01-06    3
2022-01-07    2
2022-01-08    1
Freq: D, dtype: int64

        >>> df.sub(s)
            2022-01-03 00:00:00  2022-01-04 00:00:00  ...   E   F
2022-01-03                  NaN                  NaN  ... NaN NaN
2022-01-04                  NaN                  NaN  ... NaN NaN
2022-01-05                  NaN                  NaN  ... NaN NaN
2022-01-06                  NaN                  NaN  ... NaN NaN

[4 rows x 12 columns]

>>> df.sub(s, axis='index')
                   A         B         C         D    E    F
2022-01-03 -4.601259 -4.461592 -4.540903 -5.508923 -3.0  0.0
2022-01-04 -4.357144 -4.329690 -4.621632 -4.832148 -1.0  0.0
2022-01-05 -3.988000 -2.868181 -3.467266 -3.106626  0.0  0.0
2022-01-06 -2.395578 -2.516947 -1.442682 -2.841408  1.0  0.0
2022-01-07       NaN       NaN       NaN       NaN  NaN  NaN
2022-01-08       NaN       NaN       NaN       NaN  NaN  NaN
  • Apply 函数
>>> df
                   A         B         C         D  E    F
2022-01-03  1.398741  1.538408  1.459097  0.491077  3  6.0
2022-01-04  0.642856  0.670310  0.378368  0.167852  4  5.0
2022-01-05  0.012000  1.131819  0.532734  0.893374  4  4.0
2022-01-06  0.604422  0.483053  1.557318  0.158592  4  3.0

>>> df.apply(np.cumsum)
                   A         B         C         D   E     F
2022-01-03  1.398741  1.538408  1.459097  0.491077   3   6.0
2022-01-04  2.041597  2.208717  1.837465  0.658930   7  11.0
2022-01-05  2.053597  3.340537  2.370199  1.552304  11  15.0
2022-01-06  2.658018  3.823590  3.927517  1.710896  15  18.0

>>> df.apply(lambda x: x.max() - x.min())
A    1.386741
B    1.055355
C    1.178950
D    0.734782
E    1.000000
F    3.000000
dtype: float64
  • Series 的频数统计
>>> df
                   A         B         C         D  E    F
2022-01-03  1.398741  1.538408  1.459097  0.491077  3  6.0
2022-01-04  0.642856  0.670310  0.378368  0.167852  4  5.0
2022-01-05  0.012000  1.131819  0.532734  0.893374  4  4.0
2022-01-06  0.604422  0.483053  1.557318  0.158592  4  3.0

>>> df.E
2022-01-03    3
2022-01-04    4
2022-01-05    4
2022-01-06    4
Freq: D, Name: E, dtype: int64
            
>>> df.E.value_counts()
4    3
3    1
Name: E, dtype: int64

合并 Merge

Concat

>>> df = pd.DataFrame(np.random.randn(10, 4))
>>> df1 = df[:3]
>>> df2 = df[7:]

>>> df
          0         1         2         3
0 -0.453289  0.892104  1.891981  0.485892
1 -2.703414 -0.685189 -0.903685 -0.137809
2  0.500884  0.831745 -0.757333  1.997601
3  0.939272  1.907135 -0.528942 -0.724512
4  0.796412  0.958696  1.352733  1.805198
5 -1.141725  1.591892 -0.171487 -0.289578
6  0.155887 -0.646798  1.151169  1.051582
7 -0.917709 -0.156452  0.578088  0.639791
8  0.570383 -0.513202  0.891358 -0.567285
9  1.858860  1.628878 -1.269917 -0.396636
>>> pd.concat([df1, df2])
          0         1         2         3
0 -0.453289  0.892104  1.891981  0.485892
1 -2.703414 -0.685189 -0.903685 -0.137809
2  0.500884  0.831745 -0.757333  1.997601
7 -0.917709 -0.156452  0.578088  0.639791
8  0.570383 -0.513202  0.891358 -0.567285
9  1.858860  1.628878 -1.269917 -0.396636

Join

SQL 风格的合并。

>>> left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
>>> right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

>>> left
   key  lval
0  foo     1
1  foo     2

>>> right
   key  rval
0  foo     4
1  foo     5

>>> pd.merge(left, right, on='key')
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

追加 Append

>>> df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
>>> s = df.iloc[5]

>>> df
          A         B         C         D
0 -0.019536  1.644399  1.250679  0.140385
1  0.770806  1.496631  0.216141  0.265806
2  1.166252  0.415610 -1.166003  1.359695
3 -0.360562  0.463504  0.507450 -0.651677
4 -0.459843 -2.204788  1.381087 -0.988501
5  0.505702 -0.213073 -1.264545  1.884786
6 -0.955691  0.130868 -0.722569 -0.514278
7  0.768887  0.195819 -0.997779 -0.707758

>>> s
A    0.505702
B   -0.213073
C   -1.264545
D    1.884786
Name: 5, dtype: float64
        
>>> df.append(s, ignore_index=True)
          A         B         C         D
0 -0.019536  1.644399  1.250679  0.140385
1  0.770806  1.496631  0.216141  0.265806
2  1.166252  0.415610 -1.166003  1.359695
3 -0.360562  0.463504  0.507450 -0.651677
4 -0.459843 -2.204788  1.381087 -0.988501
5  0.505702 -0.213073 -1.264545  1.884786
6 -0.955691  0.130868 -0.722569 -0.514278
7  0.768887  0.195819 -0.997779 -0.707758
8  0.505702 -0.213073 -1.264545  1.884786

分组 Grouping

“group by” 指的是涵盖下列一项或多项步骤的处理流程:

  • 分割:按条件把数据分割成多组;
  • 应用:为每组单独应用函数;
  • 组合:将处理结果组合成一个数据结构。
>>> df
          A         B         C         D  E
0 -0.019536  1.644399  1.250679  0.140385  1
1  0.770806  1.496631  0.216141  0.265806  1
2  1.166252  0.415610 -1.166003  1.359695  1
3 -0.360562  0.463504  0.507450 -0.651677  3
4 -0.459843 -2.204788  1.381087 -0.988501  4
5  0.505702 -0.213073 -1.264545  1.884786  3
6 -0.955691  0.130868 -0.722569 -0.514278  2
7  0.768887  0.195819 -0.997779 -0.707758  2

>>> df.groupby('E')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001EE0AF69FD0>

>>> df.groupby('E').sum()
          A         B         C         D
E
1  1.917522  3.556640  0.300816  1.765886
2 -0.186804  0.326687 -1.720349 -1.222036
3  0.145139  0.250431 -0.757095  1.233109
4 -0.459843 -2.204788  1.381087 -0.988501

Reference