pandas dataframe 多条件过滤

刘超 9天前 ⋅ 772 阅读   编辑

一、多条件过滤

  使用query方法

df_filtered = df.query('a == 4 & b != 2')

  注意:等于过滤,是两个==

  使用==

data[(data['A']==0)&(data['B']==1)] 

  使用loc函数

>>> data.loc[(data['A']==0)&(data['B']==1)] # 提取data数据(多个筛选条件)
   A  B  C  D
a  0  1  2  3

二、范围过滤

// query函数
// query <
rpt.query('60000 < STK_ID < 70000')
//  query in 
rpt.query('STK_ID in (600809,600141,600329)')

// isin函数
// 筛选出dataframe中有某一个或某几个字符串的列:
list=['key1','key2']
df = df[df['one'].isin(list)]
// data[(data['A'].isin([0]))&(data['B'].isin([1]))] # isin函数
// 筛选出dataframe中不含某一个或某几个字符串的列,相当于反选
df = df[~df['one'].isin(list)]

三、有级联关系的过滤,比如20201101有两个advertiser_id(adv1044525491840、adv1049003362112),20201102有一个(adv1049003362112),直接通过not in &实现不了,如下

// 预期结果如下
      advertiser_id       day  id
0  adv1044525491840  20201101   1
1  adv1049003362112  20201101   2

>>> import pandas as pd
>>> data1 = {'id':[1,2,3],'day':[20201101,20201101,20201102],'advertiser_id':['adv1044525491840','adv1049003362112','adv1049003362112']}
>>> patchDF = pd.DataFrame(data1)
>>> data2 = {'day':[20201102],'advertiser_id':['adv1049003362112']}
>>> advertiserDF = pd.DataFrame(data2)
>>> adDF = patchDF.query("day not in (%s) & advertiser_id not in (%s)"%(advertiserDF['day'].tolist(),advertiserDF['advertiser_id'].tolist()))
>>> adDF
      advertiser_id       day  id
0  adv1044525491840  20201101   1

// 实际返回如上

  () not in ((),())写法,这种写法不支持,如下

>>> import pandas as pd
>>> data1 = {'id':[1,2,3],'day':[20201101,20201101,20201102],'advertiser_id':['adv1044525491840','adv1049003362112','adv1049003362112']}
>>> patchDF = pd.DataFrame(data1)
>>> data2 = {'day':[20201102],'advertiser_id':['adv1049003362112']}
>>> advertiserDF = pd.DataFrame(data2)
>>> adDF = patchDF.query("(day,advertiser_id) not in ((20201102,'adv1049003362112'))")
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2850, in query
    new_data = self.loc[res]
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1478, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1912, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 140, in _get_label
    return self.obj._xs(label, axis=axis)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2987, in xs
    loc = self.index.get_loc(key)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 159, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index_class_helper.pxi", line 120, in pandas._libs.index.Int64Engine._check_type
KeyError: True

  left join方式

>>> import pandas as pd
>>> import numpy as np
>>> data1 = {'id':[1,2,3],'day':[20201101,20201101,20201102],'advertiser_id':['adv1044525491840','adv1049003362112','adv1049003362112']}
>>> patchDF = pd.DataFrame(data1)
>>> data2 = {'day':[20201102],'advertiser_id':['adv1049003362112'],'id':[1]}
>>> advertiserDF = pd.DataFrame(data2)
>>> mergeDF = pd.merge(patchDF, advertiserDF, how='left', on=['day', 'advertiser_id'],suffixes=('_patch', '_advertiser'))
>>> adDF = mergeDF[np.isnan(mergeDF['id_advertiser'])]
>>> adDF
      advertiser_id       day  id_patch  id_advertiser
0  adv1044525491840  20201101         1            NaN
1  adv1049003362112  20201101         2            NaN

// 求非nan,大数据量可能报错https://itdiandi.net/view/2874
// >>> adDF = mergeDF[~np.isnan(mergeDF['id_advertiser'])]
// >>> adDF
//       advertiser_id       day  id_patch  id_advertiser
// 2  adv1049003362112  20201102         3            1.0
// 使用pd.notna判断
adDF = mergeDF[pd.notna(mergeDF['id_advertiser'])]

四、参考文章
1、https://stackoverflow.com/questions/22591174/pandas-multiple-conditions-while-indexing-data-frame-unexpected-behavior
2、https://stackoverflow.com/questions/15315452/selecting-with-complex-criteria-from-pandas-dataframe


注意:本文归作者所有,未经作者允许,不得转载

全部评论: 0

    我有话说: