pandas技巧（持续更新）

作者：YXN-python 阅读量：49 发布日期：2023-03-25

更新一

1. 字典创建Dataframe

df_dict = {'name': ['Alice_001', 'Bob_002', 'Cindy_003', 'Eric_004', 'Helen_005', 'Grace_006'],
           'sex': ['female', 'male', 'female', 'male', 'female', 'male'],
           'math': [90, 89, 99, 78, 97, 93],
           'english': [95, 94, 80, 94, 94, 90]}
# [1].直接写入参数test_dict
df = pd.DataFrame(df_dict)
# [2].字典型赋值
df = pd.DataFrame(data=df_dict)

2. 列拆分（split/extract）

将 name 拆分为 name 和 id 列

字符拆分：

df1[['name', 'id']] = df1['name'].str.split('_', 2, expand = True)

正则表达式拆分：

df2 = df.copy()
df2['name2'] = df2['name'].str.extract('([A-Z]+[a-z]+)')
df2['id2'] = df2['name'].str.extract('(\d+)')

3. 列合并（cat）

自定义连接符：

df1["name_id"] = df1["name"].str.cat(df1["id"],sep='_'*3)

某列合并输出：

df1["name"].str.cat(sep='*'*5)

4. 左右填充（pad）

左填充：

df1["id"] = df1["id"].str.pad(10,fillchar="*")
# 相当于ljust()
df1["id"] = df1["id"].str.rjust(10,fillchar="*")

右填充：

df1["id"] = df1["id"].str.pad(10,side="right",fillchar="*")

两侧填充：

df1["id"] = df1["id"].str.pad(10,side="both",fillchar="*")

5. 根据类型筛选列（select_dtypes）

筛选数值列：

df1.select_dtypes(include=['float64', 'int64'])

筛选object列：

df1.select_dtypes(include=['object'])

6. 排序（rank）

英语成绩排名：

df1['e_rank'] = df1['english'].rank(method='min',ascending=False)

更新二

示例数据：

df2 = pd.DataFrame({'id':['a','b','a','c'], 'data_1':[3,7,[1,4,5],9], 'data_2':[1,1,1,1]})

1. 一行展多行：

列表展开，将一行数据展开成多行（explode）

df2.explode('data_1').reset_index(drop=True)

2. 多行合一行

多行数据合并成一行，按id合并：

df2.groupby(['id']).agg({'data_1': [', '.join],'data_1': lambda x: list(x)}).reset_index()

3. 累加计数

列依次的累加（cumsum），类似打卡记录：

df2['data_cumsum'] = df2[['data_2','id']].groupby('id').cumsum()

4. 分组统计

按id分组，生成新的Dataframe：

df2.groupby('id')['data_2'].count().to_frame('数量').reset_index()

5. 指定位置插入列（insert)

在第三列位置插入新列（从0开始计算）：

new_col = np.random.randint(1,10,size=6)
df2.insert(1, 'data_0', new_col)

6. 列条件替换（where）

指定列小于5的值替换成0：

df2['data_1'] = df2['data_1'].where(df2['data_1'] > 5 , 0)

更新三

示例数据

times = pd.date_range('20210101', '20210110')
datas = np.random.randint(1000,5000,10)
df = pd.DataFrame({'日期':times, '盈利':datas})

1. 统计函数当前元素与前面元素的相差百分比（pct_change）

1.1. 盈利列：

df['盈利比'] = df['盈利'].pct_change()

1.2. 百分比格式（apply/format）

df = df.fillna(0)
df['盈利比'] = df['盈利比'].apply(lambda x: format(x, '.2%'))

2. 添加表格标题（set_caption）

每日盈利表单：

df.style.set_caption("每日盈利表单").format({"盈利": "￥{:.2f}"})

3. 隐藏索引（hide_index）

df.style.set_caption("每日盈利表单").format({"盈利": "￥{:.2f}"}).hide_index()

4. 背景色（background_gradient)

盈利列：

df.style.set_caption("每日盈利表单").format({"盈利": "￥{:.2f}"}).hide_index().back

5.内联样式设置（set_properties）

宽度、字体大小：

df.style.set_properties(**{'width': '100px', 'font-size': '14px'})

7. 其他样式设置（**）

7.1. 盈利比列设置为红色：

df.style.set_properties(subset=['盈利比'], **{'color': 'red'})

7.2. 整个背景为黄色：

df.style.set_properties(**{'background-color': 'yellow'})

7.3. 整个背景为黑色，数值为草绿色，边框为白色：

df.style.set_properties(**{'background-color': 'black',
                           'color': 'lawngreen',
                           'border-color': 'white'}).hide_index()

YXN-python

2023-03-25