沧海拾珠

Pandas 去除重复数据

1. 找出数据集中的重复数据,使用df.duplicated()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
df = DataFrame({'A':['Britain','USA','USA','China','China'],
'B':['BBC','NPR','NPR','CCTV','CCTV'],
'C':['good','bad','bad','great','great']})
df
A B C
0 Britain BBC good
1 USA NPR bad
2 USA NPR bad
3 China CCTV great
4 China CCTV great

找出重复数据,返回boolean值

1
2
3
4
5
6
df.duplicated()
0 False
1 False
2 True
3 False
4 True

2.去除重复的行

1
2
3
4
5
df.drop_duplicates()
A B C
0 Britain BBC good
1 USA NPR bad
3 China CCTV great

3. 去除含有重复值的列所关联的行

1
2
3
4
5
6
7
8
df = DataFrame({'A':[1,3,3,4,4],'B':['BBC','NPR','NPR','CCTV','CCTV'],
'C':['good','bad','bad','medium','great']})
df.drop_duplicates(['C'])
A B C
0 1 BBC good
1 3 NPR bad
3 4 CCTV medium
4 4 CCTV great