Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.4k views
in Technique[技术] by (71.8m points)

python - Pandas drop duplicates based on multiple columns but keeping one column unique

I have a dataframe having the following shape

colA colB colC specialCol
 1     2   3      old
 1     2   3      old
 1     1   2      old
 1     2   3      new
 2     3   4      new

I want to drop duplicates considering colA, colB & colC (also I should not ignore the specialCol. It should be unique rather than duplicate). My intented output is

colA colB colC specialCol
 1    2     3    old
 1    1     2    old
 2    3     4    new

Only one 1 2 3 should be removed from old & new and it should not compare two olds/two news. How to acheive this in Pandas?

EDIT:

May be even the following will do good for me. I have two dataframes and I am merging it. While merging it should remove duplicates. But for one duplicate row in frame 1 it should remove only one duplicate row in frame 2 like below

df1:

colA colB colC
1    2    3
1    5    4
1    1    2

df2:

colA colB colC
1    2    3
1    2    3
1    1    2
1    1    2

Result of df1 & df2 merge:

colA colB colC
1    2    3
1    5    4
1    1    2

Note that only one row of 1 2 3 & 1 1 2 are removed because only one copy is available in df1. If there exists two copies in df1 then both copies of 1 2 3 would be removed.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can directly use drop_duplicates to get the index values that are not duplicate and then filter only on those index values using loc -

import pandas as pd

l = [[1,1,1,1,2],[2,2,1,2,3],[3,3,2,3,4],['old','old','old','new','new']]
d = {'Col1':l[0],'Col2':l[1],'Col3':l[2],'Col4':l[3]}
df = pd.DataFrame(d)

>>> df
   Col1  Col2  Col3 Col4
0     1     2     3  old
1     1     2     3  old
2     1     1     2  old
3     1     2     3  new
4     2     3     4  new

##### Drop Duplicates - 

>>> idx_map = df[['Col1','Col2','Col3']].drop_duplicates().index

>>> df.loc[idx_map]
   Col1  Col2  Col3 Col4
0     1     2     3  old
2     1     1     2  old
4     2     3     4  new

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...