Pandas drop_duplicates() Method
Pandas, the powerful data manipulation library in Python, provides a variety of methods to clean and manipulate data efficiently. One such method, drop_duplicates(), allows us to eliminate duplicate rows from a DataFrame.
We will learn about the drop_duplicates() method in detail with all its parameters and examples.
- drop_duplicates() Method
- Examples
- Conclusion
Table of Contents
1. drop_duplicates() Method
The drop_duplicates() method is used to remove duplicate rows from a DataFrame. It takes a few parameters to customize the behavior of the method.
It operates based on the values in one or more columns, providing flexibility in identifying and eliminating duplicates.
1.1 Syntax
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
Here is the description of the parameters:
Parameter | Description |
---|---|
subset | It specifies the column or list of columns to consider for identifying duplicate rows. If no column is specified, all the columns are considered. |
keep | It specifies which occurrence of the duplicate row should be kept. It can take the following values:
|
inplace | It specifies whether the changes should be made in the original DataFrame or a new DataFrame should be returned. It can take the following values:
|
ignore_index | It specifies whether the index of the DataFrame should be reset after dropping the duplicate rows. It can take the following values:
|
1.2 Return Value
It returns a DataFrame with the duplicate rows dropped if the inplace=False (Default). If inplace=True, it returns None.
2. Examples
Going through examples will help us understand the method deeply.
Example 1: Removing complete duplicate row
Removing duplicate rows from a DataFrame. Complete duplicate rows are removed.
import pandas as pd
# Creating a DataFrame with duplicate values
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
'Age': [25, 30, 35, 25],
'City': ['NY', 'LA', 'SF', 'NY']
})
# ๐ Drop duplicates
df_single_column = df.drop_duplicates()
print("DataFrame after dropping duplicates:")
print(df_single_column)
Output:
DataFrame after dropping duplicates: Name Age City 0 Alice 25 NY 1 Bob 30 LA 2 Charlie 35 SF
Example 2: Removing partical duplicate row
Suppose there is a row in dataframe whose only few of column have duplicate value but other are unique. In this case you need to use the subset parameter and pass the column label which you want to keep unique.
import pandas as pd
# Creating a DataFrame with duplicate values
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
'Age': [25, 30, 35, 44], # ๐ Age column is not duplicate
'City': ['NY', 'LA', 'SF', 'NY']
})
# remove duplicate rows based on 'Name' column
# ๐ Drop duplicates
df_single_column = df.drop_duplicates(subset=['Name'])
print("DataFrame after dropping duplicates:")
print(df_single_column)
Output:
DataFrame after dropping duplicates: Name Age City 0 Alice 25 NY 1 Bob 30 LA 2 Charlie 35 SF
Example 3: Drop Duplicate Rows based on Multiple Columns
In case we want a list of columns ro be unique throughout the dataframe, we can pass a list of column names to the subset parameter.
This will remove all the rows where the combination of values in the specified columns is duplicate.
import pandas as pd
# Creating a DataFrame with duplicate values
df = pd.DataFrame({
'A': ['A1', 'A1', 'A2', 'A2', 'A3', 'A3'],
'B': ['B1', 'B1', 'B1', 'B1', 'B2', 'B2'],
'C': ['C1', 'C2', 'C2', 'C2', 'C3', 'C3'],
})
# ๐ Drop duplicates
# remove duplicate rows based on 'A' and 'B' column
df_multiple_columns = df.drop_duplicates(subset=['A', 'B'])
print("DataFrame after dropping duplicates based on 'A' and 'B' columns:")
print(df_multiple_columns)
# ๐ Drop duplicates
# remove duplicate rows based on 'A' and 'C' column
df_multiple_columns = df.drop_duplicates(subset=['A', 'C'], keep='last')
print("DataFrame after dropping duplicates based on 'A' and 'C' columns:")
print(df_multiple_columns)
Output:
DataFrame after dropping duplicates based on 'A' and 'B' columns: A B C 0 A1 B1 C1 2 A2 B1 C2 4 A3 B2 C3 DataFrame after dropping duplicates based on 'A' and 'C' columns: A B C 0 A1 B1 C1 1 A1 B1 C2 3 A2 B1 C2 5 A3 B2 C3
Example 3: Drop Duplicate Columns
Not only rows but duplicate columns can also be removed using drop_duplicates() method. For this we can take transpose of dataframe (rows to column and column to rows) and them remove duplicate and apply transpose again.
Learn how drop duplicate columns in pandas.
Conclusion
Armed with this knowledge, you can confidently tackle duplicate data in your datasets. Whether you're cleaning rows, columns, or need to retain the last occurrence, the drop_duplicates()
method is a versatile tool in your data manipulation arsenal.