Pandas Drop Duplicate Rows
Deling with a big set of data you will often encounter duplicate rows, it may be due to data entry errors or merging datasets from multiple sources. Identifying these duplicates is crucial for maintaining data integrity and conducting accurate analyses.
In this tutorial, you will learn how to remove these duplicate rows from a DataFrame.
- Dropping Duplicate Rows ποΈ
- Keeping the First Occurrence π
- Keeping the Last Occurrence β°
- Conclusion π
Table of Contents
1. Dropping Duplicate Rows ποΈ
To drop duplicate rows, we can use the drop_duplicates() method provided by Pandas. This method identifies and removes rows with identical values across all columns.
The following example shows how you can drop duplicate rows from a dataframe.
import pandas as pd
# Creating a DataFrame with duplicate rows
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
'Age': [25, 30, 35, 25],
'City': ['NY', 'LA', 'SF', 'NY']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# π Drop duplicate rows
df = df.drop_duplicates()
print("\nDataFrame after dropping duplicate rows:")
print(df)
Output:
Original DataFrame: Name Age City 0 Alice 25 NY 1 Bob 30 LA 2 Charlie 35 SF 3 Alice 25 NY DataFrame after dropping duplicate rows: Name Age City 0 Alice 25 NY 1 Bob 30 LA 2 Charlie 35 SF
As you can see we have removed the duplicate row with the name Alice and age 25.
There are few parameters you can use to customize the behavior of this method. Learn more about them in Pandas drop_duplicates() tutorial.
2. Keeping the First Occurrence π
By default the drop_duplicates() method keeps the first occurrence of the duplicate row and removes the rest, however you can also explicitly specify this behavior using the keep='first' parameter.
# π Keep the first occurrence of duplicate rows
df = df.drop_duplicates(keep='first')
print("DataFrame keeping the first occurrence:")
print(df)
3. Keeping the Last Occurrence β°
Similarly, you can use the keep='last' parameter to keep the last occurrence of the duplicate row.
# π Keep the last occurrence of duplicate rows
df = df.drop_duplicates(keep='last')
print("DataFrame keeping the last occurrence:")
print(df)
Conclusion
Handling duplicate rows in a Pandas DataFrame is essential for maintaining data quality and ensuring accurate analyses. Whether dropping duplicates, keeping the first occurrence, or keeping the last occurrence, Pandas provides flexible methods to suit your specific needs.
Apply these techniques to keep your DataFrames clean and efficient in your Python data analysis workflows. ππ