Pandas Drop Duplicates Columns
Pandas, the versatile data manipulation library in Python, provides efficient methods for handling duplicate columns in a DataFrame.
Duplicate columns in a DataFrame can arise from various scenarios, such as merging DataFrames or loading data from different sources π€. Identifying these duplicates is crucial for maintaining data integrity and avoiding redundancy.
In this tutorial, we will learn how to identify and drop duplicate columns, ensuring your DataFrame remains clean and optimized.
- Dropping Duplicate Columns ποΈ
- Keeping the First Occurrence π
- Keeping the Last Occurrence β°
- Conclusion π
Table of Contents
1. Dropping Duplicate Columns ποΈ
There are multiple ways to remove duplicates from a DataFrame. We have discussed 2 of the most efficient methods below.
1.1 Using Transpose and drop_duplicates()
To drop duplicate columns, we can use the T
(transpose) property of the DataFrame along with the drop_duplicates() method.
The T
property swaps the rows and columns of the DataFrame, and the drop_duplicates()
method drops duplicate rows from the DataFrame.
The idea is to transpose the DataFrame so that the columns become rows, remove duplicate rows, and then transpose the DataFrame back to its original form.
Let's illustrate this with an example.
import pandas as pd
# Create a list of lists with duplicate column names
data = [
['Alice', 25, 'NY', 25], # Each list represents a row
['Bob', 30, 'LA', 30],
['Charlie', 35, 'SF', 35]
]
# Provide explicit column names, including duplicates
column_names = ['Name', 'Age', 'City', 'Age'] # Duplicate 'Age' column
# Create the DataFrame
df = pd.DataFrame(data, columns=column_names)
print("Original DataFrame:")
print(df)
# π Drop the duplicate column
df = df.T.drop_duplicates().T
print("\nDataFrame with duplicate columns dropped:")
print(df)
Output:
Original DataFrame: Name Age City Age 0 Alice 25 NY 25 1 Bob 30 LA 30 2 Charlie 35 SF 35 DataFrame with duplicate columns dropped: Name Age City 0 Alice 25 NY 1 Bob 30 LA 2 Charlie 35 SF
This method ensures that only the first occurrence of each duplicated column is retained.
1.2 Using loc
Another way to drop duplicate column names is to use the loc
property of the DataFrame.
The loc
property is used to access a group of rows and columns, to drop duplicate columns, we can use the loc
property to access all the columns except the first occurrence owill access all the columns except the duplicates.
import pandas as pd
# Create a list of lists with duplicate column names
data = [
['Alice', 25, 'NY', 25], # Each list represents a row
['Bob', 30, 'LA', 30],
['Charlie', 35, 'SF', 35]
]
# Provide explicit column names, including duplicates
column_names = ['Name', 'Age', 'City', 'Age'] # Duplicate 'Age' column
# Create the DataFrame
df = pd.DataFrame(data, columns=column_names)
print("Original DataFrame:")
print(df)
# π Drop the duplicate column
df = df.loc[:, ~df.columns.duplicated()]
print("\nDataFrame with duplicate columns dropped:")
print(df)
Output:
Original DataFrame: Name Age City Age 0 Alice 25 NY 25 1 Bob 30 LA 30 2 Charlie 35 SF 35 DataFrame with duplicate columns dropped: Name Age City 0 Alice 25 NY 1 Bob 30 LA 2 Charlie 35 SF
2. Keeping the First Occurrence π
By default, both of the above methods keep the first occurrence of each duplicate column. However, we can explicitly specify the keep='first'
argument to ensure that only the first occurrence of each duplicate column is retained.
# π drop duplicate columns, keeping the first occurrence
df = df.T.drop_duplicates(keep='first').T
# π drop duplicate columns, keeping the first occurrence
df = df.loc[:, ~df.columns.duplicated(keep='first')]
3. Keeping the Last Occurrence β°
Similarly, we can specify the keep='last'
argument to keep the last occurrence of each duplicate column.
# π drop duplicate columns, keeping the last occurrence
df = df.T.drop_duplicates(keep='last').T
# π drop duplicate columns, keeping the last occurrence
df = df.loc[:, ~df.columns.duplicated(keep='last')]
Conclusion
Handling duplicate columns in a Pandas DataFrame is essential for maintaining data quality and ensuring optimal performance.
Whether dropping duplicates, keeping the first occurrence, or keeping the last occurrence, Pandas provides flexible methods to suit your specific needs.