Merge Two DataFrames Pandas π€
In the realm of data manipulation, merging two DataFrames is a common task that unlocks new insights and enhances analysis.
In this article, we will look at different ways to merge two DataFrames in Pandas.
- How to Merge Pandas DataFrames
- Applying Optional Parameters
- Merge based on different column names
- Change the merge type
- Conclusion
Table of Contents
How to Merge Pandas DataFrames
Merging Pandas DataFrames is like blending ingredients to create a delicious recipe. Each method and parameter adds its unique flavor to the mix, resulting in a harmonious combination of data.
Let's start with the basics and merge two DataFrames using the merge() method.
1. Using merge()
The merge() method is the most powerful and commonly used method to merge two DataFrames in Pandas. It is a versatile method that can merge DataFrames based on a single or multiple columns.
Let's see how to merge two DataFrames based on a single column.
Example 1: Merge two DataFrames based on a single column
Suppose we have two DataFrames, df1 and df2, and we want to merge them based on the id column. Here is how we can do this:
import pandas as pd
# create first dataframe
df1 = pd.DataFrame({'id': [1, 2, 3],
'name': ['John', 'Mike', 'Sarah']})
# create second dataframe
df2 = pd.DataFrame({'id': [1, 2, 3],
'age': [22, 25, 21]})
# π merge the two dataframes
df3 = pd.merge(df1, df2, on='id')
print(df3)
Output:
id name age 0 1 John 22 1 2 Mike 25 2 3 Sarah 21
Example 2: Merge two DataFrames based on multiple columns
The following example merges two DataFrames based on the id and name columns.
import pandas as pd
# create first dataframe
df1 = pd.DataFrame({'id': [1, 2, 3],
'name': ['John', 'Bob', 'Sarah']})
# create second dataframe
df2 = pd.DataFrame({'id': [1, 2, 3],
'name': ['John', 'Mike', 'Sarah'],
'age': [22, 25, 21]})
# π merge the two dataframes
df3 = pd.merge(df1, df2, on=['id', 'name'])
print(df3)
Output:
id name age 0 1 John 22 1 3 Sarah 21
2. Using join()
The join() method elegantly merges DataFrames based on their indices, providing a seamless integration of data.
LThe following example merges two DataFrames based on their indices.
import pandas as pd
# create first dataframe
df1 = pd.DataFrame({'name': ['John', 'Bob', 'Sarah'],
'age': [22, 25, 21]})
# create second dataframe
df2 = pd.DataFrame({'id': [1, 2, 3],
'city': ['London', 'Berlin', 'New York'],
'age': [22, 25, 21]})
# π merge the two dataframes using join()
df3 = df1.join(df2, lsuffix='_caller', rsuffix='_other')
print(df3)
Output:
name age_caller id city age_other 0 John 22 1 London 22 1 Bob 25 2 Berlin 25 2 Sarah 21 3 New York 21
3. Using concat()
The concat() function offers flexibility in combining datasets along specified axes, allowing for versatile data integration.
The following example merges two DataFrames along the columns axis (by default).
import pandas as pd
# create first dataframe
df1 = pd.DataFrame({'name': ['John', 'Bob', 'Sarah'],
'age': [22, 25, 21]})
# create second dataframe
df2 = pd.DataFrame({'name': ['Mike', 'Tom', 'Harry'],
'age': [27, 21, 30]})
# π merge the two dataframes using concat()
df3 = pd.concat([df1, df2])
print(df3)
Output:
name age 0 John 22 1 Bob 25 2 Sarah 21 0 Mike 27 1 Tom 21 2 Harry 30
Applying Optional Parameters
To avoide duplicate columns in the merged DataFrame, we can use the suffixes parameter to specify the suffixes to be used for overlapping columns.
To avoide confusion you can pass a list of strings as suffixes, where the first element is appended to the overlapping columns from the left DataFrame, and the second element is appended to the overlapping columns from the right DataFrame.
import pandas as pd
# create first dataframe
df1 = pd.DataFrame({'id': [1, 2, 3],
'name': ['John', 'Bob', 'Sarah']})
# create second dataframe
df2 = pd.DataFrame({'id': [1, 2, 3],
'name': ['Adam', 'Smith', 'Dave'],
'age': [22, 25, 21]})
# π merge the two dataframes with suffixes
df3 = pd.merge(df1, df2, on='id', suffixes=['_caller', '_other'])
print(df3)
Output:
id name_caller name_other age 0 1 John Adam 22 1 2 Bob Smith 25 2 3 Sarah Dave 21
Merge based on different column names
Merging DataFrames based on different column names is a common task in data analysis. For example, you may have a column named id in one DataFrame and user_id in another DataFrame and you want to merge them based on these columns.
import pandas as pd
# create first dataframe
df1 = pd.DataFrame({'id': [1, 2, 3],
'name': ['John', 'Bob', 'Sarah']})
# create second dataframe
df2 = pd.DataFrame({'user_id': [1, 2, 3],
'age': [22, 25, 21]})
# π merge the two dataframes based on 2 different columns
df3 = pd.merge(df1, df2, left_on='id', right_on='user_id')
print(df3)
Output:
id name user_id age 0 1 John 1 22 1 2 Bob 2 25 2 3 Sarah 3 21
Change the merge type
By default, the merge() method performs an inner join on the DataFrames. However, we can change the merge type by using the how parameter.
Allowed value for how parameter are inner, outer, left, and right joins tailored to your needs.
# Change merge type to outer
merged_outer_df = pd.merge(df1, df2, on='ID', how='outer')
print(merged_outer_df)
Conclusion
Merging Pandas DataFrames is an essential skill for any data wrangler. By mastering the various methods, optional parameters, and best practices discussed here, you're well-equipped to blend disparate datasets into a cohesive whole, unlocking valuable insights and unleashing the full potential of your data analysis endeavors.
Happy merging! πβ¨