Pandas Select Rows by Condition
Pandas is most commonly used Python library for data analysis. It gives us options to select rows based on conditions, which is a very common operation in data analysis.
In this article, we will look at various ways to select rows based on conditions in a Pandas DataFrame.
Our Test Data
Let's create a DataFrame to work with.
import pandas as pd
# Creating a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 45],
'City': ['NY', 'LA', 'SF', 'NY', 'LA']}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City 0 Alice 25 NY 1 Bob 30 LA 2 Charlie 35 SF 3 David 40 NY 4 Emma 45 LA
Select Rows based on Condition
Let's say we want to select all the rows where the Age
is greater than 30.
We can do this by passing the condition inside the square brackets []
as shown below.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 45],
'City': ['NY', 'LA', 'SF', 'NY', 'LA']}
df = pd.DataFrame(data)
# Selecting rows based on condition
df = df[df['Age'] > 30]
print(df)
Output:
Name Age City 2 Charlie 35 SF 3 David 40 NY 4 Emma 45 LA
Here we are passing the condition df['Age'] > 30
inside the square brackets []
which will return a boolean series with True
for all the rows where the Age
is greater than 30 and False
for all the rows where the Age
is less than or equal to 30.
Now, when we pass this boolean series inside the square brackets []
of the DataFrame, it will return all the rows where the boolean series has True
value.
So, the above code will return all the rows where the Age
is greater than 30.
Select Rows based on Multiple Conditions
You can apply multiple conditions using logical operators like &
(and), |
(or), ~
(not) etc.
Let's say we want to select all the rows where the Age
is greater than 30 and the City
is NY
.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 45],
'City': ['NY', 'LA', 'SF', 'NY', 'LA']}
df = pd.DataFrame(data)
# Selecting rows based on multiple conditions
df = df[(df['Age'] > 30) & (df['City'] == 'NY')]
print(df)
Output:
Name Age City 3 David 40 NY
Negate a Condition
Let's say we want to select all the rows where the Age
is not greater than 30.
For this, we can use the ~
(not) operator.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 45],
'City': ['NY', 'LA', 'SF', 'NY', 'LA']}
df = pd.DataFrame(data)
# Selecting rows based on negation of condition
df = df[~(df['Age'] > 30)]
print(df)
Output:
Name Age City 0 Alice 25 NY 1 Bob 30 LA
Select Rows based on Condition in List
Let's say we want to select all the rows where the City
is either NY
or SF
.
We can do this by passing the condition inside the square brackets []
as shown below.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 45],
'City': ['NY', 'LA', 'SF', 'NY', 'LA']}
df = pd.DataFrame(data)
# Selecting rows based on condition in list
df = df[df['City'].isin(['NY', 'SF'])]
print(df)
Output:
Name Age City 0 Alice 25 NY 2 Charlie 35 SF 3 David 40 NY
Other Methods
Here are few more methods for you to explore for selecting rows based on conditions.
1. Using query()
method
The query()
is an inbuilt method in Pandas which executes an expression string on the DataFrame.
You can pass single or multiple conditions inside it in the form of a string.
For example:
df.query('Age > 30')
will return all the rows where theAge
is greater than 30.df.query('Age > 30 & City == "NY"')
will return all the rows where theAge
is greater than 30 and theCity
isNY
.df.query('Age > 30 & City in ["NY", "SF"]')
will return all the rows where theAge
is greater than 30 and theCity
is eitherNY
orSF
.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 45],
'City': ['NY', 'LA', 'SF', 'NY', 'LA']}
df = pd.DataFrame(data)
# using query() method
df = df.query('Age > 30 & City in ["NY", "SF"]')
print(df)
Output:
Name Age City 2 Charlie 35 SF 3 David 40 NY
2. Using loc[]
method
The loc[]
method is used to access a group of rows and columns by label(s) or a boolean array.
For example:
df.loc[df['Age'] > 30]
will return all the rows where theAge
is greater than 30.df.loc[(df['Age'] > 30) & (df['City'] == 'NY')]
will return all the rows where theAge
is greater than 30 and theCity
isNY
.df.loc[(df['Age'] > 30) & (df['City'].isin(['NY', 'SF']))]
will return all the rows where theAge
is greater than 30 and theCity
is eitherNY
orSF
.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 45],
'City': ['NY', 'LA', 'SF', 'NY', 'LA']}
df = pd.DataFrame(data)
# using loc[] method
df = df.loc[(df['Age'] > 30) & (df['City'].isin(['NY', 'SF']))]
print(df)
Output:
Name Age City 2 Charlie 35 SF 3 David 40 NY
Conclusion
We have covered various examples and methods to select rows based on conditions in a Pandas DataFrame.
For more information, refer to the official documentation.