Pandas read text file into DataFrame
As a data analyst, you will often need to read data from external sources to perform analysis.
In this tutorial, we will be discussing how to read a text file into a Pandas DataFrame, which is a popular data manipulation library in Python.
We will cover various methods that can be used to read text files, including CSV files, TSV files, and fixed-width files.
By the end of this tutorial, you will be able to read text files and manipulate data using Pandas with ease.
Read text file into DataFrame
The first step in analyzing text data is to read it into a pandas DataFrame.
Pandas provides various functions to read data from text files like CSV, TSV, TXT, and JSON. Let's explore some of the most common functions:
- read_csv()
- read_table()
- read_fwf()
Let's see how to use these functions to read text files into a DataFrame.
1. read_csv()
The read_csv() function is used to read a comma-separated values (CSV) file into a DataFrame.
It is the most commonly used function to read text files into a DataFrame. It can handle different delimiters, headers, and encoding formats.
Let's say we have a CSV file named data.csv with the following content:
id,name,age
1,John,25
2,Smith,30
3,David,28
We can read this file into a DataFrame using the read_csv() function as follows:
import pandas as pd
# read the file into a DataFrame
df = pd.read_csv('data.csv')
print(df)
The output will be:
id name age 0 1 John 25 1 2 Smith 30 2 3 David 28
As you can see, the read_csv() function automatically reads the file and converts it into a DataFrame.
It also automatically assigns the column names as the first row of the file.
Other methods like read_table() and read_fwf() can also be used to read text files into a DataFrame.
However, the read_csv() function is the most commonly used function to read text files into a DataFrame.
Read text file with custom delimiter
By default, the read_csv() function assumes that the delimiter is a comma (,).
However, if the file uses a different delimiter, we can specify it using the sep parameter.
Let's say we have a file named data.txt with the following content:
id|name|age
1|John|25
2|Smith|30
3|David|28
We can read this file into a DataFrame using the read_csv() function as follows:
import pandas as pd
# read the file into a DataFrame
df = pd.read_csv('data.txt', sep='|')
print(df)
The output will be:
id name age 0 1 John 25 1 2 Smith 30 2 3 David 28
In the code above we have specified the sep parameter as | to specify that the file uses a pipe (|) as a delimiter.
Read text file with no header
By default, the read_csv() function assumes that the first row of the file contains the column names.
However, if the file does not contain a header, we can specify it using the header parameter.
Let's say we have a file named data.txt with the following content:
1,John,25
2,Smith,30
3,David,28
We can read this file into a DataFrame using the read_csv() function as follows:
import pandas as pd
# read the file into a DataFrame
df = pd.read_csv('data.txt', header=None)
print(df)
The output will be:
0 1 2 0 1 John 25 1 2 Smith 30 2 3 David 28
In the code above we have specified the header parameter as None to specify that the file does not contain a header.
By default, the read_csv() function assigns the column names as 0, 1, 2, and so on.
However, we can specify the column names using the names parameter.
Let's say we want to assign the column names as id, name, and age.
We can do that as follows:
import pandas as pd
# read the file into a DataFrame
df = pd.read_csv('data.txt', header=None, names=['id', 'name', 'age'])
print(df)
The output will be:
id name age 0 1 John 25 1 2 Smith 30 2 3 David 28
Manipulating Data in Pandas DataFrame
Once we have the data in a pandas DataFrame, we can perform various operations on it like selecting, filtering, sorting, grouping, and aggregating the data.
Let's explore some of the most common operations:
-
Selecting Columns - We can select one or more columns from a DataFrame using the column name(s). Here is an example:
import pandas as pd df = pd.read_csv('filename.csv') # Selecting one column df['column_name'] # Selecting multiple columns df[['column_name_1', 'column_name_2']]
-
Filtering Rows - We can filter rows based on certain conditions using logical operators like ==, !=, >, <, >=, <=, and, or, not. Here is an example:
import pandas as pd df = pd.read_csv('filename.csv') # Filtering rows based on a condition df[df['column_name'] > 10]
-
Sorting Data - We can sort the DataFrame based on one or more columns using the sort_values() function.
import pandas as pd df = pd.read_csv('filename.csv') # Sorting by one column df.sort_values('column_name') # Sorting by multiple columns df.sort_values(['column_name_1', 'column_name_2'])
-
Grouping Data - We can group the DataFrame by one or more columns using the groupby() function.
import pandas as pd df = pd.read_csv('filename.csv') # Grouping by one column df.groupby('column_name').mean() # Grouping by multiple columns df.groupby(['column_name_1', 'column_name_2']).mean()
Conclusion
In this tutorial, we have learned how to read data from a text file into a pandas DataFrame.
We have also learned how to manipulate the data in a DataFrame.
Use the read_csv() function to read data from a CSV file into a DataFrame.
Happy Learning!😇