Pandas read text file into DataFrame

As a data analyst, you will often need to read data from external sources to perform analysis.

In this tutorial, we will be discussing how to read a text file into a Pandas DataFrame, which is a popular data manipulation library in Python.

We will cover various methods that can be used to read text files, including CSV files, TSV files, and fixed-width files.

By the end of this tutorial, you will be able to read text files and manipulate data using Pandas with ease.

Read text file into DataFrame

The first step in analyzing text data is to read it into a pandas DataFrame.

Pandas provides various functions to read data from text files like CSV, TSV, TXT, and JSON. Let's explore some of the most common functions:

read_csv()
read_table()
read_fwf()

Let's see how to use these functions to read text files into a DataFrame.

1. read_csv()

The read_csv() function is used to read a comma-separated values (CSV) file into a DataFrame.

It is the most commonly used function to read text files into a DataFrame. It can handle different delimiters, headers, and encoding formats.

Let's say we have a CSV file named data.csv with the following content:

id,name,age
1,John,25
2,Smith,30
3,David,28

We can read this file into a DataFrame using the read_csv() function as follows:

import pandas as pd

# read the file into a DataFrame
df = pd.read_csv('data.csv')

print(df)

The output will be:

   id   name  age
0   1   John   25
1   2  Smith   30
2   3  David   28

As you can see, the read_csv() function automatically reads the file and converts it into a DataFrame.

It also automatically assigns the column names as the first row of the file.

Other methods like read_table() and read_fwf() can also be used to read text files into a DataFrame.

However, the read_csv() function is the most commonly used function to read text files into a DataFrame.

Read text file with custom delimiter

By default, the read_csv() function assumes that the delimiter is a comma (,).

However, if the file uses a different delimiter, we can specify it using the sep parameter.

Let's say we have a file named data.txt with the following content:

id|name|age
1|John|25
2|Smith|30
3|David|28

We can read this file into a DataFrame using the read_csv() function as follows:

import pandas as pd

# read the file into a DataFrame
df = pd.read_csv('data.txt', sep='|')

print(df)

The output will be:

   id   name  age
0   1   John   25
1   2  Smith   30
2   3  David   28

In the code above we have specified the sep parameter as | to specify that the file uses a pipe (|) as a delimiter.

Read text file with no header

By default, the read_csv() function assumes that the first row of the file contains the column names.

However, if the file does not contain a header, we can specify it using the header parameter.

Let's say we have a file named data.txt with the following content:

1,John,25
2,Smith,30
3,David,28

We can read this file into a DataFrame using the read_csv() function as follows:

import pandas as pd

# read the file into a DataFrame
df = pd.read_csv('data.txt', header=None)

print(df)

The output will be:

   0      1   2
0  1   John  25
1  2  Smith  30
2  3  David  28

In the code above we have specified the header parameter as None to specify that the file does not contain a header.

By default, the read_csv() function assigns the column names as 0, 1, 2, and so on.

However, we can specify the column names using the names parameter.

Let's say we want to assign the column names as id, name, and age.

We can do that as follows:

import pandas as pd

# read the file into a DataFrame
df = pd.read_csv('data.txt', header=None, names=['id', 'name', 'age'])

print(df)

The output will be:

   id   name  age
0   1   John   25
1   2  Smith   30
2   3  David   28

Manipulating Data in Pandas DataFrame

Once we have the data in a pandas DataFrame, we can perform various operations on it like selecting, filtering, sorting, grouping, and aggregating the data.

Let's explore some of the most common operations:

Selecting Columns - We can select one or more columns from a DataFrame using the column name(s). Here is an example:

import pandas as pd

df = pd.read_csv('filename.csv')

# Selecting one column
df['column_name']

# Selecting multiple columns
df[['column_name_1', 'column_name_2']]

Filtering Rows - We can filter rows based on certain conditions using logical operators like ==, !=, >, <, >=, <=, and, or, not. Here is an example:
```
import pandas as pd

df = pd.read_csv('filename.csv')

# Filtering rows based on a condition
df[df['column_name'] > 10]
```

Sorting Data - We can sort the DataFrame based on one or more columns using the sort_values() function.

import pandas as pd

df = pd.read_csv('filename.csv')

# Sorting by one column
df.sort_values('column_name')

# Sorting by multiple columns
df.sort_values(['column_name_1', 'column_name_2'])

Grouping Data - We can group the DataFrame by one or more columns using the groupby() function.

import pandas as pd

df = pd.read_csv('filename.csv')

# Grouping by one column
df.groupby('column_name').mean()

# Grouping by multiple columns
df.groupby(['column_name_1', 'column_name_2']).mean()

Conclusion

In this tutorial, we have learned how to read data from a text file into a pandas DataFrame.

We have also learned how to manipulate the data in a DataFrame.

Use the read_csv() function to read data from a CSV file into a DataFrame.

Happy Learning!😇