Pandas Dataframe [Complete Guide]

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

In this article at OpenGenus, we will understand the basic concepts of Pandas Dataframe and its usage with various examples.

Table of contents:

  1. Introduction to Pandas Dataframe
  2. Creating a Pandas Dataframe
  3. Basic operations with Pandas Dataframe
  4. Best Practices and Tips

Introduction to Pandas Dataframe

What is a Pandas Dataframe

The Pandas DataFrame is a two-dimensional data structure with axes (rows and columns) labelled. It is majorly used in data science, machine learning, scientific computing, and other data-intensive fields.

A DataFrame can be created using various input sources like CSV, Excel, SQL database, or from existing Python data structures like lists and dictionaries. The main components of a DataFrame are data, rows, and columns.

Pandas Dataframe is a two-dimensional, heterogeneous table-like data structure that has rows and columns. It is a powerful and widely used data manipulation library for Python programming language. In pandas, data is aligned in a tabular fashion, i.e., rows and columns.

Each column of a pandas DataFrame can have a different data type (e.g., numeric, string, boolean, etc.). The rows of a pandas DataFrame are identified by an index, which can be numeric or a label that uniquely identifies each row.

Why use a Pandas Dataframe

Pandas Dataframe is a powerful tool for data manipulation and analysis. It provides various functionalities for handling large, complex data sets with ease, including indexing, filtering, aggregation, and pivoting. The following are some of the reasons why Pandas Dataframe is widely used:
1.Felxibility
2.Efficiency
3.Data Cleaning
4.Data Manipulation

Installation and Setup of Pandas Library

To install the Pandas library, you need to have Python installed on your system. Once you have Python installed, you can install Pandas using the following command:

pip install pandas

After installing the Pandas library, you can import it in your Python code using the following command:

import pandas as pd
(This will allow you to use all the functionalities provided by the Pandas library in your Python code.)

Creating a Pandas Dataframe

Creating a Dataframe from scratch

Creating a DataFrame from scratch refers to the process of constructing a DataFrame object without using any existing data sources like CSV, Excel, or SQL databases. Instead, you can create a DataFrame using various Python data structures such as lists, dictionaries, or arrays.

The DataFrame() function in Pandas is commonly used to create a DataFrame. It takes various inputs like lists, dictionaries, or NumPy arrays and converts them into a tabular structure. For example:
import pandas as pd

data = {'Name': ['John', 'Mike', 'Sarah'], 'Age': [25, 30, 35], 'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)

(This creates a DataFrame with three columns: Name, Age, and Salary, using a dictionary as the input)

Another approach is to create a DataFrame by creating separate lists for each column and then combining them into a dictionary. Here's an example:

import pandas as pd
people = ["Tom", "Jerry", "Alice"]
sales = [1000, 1500, 1200]
data = {'People': people, 'Sales': sales}
df = pd.DataFrame(data)

(This creates a DataFrame with two columns: People and Sales, based on the different lists.)

Creating a Dataframe from different data sources like CSV,EXCEL,SQL databases,etc.

Pandas provides convenient methods to read data from various data sources. Here are some examples:

Reading a Dataframe from a file
1.Reading from a CSV file: You can read data from a CSV file and create a DataFrame using the read_csv() function. For instance:

import pandas as pd
df = pd.readcsv('data.csv')

(This reads the data from the 'data.csv' file and creates a DataFrame.)

2.Reading from an Excel file: Pandas allows you to read data from an Excel file and create a DataFrame using the read_excel() function. Here's an example:

import pandas as pd
df = pd.readexcel('data.xlsx', sheetname='Sheet1')

(This reads the data from the 'Sheet1' in the 'data.xlsx' file and creates a DataFrame.)

3.Reading from a SQL database: Pandas provides functionality to connect to a SQL database and retrieve data as a DataFrame. You can use the read_sql() function by specifying the SQL query or table name. Here's an example using SQLite:

import pandas as pd
import sqlite3
conn = sqlite3.connect('database.db')
query = "SELECT FROM tablename"
df = pd.readsql(query, conn)

(This retrieves data from the specified table in the SQLite database and creates a DataFrame.)

These are just a few examples of how you can create a DataFrame from different data sources using Pandas. The library provides additional methods to read from various file formats and databases, including JSON, HTML, and more.

Basic operations with Pandas Dataframe

Viewing Dataframes

To view a DataFrame in Pandas, you can use several methods such as head(), tail(), iloc[], and loc[].

1.head(): The head() function allows you to see the first few rows of a DataFrame. By default, it displays the first 5 rows, but you can specify the number of rows to be shown. For example:
df.head()
(Display the first 5 rows of the DataFrame by default.)

2.tail(): The tail() function shows the last few rows of a DataFrame. Similar to head(), it displays the last 5 rows by default, but you can specify the number of rows to be shown. For instance:
df.tail()
(Display the last 5 rows of the DataFrame by default.)

3.iloc[]: The iloc[] indexer is used for integer-location based indexing. It allows you to select rows and columns by their integer positions. You can pass either single integers, lists, or slices to the iloc[] indexer. Here's an example:
df.iloc[2:5, 1:3]
(Select rows 2 to 4 (exclusive) and columns 1 to 2 (exclusive).)

4.loc[]: The loc[] indexer is used for label-based indexing. It allows you to select rows and columns by their labels or boolean conditions. You can pass labels or boolean arrays to the loc[] indexer. For example:
df.loc[df['Age'] > 30, ['Name', 'Salary']]
(Select rows where Age is greater than 30 and columns Name and Salary.)

Basic statistical operations

Pandas provides various methods to perform basic statistical operations on a DataFrame. Some commonly used methods include:
1.count(): The count() method returns the number of non-null observations for each column in the DataFrame.
2.sum(): The sum() method calculates the sum of values in each column.
3.min(): The min() method returns the minimum value in each column.
4.max(): The max() method returns the maximum value in each column.
5.mean(): The mean() method calculates the arithmetic mean of values in each column.

Here's an example of using these methods:
df.count()
(Count non-null values in each column.)

df.sum()
(Calculate the sum of values in each column.)

df.min()
(Find the minimum value in each column.)

df.max()
(Find the maximum value in each column.)

df.mean()
(Calculate the mean of values in each column.)

These methods provide useful insights into the statistical properties of the DataFrame.

Data Cleaning

Data cleaning involves handling null values, removing duplicates, and replacing values within a dataset to ensure its quality and reliability.

1.Handling Null Values: Null values, also known as missing values, can be dealt with in various ways. One approach is to remove records or columns that contain null values, but this may result in data loss. Alternatively, you can fill null values with appropriate replacements. Pandas provides the .fillna() method, which allows you to fill missing values with specific values or statistical measures like the mean or median.
Examples:
df.isnull()
(Checking for null value.)

df.dropna()
(Dropping rows with null values.)

df.fillna(0)
(Replacing null values with 0.)

2.Removing Duplicates: Duplicate data can adversely affect analysis and modeling. Pandas offers the .drop_duplicates() method to identify and remove duplicate rows from a DataFrame. It compares the values in all columns by default, but you can specify columns to consider for duplicate detection.
Examples:
df.duplicated()
(Checking for duplicate rows.)

df.dropduplicates()
(Dropping duplicate rows.)

3.Replacing Values: Sometimes, you need to replace specific values in a DataFrame. Pandas provides the .replace() method, allowing you to replace values or patterns with new values. This can be useful for standardizing data or correcting erroneous entries.
Examples:
df.replace({'Gender': {'Male': 'M', 'Female': 'F'}})
(Replacing 'Male' with 'M' and 'Female' with 'F'.)

Manipulating Dataframes

Manipulating DataFrames involves various operations such as renaming columns, dropping columns, and adding new columns. Pandas provides several methods to perform these tasks:

1.Renaming Columns: To rename columns in a DataFrame, you can use the .rename() method. It allows you to specify a dictionary mapping old column names to new names. For example:
df.rename(columns={'oldname': 'newname'}, inplace=True)
(This renames the column 'old_name' to 'new_name' in the DataFrame df.)

df.rename(columns={'OldName1': 'NewName1', 'OldName2': 'NewName2'}, inplace=True)
(Renaming multiple columns)

2.Dropping Columns: To remove specific columns from a DataFrame, you can use the .drop() method. It allows you to specify the column(s) to drop. For example:
df.drop(columns=['column1', 'column2'], inplace=True)
(This removes 'column1' and 'column2' from the DataFrame df.)

df.drop(['ColumnName1', 'ColumnName2'], axis=1, inplace=True)
(Dropping multiple columns.)

3.Adding New Columns: To add a new column by assigning a list or array of values directly to it.
Examples:
df['NewColumn'] = df['Column1'] + df['Column2']
(This adds a new column named 'NewColumn' to the DataFrame df and assigns values to it.)

These operations enable you to modify the structure and content of DataFrames to suit your analysis requirements.These are just a few examples of data cleaning and DataFrame manipulation operations provided by Pandas. The library offers many more functions and methods to handle various data transformation tasks.
Remember, data cleaning and manipulation should be performed with caution and based on the specific characteristics and requirements of your dataset.

Best Practices and Tips

  • Pandas provides methods to optimize memory usage, such as using appropriate data types for columns and dropping unnecessary columns to reduce memory footprint
  • Pandas offers a wide range of built-in functions and methods that are optimized for performance. Utilizing these functions can often be faster than writing custom functions.
  • Avoid using the assignment operator for copying DataFrames: When creating a copy of a DataFrame, use the df.copy() method instead of the assignment operator (=) to avoid modifying the original DataFrame unintentionally.

With this article at OpenGenus, you must have the complete idea of Pandas Dataframe.

Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.