Pandas Library
Pandas is an open-source Python library that provides powerful and flexible data structures and data analysis tools. It is widely used in data manipulation, data cleaning, data transformation, and data analysis tasks. The name “Pandas” is derived from “Panel Data,” which refers to multi-dimensional structured data.
Key features of Pandas:
- DataFrame: The DataFrame is a two-dimensional tabular data structure, similar to a spreadsheet or a SQL table. It consists of rows and columns, where each column can hold different data types. DataFrames are very versatile and can handle heterogeneous data effectively.
- Series: A Series is a one-dimensional labeled array that can hold data of any type, similar to a column in a DataFrame.
- Data Input/Output: Pandas supports reading and writing data from/to various file formats like CSV, Excel, JSON, SQL databases, and more.
- Data Cleaning and Transformation: Pandas provides functions to handle missing data, remove duplicates, rename columns, and reshape data.
- Data Selection and Filtering: You can use boolean indexing, label-based indexing (`.loc`), and position-based indexing (`.iloc`) to select and filter data in DataFrames.
- Grouping and Aggregation: Pandas allows you to group data based on certain criteria and perform aggregations like sum, mean, max, etc. on the grouped data.
- Merging and Joining: You can merge or join multiple DataFrames based on common columns.
Let’s see some practical examples to understand how Pandas works with data.
Example 1
Suppose you have a CSV file named `employee_salary.csv` with the following data:
Name,Age,Department,Salary
John,35,Engineering,80000
Alice,28,Sales,60000
Bob,42,Engineering,90000
Eve,31,Marketing,75000
You can create a Python script like this:
import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv('employee_salary.csv')
# Display the DataFrame
print("Employee Salary Data:")
print(df)
# Calculate the average salary
average_salary = df['Salary'].mean()
# Display the average salary
print("\nAverage Salary:", average_salary)
# Filter employees who are older than 30 and work in the Engineering department
filtered_df = df[(df['Age'] > 30) & (df['Department'] == 'Engineering')]
# Display the filtered DataFrame
print("\nEmployees older than 30 in Engineering Department:")
print(filtered_df)
When you run this script, it will output:
Employee Salary Data:
Name Age Department Salary
0 John 35 Engineering 80000
1 Alice 28 Sales 60000
2 Bob 42 Engineering 90000
3 Eve 31 Marketing 75000
Average Salary: 76250.0
Employees older than 30 in Engineering Department:
Name Age Department Salary
0 John 35 Engineering 80000
2 Bob 42 Engineering 90000
Process finished with exit code 0
In summary, the code reads employee salary data from a CSV file into a Pandas DataFrame, calculates the average salary of all employees, and filters the DataFrame to include only employees older than 30 who work in the Engineering department. This demonstrates how Pandas allows you to handle and analyze tabular data with ease, making it a powerful library for data manipulation in Python.
Example 2
import pandas as pd
# Create a dictionary of data
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 22, 35, 28],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Seattle']
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
# Display the DataFrame
print(df, end='\n\n')
# Select specific columns
print(df[['Name', 'Age']], end='\n\n')
# Filter data based on a condition
print(df[df['Age'] > 25], end='\n\n')
# Add a new column
df['Gender'] = ['F', 'M', 'M', 'M', 'F']
print(df, end='\n\n')
# Group data and calculate the average age for each gender
grouped_data = df.groupby('Gender')['Age'].mean()
print(grouped_data, end='\n')
When you run this script, it will output:
Name Age City
0 Alice 25 New York
1 Bob 30 San Francisco
2 Charlie 22 Los Angeles
3 David 35 Chicago
4 Eva 28 Seattle
Name Age
0 Alice 25
1 Bob 30
2 Charlie 22
3 David 35
4 Eva 28
Name Age City
1 Bob 30 San Francisco
3 David 35 Chicago
4 Eva 28 Seattle
Name Age City Gender
0 Alice 25 New York F
1 Bob 30 San Francisco M
2 Charlie 22 Los Angeles M
3 David 35 Chicago M
4 Eva 28 Seattle F
Gender
F 26.5
M 29.0
Name: Age, dtype: float64
Process finished with exit code 0
The provided code demonstrates some of the fundamental operations you can perform with Pandas DataFrames, such as creating a DataFrame, selecting specific columns, filtering data based on conditions, adding new columns, and performing group-wise calculations. Pandas is a powerful library that simplifies data manipulation and analysis in Python, making it a popular choice for data-related tasks.
Example 3
Assume you have a CSV file named `example.csv` with the following data:
Name,Age,City
Alice,25,New York
Bob,30,San Francisco
Charlie,22,Los Angeles
You can create a Python script like this:
import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv('example.csv')
# Display the DataFrame
print("Original DataFrame:")
print(df)
# Filter the data to show only people aged 25 or older
filtered_df = df[df['Age'] >= 25]
# Display the filtered DataFrame
print("\nFiltered DataFrame:")
print(filtered_df)
# Calculate the average age
average_age = df['Age'].mean()
# Display the average age
print("\nAverage Age:", average_age)
When you run this script, it will output:
Original DataFrame:
Name Age City
0 Alice 25 New York
1 Bob 30 San Francisco
2 Charlie 22 Los Angeles
Filtered DataFrame:
Name Age City
0 Alice 25 New York
1 Bob 30 San Francisco
Average Age: 25.666666666666668
Process finished with exit code 0
In summary, the code reads a CSV file into a Pandas DataFrame, displays the original DataFrame, filters the data to show only people aged 25 or older, and calculates and displays the average age of all individuals in the dataset. Pandas provides an efficient and intuitive way to handle and analyze tabular data, making it a powerful tool for data manipulation tasks in Python.
Example 4
Assume you have a CSV file named `scores.csv` with the following data:
Name,Math,English,Science
Alice,90,85,92
Bob,80,75,88
Charlie,78,92,80
You can create a Python script like this:
import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv('scores.csv')
# Display the DataFrame
print("Original DataFrame:")
print(df)
# Calculate the average score for each student
df['Average'] = df[['Math', 'English', 'Science']].mean(axis=1)
# Display the DataFrame with the average column
print("\nDataFrame with Average:")
print(df)
When you run this script, it will output:
Original DataFrame:
Name Math English Science
0 Alice 90 85 92
1 Bob 80 75 88
2 Charlie 78 92 80
DataFrame with Average:
Name Math English Science Average
0 Alice 90 85 92 89.000000
1 Bob 80 75 88 81.000000
2 Charlie 78 92 80 83.333333
Process finished with exit code 0
The output will be two tabular representations: one with the original scores and another with the ‘Average’ column appended, showing the calculated average score for each student.
Overall, this code is a basic example of using Pandas to read data from a CSV file, perform some data manipulation (calculating the average score), and display the results in an organized tabular format. Pandas’ ease of use and powerful functionalities make it a popular choice for data analysis tasks in Python.
Example 5
Suppose you have data on the monthly sales of products in a retail store, and you want to analyze the data using Pandas. Assume you have a CSV file named `sales_data.csv` with the following data:
Product,Month,Sales
A,Jan,100
B,Jan,150
A,Feb,120
B,Feb,180
A,Mar,90
B,Mar,160
You can create a Python script like this:
import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv('sales_data.csv')
# Display the DataFrame
print("Sales Data:")
print(df)
# Calculate the total sales for each product
total_sales = df.groupby('Product')['Sales'].sum().reset_index()
# Display the total sales
print("\nTotal Sales:")
print(total_sales)
# Calculate the average sales per month
average_sales_per_month = df.groupby('Month')['Sales'].mean().reset_index()
# Display the average sales per month
print("\nAverage Sales per Month:")
print(average_sales_per_month)
When you run this script, it will output:
Sales Data:
Product Month Sales
0 A Jan 100
1 B Jan 150
2 A Feb 120
3 B Feb 180
4 A Mar 90
5 B Mar 160
Total Sales:
Product Sales
0 A 310
1 B 490
Average Sales per Month:
Month Sales
0 Feb 150.0
1 Jan 125.0
2 Mar 125.0
Process finished with exit code 0
Overall, the code demonstrates how Pandas simplifies the process of reading data from a CSV file, performing grouping and aggregation operations, and displaying the results. It enables easy data manipulation and analysis tasks, making it a powerful tool for data handling in Python
Example 6
Suppose we have a CSV file named sales_data.csv
with the following data:
Product,Month,Sales
A,Jan,100
B,Jan,150
A,Feb,120
B,Feb,180
A,Mar,90
B,Mar,160
You can create a Python script like this:
import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv('sales_data1.csv')
# Display the DataFrame
print("Sales Data:")
print(df)
# Calculate the total sales for each product
total_sales = df.groupby('Product')['Sales'].sum().reset_index()
# Display the total sales
print("\nTotal Sales:")
print(total_sales)
# Calculate the average sales for each month
average_sales_per_month = df.groupby('Month')['Sales'].mean().reset_index()
# Display the average sales per month
print("\nAverage Sales per Month:")
print(average_sales_per_month)
# Calculate the maximum sales value
max_sales = df['Sales'].max()
print("\nMaximum Sales Value:", max_sales)
# Calculate the minimum sales value
min_sales = df['Sales'].min()
print("\nMinimum Sales Value:", min_sales)
# Calculate the total number of sales
total_sales_count = df['Sales'].count()
print("\nTotal Number of Sales:", total_sales_count)
# Calculate the total revenue
total_revenue = df['Sales'].sum()
print("\nTotal Revenue:", total_revenue)
# Calculate the sales percentage for each product
df['Sales Percentage'] = (df['Sales'] / df['Sales'].sum()) * 100
print("\nSales Data with Percentage:")
print(df)
When you run this script, it will output:
Sales Data:
Product Month Sales
0 A Jan 100
1 B Jan 150
2 A Feb 120
3 B Feb 180
4 A Mar 90
5 B Mar 160
Total Sales:
Product Sales
0 A 310
1 B 490
Average Sales per Month:
Month Sales
0 Feb 150.0
1 Jan 125.0
2 Mar 125.0
Maximum Sales Value: 180
Minimum Sales Value: 90
Total Number of Sales: 6
Total Revenue: 800
Sales Data with Percentage:
Product Month Sales Sales Percentage
0 A Jan 100 12.50
1 B Jan 150 18.75
2 A Feb 120 15.00
3 B Feb 180 22.50
4 A Mar 90 11.25
5 B Mar 160 20.00
Process finished with exit code 0
In this example, we used DataFrame computations to:
- Calculate the total sales for each product and the average sales for each month.
- Find the maximum and minimum sales values in the dataset.
- Calculate the total number of sales and the total revenue.
- Compute the sales percentage for each product based on the total revenue.
Correlation in Pandas
In Pandas, correlation refers to the statistical relationship between two variables, which indicates how they change together. The correlation coefficient is a measure of the strength and direction of the linear relationship between two numerical variables. It quantifies the degree to which the variables move together and ranges from -1 to 1.
Pandas provides the corr()
method to compute the correlation between columns in a DataFrame. This method can be used to calculate various correlation coefficients, such as Pearson correlation coefficient and Spearman rank correlation coefficient.
Here’s a practical example to demonstrate how to calculate the correlation in Pandas:
import pandas as pd
# Sample data
data = {
'Age': [25, 30, 22, 35, 28],
'Income': [50000, 60000, 45000, 80000, 55000],
'Sales': [1000, 1200, 800, 1500, 1100]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Display the DataFrame
print("Data:")
print(df)
# Calculate the correlation matrix using Pearson method
correlation_matrix = df.corr(method='pearson')
# Display the correlation matrix
print("\nCorrelation Matrix:")
print(correlation_matrix)
When you run this script, it will output:
Data:
Age Income Sales
0 25 50000 1000
1 30 60000 1200
2 22 45000 800
3 35 80000 1500
4 28 55000 1100
Correlation Matrix:
Age Income Sales
Age 1.000000 0.972073 0.995153
Income 0.972073 1.000000 0.979471
Sales 0.995153 0.979471 1.000000
Process finished with exit code 0
In this example, we created a DataFrame with columns ‘Age’, ‘Income’, and ‘Sales’, representing the age of individuals, their income, and their sales. We then used the corr()
method with the method='pearson'
parameter to calculate the correlation matrix using the Pearson correlation coefficient. The resulting correlation matrix displays the pairwise correlations between the columns.
A correlation coefficient of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship between the variables. In the output, you can observe that ‘Age’ and ‘Sales’ have a strong positive correlation of approximately 0.972, indicating that they move together in a positive direction. Similarly, ‘Age’ and ‘Income’ have a strong positive correlation of approximately 0.913.
To explore more libraries, you can refer our other blogs on Python Libraries