Python Pandas Library

August 4, 2023

Pandas Library

Pandas is an open-source Python library that provides powerful and flexible data structures and data analysis tools. It is widely used in data manipulation, data cleaning, data transformation, and data analysis tasks. The name “Pandas” is derived from “Panel Data,” which refers to multi-dimensional structured data.

Key features of Pandas:

DataFrame: The DataFrame is a two-dimensional tabular data structure, similar to a spreadsheet or a SQL table. It consists of rows and columns, where each column can hold different data types. DataFrames are very versatile and can handle heterogeneous data effectively.

Series: A Series is a one-dimensional labeled array that can hold data of any type, similar to a column in a DataFrame.

Data Input/Output: Pandas supports reading and writing data from/to various file formats like CSV, Excel, JSON, SQL databases, and more.

Data Cleaning and Transformation: Pandas provides functions to handle missing data, remove duplicates, rename columns, and reshape data.

Data Selection and Filtering: You can use boolean indexing, label-based indexing (`.loc`), and position-based indexing (`.iloc`) to select and filter data in DataFrames.

Grouping and Aggregation: Pandas allows you to group data based on certain criteria and perform aggregations like sum, mean, max, etc. on the grouped data.

Merging and Joining: You can merge or join multiple DataFrames based on common columns.

Let’s see some practical examples to understand how Pandas works with data.

Example 1

Suppose you have a CSV file named `employee_salary.csv` with the following data:

Name,Age,Department,Salary

John,35,Engineering,80000

Alice,28,Sales,60000

Bob,42,Engineering,90000

Eve,31,Marketing,75000

You can create a Python script like this:

				
					import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('employee_salary.csv')

# Display the DataFrame
print("Employee Salary Data:")
print(df)

# Calculate the average salary
average_salary = df['Salary'].mean()

# Display the average salary
print("\nAverage Salary:", average_salary)

# Filter employees who are older than 30 and work in the Engineering department
filtered_df = df[(df['Age'] > 30) & (df['Department'] == 'Engineering')]

# Display the filtered DataFrame
print("\nEmployees older than 30 in Engineering Department:")
print(filtered_df)

When you run this script, it will output:

				
					Employee Salary Data:
    Name  Age   Department  Salary
0   John   35  Engineering   80000
1  Alice   28        Sales   60000
2    Bob   42  Engineering   90000
3    Eve   31    Marketing   75000

Average Salary: 76250.0

Employees older than 30 in Engineering Department:
   Name  Age   Department  Salary
0  John   35  Engineering   80000
2   Bob   42  Engineering   90000

Process finished with exit code 0

In summary, the code reads employee salary data from a CSV file into a Pandas DataFrame, calculates the average salary of all employees, and filters the DataFrame to include only employees older than 30 who work in the Engineering department. This demonstrates how Pandas allows you to handle and analyze tabular data with ease, making it a powerful library for data manipulation in Python.

Example 2

				
					import pandas as pd

# Create a dictionary of data
data = {
   'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
   'Age': [25, 30, 22, 35, 28],
   'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Seattle']
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the DataFrame
print(df, end='\n\n')
# Select specific columns
print(df[['Name', 'Age']], end='\n\n')

# Filter data based on a condition
print(df[df['Age'] > 25], end='\n\n')

# Add a new column
df['Gender'] = ['F', 'M', 'M', 'M', 'F']
print(df, end='\n\n')

# Group data and calculate the average age for each gender
grouped_data = df.groupby('Gender')['Age'].mean()
print(grouped_data, end='\n')

When you run this script, it will output:

				
					Name        Age              City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   22    Los Angeles
3    David   35        Chicago
4      Eva   28        Seattle

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
3    David   35
4      Eva   28

    Name  Age           City
1    Bob   30  San Francisco
3  David   35        Chicago
4    Eva   28        Seattle

      Name  Age           City Gender
0    Alice   25       New York      F
1      Bob   30  San Francisco      M
2  Charlie   22    Los Angeles      M
3    David   35        Chicago      M
4      Eva   28        Seattle      F

Gender
F    26.5
M    29.0
Name: Age, dtype: float64

Process finished with exit code 0

The provided code demonstrates some of the fundamental operations you can perform with Pandas DataFrames, such as creating a DataFrame, selecting specific columns, filtering data based on conditions, adding new columns, and performing group-wise calculations. Pandas is a powerful library that simplifies data manipulation and analysis in Python, making it a popular choice for data-related tasks.

Example 3

Assume you have a CSV file named `example.csv` with the following data:

Name,Age,City

Alice,25,New York

Bob,30,San Francisco

Charlie,22,Los Angeles

You can create a Python script like this:

				
					import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('example.csv')

# Display the DataFrame
print("Original DataFrame:")
print(df)

# Filter the data to show only people aged 25 or older
filtered_df = df[df['Age'] >= 25]

# Display the filtered DataFrame
print("\nFiltered DataFrame:")
print(filtered_df)

# Calculate the average age
average_age = df['Age'].mean()

# Display the average age
print("\nAverage Age:", average_age)

When you run this script, it will output:

				
					Original DataFrame:
      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   22    Los Angeles

Filtered DataFrame:
    Name  Age           City
0  Alice   25       New York
1    Bob   30  San Francisco

Average Age: 25.666666666666668

Process finished with exit code 0

In summary, the code reads a CSV file into a Pandas DataFrame, displays the original DataFrame, filters the data to show only people aged 25 or older, and calculates and displays the average age of all individuals in the dataset. Pandas provides an efficient and intuitive way to handle and analyze tabular data, making it a powerful tool for data manipulation tasks in Python.

Example 4

Assume you have a CSV file named `scores.csv` with the following data:

Name,Math,English,Science

Alice,90,85,92

Bob,80,75,88

Charlie,78,92,80

You can create a Python script like this:

				
					import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('scores.csv')

# Display the DataFrame
print("Original DataFrame:")
print(df)

# Calculate the average score for each student
df['Average'] = df[['Math', 'English', 'Science']].mean(axis=1)

# Display the DataFrame with the average column
print("\nDataFrame with Average:")
print(df)

When you run this script, it will output:

				
					Original DataFrame:
      Name  Math  English  Science
0    Alice    90       85       92
1      Bob    80       75       88
2  Charlie    78       92       80

DataFrame with Average:
      Name  Math  English  Science    Average
0    Alice    90       85       92  89.000000
1      Bob    80       75       88  81.000000
2  Charlie    78       92       80  83.333333

Process finished with exit code 0

The output will be two tabular representations: one with the original scores and another with the ‘Average’ column appended, showing the calculated average score for each student.

Overall, this code is a basic example of using Pandas to read data from a CSV file, perform some data manipulation (calculating the average score), and display the results in an organized tabular format. Pandas’ ease of use and powerful functionalities make it a popular choice for data analysis tasks in Python.

Example 5

Suppose you have data on the monthly sales of products in a retail store, and you want to analyze the data using Pandas. Assume you have a CSV file named `sales_data.csv` with the following data:

Product,Month,Sales

A,Jan,100

B,Jan,150

A,Feb,120

B,Feb,180

A,Mar,90

B,Mar,160

You can create a Python script like this:

				
					import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('sales_data.csv')

# Display the DataFrame
print("Sales Data:")
print(df)

# Calculate the total sales for each product
total_sales = df.groupby('Product')['Sales'].sum().reset_index()

# Display the total sales
print("\nTotal Sales:")
print(total_sales)

# Calculate the average sales per month
average_sales_per_month = df.groupby('Month')['Sales'].mean().reset_index()

# Display the average sales per month
print("\nAverage Sales per Month:")
print(average_sales_per_month)

When you run this script, it will output:

				
					Sales Data:
  Product Month  Sales
0       A   Jan    100
1       B   Jan    150
2       A   Feb    120
3       B   Feb    180
4       A   Mar     90
5       B   Mar    160

Total Sales:
  Product  Sales
0       A    310
1       B    490

Average Sales per Month:
  Month  Sales
0   Feb  150.0
1   Jan  125.0
2   Mar  125.0

Process finished with exit code 0

Overall, the code demonstrates how Pandas simplifies the process of reading data from a CSV file, performing grouping and aggregation operations, and displaying the results. It enables easy data manipulation and analysis tasks, making it a powerful tool for data handling in Python

Example 6
Suppose we have a CSV file named sales_data.csv with the following data:
Product,Month,Sales
A,Jan,100
B,Jan,150
A,Feb,120
B,Feb,180
A,Mar,90
B,Mar,160
You can create a Python script like this:

				
					import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('sales_data1.csv')

# Display the DataFrame
print("Sales Data:")
print(df)

# Calculate the total sales for each product
total_sales = df.groupby('Product')['Sales'].sum().reset_index()

# Display the total sales
print("\nTotal Sales:")
print(total_sales)

# Calculate the average sales for each month
average_sales_per_month = df.groupby('Month')['Sales'].mean().reset_index()

# Display the average sales per month
print("\nAverage Sales per Month:")
print(average_sales_per_month)

# Calculate the maximum sales value
max_sales = df['Sales'].max()
print("\nMaximum Sales Value:", max_sales)

# Calculate the minimum sales value
min_sales = df['Sales'].min()
print("\nMinimum Sales Value:", min_sales)

# Calculate the total number of sales
total_sales_count = df['Sales'].count()
print("\nTotal Number of Sales:", total_sales_count)

# Calculate the total revenue
total_revenue = df['Sales'].sum()
print("\nTotal Revenue:", total_revenue)

# Calculate the sales percentage for each product
df['Sales Percentage'] = (df['Sales'] / df['Sales'].sum()) * 100
print("\nSales Data with Percentage:")
print(df)

When you run this script, it will output:

				
					Sales Data:
  Product Month  Sales
0       A   Jan    100
1       B   Jan    150
2       A   Feb    120
3       B   Feb    180
4       A   Mar     90
5       B   Mar    160

Total Sales:
  Product  Sales
0       A    310
1       B    490

Average Sales per Month:
  Month  Sales
0   Feb  150.0
1   Jan  125.0
2   Mar  125.0

Maximum Sales Value: 180

Minimum Sales Value: 90

Total Number of Sales: 6

Total Revenue: 800

Sales Data with Percentage:
  Product Month  Sales  Sales Percentage
0       A   Jan    100             12.50
1       B   Jan    150             18.75
2       A   Feb    120             15.00
3       B   Feb    180             22.50
4       A   Mar     90             11.25
5       B   Mar    160             20.00

Process finished with exit code 0

In this example, we used DataFrame computations to:

Calculate the total sales for each product and the average sales for each month.
Find the maximum and minimum sales values in the dataset.
Calculate the total number of sales and the total revenue.
Compute the sales percentage for each product based on the total revenue.

Correlation in Pandas

In Pandas, correlation refers to the statistical relationship between two variables, which indicates how they change together. The correlation coefficient is a measure of the strength and direction of the linear relationship between two numerical variables. It quantifies the degree to which the variables move together and ranges from -1 to 1.

Pandas provides the corr() method to compute the correlation between columns in a DataFrame. This method can be used to calculate various correlation coefficients, such as Pearson correlation coefficient and Spearman rank correlation coefficient.

Here’s a practical example to demonstrate how to calculate the correlation in Pandas:

				
					import pandas as pd

# Sample data
data = {
    'Age': [25, 30, 22, 35, 28],
    'Income': [50000, 60000, 45000, 80000, 55000],
    'Sales': [1000, 1200, 800, 1500, 1100]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print("Data:")
print(df)

# Calculate the correlation matrix using Pearson method
correlation_matrix = df.corr(method='pearson')

# Display the correlation matrix
print("\nCorrelation Matrix:")
print(correlation_matrix)

When you run this script, it will output:

				
					Data:
   Age  Income  Sales
0   25   50000   1000
1   30   60000   1200
2   22   45000    800
3   35   80000   1500
4   28   55000   1100

Correlation Matrix:
             Age    Income     Sales
Age     1.000000  0.972073  0.995153
Income  0.972073  1.000000  0.979471
Sales   0.995153  0.979471  1.000000

Process finished with exit code 0

In this example, we created a DataFrame with columns ‘Age’, ‘Income’, and ‘Sales’, representing the age of individuals, their income, and their sales. We then used the corr() method with the method='pearson' parameter to calculate the correlation matrix using the Pearson correlation coefficient. The resulting correlation matrix displays the pairwise correlations between the columns.

A correlation coefficient of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship between the variables. In the output, you can observe that ‘Age’ and ‘Sales’ have a strong positive correlation of approximately 0.972, indicating that they move together in a positive direction. Similarly, ‘Age’ and ‘Income’ have a strong positive correlation of approximately 0.913.

To explore more libraries, you can refer our other blogs on Python Libraries

Pandas Library

Correlation in Pandas

Popular Courses

Contact Info

Address

Phone

Email

Copyright © 2023 Tech Amplifiers

Designed by Tribuzz Technologies