How to analyze San Francisco city employee salary data with Pandas
By Deepnote team
•Updated on November 23, 2023
Analyze San Francisco city employee salary data using the Pandas library in Python.
Welcome to a tutorial designed to help you practice your pandas skills! In this blog post, we will be using the SF Salaries Dataset from Kaggle. The tasks will guide you step-by-step through the data analysis process using Pandas, starting with basic operations and advancing to more complex queries.
Let's get started!
Import pandas
The first step in any data analysis task involving Pandas is to import the library. We give it an alias 'pd' for convenience.
import pandas as pd
Read Salaries.csv as a dataframe called sal
Pandas provides an easy way to read data from a CSV file and store it in a DataFrame object. Here's how you do it:
sal = pd.read_csv('Salaries.csv')
Check the head of the DataFrame
Once the data is loaded into the DataFrame, it's often useful to check the first few rows of the DataFrame to get an understanding of the data and its structure.
sal.head()
Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.00 400184.25 NaN 567595.43 567595.43 2011 NaN San Francisco NaN
1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966.02 245131.88 137811.38 NaN 538909.28 538909.28 2011 NaN San Francisco NaN
2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739.13 106088.18 16452.60 NaN 335279.91 335279.91 2011 NaN San Francisco NaN
3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916.00 56120.71 198306.90 NaN 332343.61 332343.61 2011 NaN San Francisco NaN
4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134401.60 9737.00 182234.59 NaN 326373.19 326373.19 2011 NaN San Francisco NaN
Use the .info() method to find out how many entries there are
The .info()
method is a quick way to get an overview of the DataFrame, such as the number of entries, the data types of the columns, and which columns have missing values.
sal.info() # Expected output: 148654 Entries
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
Id 148654 non-null int64
EmployeeName 148654 non-null object
JobTitle 148654 non-null object
BasePay 148045 non-null float64
OvertimePay 148650 non-null float64
OtherPay 148650 non-null float64
Benefits 112491 non-null float64
TotalPay 148654 non-null float64
TotalPayBenefits 148654 non-null float64
Year 148654 non-null int64
Notes 0 non-null float64
Agency 148654 non-null object
Status 0 non-null float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB
What is the average base pay?
To find out the average BasePay, you can use the .mean()
method on the 'BasePay' column.
sal['BasePay'].mean()
66325.44884050643
What is the highest amount of OvertimePay in the dataset?
Similarly, to find out the maximum overtime pay, use the .max()
method.
sal['OvertimePay'].max()
245131.88
What is the job title of Joseph Driscoll?
We can filter the data for 'Joseph Driscoll' and then select the 'JobTitle' column.
sal[sal['EmployeeName'] == 'JOSEPH DRISCOLL']['JobTitle']
24 CAPTAIN, FIRE SUPPRESSION
Name: JobTitle, dtype: object
How much does Joseph Driscoll make (including benefits)?
Similarly, to find out how much Joseph Driscoll makes, we need to include the benefits, so we select the 'TotalPayBenefits' column.
sal[sal['EmployeeName'] == 'JOSEPH DRISCOLL']['TotalPayBenefits']
24 270324.91
Name: TotalPayBenefits, dtype: float64
What is the name of the highest-paid person (including benefits)?
To find the highest-paid person, we can sort by the 'TotalPayBenefits' column and then get the first entry.
sal[sal['TotalPayBenefits'] == sal['TotalPayBenefits'].max()]['EmployeeName']
What is the name of the lowest-paid person (including benefits)?
Similar to the previous query, except here we look for the minimum.
sal[sal['TotalPayBenefits'] == sal['TotalPayBenefits'].min()]['EmployeeName']
You may find something strange about this entry as it can have negative pay!
What was the average (mean) BasePay of all employees per year? (2011-2014)
You can group by the 'Year' and then calculate the mean of the BasePay for each group.
sal.groupby('Year').mean()['BasePay']
How many unique job titles are there?
The .nunique()
method gives us the number of unique entries in the 'JobTitle' column.
sal['JobTitle'].nunique()
What are the top 5 most common jobs?
Using .value_counts()
will give us the frequency of each job title. We then take the top 5 with .head(5)
.
sal['JobTitle'].value_counts().head(5)
How many job titles were represented by only one person in 2013?
We filter the data to include only the year 2013 and then apply a similar technique as before, but this time we check for job titles that appear exactly once.
sum(sal[sal['Year'] == 2013]['JobTitle'].value_counts() == 1)
How many people have the word Chief in their job title?
This requires a custom function that checks if 'chief' is in the job title (case insensitive), and we then apply this function to each job title and sum up the True values.
def chief_string(title):
if 'chief' in title.lower():
return True
else:
return False
sum(sal['JobTitle'].apply(lambda x: chief_string(x)))
Bonus: is there a correlation between length of the job title string and salary?
First, we calculate the length of each job title, then we check the correlation with TotalPayBenefits.
sal['title_len'] = sal['JobTitle'].apply(len)
sal[['title_len', 'TotalPayBenefits']].corr() # No significant correlation
title_len TotalPayBenefits
title_len 1.000000 -0.036878
TotalPayBenefits -0.036878 1.000000
Summary
In this blog post, we went through a series of tasks involving the San Francisco Salaries dataset to practice data manipulation and analysis using Pandas. We covered fundamental techniques such as reading data, selecting columns, filtering, grouping, and applying functions. These operations are essential for any data analyst or data scientist working with tabular data in Python.