Whether you are just getting started in data science or are a seasoned expert, one of the most critical activities is exploratory data analysis (EDA). EDA allows you to develop an intimate understanding of your data that is foundational to downstream modeling and analysis. It is the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses, and check assumptions with summary statistics and graphical representations.
EDA is a crucial initial step when working with any new dataset. In this post, we’ll explore how to perform effective EDA using Python and the powerful Pandas data analysis library. Pandas provides data structures and instruments that make data exploration seamless. In this blog, we are going to explore some of the activities and tools that can be use to do EDA.
Table of Contents
Datasets
EDA involves activities aimed at detecting patterns, identifying anomalies, testing assumptions, and conducting preliminary feature engineering and data preparation. Key aspects of EDA include:
- Importing and formatting raw datasets
- Conducting integrity checks and dealing with missing values
- Generating summary statistics on both numeric and categorical variables
- Creating visualizations such as histograms, scatter plots, and bar charts
- Stratified analysis by segmenting data into subgroups
- Identifying promising relationships and correlations to formulate hypotheses
- Cleaning, transforming, and preparing data for modelling
Reference: https://datos.gob.es/en/documentacion/practical-introductory-guide-exploratory-data-analysis
2. Why perform EDA?
Thorough EDA is critical because:
- Builds understanding and intuition about the data.
- Reveals issues that guide data cleaning and preprocessing.
- Provides insights that lead to testable hypotheses about the data.
- Identifies relationships between variables to model.
- Surfaces gaps that highlight areas for additional data collection.
- Improves the quality of downstream analyses and models.
3. Conducting EDA with Python and Pandas
Python and Pandas provide a flexible, powerful environment for exploring datasets. Key aspects include:
- Easily importing data from various sources like CSV and databases
- Dataframe structure for storing tabular data
- Fast summary stats generation with .describe()
- Built-in methods for handling missing data
- Plotting functions for quick data visualization
- Filtering, grouping, and combining datasets for segmented analysis
- Merging and joining data from different sources
- Easy handling of large datasets not possible in Excel
By leveraging Python and Pandas for thorough EDA, you gain key insights into your data that enable building effective machine learning models and analytics systems. Make EDA a priority step in your data science projects!
Lets do some EDA with Python and Pandas!
Importing libraries
As a data coach, I always emphasize to my students and clients the value of Python’s extensive open source libraries for exploratory data analysis.The libraries provide pre-built tools that are optimized for data tasks, saving you time and effort compared to coding from scratch. Leveraging them follows best practices refined by the Python data community over years. For example, Pandas DataFrames handle messy real-world data better than native Python objects. Matplotlib and Seaborn build meaningful graphics tailored for statistical analysis.
These libraries come with batteries included – they have pre-built tools and best practices so you don’t have to reinvent the wheel. Leveraging Python libraries accelerates your EDA, avoids re-coding common tasks, and follows established conventions, allowing you to focus on uncovering insights rather than low-level programming.
For this exercise we are going to import and use pandas.
Documentation: https://pandas.pydata.org/docs/reference/frame.html
Note: Place the files “customer.csv” and “payment.csv” in the same folder where this notebook is located
# Import the pandas package
import pandas as pd
## Read the customer table and assign it the label 'customer' so we can refer to it later
customer = pd.read_csv('customer.csv')
Let’s break down this line to see what’s happening:
'./data/customer.csv'
is a string
In Python text is stored as a string, a sequence of characters enclosed in ‘single quotes’, “double quotes”, or “””triple quotes”””.
Everything in Python is an object and every object in Python has a type. Computer programs typcally keep track of a range of data types. For example, 1.5
is a floating point number, while 1
is an integer. Programs need to distinguish between these two types for various reasons:
- They are stored in memory differently.
- Their arithmetic operations are different
Some of the basic numerical types in Python include:
int
(integer; a whole number with no decimal place)
10
-3
float
(float; a number that has a decimal place)
7.41
-0.006
pd.read_csv
is a function in the Pandas module
Remember from SQL: Server.Database.Schema.Table.Column
.
are used to denote nested objects
read_csv
is a Pandas function, so part of pd
customer =
assigns our object to the customer variable
Our customer table is given a label customer
. These variables are like the aliases we saw in SQL.
In Python, a variable is a name you specify in your code that maps to a particular object, object instance, or value.
By defining variables, we can refer to things by names that make sense to us. Names for variables can only contain letters, underscores (_
), or numbers (no spaces, dashes, or other characters). Variable names must start with a letter or underscore – good practice is to start with a letter.
We assign a label to an object with a single =
4. Exploring the data
Next, we will explore the data by using the .head() method to see the top rows and the columns attribute to see the names of all the columns.
- df.head() – View first 5 rows
- df.tail() – View last 5 rows
- df.shape – Number of rows and columns
- df.columns – Return the names of the columns
- df.dtypes – Data type of each column
- df.info() – Index, datatype and memory information
- df.describe() – Summary statistics for numeric columns
Lets explore each of this method starting with “head”, the syntax for using it is:
DataFrame.head()
In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
##Calling the dataset will return the first 5 and last 5 rows of the dataset.
customer
customer_id | store_id | first_name | last_name | address_id | activebool | create_date | last_update | active | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | MARY | SMITH | MARY.SMITH@sakilacustomer.org | 5 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
1 | 2 | 1 | PATRICIA | JOHNSON | PATRICIA.JOHNSON@sakilacustomer.org | 6 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
2 | 3 | 1 | LINDA | WILLIAMS | LINDA.WILLIAMS@sakilacustomer.org | 7 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
3 | 4 | 2 | BARBARA | JONES | BARBARA.JONES@sakilacustomer.org | 8 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
4 | 5 | 1 | ELIZABETH | BROWN | ELIZABETH.BROWN@sakilacustomer.org | 9 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
… | … | … | … | … | … | … | … | … | … | … |
594 | 595 | 1 | TERRENCE | GUNDERSON | TERRENCE.GUNDERSON@sakilacustomer.org | 601 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
595 | 596 | 1 | ENRIQUE | FORSYTHE | ENRIQUE.FORSYTHE@sakilacustomer.org | 602 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
596 | 597 | 1 | FREDDIE | DUGGAN | FREDDIE.DUGGAN@sakilacustomer.org | 603 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
597 | 598 | 1 | WADE | DELVALLE | WADE.DELVALLE@sakilacustomer.org | 604 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
598 | 599 | 2 | AUSTIN | CINTRON | AUSTIN.CINTRON@sakilacustomer.org | 605 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
599 rows × 10 columns
##Now lets try head:
customer.head()
customer_id | store_id | first_name | last_name | address_id | activebool | create_date | last_update | active | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | MARY | SMITH | MARY.SMITH@sakilacustomer.org | 5 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
1 | 2 | 1 | PATRICIA | JOHNSON | PATRICIA.JOHNSON@sakilacustomer.org | 6 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
2 | 3 | 1 | LINDA | WILLIAMS | LINDA.WILLIAMS@sakilacustomer.org | 7 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
3 | 4 | 2 | BARBARA | JONES | BARBARA.JONES@sakilacustomer.org | 8 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
4 | 5 | 1 | ELIZABETH | BROWN | ELIZABETH.BROWN@sakilacustomer.org | 9 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
Lets explore this method “tail”, the syntax for using it is:
DataFrame.tail()
In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
##Tail:
customer.tail()
customer_id | store_id | first_name | last_name | address_id | activebool | create_date | last_update | active | ||
---|---|---|---|---|---|---|---|---|---|---|
594 | 595 | 1 | TERRENCE | GUNDERSON | TERRENCE.GUNDERSON@sakilacustomer.org | 601 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
595 | 596 | 1 | ENRIQUE | FORSYTHE | ENRIQUE.FORSYTHE@sakilacustomer.org | 602 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
596 | 597 | 1 | FREDDIE | DUGGAN | FREDDIE.DUGGAN@sakilacustomer.org | 603 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
597 | 598 | 1 | WADE | DELVALLE | WADE.DELVALLE@sakilacustomer.org | 604 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
598 | 599 | 2 | AUSTIN | CINTRON | AUSTIN.CINTRON@sakilacustomer.org | 605 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
Lets method “shape”, the syntax for using it is:
DataFrame.shape()
In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
##Shape, number of rows and number of columns:
customer.shape
(599, 10)
Lets the method “columns”, the syntax for using it is:
DataFrame.columns
In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
customer.columns
Index(['customer_id', 'store_id', 'first_name', 'last_name', 'email', 'address_id', 'activebool', 'create_date', 'last_update', 'active'], dtype='object')
Lets the method “dtypes”, the syntax for using it is:
DataFrame.dtypes
In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
customer.dtypes
customer_id int64 store_id int64 first_name object last_name object email object address_id int64 activebool bool create_date object last_update object active int64 dtype: object
df.info() should become standard practice when loading in new data in Python. It provides an informative data quality “dashboard” to guide deeper investigation or cleaning as needed. Catching potential problems early mitigates headaches later in analysis. The ease and speed of df.info() allows quick iterations to check data manipulations. Leveraging this single line of code accelerates EDA and sets the stage for reliable data science workflows.
- Provides concise summary of DataFrame contents – shape, data types, memory usage
- Identifies any incorrect or unexpected data types that could cause errors later
- Easily spots missing values represented as NaN with a count per column
- Can be used to compare before and after DataFrame changes during data cleaning
Lets explore each of this method exploring “info”, the syntax for using it is:
DataFrame.info()
In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
customer.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 599 entries, 0 to 598 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 599 non-null int64 1 store_id 599 non-null int64 2 first_name 599 non-null object 3 last_name 599 non-null object 4 email 599 non-null object 5 address_id 599 non-null int64 6 activebool 599 non-null bool 7 create_date 599 non-null object 8 last_update 599 non-null object 9 active 599 non-null int64 dtypes: bool(1), int64(4), object(5) memory usage: 42.8+ KB
df.describe() is an extremely useful method in exploratory data analysis to summarize numerical data. Rapidly provides an insightful overview of key statistics that support exploratory analysis; the output reveals details about quality, variance, and anomalies that warrant further examination. Here are some key reasons it provides value:
- Calculates standard summary statistics like count, mean, std dev, min, max, and quartiles. Gives overview of distribution.
- Statistics help identify outliers, skewness, and shape characteristics.
- Count of non-null values highlights missing data if lower than total rows.
- Summary stats can reveal insights like high standard deviations and long tails.
- Provides a “first look” at numeric attributes to guide more advanced analysis.
Lets explore this method “describe”, the syntax for using it is:
DataFrame.describe()
In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
customer.describe()
customer_id | store_id | address_id | active | |
---|---|---|---|---|
count | 599.000000 | 599.000000 | 599.000000 | 599.000000 |
mean | 300.000000 | 1.455760 | 304.724541 | 0.974958 |
std | 173.060683 | 0.498455 | 173.698609 | 0.156382 |
min | 1.000000 | 1.000000 | 5.000000 | 0.000000 |
25% | 150.500000 | 1.000000 | 154.500000 | 1.000000 |
50% | 300.000000 | 1.000000 | 305.000000 | 1.000000 |
75% | 449.500000 | 2.000000 | 454.500000 | 1.000000 |
max | 599.000000 | 2.000000 | 605.000000 | 1.000000 |
In summary, Pandas provides many useful functions to explore and summarize key details about an unfamiliar dataset. Methods like df.head(), df.tail(), df.shape, df.columns, df.dtypes, and df.info() quickly provide an overview of the structure, size, and data types.
df.describe() generates descriptive statistics to study numeric columns. Taken together, these methods help you efficiently inspect the contents, completeness, and integrity of the data.
5. Selecting and Analysing Columns
Filtering columns is an important technique in exploratory data analysis to focus on relevant subsets of data. If we only want some parts of a DataFrame, we need to be able to specify the columns we are interested in. To do so, we need to be able to choose a specific column from our DataFrame.
Note: A column returned on its own is returned as a Series (a pandas data type)
The syntax for choosing a Column from the DataFrame and returning it as a Series is:
DataFrame[‘Column’]
The syntax for choosing more than one column from the DataFrame, returning a DataFrame is:
DataFrame[[‘Column 1’, ‘Column 2’]]
The syntax for choosing a Column from the DataFrame and returning a DataFrame is:
DataFrame[[‘Column’]]
In these examples, DataFrame is the name of the DataFrame and Column is the name of the Column.
E.g. Single column:
customer['first_name']
0 MARY 1 PATRICIA 2 LINDA 3 BARBARA 4 ELIZABETH ... 594 TERRENCE 595 ENRIQUE 596 FREDDIE 597 WADE 598 AUSTIN Name: first_name, Length: 599, dtype: object
E.g. Multiple columns:
customer[['first_name', 'email']]
first_name | ||
---|---|---|
0 | MARY | MARY.SMITH@sakilacustomer.org |
1 | PATRICIA | PATRICIA.JOHNSON@sakilacustomer.org |
2 | LINDA | LINDA.WILLIAMS@sakilacustomer.org |
3 | BARBARA | BARBARA.JONES@sakilacustomer.org |
4 | ELIZABETH | ELIZABETH.BROWN@sakilacustomer.org |
… | … | … |
594 | TERRENCE | TERRENCE.GUNDERSON@sakilacustomer.org |
595 | ENRIQUE | ENRIQUE.FORSYTHE@sakilacustomer.org |
596 | FREDDIE | FREDDIE.DUGGAN@sakilacustomer.org |
597 | WADE | WADE.DELVALLE@sakilacustomer.org |
598 | AUSTIN | AUSTIN.CINTRON@sakilacustomer.org |
599 rows × 2 columns
We can use the methods we saw before for analysis on a subset of data, e.g.:
customer[['first_name', 'last_name']].describe()
first_name | last_name | |
---|---|---|
count | 599 | 599 |
unique | 591 | 599 |
top | JESSIE | SMITH |
freq | 2 | 1 |
6. Unique
The pd.unique() method in Pandas can be very useful for exploratory data analysis to understand the distinct values present in a column and get a quick sense of the variability/cardinality of data in a column using nunique(). Can be combined with value_counts() for frequencies of unique values.
Let’s explore “unique”, “nunique” and “value_counts”, the syntax for using them is:
DataFrame[‘Column’].unique()
DataFrame[‘Column’].nunique()
DataFrame[‘Column’].value_counts()
In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
E.g.: Let’s analyse how many stores are in the dataset and their frequency:
customer['store_id'].unique()
array([1, 2], dtype=int64)
customer['store_id'].nunique()
2
customer['store_id'].value_counts()
1 326 2 273 Name: store_id, dtype: int64
We can see that there are 2 stores and that are 326 customers register in strore 1 and 273 in store 2.
7. Sorting
By reordering the rows of a DataFrame based on one or more columns, sort_values() reveals patterns, relationships, trends, and anomalies that may be obscured in unsorted data. Sorting numeric data into ascending or descending order highlights outliers and distributions. Chronologically sorting date columns provides temporal context.
Next, we will sort the customers by their first name.
The syntax for sorting the data is:
\# *Sort the data from lowest to highest*
DataFrame.sort_values(by = ‘Column’)
In this example DataFrame is the name of the DataFrame, Column is the name of the Column we want to sort by.
customer.sort_values(by='first_name').head()
customer_id | store_id | first_name | last_name | address_id | activebool | create_date | last_update | active | ||
---|---|---|---|---|---|---|---|---|---|---|
374 | 375 | 2 | AARON | SELBY | AARON.SELBY@sakilacustomer.org | 380 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
366 | 367 | 1 | ADAM | GOOCH | ADAM.GOOCH@sakilacustomer.org | 372 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
524 | 525 | 2 | ADRIAN | CLARY | ADRIAN.CLARY@sakilacustomer.org | 531 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
216 | 217 | 2 | AGNES | BISHOP | AGNES.BISHOP@sakilacustomer.org | 221 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
388 | 389 | 1 | ALAN | KAHN | ALAN.KAHN@sakilacustomer.org | 394 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
8. Filtering Rows
By selectively removing rows based on logical criteria, you can focus on specific observations and patterns. For example, filtering numeric columns by thresholds reveals outliers and anomalies. Filtering categorical data on classes studies their distributions separately. Extracting rows with missing values helps quantify and understand NaN. Row filtering enables segmenting data like customers by region or users by age group.
If we wish to look at subsets of the data, we will need to filter or group it. Let’s start by learning to filter it.
To do so, we need to be able to choose a specific row from our table.
The syntax for filtering a DataFrame is:
\# *Filter the DataFrame*
DataFrame[ DataFrame[‘Column’] == ‘value’ ]
In this example DataFrame is the name of the DataFrame, Column is the name of the column we want to filter on and value is the value we’re interested in.
Lets make a subset of store “2”:
customer[customer['store_id'] == 2].head()
customer_id | store_id | first_name | last_name | address_id | activebool | create_date | last_update | active | ||
---|---|---|---|---|---|---|---|---|---|---|
3 | 4 | 2 | BARBARA | JONES | BARBARA.JONES@sakilacustomer.org | 8 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
5 | 6 | 2 | JENNIFER | DAVIS | JENNIFER.DAVIS@sakilacustomer.org | 10 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
7 | 8 | 2 | SUSAN | WILSON | SUSAN.WILSON@sakilacustomer.org | 12 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
8 | 9 | 2 | MARGARET | MOORE | MARGARET.MOORE@sakilacustomer.org | 13 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
10 | 11 | 2 | LISA | ANDERSON | LISA.ANDERSON@sakilacustomer.org | 15 | True | 2006-02-14 | 2006-02-15 09:57:20 | 1 |
9. Aggregating data
Fundamentally, groupby() enables dividing up heterogeneous data into homogeneous chunks to ask focused, comparative questions. It facilitates studying inter-group variability and intra-group consistency. Any aggregates, transformations, and visualizations applied within groups can reveal insights not visible in the full data. If we want to look at a customer’s aggregate transactions, we can use the .groupby() method to aggregate the data.
The syntax for aggregating the DataFrame is:
\# *Aggregate the DataFrame by group*
DataFrame.groupby(‘group’)
We can choose what kind of aggregate output we want for the columns by appending the appropreate method. E.g. sum()
\# *Aggregate the DataFrame by group*
DataFrame.groupby(‘group’).sum()
If we want to apply different aggregation methods to each column, we can with a different syntax
\# *Aggregate the DataFrame by group*
DataFrame.groupby(‘group’).agg({
‘Column1’ : ‘sum’, \# *take the sum of Column1*
‘Column2’ : ‘mean’, \# *take the mean of Column2*
‘Column3’ : ‘max’ \# *take the max of Column3*
})
In this example, group is the name of the Column we want to group by and sum/mean/max are the aggregation methods.
Here is the groupby documentation
For the next example we would like to replicate the distribution of clients per store (we did it with unique ;):
customer.groupby('store_id').count()
customer_id | first_name | last_name | address_id | activebool | create_date | last_update | active | ||
---|---|---|---|---|---|---|---|---|---|
store_id | |||||||||
1 | 326 | 326 | 326 | 326 | 326 | 326 | 326 | 326 | 326 |
2 | 273 | 273 | 273 | 273 | 273 | 273 | 273 | 273 | 273 |
We can see that the output returns all the columns of the dataset, to clear it we can select a single (or multiple coulumns if needed) with the following syntaxt:
customer[['store_id', 'customer_id']].groupby('store_id').count()
customer_id | |
---|---|
store_id | |
1 | 326 |
2 | 273 |
Let’s use a different dataset to exemplify better the groupby method, for these we are going to load the payment dataset:
payment = pd.read_csv('payment.csv')
payment.head()
payment_id | customer_id | staff_id | rental_id | amount | payment_date | |
---|---|---|---|---|---|---|
0 | 16050 | 269 | 2 | 7 | 1.99 | 2007-01-24 21:40:19.996577 |
1 | 16051 | 269 | 1 | 98 | 0.99 | 2007-01-25 15:16:50.996577 |
2 | 16052 | 269 | 2 | 678 | 6.99 | 2007-01-28 21:44:14.996577 |
3 | 16053 | 269 | 2 | 703 | 0.99 | 2007-01-29 00:58:02.996577 |
4 | 16054 | 269 | 1 | 750 | 4.99 | 2007-01-29 08:10:06.996577 |
We can make an analysis using group by of different aggregations (like counts, sums, min and max) on the amount feature. In this case we make an analysis by staff member and the sales they made:
payment[['staff_id', 'amount']].groupby('staff_id').sum()
amount | |
---|---|
staff_id | |
1 | 33489.47 |
2 | 33927.04 |
The mean can also be analysed:
payment[['staff_id', 'amount']].groupby('staff_id').mean()
amount | |
---|---|
staff_id | |
1 | 4.156568 |
2 | 4.245125 |
Here the list of the aggregations funtions available in Pandas:
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sum.html
10. Plotting
Visual representation of the data enables detection of patterns, relationships, trends, and anomalies that may not be observable in the raw data alone. Fundamentally, graphical exploration leads to new hypotheses, questions, and insights that drive the analysis forward. Generating effective visualizations during EDA is both an art and a science. Master the key plot types, know their strengths and weaknesses, and enrich your analysis with impactful, intuitive graphics. Visual data exploration should be a core part of any data scientist’s workflow. Here are some examples of visualisations and their benefit:
- Plots such as histograms, scatter plots, and box plots provide summaries of distributions that reveal insights into the shape, skew, outliers, and clusters within the data.
- Visualizing data categorized by color or facet makes comparisons across groups explicit.
- Time series plots surface seasonalities and temporal patterns.
- Correlation plots highlight relationships between variables.
Pandas allows us to plot data from the DataFrame. Pandas will plot for you most of the standard plots you would want.
Lets analyse the distribution of the amount of payments in the dataset with a histogram.
The syntax to make a histogram is:
DataFrame.hist(column = ‘Column’)
In this example DataFrame is the name of the DataFrame, Column is the name of the numerical column you would like to make the histogram on.
payment.hist(column='amount');
We can identify three peaks on the distribution on payments: the first one in 1 with 3,000 occurrences, the second on the value of 3 with about 3,500, and the highest in 5 with more than 5,000 occurrences; not many payments are made above the monetary value of 6.
Conclusion
Exploratory data analysis is a crucial first step in any data science undertaking, and Python’s Pandas library provides the perfect tools to perform effective EDA. By dedicating time to activities like data importing, cleaning, visualization, segmentation, and hypothesis generation using Pandas’ versatile built-in capabilities, you gain intimate understanding and derive insights that enable building better models and analytics systems downstream.
Thorough EDA leveraging Python and Pandas leads to more informed conclusions and impactful data-driven decision-making by uncovering key findings, relationships and anomalies that would otherwise go unseen. The power of these flexible data analysis tools makes EDA an indispensable part of the data science process.
I hope you enjoyed this quick introduction to this interesting subject. Make EDA a priority step in your data science projects!