Whether you are just getting started in data science or are a seasoned expert, one of the most critical activities is exploratory data analysis (EDA). EDA allows you to develop an intimate understanding of your data that is foundational to downstream modeling and analysis. It is the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses, and check assumptions with summary statistics and graphical representations.EDA is a crucial initial step when working with any new dataset. In this post, we’ll explore how to perform effective EDA using Python and the powerful Pandas data analysis library. Pandas provides data structures and instruments that make data exploration seamless. In this blog, we are going to explore some of the activities and tools that can be use to do EDA.

1. What is EDA?
2. Why Perform EDA?
3. Conducting EDA with Python and Pandas
4. Exploring the data
5. Selecting and Analysing Columns
6. Unique
7. Sorting
8. Filtering Rows
9. Aggregating the data
10. Plotting
Conclusion

Datasets

1. What is EDA?

EDA involves activities aimed at detecting patterns, identifying anomalies, testing assumptions, and conducting preliminary feature engineering and data preparation. Key aspects of EDA include:

Importing and formatting raw datasets
Conducting integrity checks and dealing with missing values
Generating summary statistics on both numeric and categorical variables
Creating visualizations such as histograms, scatter plots, and bar charts
Stratified analysis by segmenting data into subgroups
Identifying promising relationships and correlations to formulate hypotheses
Cleaning, transforming, and preparing data for modelling

Reference: https://datos.gob.es/en/documentacion/practical-introductory-guide-exploratory-data-analysis

2. Why perform EDA?

Thorough EDA is critical because:

Builds understanding and intuition about the data.
Reveals issues that guide data cleaning and preprocessing.
Provides insights that lead to testable hypotheses about the data.
Identifies relationships between variables to model.
Surfaces gaps that highlight areas for additional data collection.
Improves the quality of downstream analyses and models.

3. Conducting EDA with Python and Pandas

Python and Pandas provide a flexible, powerful environment for exploring datasets. Key aspects include:

Easily importing data from various sources like CSV and databases
Dataframe structure for storing tabular data
Fast summary stats generation with .describe()
Built-in methods for handling missing data
Plotting functions for quick data visualization
Filtering, grouping, and combining datasets for segmented analysis
Merging and joining data from different sources
Easy handling of large datasets not possible in Excel

By leveraging Python and Pandas for thorough EDA, you gain key insights into your data that enable building effective machine learning models and analytics systems. Make EDA a priority step in your data science projects!
Lets do some EDA with Python and Pandas!

Importing libraries

As a data coach, I always emphasize to my students and clients the value of Python’s extensive open source libraries for exploratory data analysis.The libraries provide pre-built tools that are optimized for data tasks, saving you time and effort compared to coding from scratch. Leveraging them follows best practices refined by the Python data community over years. For example, Pandas DataFrames handle messy real-world data better than native Python objects. Matplotlib and Seaborn build meaningful graphics tailored for statistical analysis.These libraries come with batteries included – they have pre-built tools and best practices so you don’t have to reinvent the wheel. Leveraging Python libraries accelerates your EDA, avoids re-coding common tasks, and follows established conventions, allowing you to focus on uncovering insights rather than low-level programming.For this exercise we are going to import and use pandas.
Documentation: https://pandas.pydata.org/docs/reference/frame.htmlNote: Place the files “customer.csv” and “payment.csv” in the same folder where this notebook is located

In [1]:

# Import the pandas package
import pandas as pd

In [2]:

## Read the customer table and assign it the label 'customer' so we can refer to it later
customer = pd.read_csv('customer.csv')

Let’s break down this line to see what’s happening:

`'./data/customer.csv'` is a string

In Python text is stored as a string, a sequence of characters enclosed in ‘single quotes’, “double quotes”, or “””triple quotes”””.Everything in Python is an object and every object in Python has a type. Computer programs typcally keep track of a range of data types. For example, 1.5 is a floating point number, while 1 is an integer. Programs need to distinguish between these two types for various reasons:

They are stored in memory differently.
Their arithmetic operations are different

Some of the basic numerical types in Python include:
int (integer; a whole number with no decimal place)10-3
float (float; a number that has a decimal place)7.41-0.006

`pd.read_csv` is a function in the Pandas module

Remember from SQL: Server.Database.Schema.Table.Column. are used to denote nested objectsread_csv is a Pandas function, so part of pd

`customer =` assigns our object to the customer variable

Our customer table is given a label customer. These variables are like the aliases we saw in SQL.In Python, a variable is a name you specify in your code that maps to a particular object, object instance, or value.By defining variables, we can refer to things by names that make sense to us. Names for variables can only contain letters, underscores (_), or numbers (no spaces, dashes, or other characters). Variable names must start with a letter or underscore – good practice is to start with a letter.We assign a label to an object with a single =

4. Exploring the data

Next, we will explore the data by using the .head() method to see the top rows and the columns attribute to see the names of all the columns.

df.head() – View first 5 rows
df.tail() – View last 5 rows
df.shape – Number of rows and columns
df.columns – Return the names of the columns
df.dtypes – Data type of each column
df.info() – Index, datatype and memory information
df.describe() – Summary statistics for numeric columns

Lets explore each of this method starting with “head”, the syntax for using it is:

DataFrame.head()

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
The head documentation

In [3]:

##Calling the dataset will return the first 5 and last 5 rows of the dataset.
customer

Out[3]:

	customer_id	store_id	first_name	last_name	email	address_id	activebool	create_date	last_update	active
0	1	1	MARY	SMITH	MARY.SMITH@sakilacustomer.org	5	True	2006-02-14	2006-02-15 09:57:20	1
1	2	1	PATRICIA	JOHNSON	PATRICIA.JOHNSON@sakilacustomer.org	6	True	2006-02-14	2006-02-15 09:57:20	1
2	3	1	LINDA	WILLIAMS	LINDA.WILLIAMS@sakilacustomer.org	7	True	2006-02-14	2006-02-15 09:57:20	1
3	4	2	BARBARA	JONES	BARBARA.JONES@sakilacustomer.org	8	True	2006-02-14	2006-02-15 09:57:20	1
4	5	1	ELIZABETH	BROWN	ELIZABETH.BROWN@sakilacustomer.org	9	True	2006-02-14	2006-02-15 09:57:20	1
…	…	…	…	…	…	…	…	…	…	…
594	595	1	TERRENCE	GUNDERSON	TERRENCE.GUNDERSON@sakilacustomer.org	601	True	2006-02-14	2006-02-15 09:57:20	1
595	596	1	ENRIQUE	FORSYTHE	ENRIQUE.FORSYTHE@sakilacustomer.org	602	True	2006-02-14	2006-02-15 09:57:20	1
596	597	1	FREDDIE	DUGGAN	FREDDIE.DUGGAN@sakilacustomer.org	603	True	2006-02-14	2006-02-15 09:57:20	1
597	598	1	WADE	DELVALLE	WADE.DELVALLE@sakilacustomer.org	604	True	2006-02-14	2006-02-15 09:57:20	1
598	599	2	AUSTIN	CINTRON	AUSTIN.CINTRON@sakilacustomer.org	605	True	2006-02-14	2006-02-15 09:57:20	1

599 rows × 10 columns

In [4]:

##Now lets try head:
customer.head()

Out[4]:

	customer_id	store_id	first_name	last_name	email	address_id	activebool	create_date	last_update	active
0	1	1	MARY	SMITH	MARY.SMITH@sakilacustomer.org	5	True	2006-02-14	2006-02-15 09:57:20	1
1	2	1	PATRICIA	JOHNSON	PATRICIA.JOHNSON@sakilacustomer.org	6	True	2006-02-14	2006-02-15 09:57:20	1
2	3	1	LINDA	WILLIAMS	LINDA.WILLIAMS@sakilacustomer.org	7	True	2006-02-14	2006-02-15 09:57:20	1
3	4	2	BARBARA	JONES	BARBARA.JONES@sakilacustomer.org	8	True	2006-02-14	2006-02-15 09:57:20	1
4	5	1	ELIZABETH	BROWN	ELIZABETH.BROWN@sakilacustomer.org	9	True	2006-02-14	2006-02-15 09:57:20	1

Lets explore this method “tail”, the syntax for using it is:

DataFrame.tail()

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.The head documentation

In [5]:

##Tail:
customer.tail()

Out[5]:

	customer_id	store_id	first_name	last_name	email	address_id	activebool	create_date	last_update	active
594	595	1	TERRENCE	GUNDERSON	TERRENCE.GUNDERSON@sakilacustomer.org	601	True	2006-02-14	2006-02-15 09:57:20	1
595	596	1	ENRIQUE	FORSYTHE	ENRIQUE.FORSYTHE@sakilacustomer.org	602	True	2006-02-14	2006-02-15 09:57:20	1
596	597	1	FREDDIE	DUGGAN	FREDDIE.DUGGAN@sakilacustomer.org	603	True	2006-02-14	2006-02-15 09:57:20	1
597	598	1	WADE	DELVALLE	WADE.DELVALLE@sakilacustomer.org	604	True	2006-02-14	2006-02-15 09:57:20	1
598	599	2	AUSTIN	CINTRON	AUSTIN.CINTRON@sakilacustomer.org	605	True	2006-02-14	2006-02-15 09:57:20	1

Lets method “shape”, the syntax for using it is:

DataFrame.shape()

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.The shape documentation

In [6]:

##Shape, number of rows and number of columns:
customer.shape

Out[6]:

(599, 10)

Lets the method “columns”, the syntax for using it is:

DataFrame.columns

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
The columns documentation

In [7]:

customer.columns

Out[7]:

Index(['customer_id', 'store_id', 'first_name', 'last_name', 'email',
       'address_id', 'activebool', 'create_date', 'last_update', 'active'],
      dtype='object')

Lets the method “dtypes”, the syntax for using it is:

DataFrame.dtypes

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
The dtypes documentation

In [8]:

customer.dtypes

Out[8]:

customer_id     int64
store_id        int64
first_name     object
last_name      object
email          object
address_id      int64
activebool       bool
create_date    object
last_update    object
active          int64
dtype: object

df.info() should become standard practice when loading in new data in Python. It provides an informative data quality “dashboard” to guide deeper investigation or cleaning as needed. Catching potential problems early mitigates headaches later in analysis. The ease and speed of df.info() allows quick iterations to check data manipulations. Leveraging this single line of code accelerates EDA and sets the stage for reliable data science workflows.

Provides concise summary of DataFrame contents – shape, data types, memory usage
Identifies any incorrect or unexpected data types that could cause errors later
Easily spots missing values represented as NaN with a count per column
Can be used to compare before and after DataFrame changes during data cleaning

Lets explore each of this method exploring “info”, the syntax for using it is:

DataFrame.info()

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.The info documentation

In [20]:

customer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599 entries, 0 to 598
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   customer_id  599 non-null    int64 
 1   store_id     599 non-null    int64 
 2   first_name   599 non-null    object
 3   last_name    599 non-null    object
 4   email        599 non-null    object
 5   address_id   599 non-null    int64 
 6   activebool   599 non-null    bool  
 7   create_date  599 non-null    object
 8   last_update  599 non-null    object
 9   active       599 non-null    int64 
dtypes: bool(1), int64(4), object(5)
memory usage: 42.8+ KB

df.describe() is an extremely useful method in exploratory data analysis to summarize numerical data. Rapidly provides an insightful overview of key statistics that support exploratory analysis; the output reveals details about quality, variance, and anomalies that warrant further examination. Here are some key reasons it provides value:

Calculates standard summary statistics like count, mean, std dev, min, max, and quartiles. Gives overview of distribution.
Statistics help identify outliers, skewness, and shape characteristics.
Count of non-null values highlights missing data if lower than total rows.
Summary stats can reveal insights like high standard deviations and long tails.
Provides a “first look” at numeric attributes to guide more advanced analysis.

Lets explore this method “describe”, the syntax for using it is:

DataFrame.describe()

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
The describe documentation

In [22]:

customer.describe()

Out[22]:

	customer_id	store_id	address_id	active
count	599.000000	599.000000	599.000000	599.000000
mean	300.000000	1.455760	304.724541	0.974958
std	173.060683	0.498455	173.698609	0.156382
min	1.000000	1.000000	5.000000	0.000000
25%	150.500000	1.000000	154.500000	1.000000
50%	300.000000	1.000000	305.000000	1.000000
75%	449.500000	2.000000	454.500000	1.000000
max	599.000000	2.000000	605.000000	1.000000

In summary, Pandas provides many useful functions to explore and summarize key details about an unfamiliar dataset. Methods like df.head(), df.tail(), df.shape, df.columns, df.dtypes, and df.info() quickly provide an overview of the structure, size, and data types.
df.describe() generates descriptive statistics to study numeric columns. Taken together, these methods help you efficiently inspect the contents, completeness, and integrity of the data.

5. Selecting and Analysing Columns

Filtering columns is an important technique in exploratory data analysis to focus on relevant subsets of data. If we only want some parts of a DataFrame, we need to be able to specify the columns we are interested in. To do so, we need to be able to choose a specific column from our DataFrame.
Note: A column returned on its own is returned as a Series (a pandas data type)The syntax for choosing a Column from the DataFrame and returning it as a Series is:DataFrame[‘Column’]
The syntax for choosing more than one column from the DataFrame, returning a DataFrame is:

DataFrame[[‘Column 1’, ‘Column 2’]]

The syntax for choosing a Column from the DataFrame and returning a DataFrame is:

DataFrame[[‘Column’]]

In these examples, DataFrame is the name of the DataFrame and Column is the name of the Column.

E.g. Single column:

In [11]:

customer['first_name']

Out[11]:

0           MARY
1       PATRICIA
2          LINDA
3        BARBARA
4      ELIZABETH
         ...    
594     TERRENCE
595      ENRIQUE
596      FREDDIE
597         WADE
598       AUSTIN
Name: first_name, Length: 599, dtype: object

E.g. Multiple columns:

In [29]:

customer[['first_name', 'email']]

Out[29]:

	first_name	email
0	MARY	MARY.SMITH@sakilacustomer.org
1	PATRICIA	PATRICIA.JOHNSON@sakilacustomer.org
2	LINDA	LINDA.WILLIAMS@sakilacustomer.org
3	BARBARA	BARBARA.JONES@sakilacustomer.org
4	ELIZABETH	ELIZABETH.BROWN@sakilacustomer.org
…	…	…
594	TERRENCE	TERRENCE.GUNDERSON@sakilacustomer.org
595	ENRIQUE	ENRIQUE.FORSYTHE@sakilacustomer.org
596	FREDDIE	FREDDIE.DUGGAN@sakilacustomer.org
597	WADE	WADE.DELVALLE@sakilacustomer.org
598	AUSTIN	AUSTIN.CINTRON@sakilacustomer.org

599 rows × 2 columns

We can use the methods we saw before for analysis on a subset of data, e.g.:

In [30]:

customer[['first_name', 'last_name']].describe()

Out[30]:

	first_name	last_name
count	599	599
unique	591	599
top	JESSIE	SMITH
freq	2	1

6. Unique

The pd.unique() method in Pandas can be very useful for exploratory data analysis to understand the distinct values present in a column and get a quick sense of the variability/cardinality of data in a column using nunique(). Can be combined with value_counts() for frequencies of unique values.

Let’s explore “unique”, “nunique” and “value_counts”, the syntax for using them is:

DataFrame[‘Column’].unique()DataFrame[‘Column’].nunique()DataFrame[‘Column’].value_counts()

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
Here the unique documentation
E.g.: Let’s analyse how many stores are in the dataset and their frequency:

In [34]:

customer['store_id'].unique()

Out[34]:

array([1, 2], dtype=int64)

In [35]:

customer['store_id'].nunique()

Out[35]:

In [37]:

customer['store_id'].value_counts()

Out[37]:

1    326
2    273
Name: store_id, dtype: int64

We can see that there are 2 stores and that are 326 customers register in strore 1 and 273 in store 2.

7. Sorting

By reordering the rows of a DataFrame based on one or more columns, sort_values() reveals patterns, relationships, trends, and anomalies that may be obscured in unsorted data. Sorting numeric data into ascending or descending order highlights outliers and distributions. Chronologically sorting date columns provides temporal context.Next, we will sort the customers by their first name.The syntax for sorting the data is:

\# *Sort the data from lowest to highest*
DataFrame.sort_values(by = ‘Column’)

In this example DataFrame is the name of the DataFrame, Column is the name of the Column we want to sort by.
The sort_values documentation

In [38]:

customer.sort_values(by='first_name').head()

Out[38]:

	customer_id	store_id	first_name	last_name	email	address_id	activebool	create_date	last_update	active
374	375	2	AARON	SELBY	AARON.SELBY@sakilacustomer.org	380	True	2006-02-14	2006-02-15 09:57:20	1
366	367	1	ADAM	GOOCH	ADAM.GOOCH@sakilacustomer.org	372	True	2006-02-14	2006-02-15 09:57:20	1
524	525	2	ADRIAN	CLARY	ADRIAN.CLARY@sakilacustomer.org	531	True	2006-02-14	2006-02-15 09:57:20	1
216	217	2	AGNES	BISHOP	AGNES.BISHOP@sakilacustomer.org	221	True	2006-02-14	2006-02-15 09:57:20	1
388	389	1	ALAN	KAHN	ALAN.KAHN@sakilacustomer.org	394	True	2006-02-14	2006-02-15 09:57:20	1

8. Filtering Rows

By selectively removing rows based on logical criteria, you can focus on specific observations and patterns. For example, filtering numeric columns by thresholds reveals outliers and anomalies. Filtering categorical data on classes studies their distributions separately. Extracting rows with missing values helps quantify and understand NaN. Row filtering enables segmenting data like customers by region or users by age group.
If we wish to look at subsets of the data, we will need to filter or group it. Let’s start by learning to filter it.To do so, we need to be able to choose a specific row from our table.
The syntax for filtering a DataFrame is:

\# *Filter the DataFrame*
DataFrame[ DataFrame[‘Column’] == ‘value’ ]

In this example DataFrame is the name of the DataFrame, Column is the name of the column we want to filter on and value is the value we’re interested in.
Lets make a subset of store “2”:

In [40]:

customer[customer['store_id'] == 2].head()

Out[40]:

	customer_id	store_id	first_name	last_name	email	address_id	activebool	create_date	last_update	active
3	4	2	BARBARA	JONES	BARBARA.JONES@sakilacustomer.org	8	True	2006-02-14	2006-02-15 09:57:20	1
5	6	2	JENNIFER	DAVIS	JENNIFER.DAVIS@sakilacustomer.org	10	True	2006-02-14	2006-02-15 09:57:20	1
7	8	2	SUSAN	WILSON	SUSAN.WILSON@sakilacustomer.org	12	True	2006-02-14	2006-02-15 09:57:20	1
8	9	2	MARGARET	MOORE	MARGARET.MOORE@sakilacustomer.org	13	True	2006-02-14	2006-02-15 09:57:20	1
10	11	2	LISA	ANDERSON	LISA.ANDERSON@sakilacustomer.org	15	True	2006-02-14	2006-02-15 09:57:20	1

9. Aggregating data

Fundamentally, groupby() enables dividing up heterogeneous data into homogeneous chunks to ask focused, comparative questions. It facilitates studying inter-group variability and intra-group consistency. Any aggregates, transformations, and visualizations applied within groups can reveal insights not visible in the full data. If we want to look at a customer’s aggregate transactions, we can use the .groupby() method to aggregate the data.
The syntax for aggregating the DataFrame is:

\# *Aggregate the DataFrame by group*
DataFrame.groupby(‘group’)

We can choose what kind of aggregate output we want for the columns by appending the appropreate method. E.g. sum()

\# *Aggregate the DataFrame by group*
DataFrame.groupby(‘group’).sum()

If we want to apply different aggregation methods to each column, we can with a different syntax

\# *Aggregate the DataFrame by group*
DataFrame.groupby(‘group’).agg({
‘Column1’ : ‘sum’, \# *take the sum of Column1*
‘Column2’ : ‘mean’, \# *take the mean of Column2*
‘Column3’ : ‘max’ \# *take the max of Column3*
})

In this example, group is the name of the Column we want to group by and sum/mean/max are the aggregation methods.
Here is the groupby documentation

For the next example we would like to replicate the distribution of clients per store (we did it with unique ;):

In [8]:

customer.groupby('store_id').count()

Out[8]:

	customer_id	first_name	last_name	email	address_id	activebool	create_date	last_update	active
store_id
1	326	326	326	326	326	326	326	326	326
2	273	273	273	273	273	273	273	273	273

We can see that the output returns all the columns of the dataset, to clear it we can select a single (or multiple coulumns if needed) with the following syntaxt:

In [55]:

customer[['store_id', 'customer_id']].groupby('store_id').count()

Out[55]:

	customer_id
store_id
1	326
2	273

Let’s use a different dataset to exemplify better the groupby method, for these we are going to load the payment dataset:

In [14]:

payment = pd.read_csv('payment.csv')
payment.head()

Out[14]:

	payment_id	customer_id	staff_id	rental_id	amount	payment_date
0	16050	269	2	7	1.99	2007-01-24 21:40:19.996577
1	16051	269	1	98	0.99	2007-01-25 15:16:50.996577
2	16052	269	2	678	6.99	2007-01-28 21:44:14.996577
3	16053	269	2	703	0.99	2007-01-29 00:58:02.996577
4	16054	269	1	750	4.99	2007-01-29 08:10:06.996577

We can make an analysis using group by of different aggregations (like counts, sums, min and max) on the amount feature. In this case we make an analysis by staff member and the sales they made:

In [63]:

payment[['staff_id', 'amount']].groupby('staff_id').sum()

Out[63]:

	amount
staff_id
1	33489.47
2	33927.04

The mean can also be analysed:

In [64]:

payment[['staff_id', 'amount']].groupby('staff_id').mean()

Out[64]:

	amount
staff_id
1	4.156568
2	4.245125

Here the list of the aggregations funtions available in Pandas:
pandas%20aggregations.png Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sum.html

10. Plotting

Visual representation of the data enables detection of patterns, relationships, trends, and anomalies that may not be observable in the raw data alone. Fundamentally, graphical exploration leads to new hypotheses, questions, and insights that drive the analysis forward. Generating effective visualizations during EDA is both an art and a science. Master the key plot types, know their strengths and weaknesses, and enrich your analysis with impactful, intuitive graphics. Visual data exploration should be a core part of any data scientist’s workflow. Here are some examples of visualisations and their benefit:

Plots such as histograms, scatter plots, and box plots provide summaries of distributions that reveal insights into the shape, skew, outliers, and clusters within the data.
Visualizing data categorized by color or facet makes comparisons across groups explicit.
Time series plots surface seasonalities and temporal patterns.
Correlation plots highlight relationships between variables.

Pandas allows us to plot data from the DataFrame. Pandas will plot for you most of the standard plots you would want.Lets analyse the distribution of the amount of payments in the dataset with a histogram.
The syntax to make a histogram is:

DataFrame.hist(column = ‘Column’)

In this example DataFrame is the name of the DataFrame, Column is the name of the numerical column you would like to make the histogram on.

In [19]:

payment.hist(column='amount');

We can identify three peaks on the distribution on payments: the first one in 1 with 3,000 occurrences, the second on the value of 3 with about 3,500, and the highest in 5 with more than 5,000 occurrences; not many payments are made above the monetary value of 6.

Conclusion

Exploratory data analysis is a crucial first step in any data science undertaking, and Python’s Pandas library provides the perfect tools to perform effective EDA. By dedicating time to activities like data importing, cleaning, visualization, segmentation, and hypothesis generation using Pandas’ versatile built-in capabilities, you gain intimate understanding and derive insights that enable building better models and analytics systems downstream.
Thorough EDA leveraging Python and Pandas leads to more informed conclusions and impactful data-driven decision-making by uncovering key findings, relationships and anomalies that would otherwise go unseen. The power of these flexible data analysis tools makes EDA an indispensable part of the data science process.I hope you enjoyed this quick introduction to this interesting subject. Make EDA a priority step in your data science projects!

Exploratory Data Analysis with Python and Pandas | Rockborne

Table of Contents

Datasets

1. What is EDA?

2. Why perform EDA?

3. Conducting EDA with Python and Pandas

Importing libraries

Let’s break down this line to see what’s happening:

`'./data/customer.csv'` is a string

`pd.read_csv` is a function in the Pandas module

`customer =` assigns our object to the customer variable

4. Exploring the data

5. Selecting and Analysing Columns

6. Unique

7. Sorting

8. Filtering Rows

9. Aggregating data

10. Plotting

Conclusion

Related Articles

Connecting Python & Snowflake for Data Analysis | Rockborne

It Starts With Culture: Addressing the Digital Skills Gap

Rockborne Wins Two Awards at the 2023 DataIQ Awards

Exploratory Data Analysis with Python and Pandas | Rockborne

Table of Contents

Datasets

1. What is EDA?

2. Why perform EDA?

3. Conducting EDA with Python and Pandas

Importing libraries

Let’s break down this line to see what’s happening:

'./data/customer.csv' is a string

pd.read_csv is a function in the Pandas module

customer = assigns our object to the customer variable

4. Exploring the data

5. Selecting and Analysing Columns

6. Unique

7. Sorting

8. Filtering Rows

9. Aggregating data

10. Plotting

Conclusion

Related Articles

Connecting Python & Snowflake for Data Analysis | Rockborne

It Starts With Culture: Addressing the Digital Skills Gap

Rockborne Wins Two Awards at the 2023 DataIQ Awards

`'./data/customer.csv'` is a string

`pd.read_csv` is a function in the Pandas module

`customer =` assigns our object to the customer variable