Rockborne's logo
  • About
    • Insights
    • Meet the team
  • Attract Train Deploy
  • Data Training
  • AI / LLM Training
  • Graduates
Contact
open mobile menu close mobile menu
  • About
    • Insights
    • Meet the team
  • Attract Train Deploy
  • Data Training
  • AI / LLM Training
  • Graduates
Contact

21 Nov '23

Exploratory Data Analysis with Python and Pandas | Rockborne

by Miguel Angel Sanchez Razo

 Whether you are just getting started in data science or are a seasoned expert, one of the most critical activities is exploratory data analysis (EDA). EDA allows you to develop an intimate understanding of your data that is foundational to downstream modeling and analysis. It is the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses, and check assumptions with summary statistics and graphical representations.EDA is a crucial initial step when working with any new dataset. In this post, we’ll explore how to perform effective EDA using Python and the powerful Pandas data analysis library. Pandas provides data structures and instruments that make data exploration seamless. In this blog, we are going to explore some of the activities and tools that can be use to do EDA.

Table of Contents

  • 1. What is EDA?
  • 2. Why Perform EDA?
  • 3. Conducting EDA with Python and Pandas
  • 4. Exploring the data
  • 5. Selecting and Analysing Columns
  • 6. Unique
  • 7. Sorting
  • 8. Filtering Rows
  • 9. Aggregating the data
  • 10. Plotting
  • Conclusion

Datasets

  • payment.csv
  • customer.csv

1. What is EDA?

EDA involves activities aimed at detecting patterns, identifying anomalies, testing assumptions, and conducting preliminary feature engineering and data preparation. Key aspects of EDA include:

  • Importing and formatting raw datasets
  • Conducting integrity checks and dealing with missing values
  • Generating summary statistics on both numeric and categorical variables
  • Creating visualizations such as histograms, scatter plots, and bar charts
  • Stratified analysis by segmenting data into subgroups
  • Identifying promising relationships and correlations to formulate hypotheses
  • Cleaning, transforming, and preparing data for modelling

EDA.jpg
Reference: https://datos.gob.es/en/documentacion/practical-introductory-guide-exploratory-data-analysis

2. Why perform EDA?

Thorough EDA is critical because:

  • Builds understanding and intuition about the data.
  • Reveals issues that guide data cleaning and preprocessing.
  • Provides insights that lead to testable hypotheses about the data.
  • Identifies relationships between variables to model.
  • Surfaces gaps that highlight areas for additional data collection.
  • Improves the quality of downstream analyses and models.

3. Conducting EDA with Python and Pandas

Python and Pandas provide a flexible, powerful environment for exploring datasets. Key aspects include:

  • Easily importing data from various sources like CSV and databases
  • Dataframe structure for storing tabular data
  • Fast summary stats generation with .describe()
  • Built-in methods for handling missing data
  • Plotting functions for quick data visualization
  • Filtering, grouping, and combining datasets for segmented analysis
  • Merging and joining data from different sources
  • Easy handling of large datasets not possible in Excel

By leveraging Python and Pandas for thorough EDA, you gain key insights into your data that enable building effective machine learning models and analytics systems. Make EDA a priority step in your data science projects!
Lets do some EDA with Python and Pandas!

Importing libraries

As a data coach, I always emphasize to my students and clients the value of Python’s extensive open source libraries for exploratory data analysis.The libraries provide pre-built tools that are optimized for data tasks, saving you time and effort compared to coding from scratch. Leveraging them follows best practices refined by the Python data community over years. For example, Pandas DataFrames handle messy real-world data better than native Python objects. Matplotlib and Seaborn build meaningful graphics tailored for statistical analysis.These libraries come with batteries included – they have pre-built tools and best practices so you don’t have to reinvent the wheel. Leveraging Python libraries accelerates your EDA, avoids re-coding common tasks, and follows established conventions, allowing you to focus on uncovering insights rather than low-level programming.For this exercise we are going to import and use pandas.
Documentation: https://pandas.pydata.org/docs/reference/frame.htmlNote: Place the files “customer.csv” and “payment.csv” in the same folder where this notebook is located

In [1]:

# Import the pandas package
import pandas as pd

In [2]:

## Read the customer table and assign it the label 'customer' so we can refer to it later
customer = pd.read_csv('customer.csv')

Let’s break down this line to see what’s happening:

'./data/customer.csv' is a string

In Python text is stored as a string, a sequence of characters enclosed in ‘single quotes’, “double quotes”, or “””triple quotes”””.Everything in Python is an object and every object in Python has a type. Computer programs typcally keep track of a range of data types. For example, 1.5 is a floating point number, while 1 is an integer. Programs need to distinguish between these two types for various reasons:

  • They are stored in memory differently.
  • Their arithmetic operations are different

Some of the basic numerical types in Python include:
int (integer; a whole number with no decimal place)10-3
float (float; a number that has a decimal place)7.41-0.006

pd.read_csv is a function in the Pandas module

Remember from SQL: Server.Database.Schema.Table.Column. are used to denote nested objectsread_csv is a Pandas function, so part of pd

customer = assigns our object to the customer variable

Our customer table is given a label customer. These variables are like the aliases we saw in SQL.In Python, a variable is a name you specify in your code that maps to a particular object, object instance, or value.By defining variables, we can refer to things by names that make sense to us. Names for variables can only contain letters, underscores (_), or numbers (no spaces, dashes, or other characters). Variable names must start with a letter or underscore – good practice is to start with a letter.We assign a label to an object with a single =


4. Exploring the data

Next, we will explore the data by using the .head() method to see the top rows and the columns attribute to see the names of all the columns.

  • df.head() – View first 5 rows
  • df.tail() – View last 5 rows
  • df.shape – Number of rows and columns
  • df.columns – Return the names of the columns
  • df.dtypes – Data type of each column
  • df.info() – Index, datatype and memory information
  • df.describe() – Summary statistics for numeric columns

Lets explore each of this method starting with “head”, the syntax for using it is:

DataFrame.head()

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
The head documentation

In [3]:

##Calling the dataset will return the first 5 and last 5 rows of the dataset.
customer

Out[3]:

customer_id store_id first_name last_name email address_id activebool create_date last_update active
0 1 1 MARY SMITH MARY.SMITH@sakilacustomer.org 5 True 2006-02-14 2006-02-15 09:57:20 1
1 2 1 PATRICIA JOHNSON PATRICIA.JOHNSON@sakilacustomer.org 6 True 2006-02-14 2006-02-15 09:57:20 1
2 3 1 LINDA WILLIAMS LINDA.WILLIAMS@sakilacustomer.org 7 True 2006-02-14 2006-02-15 09:57:20 1
3 4 2 BARBARA JONES BARBARA.JONES@sakilacustomer.org 8 True 2006-02-14 2006-02-15 09:57:20 1
4 5 1 ELIZABETH BROWN ELIZABETH.BROWN@sakilacustomer.org 9 True 2006-02-14 2006-02-15 09:57:20 1
… … … … … … … … … … …
594 595 1 TERRENCE GUNDERSON TERRENCE.GUNDERSON@sakilacustomer.org 601 True 2006-02-14 2006-02-15 09:57:20 1
595 596 1 ENRIQUE FORSYTHE ENRIQUE.FORSYTHE@sakilacustomer.org 602 True 2006-02-14 2006-02-15 09:57:20 1
596 597 1 FREDDIE DUGGAN FREDDIE.DUGGAN@sakilacustomer.org 603 True 2006-02-14 2006-02-15 09:57:20 1
597 598 1 WADE DELVALLE WADE.DELVALLE@sakilacustomer.org 604 True 2006-02-14 2006-02-15 09:57:20 1
598 599 2 AUSTIN CINTRON AUSTIN.CINTRON@sakilacustomer.org 605 True 2006-02-14 2006-02-15 09:57:20 1

599 rows × 10 columns

In [4]:

##Now lets try head:
customer.head()

Out[4]:

customer_id store_id first_name last_name email address_id activebool create_date last_update active
0 1 1 MARY SMITH MARY.SMITH@sakilacustomer.org 5 True 2006-02-14 2006-02-15 09:57:20 1
1 2 1 PATRICIA JOHNSON PATRICIA.JOHNSON@sakilacustomer.org 6 True 2006-02-14 2006-02-15 09:57:20 1
2 3 1 LINDA WILLIAMS LINDA.WILLIAMS@sakilacustomer.org 7 True 2006-02-14 2006-02-15 09:57:20 1
3 4 2 BARBARA JONES BARBARA.JONES@sakilacustomer.org 8 True 2006-02-14 2006-02-15 09:57:20 1
4 5 1 ELIZABETH BROWN ELIZABETH.BROWN@sakilacustomer.org 9 True 2006-02-14 2006-02-15 09:57:20 1

Lets explore this method “tail”, the syntax for using it is:

DataFrame.tail()

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.The head documentation

In [5]:

##Tail:
customer.tail()

Out[5]:

customer_id store_id first_name last_name email address_id activebool create_date last_update active
594 595 1 TERRENCE GUNDERSON TERRENCE.GUNDERSON@sakilacustomer.org 601 True 2006-02-14 2006-02-15 09:57:20 1
595 596 1 ENRIQUE FORSYTHE ENRIQUE.FORSYTHE@sakilacustomer.org 602 True 2006-02-14 2006-02-15 09:57:20 1
596 597 1 FREDDIE DUGGAN FREDDIE.DUGGAN@sakilacustomer.org 603 True 2006-02-14 2006-02-15 09:57:20 1
597 598 1 WADE DELVALLE WADE.DELVALLE@sakilacustomer.org 604 True 2006-02-14 2006-02-15 09:57:20 1
598 599 2 AUSTIN CINTRON AUSTIN.CINTRON@sakilacustomer.org 605 True 2006-02-14 2006-02-15 09:57:20 1

Lets method “shape”, the syntax for using it is:

DataFrame.shape()

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.The shape documentation

In [6]:

##Shape, number of rows and number of columns:
customer.shape

Out[6]:

(599, 10)

Lets the method “columns”, the syntax for using it is:

DataFrame.columns

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
The columns documentation

In [7]:

customer.columns

Out[7]:

Index(['customer_id', 'store_id', 'first_name', 'last_name', 'email',
       'address_id', 'activebool', 'create_date', 'last_update', 'active'],
      dtype='object')

Lets the method “dtypes”, the syntax for using it is:

DataFrame.dtypes

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
The dtypes documentation

In [8]:

customer.dtypes

Out[8]:

customer_id     int64
store_id        int64
first_name     object
last_name      object
email          object
address_id      int64
activebool       bool
create_date    object
last_update    object
active          int64
dtype: object

df.info() should become standard practice when loading in new data in Python. It provides an informative data quality “dashboard” to guide deeper investigation or cleaning as needed. Catching potential problems early mitigates headaches later in analysis. The ease and speed of df.info() allows quick iterations to check data manipulations. Leveraging this single line of code accelerates EDA and sets the stage for reliable data science workflows.

  • Provides concise summary of DataFrame contents – shape, data types, memory usage
  • Identifies any incorrect or unexpected data types that could cause errors later
  • Easily spots missing values represented as NaN with a count per column
  • Can be used to compare before and after DataFrame changes during data cleaning

Lets explore each of this method exploring “info”, the syntax for using it is:

DataFrame.info()

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.The info documentation

In [20]:

customer.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599 entries, 0 to 598
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   customer_id  599 non-null    int64 
 1   store_id     599 non-null    int64 
 2   first_name   599 non-null    object
 3   last_name    599 non-null    object
 4   email        599 non-null    object
 5   address_id   599 non-null    int64 
 6   activebool   599 non-null    bool  
 7   create_date  599 non-null    object
 8   last_update  599 non-null    object
 9   active       599 non-null    int64 
dtypes: bool(1), int64(4), object(5)
memory usage: 42.8+ KB

df.describe() is an extremely useful method in exploratory data analysis to summarize numerical data. Rapidly provides an insightful overview of key statistics that support exploratory analysis; the output reveals details about quality, variance, and anomalies that warrant further examination. Here are some key reasons it provides value:

  • Calculates standard summary statistics like count, mean, std dev, min, max, and quartiles. Gives overview of distribution.
  • Statistics help identify outliers, skewness, and shape characteristics.
  • Count of non-null values highlights missing data if lower than total rows.
  • Summary stats can reveal insights like high standard deviations and long tails.
  • Provides a “first look” at numeric attributes to guide more advanced analysis.

Lets explore this method “describe”, the syntax for using it is:

 DataFrame.describe()

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
The describe documentation

In [22]:

customer.describe()

Out[22]:

customer_id store_id address_id active
count 599.000000 599.000000 599.000000 599.000000
mean 300.000000 1.455760 304.724541 0.974958
std 173.060683 0.498455 173.698609 0.156382
min 1.000000 1.000000 5.000000 0.000000
25% 150.500000 1.000000 154.500000 1.000000
50% 300.000000 1.000000 305.000000 1.000000
75% 449.500000 2.000000 454.500000 1.000000
max 599.000000 2.000000 605.000000 1.000000

In summary, Pandas provides many useful functions to explore and summarize key details about an unfamiliar dataset. Methods like df.head(), df.tail(), df.shape, df.columns, df.dtypes, and df.info() quickly provide an overview of the structure, size, and data types.
df.describe() generates descriptive statistics to study numeric columns. Taken together, these methods help you efficiently inspect the contents, completeness, and integrity of the data.


5. Selecting and Analysing Columns

Filtering columns is an important technique in exploratory data analysis to focus on relevant subsets of data. If we only want some parts of a DataFrame, we need to be able to specify the columns we are interested in. To do so, we need to be able to choose a specific column from our DataFrame.
Note: A column returned on its own is returned as a Series (a pandas data type)The syntax for choosing a Column from the DataFrame and returning it as a Series is:DataFrame[‘Column’]
The syntax for choosing more than one column from the DataFrame, returning a DataFrame is:

DataFrame[[‘Column 1’, ‘Column 2’]]

The syntax for choosing a Column from the DataFrame and returning a DataFrame is:

DataFrame[[‘Column’]]

In these examples, DataFrame is the name of the DataFrame and Column is the name of the Column.


E.g. Single column:

In [11]:

customer['first_name']

Out[11]:

0           MARY
1       PATRICIA
2          LINDA
3        BARBARA
4      ELIZABETH
         ...    
594     TERRENCE
595      ENRIQUE
596      FREDDIE
597         WADE
598       AUSTIN
Name: first_name, Length: 599, dtype: object

E.g. Multiple columns:

In [29]:

customer[['first_name', 'email']]

Out[29]:

first_name email
0 MARY MARY.SMITH@sakilacustomer.org
1 PATRICIA PATRICIA.JOHNSON@sakilacustomer.org
2 LINDA LINDA.WILLIAMS@sakilacustomer.org
3 BARBARA BARBARA.JONES@sakilacustomer.org
4 ELIZABETH ELIZABETH.BROWN@sakilacustomer.org
… … …
594 TERRENCE TERRENCE.GUNDERSON@sakilacustomer.org
595 ENRIQUE ENRIQUE.FORSYTHE@sakilacustomer.org
596 FREDDIE FREDDIE.DUGGAN@sakilacustomer.org
597 WADE WADE.DELVALLE@sakilacustomer.org
598 AUSTIN AUSTIN.CINTRON@sakilacustomer.org

599 rows × 2 columns


We can use the methods we saw before for analysis on a subset of data, e.g.:

In [30]:

customer[['first_name', 'last_name']].describe()

Out[30]:

first_name last_name
count 599 599
unique 591 599
top JESSIE SMITH
freq 2 1

6. Unique

The pd.unique() method in Pandas can be very useful for exploratory data analysis to understand the distinct values present in a column and get a quick sense of the variability/cardinality of data in a column using nunique(). Can be combined with value_counts() for frequencies of unique values.


Let’s explore “unique”, “nunique” and “value_counts”, the syntax for using them is:

DataFrame[‘Column’].unique()DataFrame[‘Column’].nunique()DataFrame[‘Column’].value_counts()

In this example DataFrame is the name of the table, in this case “customer” is already a dataset.
Here the unique documentation
E.g.: Let’s analyse how many stores are in the dataset and their frequency:

In [34]:

customer['store_id'].unique()

Out[34]:

array([1, 2], dtype=int64)

In [35]:

customer['store_id'].nunique()

Out[35]:

2

In [37]:

customer['store_id'].value_counts()

Out[37]:

1    326
2    273
Name: store_id, dtype: int64

We can see that there are 2 stores and that are 326 customers register in strore 1 and 273 in store 2.


7. Sorting

By reordering the rows of a DataFrame based on one or more columns, sort_values() reveals patterns, relationships, trends, and anomalies that may be obscured in unsorted data. Sorting numeric data into ascending or descending order highlights outliers and distributions. Chronologically sorting date columns provides temporal context.Next, we will sort the customers by their first name.The syntax for sorting the data is:

\# *Sort the data from lowest to highest*
DataFrame.sort_values(by = ‘Column’)

In this example DataFrame is the name of the DataFrame, Column is the name of the Column we want to sort by.
The sort_values documentation

In [38]:

customer.sort_values(by='first_name').head()

Out[38]:

customer_id store_id first_name last_name email address_id activebool create_date last_update active
374 375 2 AARON SELBY AARON.SELBY@sakilacustomer.org 380 True 2006-02-14 2006-02-15 09:57:20 1
366 367 1 ADAM GOOCH ADAM.GOOCH@sakilacustomer.org 372 True 2006-02-14 2006-02-15 09:57:20 1
524 525 2 ADRIAN CLARY ADRIAN.CLARY@sakilacustomer.org 531 True 2006-02-14 2006-02-15 09:57:20 1
216 217 2 AGNES BISHOP AGNES.BISHOP@sakilacustomer.org 221 True 2006-02-14 2006-02-15 09:57:20 1
388 389 1 ALAN KAHN ALAN.KAHN@sakilacustomer.org 394 True 2006-02-14 2006-02-15 09:57:20 1

8. Filtering Rows

By selectively removing rows based on logical criteria, you can focus on specific observations and patterns. For example, filtering numeric columns by thresholds reveals outliers and anomalies. Filtering categorical data on classes studies their distributions separately. Extracting rows with missing values helps quantify and understand NaN. Row filtering enables segmenting data like customers by region or users by age group.
If we wish to look at subsets of the data, we will need to filter or group it. Let’s start by learning to filter it.To do so, we need to be able to choose a specific row from our table.
The syntax for filtering a DataFrame is:

\# *Filter the DataFrame*
DataFrame[ DataFrame[‘Column’] == ‘value’ ]

In this example DataFrame is the name of the DataFrame, Column is the name of the column we want to filter on and value is the value we’re interested in.
Lets make a subset of store “2”:

In [40]:

customer[customer['store_id'] == 2].head()

Out[40]:

customer_id store_id first_name last_name email address_id activebool create_date last_update active
3 4 2 BARBARA JONES BARBARA.JONES@sakilacustomer.org 8 True 2006-02-14 2006-02-15 09:57:20 1
5 6 2 JENNIFER DAVIS JENNIFER.DAVIS@sakilacustomer.org 10 True 2006-02-14 2006-02-15 09:57:20 1
7 8 2 SUSAN WILSON SUSAN.WILSON@sakilacustomer.org 12 True 2006-02-14 2006-02-15 09:57:20 1
8 9 2 MARGARET MOORE MARGARET.MOORE@sakilacustomer.org 13 True 2006-02-14 2006-02-15 09:57:20 1
10 11 2 LISA ANDERSON LISA.ANDERSON@sakilacustomer.org 15 True 2006-02-14 2006-02-15 09:57:20 1

9. Aggregating data

Fundamentally, groupby() enables dividing up heterogeneous data into homogeneous chunks to ask focused, comparative questions. It facilitates studying inter-group variability and intra-group consistency. Any aggregates, transformations, and visualizations applied within groups can reveal insights not visible in the full data. If we want to look at a customer’s aggregate transactions, we can use the .groupby() method to aggregate the data.
The syntax for aggregating the DataFrame is:

\# *Aggregate the DataFrame by group*
DataFrame.groupby(‘group’)

We can choose what kind of aggregate output we want for the columns by appending the appropreate method. E.g. sum()

\# *Aggregate the DataFrame by group*
DataFrame.groupby(‘group’).sum()

If we want to apply different aggregation methods to each column, we can with a different syntax

\# *Aggregate the DataFrame by group*
DataFrame.groupby(‘group’).agg({
‘Column1’ : ‘sum’, \# *take the sum of Column1*
‘Column2’ : ‘mean’, \# *take the mean of Column2*
‘Column3’ : ‘max’ \# *take the max of Column3*
})

In this example, group is the name of the Column we want to group by and sum/mean/max are the aggregation methods.
Here is the groupby documentation


For the next example we would like to replicate the distribution of clients per store (we did it with unique ;):

In [8]:

customer.groupby('store_id').count()

Out[8]:

customer_id first_name last_name email address_id activebool create_date last_update active
store_id
1 326 326 326 326 326 326 326 326 326
2 273 273 273 273 273 273 273 273 273

We can see that the output returns all the columns of the dataset, to clear it we can select a single (or multiple coulumns if needed) with the following syntaxt:

In [55]:

customer[['store_id', 'customer_id']].groupby('store_id').count()

Out[55]:

customer_id
store_id
1 326
2 273

Let’s use a different dataset to exemplify better the groupby method, for these we are going to load the payment dataset:

In [14]:

payment = pd.read_csv('payment.csv')
payment.head()

Out[14]:

payment_id customer_id staff_id rental_id amount payment_date
0 16050 269 2 7 1.99 2007-01-24 21:40:19.996577
1 16051 269 1 98 0.99 2007-01-25 15:16:50.996577
2 16052 269 2 678 6.99 2007-01-28 21:44:14.996577
3 16053 269 2 703 0.99 2007-01-29 00:58:02.996577
4 16054 269 1 750 4.99 2007-01-29 08:10:06.996577

We can make an analysis using group by of different aggregations (like counts, sums, min and max) on the amount feature. In this case we make an analysis by staff member and the sales they made:

In [63]:

payment[['staff_id', 'amount']].groupby('staff_id').sum()

Out[63]:

amount
staff_id
1 33489.47
2 33927.04

The mean can also be analysed:

In [64]:

payment[['staff_id', 'amount']].groupby('staff_id').mean()

Out[64]:

amount
staff_id
1 4.156568
2 4.245125

Here the list of the aggregations funtions available in Pandas:
pandas%20aggregations.pngReference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sum.html

10. Plotting

Visual representation of the data enables detection of patterns, relationships, trends, and anomalies that may not be observable in the raw data alone. Fundamentally, graphical exploration leads to new hypotheses, questions, and insights that drive the analysis forward. Generating effective visualizations during EDA is both an art and a science. Master the key plot types, know their strengths and weaknesses, and enrich your analysis with impactful, intuitive graphics. Visual data exploration should be a core part of any data scientist’s workflow. Here are some examples of visualisations and their benefit:

  • Plots such as histograms, scatter plots, and box plots provide summaries of distributions that reveal insights into the shape, skew, outliers, and clusters within the data.
  • Visualizing data categorized by color or facet makes comparisons across groups explicit.
  • Time series plots surface seasonalities and temporal patterns.
  • Correlation plots highlight relationships between variables.

Pandas allows us to plot data from the DataFrame. Pandas will plot for you most of the standard plots you would want.Lets analyse the distribution of the amount of payments in the dataset with a histogram.
The syntax to make a histogram is:

DataFrame.hist(column = ‘Column’)

In this example DataFrame is the name of the DataFrame, Column is the name of the numerical column you would like to make the histogram on.

In [19]:

payment.hist(column='amount');

We can identify three peaks on the distribution on payments: the first one in 1 with 3,000 occurrences, the second on the value of 3 with about 3,500, and the highest in 5 with more than 5,000 occurrences; not many payments are made above the monetary value of 6.

Conclusion

Exploratory data analysis is a crucial first step in any data science undertaking, and Python’s Pandas library provides the perfect tools to perform effective EDA. By dedicating time to activities like data importing, cleaning, visualization, segmentation, and hypothesis generation using Pandas’ versatile built-in capabilities, you gain intimate understanding and derive insights that enable building better models and analytics systems downstream.
Thorough EDA leveraging Python and Pandas leads to more informed conclusions and impactful data-driven decision-making by uncovering key findings, relationships and anomalies that would otherwise go unseen. The power of these flexible data analysis tools makes EDA an indispensable part of the data science process.I hope you enjoyed this quick introduction to this interesting subject. Make EDA a priority step in your data science projects!

Share

Twitter logo icon LinkedIn logo icon

Related Articles

Contact

Connecting Python & Snowflake for Data Analysis | Rockborne

Explore the powerful integration of Python and Snowflake for advanced data analytics in our comprehensive guide, from setup to data extraction.

22 Jan 2024

It Starts With Culture: Addressing the Digital Skills Gap

31 Oct 2023

Rockborne Wins Two Awards at the 2023 DataIQ Awards

02 Oct 2023

  • Home
  • About
  • Graduates
  • Attract Train Deploy
  • Meet the Team
  • Insights
  • Contact Us | Attract Train Deploy | Data & AI Talent
  • Data Protection Policy
  • Cookie Policy | Privacy & Data Protection | Rockborne
A Harnham Group Company
Designed By: Fanatic