Introduction:
In the dynamic realm of data science, the synergy between Python and Snowflake emerges as a dynamic force, empowering data professionals to transcend traditional boundaries. In this blog post, we’ll delve deeper into the role of Snowflake as a cloud-based data warehousing platform and explore how it seamlessly integrates with Python, enhancing the capabilities of data analysis and data processing.
Python – The Swiss Army Knife of Data Science:
Python’s ascendancy in the data science domain is not only due to its versatility but also its adaptability to seamlessly integrate with powerful platforms like Snowflake. Python’s readability and extensive libraries find a perfect ally in Snowflake, transforming it into a wieldy tool for data exploration and analysis.
Snowflake – The Cloud-Driven Data Warehouse:
Snowflake’s architecture, built for the cloud, revolutionizes the way data is stored and processed. Its elasticity, scalability, and ability to handle semi-structured data make it an ideal choice for modern data analytics. As a data coach, understanding Snowflake’s key features and how can be integrated with other powerful tools like Python is essential for harnessing its full potential.
Prerequisites:
Libraries
The libraries that we are going to be using in this exercise are:
- snowflake-connector-python
- You can install it in your environment by: pip install snowflake-connector-python pandas
- You can install it in your environment by: pip install pandas
Connection to Snowflake details
For this exercise a Snowflake instance is needed together with the following account details:
- Account
- User
- Password
- Warehouse
- Database
Objectives:
- Stablish and validate a connection to Snowflake.
- Create a Schema.
- Create a table (SQL-DDL), populate it (SQL-DML), and query the data (SQL-Select) with WHERE, and GROUP BY clauses.
- Extract data from Snowflake to a Pandas Dataframe.
Importing Libraries
import snowflake.connector
import pandas as pd
Establish and validate a connection to Snowflake.
Defining connection credentials:
##Setting up env and connection variables:
ACCOUNT = "put here your credentials value"
USER = "put here your credentials value"
PASSWORD = "put here your credentials value"
WAREHOUSE = "put here your credentials value"
DATABASE = "put here your credentials value"
##Establishing the connection and cursor to be used to execute the API requests:
ctx = snowflake.connector.connect(
user=USER,
password=PASSWORD,
account=ACCOUNT,
warehouse=WAREHOUSE,
database=DATABASE
)
cs = ctx.cursor()
A cursor acts as a pointer or an iterator that allows you to traverse through the results of a database query one at a time. It provides a mechanism for the Python program to interact with the results returned from the database.
Create a Schema.
In databases, a schema is a logical container or organizational structure that holds and defines objects, such as tables, views, indexes, procedures, and other elements. It provides a way to group related database objects together, helping to organize and manage the database’s structure and data. We are going to create the PYTHON_TEST_SCHEMA schema to host the data we are going to use for this exercise.
To execute quesries on Snowflake we are going to use execute(“quey”) on the cursor.
## Creating a Database Schema
cs.execute("CREATE SCHEMA IF NOT EXISTS PYTHON_TEST_SCHEMA")
<snowflake.connector.cursor.SnowflakeCursor at 0x1a7f89e91d0>
## Change the connection to the created schema:
cs.execute("USE STINGRAY_DB.PYTHON_TEST_SCHEMA")
<snowflake.connector.cursor.SnowflakeCursor at 0x1a7f89e91d0>
##Validate that that we are in the right warehouse, database and schema
cs.execute("SELECT CURRENT_WAREHOUSE(), CURRENT_DATABASE(), CURRENT_SCHEMA();")
cs.fetchall()
[('STINGRAY_WH', 'STINGRAY_DB', 'PYTHON_TEST_SCHEMA')]
fetchall() fetches all or remaining rows of a query result set and returns a list of sequences.
_Note: The name of my database and schema in Snowflake instance are STINGRAY_WH and STINGRAY_DB so you should have a different result. Change accordingly these values when needed in the following stages of the exercise._
Create a table (SQL-DDL), populate it (SQL-DML), and query the data (SQL-Select).
We are going to create a dummy table that will contain information about dogs like their names, breed, age, weight, owner and registration date. The table is going to be later populated with 10 rows.
##Remove the table in case it exist:
cs.execute("DROP TABLE IF EXISTS STINGRAY_DB.PYTHON_TEST_SCHEMA.DOGS;")
cs.fetchall()
[('DOGS successfully dropped.',)]
##Creating a table:
###Defining the query to be executed:
query = "CREATE TABLE DOGS (\
DogID NUMBER, \
Name VARCHAR, \
Breed VARCHAR, \
Age NUMBER, \
Weight FLOAT, \
OwnerName VARCHAR, \
RegistrationDate DATE)"
query
'CREATE TABLE DOGS ( DogID NUMBER, Name VARCHAR, Breed VARCHAR, Age NUMBER, Weight FLOAT, OwnerName VARCHAR, RegistrationDate DATE)'
##Executing the query and printing the output:
cs.execute(query)
cs.fetchall()
[('Table DOGS successfully created.',)]
##Exploring the first 2 columns that are int he dogs table:
cs.execute("SHOW COLUMNS IN TABLE DOGS;")
cs.fetchall()[0:2]
[('DOGS', 'PYTHON_TEST_SCHEMA', 'DOGID', '{"type":"FIXED","precision":38,"scale":0,"nullable":true}', 'true', '', 'COLUMN', '', '', 'STINGRAY_DB', ''), ('DOGS', 'PYTHON_TEST_SCHEMA', 'NAME', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'true', '', 'COLUMN', '', '', 'STINGRAY_DB', '')]
##Creating some dummy data to be added to the table:
queries = ["INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (1, 'Buddy', 'Labrador Retriever', 3, 25.5, 'John Smith', '2022-01-01');",
"INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (2, 'Max', 'Beagle', 2, 30.0, 'Alice Johnson', '2022-02-15');",
"INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (3, 'Charlie', 'Golden Retriever', 4, 28.0, 'Bob Williams', '2022-03-20');",
"INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (4, 'Lucy', 'Beagle', 2, 18.5, 'Emily Davis', '2022-04-10');",
"INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (5, 'Rocky', 'Siberian Husky', 5, 22.0, 'Michael Brown', '2022-05-05');",
"INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (6, 'Molly', 'Dachshund', 3, 12.5, 'Alice Johnson', '2022-06-15');",
"INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (7, 'Coco', 'Poodle', 4, 15.8, 'Daniel Lee', '2022-07-20');",
"INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (8, 'Bailey', 'Siberian Husky', 2, 28.5, 'Olivia Garcia', '2022-08-25');",
"INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (9, 'Zoe', 'Golden Retriever', 1, 10.0, 'John Smith', '2022-09-30');",
"INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (10, 'Tucker', 'Golden Retriever', 6, 35.2, 'Alice Johnson', '2022-10-10');"]
##Testing:
##Inserting a single row:
cs.execute(queries[0])
cs.fetchall()
[(1,)]
##Testing if the insertion of data worked:
cs.execute("SELECT * FROM DOGS;")
cs.fetchall()
[(1, 'Buddy', 'Labrador Retriever', 3, 25.5, 'John Smith', datetime.date(2022, 1, 1))]
##Populating the table with the remaining rows:
for query in queries[1:]:
try:
cs.execute(query)
except:
print('Error while inserting:', query)
##Querying the recently created table:
cs.execute("SELECT * FROM DOGS;")
cs.fetchall()
[(1, 'Buddy', 'Labrador Retriever', 3, 25.5, 'John Smith', datetime.date(2022, 1, 1)), (2, 'Max', 'Beagle', 2, 30.0, 'Alice Johnson', datetime.date(2022, 2, 15)), (3, 'Charlie', 'Golden Retriever', 4, 28.0, 'Bob Williams', datetime.date(2022, 3, 20)), (4, 'Lucy', 'Beagle', 2, 18.5, 'Emily Davis', datetime.date(2022, 4, 10)), (5, 'Rocky', 'Siberian Husky', 5, 22.0, 'Michael Brown', datetime.date(2022, 5, 5)), (6, 'Molly', 'Dachshund', 3, 12.5, 'Alice Johnson', datetime.date(2022, 6, 15)), (7, 'Coco', 'Poodle', 4, 15.8, 'Daniel Lee', datetime.date(2022, 7, 20)), (8, 'Bailey', 'Siberian Husky', 2, 28.5, 'Olivia Garcia', datetime.date(2022, 8, 25)), (9, 'Zoe', 'Golden Retriever', 1, 10.0, 'John Smith', datetime.date(2022, 9, 30)), (10, 'Tucker', 'Golden Retriever', 6, 35.2, 'Alice Johnson', datetime.date(2022, 10, 10))]
## Doing some analytics of the table:
###Defining the query:
query = "SELECT COUNT(*), MIN(AGE), MAX(AGE), AVG(AGE), COUNT(DISTINCT BREED), COUNT(DISTINCT OWNERNAME)\
FROM DOGS;"
query
'SELECT COUNT(*), MIN(AGE), MAX(AGE), AVG(AGE), COUNT(DISTINCT BREED), COUNT(DISTINCT OWNERNAME) FROM DOGS;'
##Executing the query:
cs.execute(query)
cs.fetchall()
[(10, 1, 6, Decimal('3.200000'), 6, 7)]
Querying the data, WHERE:
##Applying WHERE clause on numeric values
##Dogs with ages above 3 years
##Defining query:
query = "SELECT NAME, BREED, AGE\
FROM DOGS\
WHERE AGE > 3;"
query
'SELECT NAME, BREED, AGE FROM DOGS WHERE AGE > 3;'
##Executing the query:
cs.execute(query)
cs.fetchall()
[('Charlie', 'Golden Retriever', 4), ('Rocky', 'Siberian Husky', 5), ('Coco', 'Poodle', 4), ('Tucker', 'Golden Retriever', 6)]
##Applying WHERE clause on string values
##Filter only Golden Retrievers
##Defining query:
query = "SELECT NAME, BREED, AGE\
FROM DOGS\
WHERE BREED = " + "'"+"Golden Retriever"+"'" + ";"
query
"SELECT NAME, BREED, AGE FROM DOGS WHERE BREED = 'Golden Retriever';"
##Executing the query:
cs.execute(query)
cs.fetchall()
[('Charlie', 'Golden Retriever', 4), ('Zoe', 'Golden Retriever', 1), ('Tucker', 'Golden Retriever', 6)]
Querying the data, GROUP BY:
##Testing GROUPBY
##Know how many dogs each Owner have
##Defining query:
query = "SELECT OWNERNAME, COUNT(*) DOGS\
FROM DOGS\
GROUP BY OWNERNAME\
ORDER BY DOGS DESC;"
print(query)
##Executing the query:
cs.execute(query)
cs.fetchall()
SELECT OWNERNAME, COUNT(*) DOGS FROM DOGS GROUP BY OWNERNAME ORDER BY DOGS DESC;
[('Alice Johnson', 3), ('John Smith', 2), ('Bob Williams', 1), ('Emily Davis', 1), ('Michael Brown', 1), ('Daniel Lee', 1), ('Olivia Garcia', 1)]
Extract data from Snowflake to a Pandas Dataframe:
To extract a table from Snowflake into a Pandas dataframe, we can use fetch_pandas_all()
##Adding some Pandas to the workflow
##Defining query:
query = "SELECT * FROM DOGS;"
print(query)
##Executing the query:
cs.execute(query)
##Saving te output as Pandas DataFrame:
dogs_df = cs.fetch_pandas_all()
dogs_df
SELECT * FROM DOGS;
DOGID | NAME | BREED | AGE | WEIGHT | OWNERNAME | REGISTRATIONDATE | |
---|---|---|---|---|---|---|---|
0 | 1 | Buddy | Labrador Retriever | 3 | 25.5 | John Smith | 2022-01-01 |
1 | 2 | Max | Beagle | 2 | 30.0 | Alice Johnson | 2022-02-15 |
2 | 3 | Charlie | Golden Retriever | 4 | 28.0 | Bob Williams | 2022-03-20 |
3 | 4 | Lucy | Beagle | 2 | 18.5 | Emily Davis | 2022-04-10 |
4 | 5 | Rocky | Siberian Husky | 5 | 22.0 | Michael Brown | 2022-05-05 |
5 | 6 | Molly | Dachshund | 3 | 12.5 | Alice Johnson | 2022-06-15 |
6 | 7 | Coco | Poodle | 4 | 15.8 | Daniel Lee | 2022-07-20 |
7 | 8 | Bailey | Siberian Husky | 2 | 28.5 | Olivia Garcia | 2022-08-25 |
8 | 9 | Zoe | Golden Retriever | 1 | 10.0 | John Smith | 2022-09-30 |
9 | 10 | Tucker | Golden Retriever | 6 | 35.2 | Alice Johnson | 2022-10-10 |
Comparison of filtering and grouping data with Pandas and SQL call to Snowflake:
We can do the same filtering and group by but with the dataframe that has been extracted.
WHERE:
###Filtering dogs with ages above 3 years
dogs_df[dogs_df['AGE'] > 3][['NAME', 'BREED', 'AGE']]
NAME | BREED | AGE | |
---|---|---|---|
2 | Charlie | Golden Retriever | 4 |
4 | Rocky | Siberian Husky | 5 |
6 | Coco | Poodle | 4 |
9 | Tucker | Golden Retriever | 6 |
Note: This is executed in the local system or where Python instance is located.
##Same but with a query to snowflake
###Note: This is executed in Snowflake avoiding local resources.
query = "SELECT NAME, BREED, AGE\
FROM DOGS\
WHERE AGE > 3;"
print(query)
cs.execute(query)
cs.fetch_pandas_all()
SELECT NAME, BREED, AGE FROM DOGS WHERE AGE > 3;
NAME | BREED | AGE | |
---|---|---|---|
0 | Charlie | Golden Retriever | 4 |
1 | Rocky | Siberian Husky | 5 |
2 | Coco | Poodle | 4 |
3 | Tucker | Golden Retriever | 6 |
Note: This is executed in Snowflake avoiding local resources.
GROUP BY:
##Doing the same Group By
###Lets start with the query to Snowflake:
query = "SELECT OWNERNAME, COUNT(*) DOGS\
FROM DOGS\
GROUP BY OWNERNAME\
ORDER BY DOGS DESC;"
print(query)
cs.execute(query)
cs.fetch_pandas_all()
SELECT OWNERNAME, COUNT(*) DOGS FROM DOGS GROUP BY OWNERNAME ORDER BY DOGS DESC;
OWNERNAME | DOGS | |
---|---|---|
0 | Alice Johnson | 3 |
1 | John Smith | 2 |
2 | Bob Williams | 1 |
3 | Emily Davis | 1 |
4 | Michael Brown | 1 |
5 | Daniel Lee | 1 |
6 | Olivia Garcia | 1 |
##Now with pandas
dogs_df.groupby(['OWNERNAME'])[['DOGID']].count().reset_index().rename(columns = {'DOGID' : 'DOGS'}).sort_values(by='DOGS', ascending=False)
OWNERNAME | DOGS | |
---|---|---|
0 | Alice Johnson | 3 |
4 | John Smith | 2 |
1 | Bob Williams | 1 |
2 | Daniel Lee | 1 |
3 | Emily Davis | 1 |
5 | Michael Brown | 1 |
6 | Olivia Garcia | 1 |
Closing cursor and connection.
Once you have retrieved the results and performed any necessary operations, it’s good practice to close the cursor and connection to free up resources. This step is essential, especially in scenarios with large datasets, to avoid potential resource leaks.