Introduction:

In the dynamic realm of data science, the synergy between Python and Snowflake emerges as a dynamic force, empowering data professionals to transcend traditional boundaries. In this blog post, we’ll delve deeper into the role of Snowflake as a cloud-based data warehousing platform and explore how it seamlessly integrates with Python, enhancing the capabilities of data analysis and data processing.

Python – The Swiss Army Knife of Data Science:

Python’s ascendancy in the data science domain is not only due to its versatility but also its adaptability to seamlessly integrate with powerful platforms like Snowflake. Python’s readability and extensive libraries find a perfect ally in Snowflake, transforming it into a wieldy tool for data exploration and analysis.

Snowflake – The Cloud-Driven Data Warehouse:

Snowflake’s architecture, built for the cloud, revolutionizes the way data is stored and processed. Its elasticity, scalability, and ability to handle semi-structured data make it an ideal choice for modern data analytics. As a data coach, understanding Snowflake’s key features and how can be integrated with other powerful tools like Python is essential for harnessing its full potential.

Prerequisites:

Libraries

The libraries that we are going to be using in this exercise are:

snowflake-connector-python
You can install it in your environment by: pip install snowflake-connector-python pandas
You can install it in your environment by: pip install pandas

Connection to Snowflake details

For this exercise a Snowflake instance is needed together with the following account details:

Account
User
Password
Warehouse
Database

Objectives:

Stablish and validate a connection to Snowflake.
Create a Schema.
Create a table (SQL-DDL), populate it (SQL-DML), and query the data (SQL-Select) with WHERE, and GROUP BY clauses.
Extract data from Snowflake to a Pandas Dataframe.

Importing Libraries

In [219]:

import snowflake.connector
import pandas as pd

Establish and validate a connection to Snowflake.

Defining connection credentials:

In [184]:

##Setting up env and connection variables:
ACCOUNT = "put here your credentials value"
USER = "put here your credentials value" 
PASSWORD =  "put here your credentials value" 
WAREHOUSE =   "put here your credentials value" 
DATABASE = "put here your credentials value"

In [185]:

##Establishing the connection and cursor to be used to execute the API requests:
ctx = snowflake.connector.connect(
        user=USER,
        password=PASSWORD,
        account=ACCOUNT,
        warehouse=WAREHOUSE,
        database=DATABASE
    )
cs = ctx.cursor()

A cursor acts as a pointer or an iterator that allows you to traverse through the results of a database query one at a time. It provides a mechanism for the Python program to interact with the results returned from the database.

Create a Schema.

In databases, a schema is a logical container or organizational structure that holds and defines objects, such as tables, views, indexes, procedures, and other elements. It provides a way to group related database objects together, helping to organize and manage the database’s structure and data. We are going to create the PYTHON_TEST_SCHEMA schema to host the data we are going to use for this exercise.
To execute quesries on Snowflake we are going to use execute(“quey”) on the cursor.

In [186]:

## Creating a Database Schema  
cs.execute("CREATE SCHEMA IF NOT EXISTS PYTHON_TEST_SCHEMA")

Out[186]:

<snowflake.connector.cursor.SnowflakeCursor at 0x1a7f89e91d0>

In [187]:

## Change the connection to the created schema:
cs.execute("USE STINGRAY_DB.PYTHON_TEST_SCHEMA")

Out[187]:

<snowflake.connector.cursor.SnowflakeCursor at 0x1a7f89e91d0>

In [188]:

##Validate that that we are in the right warehouse, database and schema
cs.execute("SELECT CURRENT_WAREHOUSE(), CURRENT_DATABASE(), CURRENT_SCHEMA();")
cs.fetchall()

Out[188]:

[('STINGRAY_WH', 'STINGRAY_DB', 'PYTHON_TEST_SCHEMA')]

fetchall() fetches all or remaining rows of a query result set and returns a list of sequences._Note: The name of my database and schema in Snowflake instance are STINGRAY_WH and STINGRAY_DB so you should have a different result. Change accordingly these values when needed in the following stages of the exercise._

Create a table (SQL-DDL), populate it (SQL-DML), and query the data (SQL-Select).

We are going to create a dummy table that will contain information about dogs like their names, breed, age, weight, owner and registration date. The table is going to be later populated with 10 rows.

In [203]:

##Remove the table in case it exist:
cs.execute("DROP TABLE IF EXISTS STINGRAY_DB.PYTHON_TEST_SCHEMA.DOGS;")
cs.fetchall()

Out[203]:

[('DOGS successfully dropped.',)]

In [204]:

##Creating a table:
###Defining the query to be executed:query = "CREATE TABLE DOGS (\
    DogID NUMBER, \
    Name VARCHAR, \
    Breed VARCHAR, \
    Age NUMBER, \
    Weight FLOAT, \
    OwnerName VARCHAR, \
    RegistrationDate DATE)"query

Out[204]:

'CREATE TABLE DOGS (    DogID NUMBER,     Name VARCHAR,     Breed VARCHAR,     Age NUMBER,     Weight FLOAT,     OwnerName VARCHAR,     RegistrationDate DATE)'

In [205]:

##Executing the query and printing the output:
cs.execute(query)
cs.fetchall()

Out[205]:

[('Table DOGS successfully created.',)]

In [206]:

##Exploring the first 2 columns that are int he dogs table:
cs.execute("SHOW COLUMNS IN TABLE DOGS;")
cs.fetchall()[0:2]

Out[206]:

[('DOGS',
  'PYTHON_TEST_SCHEMA',
  'DOGID',
  '{"type":"FIXED","precision":38,"scale":0,"nullable":true}',
  'true',
  '',
  'COLUMN',
  '',
  '',
  'STINGRAY_DB',
  ''),
 ('DOGS',
  'PYTHON_TEST_SCHEMA',
  'NAME',
  '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}',
  'true',
  '',
  'COLUMN',
  '',
  '',
  'STINGRAY_DB',
  '')]

In [207]:

##Creating some dummy data to be added to the table:
queries = ["INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (1, 'Buddy', 'Labrador Retriever', 3, 25.5, 'John Smith', '2022-01-01');", "INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (2, 'Max', 'Beagle', 2, 30.0, 'Alice Johnson', '2022-02-15');","INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (3, 'Charlie', 'Golden Retriever', 4, 28.0, 'Bob Williams', '2022-03-20');","INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (4, 'Lucy', 'Beagle', 2, 18.5, 'Emily Davis', '2022-04-10');","INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (5, 'Rocky', 'Siberian Husky', 5, 22.0, 'Michael Brown', '2022-05-05');","INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (6, 'Molly', 'Dachshund', 3, 12.5, 'Alice Johnson', '2022-06-15');","INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (7, 'Coco', 'Poodle', 4, 15.8, 'Daniel Lee', '2022-07-20');","INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (8, 'Bailey', 'Siberian Husky', 2, 28.5, 'Olivia Garcia', '2022-08-25');","INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (9, 'Zoe', 'Golden Retriever', 1, 10.0, 'John Smith', '2022-09-30');","INSERT INTO Dogs (DogID, Name, Breed, Age, Weight, OwnerName, RegistrationDate) \
VALUES (10, 'Tucker', 'Golden Retriever', 6, 35.2, 'Alice Johnson', '2022-10-10');"]

In [208]:

##Testing: 
##Inserting a single row:
cs.execute(queries[0])
cs.fetchall()

Out[208]:

[(1,)]

In [209]:

##Testing if the insertion of data worked:
cs.execute("SELECT * FROM DOGS;")
cs.fetchall()

Out[209]:

[(1,
  'Buddy',
  'Labrador Retriever',
  3,
  25.5,
  'John Smith',
  datetime.date(2022, 1, 1))]

In [210]:

##Populating the table with the remaining rows:
for query in queries[1:]:
    try: 
        cs.execute(query)
    except:
        print('Error while inserting:', query)

In [211]:

##Querying the recently created table:
cs.execute("SELECT * FROM DOGS;")
cs.fetchall()

Out[211]:

[(1,
  'Buddy',
  'Labrador Retriever',
  3,
  25.5,
  'John Smith',
  datetime.date(2022, 1, 1)),
 (2, 'Max', 'Beagle', 2, 30.0, 'Alice Johnson', datetime.date(2022, 2, 15)),
 (3,
  'Charlie',
  'Golden Retriever',
  4,
  28.0,
  'Bob Williams',
  datetime.date(2022, 3, 20)),
 (4, 'Lucy', 'Beagle', 2, 18.5, 'Emily Davis', datetime.date(2022, 4, 10)),
 (5,
  'Rocky',
  'Siberian Husky',
  5,
  22.0,
  'Michael Brown',
  datetime.date(2022, 5, 5)),
 (6,
  'Molly',
  'Dachshund',
  3,
  12.5,
  'Alice Johnson',
  datetime.date(2022, 6, 15)),
 (7, 'Coco', 'Poodle', 4, 15.8, 'Daniel Lee', datetime.date(2022, 7, 20)),
 (8,
  'Bailey',
  'Siberian Husky',
  2,
  28.5,
  'Olivia Garcia',
  datetime.date(2022, 8, 25)),
 (9,
  'Zoe',
  'Golden Retriever',
  1,
  10.0,
  'John Smith',
  datetime.date(2022, 9, 30)),
 (10,
  'Tucker',
  'Golden Retriever',
  6,
  35.2,
  'Alice Johnson',
  datetime.date(2022, 10, 10))]

In [212]:

## Doing some analytics of the table:
###Defining the query:
query = "SELECT COUNT(*), MIN(AGE), MAX(AGE), AVG(AGE), COUNT(DISTINCT BREED), COUNT(DISTINCT OWNERNAME)\
    FROM DOGS;"
query

Out[212]:

'SELECT COUNT(*), MIN(AGE), MAX(AGE), AVG(AGE), COUNT(DISTINCT BREED), COUNT(DISTINCT OWNERNAME)    FROM DOGS;'

In [213]:

##Executing the query:
cs.execute(query)
cs.fetchall()

Out[213]:

[(10, 1, 6, Decimal('3.200000'), 6, 7)]

Querying the data, WHERE:

In [214]:

##Applying WHERE clause on numeric values
##Dogs with ages above 3 years
##Defining query:
query = "SELECT NAME, BREED, AGE\
    FROM DOGS\
    WHERE AGE > 3;"query

Out[214]:

'SELECT NAME, BREED, AGE    FROM DOGS    WHERE AGE > 3;'

In [215]:

##Executing the query:
cs.execute(query)
cs.fetchall()

Out[215]:

[('Charlie', 'Golden Retriever', 4),
 ('Rocky', 'Siberian Husky', 5),
 ('Coco', 'Poodle', 4),
 ('Tucker', 'Golden Retriever', 6)]

In [216]:

##Applying WHERE clause on string values
##Filter only Golden Retrievers
##Defining query:
query = "SELECT NAME, BREED, AGE\
    FROM DOGS\
    WHERE BREED = " + "'"+"Golden Retriever"+"'" + ";"
query

Out[216]:

"SELECT NAME, BREED, AGE    FROM DOGS    WHERE BREED = 'Golden Retriever';"

In [217]:

##Executing the query:
cs.execute(query)
cs.fetchall()

Out[217]:

[('Charlie', 'Golden Retriever', 4),
 ('Zoe', 'Golden Retriever', 1),
 ('Tucker', 'Golden Retriever', 6)]

Querying the data, GROUP BY:

In [218]:

##Testing GROUPBY
##Know how many dogs each Owner have
##Defining query:
query = "SELECT OWNERNAME, COUNT(*) DOGS\
    FROM DOGS\
    GROUP BY OWNERNAME\
    ORDER BY DOGS DESC;"
print(query)
##Executing the query:
cs.execute(query)
cs.fetchall()

SELECT OWNERNAME, COUNT(*) DOGS    FROM DOGS    GROUP BY OWNERNAME    ORDER BY DOGS DESC;

Out[218]:

[('Alice Johnson', 3),
 ('John Smith', 2),
 ('Bob Williams', 1),
 ('Emily Davis', 1),
 ('Michael Brown', 1),
 ('Daniel Lee', 1),
 ('Olivia Garcia', 1)]

Extract data from Snowflake to a Pandas Dataframe:

To extract a table from Snowflake into a Pandas dataframe, we can use fetch_pandas_all()

In [221]:

##Adding some Pandas to the workflow
##Defining query:
query = "SELECT * FROM DOGS;"
print(query)
##Executing the query:
cs.execute(query)
##Saving te output as Pandas DataFrame:
dogs_df = cs.fetch_pandas_all()
dogs_df

SELECT * FROM DOGS;

Out[221]:

	DOGID	NAME	BREED	AGE	WEIGHT	OWNERNAME	REGISTRATIONDATE
0	1	Buddy	Labrador Retriever	3	25.5	John Smith	2022-01-01
1	2	Max	Beagle	2	30.0	Alice Johnson	2022-02-15
2	3	Charlie	Golden Retriever	4	28.0	Bob Williams	2022-03-20
3	4	Lucy	Beagle	2	18.5	Emily Davis	2022-04-10
4	5	Rocky	Siberian Husky	5	22.0	Michael Brown	2022-05-05
5	6	Molly	Dachshund	3	12.5	Alice Johnson	2022-06-15
6	7	Coco	Poodle	4	15.8	Daniel Lee	2022-07-20
7	8	Bailey	Siberian Husky	2	28.5	Olivia Garcia	2022-08-25
8	9	Zoe	Golden Retriever	1	10.0	John Smith	2022-09-30
9	10	Tucker	Golden Retriever	6	35.2	Alice Johnson	2022-10-10

Comparison of filtering and grouping data with Pandas and SQL call to Snowflake:

We can do the same filtering and group by but with the dataframe that has been extracted.

WHERE:

In [229]:

###Filtering dogs with ages above 3 years
dogs_df[dogs_df['AGE'] > 3][['NAME', 'BREED', 'AGE']]

Out[229]:

	NAME	BREED	AGE
2	Charlie	Golden Retriever	4
4	Rocky	Siberian Husky	5
6	Coco	Poodle	4
9	Tucker	Golden Retriever	6

Note: This is executed in the local system or where Python instance is located.

In [231]:

##Same but with a query to snowflake
###Note: This is executed in Snowflake avoiding local resources.
query = "SELECT NAME, BREED, AGE\
    FROM DOGS\
    WHERE AGE > 3;"print(query)
cs.execute(query)
cs.fetch_pandas_all()

SELECT NAME, BREED, AGE    FROM DOGS    WHERE AGE > 3;

Out[231]:

	NAME	BREED	AGE
0	Charlie	Golden Retriever	4
1	Rocky	Siberian Husky	5
2	Coco	Poodle	4
3	Tucker	Golden Retriever	6

Note: This is executed in Snowflake avoiding local resources.

GROUP BY:

In [232]:

##Doing the same Group By
###Lets start with the query to Snowflake:query = "SELECT OWNERNAME, COUNT(*) DOGS\
    FROM DOGS\
    GROUP BY OWNERNAME\
    ORDER BY DOGS DESC;"
print(query)cs.execute(query)
cs.fetch_pandas_all()

SELECT OWNERNAME, COUNT(*) DOGS    FROM DOGS    GROUP BY OWNERNAME    ORDER BY DOGS DESC;

Out[232]:

	OWNERNAME	DOGS
0	Alice Johnson	3
1	John Smith	2
2	Bob Williams	1
3	Emily Davis	1
4	Michael Brown	1
5	Daniel Lee	1
6	Olivia Garcia	1

In [248]:

##Now with pandas
dogs_df.groupby(['OWNERNAME'])[['DOGID']].count().reset_index().rename(columns = {'DOGID' : 'DOGS'}).sort_values(by='DOGS', ascending=False)

Out[248]:

	OWNERNAME	DOGS
0	Alice Johnson	3
4	John Smith	2
1	Bob Williams	1
2	Daniel Lee	1
3	Emily Davis	1
5	Michael Brown	1
6	Olivia Garcia	1

Closing cursor and connection.

Once you have retrieved the results and performed any necessary operations, it’s good practice to close the cursor and connection to free up resources. This step is essential, especially in scenarios with large datasets, to avoid potential resource leaks.

In [182]:

##Closing cursor:
cs.close()##CLosing connection
ctx.close()

Conclusion:

The collaboration between Python and Snowflake embodies the evolution of data analytics. Embracing both Python and Snowflake opens up unparalleled opportunities for managing, analyzing, and visualizing data. This dynamic duo not only enhances your analytical capabilities but also positions you at the forefront of the data-driven revolution.Continuous exploration and experimentation with the integration of Python and Snowflake will undoubtedly propel you towards innovative solutions. Keep honing your skills, stay curious, and let the symbiosis of Python and Snowflake be your guiding light in navigating the ever-expanding landscape of data science. Here you can find some interesting blogs for SQL and Exploratory Data Analysis with Python that would boost your knowledge on data.Happy coding!

Connecting Python & Snowflake for Data Analysis | Rockborne