Creating a RAG-Based Chatbot for Autism Using Snowflake Cortex and Snowflake Artic: A Step-by-Step Guide

Usha Rengaraju
9 min readJul 6, 2024

--

In this guide, we will explore how to harness the power of Snowflake Cortex AI to create a Retrieval-Augmented Generation (RAG) based chatbot focused on autism support. By combining Snowflake’s robust data processing capabilities with advanced AI techniques, you’ll learn to build a sophisticated chatbot that can provide valuable information and assistance on autism-related topics. This step-by-step tutorial will walk you through the process, from data preparation to deploying a fully functional chatbot, making it easier than ever to leverage cutting-edge technology for meaningful applications.

Snowflake Cortex AI

Snowflake Cortex AI is an advanced suite of AI and machine learning (ML) tools integrated into the Snowflake Data Cloud platform. It is designed to enable organizations to build, train, and deploy machine learning models directly within the Snowflake environment. Here are some key aspects of Snowflake Cortex AI:

  1. Seamless Integration: Cortex AI integrates with the Snowflake Data Cloud, allowing users to leverage Snowflake’s powerful data warehousing capabilities. This integration helps streamline the data pipeline, making it easier to prepare, train, and deploy ML models without needing to move data to external systems.
  2. Data Preparation and Feature Engineering: With Cortex AI, users can perform data preparation and feature engineering directly within Snowflake. This helps maintain data consistency and security, as all operations are conducted within a single platform.
  3. Model Training and Deployment: Cortex AI supports the entire ML lifecycle, from training models using data stored in Snowflake to deploying these models for real-time predictions. Users can utilize various ML frameworks and libraries supported by Snowflake to build and train their models.
  4. AutoML Capabilities: Cortex AI includes AutoML features that automate the process of selecting, training, and tuning machine learning models. This helps users, especially those with limited ML expertise, to build high-quality models efficiently.
  5. Scalability and Performance: Leveraging Snowflake’s scalable architecture, Cortex AI can handle large datasets and complex ML workloads efficiently. This scalability ensures that organizations can train and deploy models at scale, meeting their business requirements.
  6. Collaboration and Governance: Cortex AI provides features for collaboration and governance, allowing data scientists, analysts, and business stakeholders to work together seamlessly. Governance features ensure that data access and usage comply with organizational policies and regulations.
  7. Integration with Third-Party Tools: Cortex AI can integrate with various third-party tools and services, such as popular ML libraries (e.g., TensorFlow, PyTorch), Jupyter notebooks, and other data science tools. This flexibility allows users to leverage their preferred tools within the Snowflake ecosystem.

Retrieval Augment Generation (RAG)

Retrieval-Augmented Generation (RAG) is an advanced technique in natural language processing (NLP) that combines the strengths of retrieval-based and generation-based models. In RAG, relevant information is first retrieved from a large corpus of documents or a database, and then this information is used to generate more accurate and contextually appropriate responses.

Here’s a brief overview:

  1. Retrieval: A retrieval model, such as a search engine or a dense passage retriever, identifies and extracts relevant documents or passages from a large dataset based on the input query.
  2. Generation: A generation model, often a transformer-based language model, then uses the retrieved information to generate a coherent and contextually relevant response.

By leveraging both retrieval and generation, RAG systems can provide more informed, accurate, and context-rich answers, making them particularly useful for applications like chatbots, question-answering systems, and virtual assistants.

Flow of RAG — Ref: https://towardsdatascience.com/retrieval-augmented-generation-rag-from-theory-to-langchain-implementation-4e9bd5f6a4f2

Autism Dataset

NeuroClastic is a resource-rich platform dedicated to providing support and information for the autism community. It offers a wide array of articles, directories, and guides on topics such as autistic traits, communication methods, education, therapies, mental health, and justice. This site serves as a valuable tool for autistic individuals, parents, educators, and professionals seeking to understand and support neurodivergent experiences — https://neuroclastic.com/resources/

We have chosen some articles and converted into PDFs.

Data Ingestion

We would be writing a worksheet for the following. We would create vector of the data by converting it into chunks and storing it in a database. Before that, you need to login to https://app.snowflake.com/

CREATE DATABASE AUTISM;
CREATE SCHEMA DATA;

The SQL command `CREATE DATABASE AUTISM;` initiates the creation of a new database named “AUTISM”. This command sets up a dedicated storage area where tables, views, and other database objects can be organized and managed. Subsequently, the statement `CREATE SCHEMA DATA;` establishes a schema named “DATA” within the database structure. Schemas provide a way to logically group and organize database objects, facilitating efficient management and maintenance of data within the specified database context.

create or replace function pdf_text_chunker(file_url string)
returns table (chunk varchar)
language python
runtime_version = '3.9'
handler = 'pdf_text_chunker'
packages = ('snowflake-snowpark-python','PyPDF2', 'langchain')
as

$$
from snowflake.snowpark.types import StringType, StructField, StructType
from langchain.text_splitter import RecursiveCharacterTextSplitter
from snowflake.snowpark.files import SnowflakeFile
import PyPDF2, io
import logging
import pandas as pd

class pdf_text_chunker:

def read_pdf(self, file_url: str) -> str:

logger = logging.getLogger("udf_logger")
logger.info(f"Opening file {file_url}")

with SnowflakeFile.open(file_url, 'rb') as f:
buffer = io.BytesIO(f.readall())

reader = PyPDF2.PdfReader(buffer)
text = ""
for page in reader.pages:
try:
text += page.extract_text().replace('\n', ' ').replace('\0', ' ')
except:
text = "Unable to Extract"
logger.warn(f"Unable to extract from file {file_url}, page {page}")

return text

def process(self,file_url: str):

text = self.read_pdf(file_url)

text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 512 , #Adjust this as you see fit
chunk_overlap = 15, #This let's text have some form of overlap. Useful for keeping chunks contextual
length_function = len
)

chunks = text_splitter.split_text(text)
df = pd.DataFrame(chunks, columns=['chunks'])

yield from df.itertuples(index=False, name=None)
$$;

We now define a user-defined function (UDF) named `pdf_text_chunker` in Python within a Snowflake database. The function takes a `file_url` parameter representing the URL of a PDF file. It utilizes the `PyPDF2` library to read the PDF content, concatenates text from each page into a single string, and handles exceptions during extraction. The text is then split into chunks using the `RecursiveCharacterTextSplitter` from the `langchain` package, with parameters for chunk size and overlap. The resulting chunks are returned as rows in a table with a single column named `chunk`.

create or replace stage docs ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE') DIRECTORY = ( ENABLE = true );

This command creates or updates a stage named docs with server-side encryption (SSE) enabled, ensuring that data stored within this stage is encrypted for security purposes.

create or replace TABLE DOCS_CHUNKS_TABLE ( 
RELATIVE_PATH VARCHAR(16777216), -- Relative path to the PDF file
SIZE NUMBER(38,0), -- Size of the PDF
FILE_URL VARCHAR(16777216), -- URL for the PDF
SCOPED_FILE_URL VARCHAR(16777216), -- Scoped url (you can choose which one to keep depending on your use case)
CHUNK VARCHAR(16777216), -- Piece of text
CHUNK_VEC VECTOR(FLOAT, 768) ); -- Embedding using the VECTOR data type

insert into docs_chunks_table (relative_path, size, file_url,
scoped_file_url, chunk, chunk_vec)
select relative_path,
size,
file_url,
build_scoped_file_url(@docs, relative_path) as scoped_file_url,
func.chunk as chunk,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2',chunk) as chunk_vec
from
directory(@docs),
TABLE(pdf_text_chunker(build_scoped_file_url(@docs, relative_path))) as func;

Now, I’m setting up a table named DOCS_CHUNKS_TABLE to store data extracted from PDF files. This table has several columns:

  • RELATIVE_PATH: Stores the relative path to each PDF file.
  • SIZE: Represents the size of each PDF file.
  • FILE_URL: Stores the URL of each PDF file.
  • SCOPED_FILE_URL: Optionally stores a scoped URL depending on my use case.
  • CHUNK: Stores individual pieces of text extracted from the PDF files.
  • CHUNK_VEC: Uses the VECTOR data type to store embeddings of the text chunks, specifically using 768 dimensions.

Next, I’m inserting data into this table using a SELECT statement. Here’s how it works:

  • I’m querying the directory(@docs) function to get a list of files in my Snowflake stage named docs.
  • For each file, I call the pdf_text_chunker function, passing in the file's relative path (build_scoped_file_url(@docs, relative_path) constructs the complete file URL).
  • This function extracts text chunks (func.chunk) from each PDF file.
  • Additionally, I’m using SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', chunk) to generate 768-dimensional embeddings for each text chunk.

To view the database,

select relative_path, size, chunk, chunk_vec from docs_chunks_table limit 5;

Creation of ChatBot

Now we create a Streamlit application out of this.

Let’s start with the code

import streamlit as st # Import python packages
from snowflake.snowpark.context import get_active_session
session = get_active_session() # Get the current credentials

import pandas as pd

pd.set_option("max_colwidth",None)
num_chunks = 10 # Num-chunks provided as context. Play with this to check how it affects your accuracy

We use Streamlit for interactive web app development and Snowflake’s Snowpark SDK to access and manage credentials for data operations. It also sets up Pandas to handle data with unlimited column width and initializes a variable num_chunks to control processing based on a specified number of text chunks.

def create_prompt (myquestion):

cmd = """
with results as
(SELECT RELATIVE_PATH,
VECTOR_COSINE_SIMILARITY(docs_chunks_table.chunk_vec,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', ?)) as similarity,
chunk
from docs_chunks_table
order by similarity desc
limit ?)
select chunk, relative_path from results
"""

df_context = session.sql(cmd, params=[myquestion, num_chunks]).to_pandas()

context_lenght = len(df_context) -1

prompt_context = ""
for i in range (0, context_lenght):
prompt_context += df_context._get_value(i, 'CHUNK')

prompt_context = prompt_context.replace("'", "")
relative_path = df_context._get_value(0,'RELATIVE_PATH')

prompt = f"""
'You are an expert assistance extracting information from context provided.
Answer the question based on the context. Be concise and do not hallucinate.
Context: {prompt_context}
Question:
{myquestion}
Answer: '
"""

return prompt

Now we create a Python function, create_prompt, generates a prompt for a language model based on a given question (myquestion). It constructs a SQL command to compute cosine similarity between text embeddings (chunk_vec) in docs_chunks_table and the question's embedding. Results are ordered by similarity, limited by num_chunks. The function retrieves results into a Pandas DataFrame, concatenates text chunks into prompt_context, removes apostrophes for clarity, and sets the relative_path. Finally, it returns a formatted prompt with context, question, and placeholders for an AI assistant to answer based on the context provided.

def complete(myquestion, model_name):

prompt = create_prompt (myquestion)
cmd = f"""
select SNOWFLAKE.CORTEX.COMPLETE(?,?) as response
"""

df_response = session.sql(cmd, params=[model_name, prompt]).collect()
return df_response

def display_response (question, model):
response = complete(question, model)
res_text = response[0].RESPONSE
st.markdown(res_text)

In complete(myquestion, model_name), I first generate a prompt using create_prompt(myquestion), which creates a context-aware prompt based on the input question. Then, I construct a SQL command to utilize Snowflake's SNOWFLAKE.CORTEX.COMPLETE function, passing in the model name and the generated prompt as parameters. After executing the command and collecting the response into a DataFrame (df_response), I return it.

In display_response(question, model), I use complete(question, model) to fetch the response DataFrame. From the response, I extract the text (RESPONSE) and display it using Streamlit's st.markdown() function, presenting the model's answer in a formatted manner within the Streamlit web application interface.

st.title("Asking Questions related to Autism")

model = 'snowflake-arctic'
question = st.text_input("Enter question", placeholder="Tell me about Autiusm?", label_visibility="collapsed")


if question:
display_response (question, model)

This will call the Streamlit UI.

Final UI

Creating a Retrieval-Augmented Generation (RAG)-based chatbot focused on Autism using Snowflake involves leveraging advanced capabilities in data management and natural language processing. By utilizing Snowflake’s Snowpark SDK and Cortex functions like EMBED_TEXT_768 and COMPLETE, developers can build a robust system that not only retrieves relevant information from structured data stored in Snowflake but also generates contextually accurate responses to user queries. This step-by-step guide has demonstrated how to integrate these technologies effectively, empowering developers to create intelligent chatbots that provide tailored information and support for Autism-related inquiries. With Snowflake's scalability and security, this approach ensures reliable performance and data integrity, making it suitable for both developmental and operational phases of AI-powered applications in healthcare and beyond.

References:

  1. https://quickstarts.snowflake.com/guide/asking_questions_to_your_own_documents_with_snowflake_cortex/#0
  2. https://www.snowflake.com/en/data-cloud/cortex/
  3. https://towardsdatascience.com/retrieval-augmented-generation-rag-from-theory-to-langchain-implementation-4e9bd5f6a4f2

--

--

Usha Rengaraju
Usha Rengaraju

Written by Usha Rengaraju

Chief of Research at Exa Protocol | Autism Advocate | Corporate Trainer | Adjunct Faculty

No responses yet