RAG based Contextual Search Engine for Molecular Drug Discovery using Snowflake Artic LLM , Snowpark and Langchain
Molecular drug discovery is a data-intensive field where researchers must shift through vast amounts of scientific literature, experimental data, and molecular structures. Traditional keyword-based search methods often fail to capture the nuanced context required to extract meaningful insights. This is where contextual search — which understands and interprets the meaning behind queries — becomes essential.
In this blog, we explore how to build a contextual search engine tailored for molecular drug discovery using Retrieval-Augmented Generation (RAG) and Langchain. We’ll leverage the power of the “snowflake-arctic-embed-m” embedding model and Snowflake’s Arctic LLM to create a system that provides precise, context-rich search results, enhancing the efficiency of drug research and discovery.
The Power of Contextual Search
Contextual search goes beyond simple keyword matching to understand the intent and context behind a query, delivering more relevant and accurate results. In molecular drug discovery, where the meaning of scientific terms and relationships between molecules, diseases, and treatments are complex, context is crucial.
By leveraging contextual search, researchers can navigate large datasets and scientific literature more effectively, retrieving not just related keywords but the most meaningful insights tailored to their specific queries. This makes the discovery process faster, more precise, and ultimately more productive in advancing drug research.
Understanding the Technologies
Retrieval-Augmented Generation (RAG)
RAG combines the strengths of information retrieval and language models. It retrieves relevant documents from a database and uses a language model to generate contextual answers based on both the query and retrieved data. This method is ideal for molecular drug discovery, where precise, context-aware responses are needed.
Langchain
Langchain simplifies the development of language model applications by integrating multiple components like retrieval, embeddings, and LLMs into a cohesive pipeline. It allows developers to build robust contextual search systems, streamlining complex workflows like document retrieval and response generation.
Together, RAG and Langchain form the backbone of an efficient, domain-specific search engine.
Snowflake’s Arctic
The “snowflake-arctic-embed-m” embedding model encodes scientific research data into high-dimensional vectors, capturing complex relationships between terms, molecules, and concepts in molecular drug discovery. This allows for more nuanced data representation, enabling the system to understand and retrieve contextually relevant information from vast datasets.
Working alongside this, Snowflake’s Arctic LLM processes these embeddings to generate meaningful, context-rich responses. By interpreting the query in conjunction with retrieved data, it ensures precise answers that align with the complexities of drug discovery, making it an ideal fit for this domain. This combination enhances both retrieval accuracy and the depth of insights delivered.
Data Preparation
To build a contextual search engine, data preparation is critical, especially in the domain of molecular drug discovery, where documents like research papers, reports, and datasets are often in PDF format. This section walks through the process of preparing and chunking PDFs for embedding and search using Snowflake’s Snowpark, Langchain, and embedding models.
Step 1: Chunking the PDF Data
The following Python UDF (User-Defined Function) extracts text from PDFs, chunks the content into manageable pieces, and returns them for embedding. Here’s how it works:
- Reading the PDF: The function
read_pdf
reads the PDF file usingPyPDF2
, extracting text from each page. - Chunking the Text: The
RecursiveCharacterTextSplitter
from Langchain is used to split the text into smaller chunks, ensuring some overlap between chunks to preserve contextual continuity. Parameters likechunk_size
(256 characters) andchunk_overlap
(50 characters) help maintain context within chunks.
create or replace function pdf_text_chunker(file_url string)
returns table (chunk varchar)
language python
runtime_version = '3.9'
handler = 'pdf_text_chunker'
packages = ('snowflake-snowpark-python','PyPDF2', 'langchain')
as
$$
from snowflake.snowpark.types import StringType, StructField, StructType
from langchain.text_splitter import RecursiveCharacterTextSplitter
from snowflake.snowpark.files import SnowflakeFile
import PyPDF2, io
import logging
import pandas as pd
class pdf_text_chunker:
def read_pdf(self, file_url: str) -> str:
logger = logging.getLogger("udf_logger")
logger.info(f"Opening file {file_url}")
with SnowflakeFile.open(file_url, 'rb') as f:
buffer = io.BytesIO(f.readall())
reader = PyPDF2.PdfReader(buffer)
text = ""
for page in reader.pages:
try:
text += page.extract_text().replace('\n', ' ').replace('\0', ' ')
except:
text = "Unable to Extract"
logger.warn(f"Unable to extract from file {file_url}, page {page}")
return text
def process(self,file_url: str):
text = self.read_pdf(file_url)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 256 , #Adjust this as you see fit
chunk_overlap = 50, #This let's text have some form of overlap. Useful for keeping chunks contextual
length_function = len
)
chunks = text_splitter.split_text(text)
df = pd.DataFrame(chunks, columns=['chunks'])
yield from df.itertuples(index=False, name=None)
$$;
Step 2: Creating the Stage
The data is staged for processing using Snowflake. A stage is created to securely store and organize the PDF files.
create or replace stage docs ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE') DIRECTORY = ( ENABLE = true );
Upload the PDFs in the stage once it is created
Step 3: Storing Chunked Data in a Table
Once the PDF files are chunked, the data is inserted into the DOCS_CHUNKS_TABLE:
- Chunking Text: Text chunks are stored in the
CHUNK
column. - Embedding the Chunks: The
SNOWFLAKE.CORTEX.EMBED_TEXT_768
function is used to generate embeddings for each chunk using the "snowflake-arctic-embed-m" model. - Metadata: Additional columns such as
file_url
,relative_path
, andscoped_file_url
help store metadata for each PDF file. - The document chunks are uploaded and inserted into the table with their embeddings:
create or replace TABLE DOCS_CHUNKS_TABLE (
RELATIVE_PATH VARCHAR(16777216), -- Relative path to the PDF file
SIZE NUMBER(38,0), -- Size of the PDF
FILE_URL VARCHAR(16777216), -- URL for the PDF
SCOPED_FILE_URL VARCHAR(16777216), -- Scoped url (you can choose which one to keep depending on your use case)
CHUNK VARCHAR(16777216), -- Piece of text
CHUNK_VEC VECTOR(FLOAT, 768) ); -- Embedding using the VECTOR data type
insert into docs_chunks_table (relative_path, size, file_url,
scoped_file_url, chunk, chunk_vec)
select relative_path,
size,
file_url,
build_scoped_file_url(@docs, relative_path) as scoped_file_url,
func.chunk as chunk,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m',chunk) as chunk_vec
from
directory(@docs),
TABLE(pdf_text_chunker(build_scoped_file_url(@docs, relative_path))) as func;
Retrieval and Augmentation
In the search engine, documents are retrieved by calculating the cosine similarity between the query’s embedding and pre-stored document chunks in the docs_chunks_table. This ranks the chunks based on relevance, and the top results are selected.
The RAG (Retrieval-Augmented Generation) model uses these relevant chunks as context to generate a more informed, accurate response. By using the “snowflake-arctic-embed-m” model for embeddings, and the Snowflake’s Arctic LLM for generating answers, the system enhances the relevance of retrieved information.
def create_prompt (myquestion):
cmd = """
with results as
(SELECT RELATIVE_PATH,
VECTOR_COSINE_SIMILARITY(docs_chunks_table.chunk_vec,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', ?)) as similarity,
chunk
from docs_chunks_table
order by similarity desc
limit ?)
select chunk, relative_path from results
"""
df_context = session.sql(cmd, params=[myquestion, num_chunks]).to_pandas()
context_lenght = len(df_context) -1
prompt_context = ""
for i in range (0, context_lenght):
prompt_context += df_context._get_value(i, 'CHUNK')
prompt_context = prompt_context.replace("'", "")
relative_path = df_context._get_value(0,'RELATIVE_PATH')
prompt = f"""
'You are an expert assistance extracting information from context provided.
Answer the question based on the context. Be concise and do not hallucinate.
Context: {prompt_context}
Question:
{myquestion}
Answer: '
"""
return prompt
Code Breakdown:
- The
create_prompt
function retrieves the most relevant chunks using cosine similarity. - The top chunks are concatenated into a prompt, which is fed into the Snowflake Arctic LLM.
- The LLM generates a concise response based on the retrieved context.
from snowflake.snowpark import Session
from snowflake.cortex import Complete
def getResponse(myquestion):
prompt = create_prompt (myquestion)
response = Complete(
"snowflake-arctic",
prompt,
session=session)
return response
def display_response (question):
response = getResponse(question)
st.markdown(response)
Streamlit App for Contextual Search Results
The final step is to create a user interface that allows users to input their queries and receive responses in real time. The Streamlit app provides a simple yet effective interface to showcase the contextual search engine for molecular drug discovery.
st.title("Contextual Search Engine for Drug Discover")
question = st.text_input("Enter question", placeholder="What is the Molecular Weights?", label_visibility="collapsed")
if question:
display_response (question)