Harnessing the Power of Snowflake Cortex AI for Training Large Language Models

5 min readJul 7, 2024

In the rapidly evolving landscape of artificial intelligence, training large language models (LLMs) has become a cornerstone for advancing natural language processing capabilities. Snowflake Cortex AI offers a robust and scalable platform to facilitate this complex process. In this blog, we’ll explore how to leverage Snowflake Cortex AI to train LLMs efficiently and effectively. From data ingestion and preprocessing to model training and deployment, we’ll walk you through each step, highlighting best practices and innovative techniques that can help you harness the full potential of this powerful tool. Whether you’re an AI enthusiast, a data scientist, or a business looking to integrate cutting-edge AI solutions, this guide will provide valuable insights into the transformative capabilities of Snowflake Cortex AI.

Finetuning Large Language Models

Fine-tuning large language models (LLMs) is essential for several reasons:

Specialization: Pre-trained LLMs are trained on vast and diverse datasets, making them generalized. Fine-tuning allows these models to specialize in specific tasks or domains, enhancing their performance in particular applications such as medical diagnosis, legal advice, customer support, or any other specialized field.
Improved Accuracy: By fine-tuning on task-specific data, the model can learn the nuances and context unique to that domain. This improves the accuracy and relevance of the model’s responses, making it more reliable for real-world applications.
Efficiency: Fine-tuning can be more efficient than training a model from scratch. It leverages the pre-trained knowledge and adjusts it for specific tasks, saving computational resources and time.
Adaptability: Language and usage can vary widely between different fields and over time. Fine-tuning allows models to stay up-to-date with the latest terminology, trends, and language patterns, ensuring they remain useful and effective.
Customization: Fine-tuning enables customization of the model to align with a company’s brand voice, tone, and style. This is particularly important for customer-facing applications where maintaining a consistent brand identity is crucial.
Enhanced Performance: By focusing on relevant data, fine-tuning can reduce the noise from irrelevant information, leading to better performance and faster convergence during training.

In summary, fine-tuning large language models tailors them to meet specific needs, ensuring they deliver more accurate, efficient, and relevant results. This customization maximizes the utility of LLMs across various applications and industries.

Ref — https://www.ruder.io/recent-advances-lm-fine-tuning/

Snowflake Capabilities

Snowflake Cortex AI offers a streamlined and efficient platform for fine-tuning large language models. Here’s how you can leverage it:

Data Preparation

We would be using the Question-Answer dataset from Kaggle.

Follow the steps:

Create a new Database under Data Section and add Schema DATA
By uploading external files from Kaggle, create two tables TRAIN and VAL. You can discard the distractor columns.
Name the Columns are Question and Answer

Configuring the Fine-Tuning Job

Once your data is ready, you can configure a fine-tuning job in Snowflake Cortex AI. This involves selecting the pre-trained model you want to fine-tune, specifying the dataset, and defining the training parameters such as learning rate, batch size, and number of epochs. Snowflake Cortex AI provides a user-friendly interface for setting up these configurations, making it easy even for those who are not experts in machine learning. You can choose the models from the following.

We would be using llama3–8b.

Open a new Notebook in Cortex Platform and write the following SQL query to finetune the model

SELECT SNOWFLAKE.CORTEX.FINETUNE(
  'CREATE',
  'SciQ_model',
  'llama3-8b',
  'SELECT Question AS prompt, Answer AS completion FROM TRAIN',
  'SELECT Question AS prompt, Answer AS completion FROM VAL'
);

The model will start training after this.

Monitoring and Managing Fine-Tuning Jobs

Snowflake Cortex AI offers comprehensive tools for monitoring the progress of your fine-tuning jobs. You can view metrics such as accuracy, loss, and training time, and access logs to troubleshoot any issues. Real-time monitoring allows you to make adjustments on the fly and ensures that the fine-tuning process is proceeding as expected.

SELECT SNOWFLAKE.CORTEX.FINETUNE(
'DESCRIBE',
'Your_JOB_ID'
)

Evaluating and Validating the Fine-Tuned Model

After fine-tuning, it’s essential to evaluate the performance of your model. Snowflake Cortex AI provides various metrics and tools to assess how well the model performs on the specific task. Techniques such as cross-validation and A/B testing can be used to ensure the model meets the desired accuracy and performance criteria.

Following are the results which I got.

{"base_model":"llama3-8b","created_on":1720357072007,"finished_on":1720358521300,"id":"CortexFineTuningWorkflow_183f512d-d661-4f61-b7e4-11150ab2b651","model":"SciQ_model","progress":1.0,"status":"SUCCESS","training_data":"SELECT Question AS prompt, Answer AS completion FROM TRAIN","trained_tokens":1035650,"training_result":{"validation_loss":1.0134716033935547,"training_loss":0.49410628375064775},"validation_data":""}

Deployment of Fine-Tuned Models

We would be deploying the model in Streamlit using the following lines of Code.

# Import python packages
import streamlit as st
from snowflake.snowpark.context import get_active_session

session = get_active_session()

def complete(myquestion):

    cmd = f"""
             select SNOWFLAKE.CORTEX.COMPLETE(?,?) as response
           """
    
    df_response = session.sql(cmd, params=['SciQ_model',myquestion]).collect()
    return df_response

def display_response (question):
    response = complete(question)
    res_text = response[0].RESPONSE
    st.markdown(res_text)

#Main code

st.title("You can ask me Scientific Question")

question = st.text_input("Enter question", placeholder="Vertebrata are characterized by the presence of what?", label_visibility="collapsed")


if question:
    display_response (question)

Conclusion

Fine-tuning large language models with Snowflake Cortex AI allows you to tailor these powerful tools to your specific needs, enhancing their accuracy, efficiency, and relevance. By following the steps outlined in this blog — from data preparation to deployment — you can leverage the full potential of Snowflake Cortex AI to achieve superior results in your AI projects. Whether you’re enhancing customer service, automating content generation, or delving into new research areas, Snowflake Cortex AI provides the tools you need to succeed.