Building a Coding Assistant using Snowflake Notebooks and Gemma 2B
In this tutorial, we’ll guide you through the process of fine-tuning the Gemma 2B model for coding tasks using Snowflake Notebooks. Snowflake’s platform provides a powerful, scalable environment for machine learning, and by leveraging their integrated notebooks, you can easily customize models to suit your needs. Whether you’re enhancing the model for specific programming languages or optimizing its understanding of code structure, this step-by-step guide will help you unlock the potential of Gemma 2B for coding-related tasks. Let’s dive into setting up your environment and getting started!
Snowflake Notebooks are part of Snowflake’s Data Cloud platform, providing an integrated development environment (IDE) for data engineers, data scientists, and analysts to develop and run code directly within the Snowflake environment. They are designed for interactive computing, supporting SQL, Python, and other languages using Snowpark, Snowflake’s framework for writing complex data pipelines and processing logic.
Here are key features of Snowflake Notebooks:
- Seamless Integration with Snowflake: Since the notebooks are part of the Snowflake ecosystem, they allow easy access to Snowflake’s data warehouse and its features like the query engine, data storage, and security.
- Multi-language Support: Snowflake Notebooks support a variety of languages, most notably SQL and Python. With Snowpark, users can write scalable data transformation and machine learning pipelines in Python while leveraging Snowflake’s compute resources.
- Collaborative Development: Multiple users can collaborate within the same notebook, sharing code, comments, and results in real-time, which makes it useful for teams working on data projects.
- Access to Data Science Libraries: With support for Python, Snowflake Notebooks integrate with popular data science libraries like Pandas, Numpy, and Scikit-learn, allowing data scientists to analyze and manipulate data directly in the notebook environment.
- Integrated with Snowflake’s Compute: Notebooks leverage Snowflake’s virtual warehouses for executing code, ensuring scalable, distributed computing power. This makes them efficient for running large-scale data queries and machine learning models without worrying about resource management.
- Version Control: Snowflake Notebooks also offer version control capabilities, enabling users to track changes, roll back to previous versions, and maintain the integrity of their work.
- Interactive Visualization: The environment supports interactive visualizations, enabling users to quickly generate and explore charts, graphs, and dashboards within the notebook itself.
- Security and Governance: As with the broader Snowflake ecosystem, notebooks benefit from Snowflake’s built-in security and governance features, ensuring that data and code execution follow the required access controls and compliance standards.
Snowflake Notebooks are particularly useful for data teams who want to stay within the Snowflake ecosystem while taking advantage of an interactive and collaborative coding environment.
Finetuning a model like Gemma-2 might sound daunting, but the universe had already given you a magical tool — LoRA (Low-Rank Adapter). Let me explain how LoRA simplified your quest, transforming what could have been a difficult journey into a smooth adventure.
The Problem: Traditional Finetuning
In the past, fine-tuning a massive model like Gemma-2 would require colossal compute resources — think multiple GPUs and heaps of memory. You might spend days or even weeks on the process. In short, it was like trying to move a mountain with a spoon.
LoRA: The Magical Shortcut
Then came LoRA, a method that made finetuning as easy as waving a wand. Instead of retraining the whole model, LoRA lets you only update a small fraction of the parameters. This was achieved by adding low-rank adapters into the layers of the Gemma-2 model. Imagine if you were tasked with editing a massive book, but instead of rewriting whole pages, you were allowed to just leave sticky notes with corrections. That’s what LoRA did for model parameters.
Here’s how your journey with LoRA unfolded:
- Preparation: Setting Up the Environment You opened Snowflake’s Notebook, where all the magic happens. With your data neatly stored in Snowflake’s data warehouse, you pulled in your dataset of coding problems and solutions. You weren’t starting from scratch, though — you had access to Snowpark to manipulate the data and shape it for training.
- Loading Gemma-2: Bringing Gemma-2 into your notebook felt like calling upon a great beast of knowledge. But don’t worry; thanks to Hugging Face’s Transformers library, you loaded Gemma-2 in just a few lines of code. With Snowflake’s integrated compute, it took seconds.
- Activating LoRA: Here’s where the magic of LoRA came into play. Instead of painstakingly updating all 2 billion parameters of Gemma-2, you employed parameter-efficient finetuning. LoRA let you focus on just the most essential parameters, those that had the most impact on making Gemma-2 your perfect coding assistant. You specified the low-rank adapters and initiated the training.
- Training Without Breaking a Sweat: In the old days, you’d need special hardware — GPUs galore — to finetune such a large model. But with LoRA, you were able to run this right in Snowflake’s environment. Since the adapters reduced the compute complexity, it was efficient and fast. The model fine-tuned itself with your specific dataset — adapting to your style of coding, understanding your team’s idiosyncrasies, and learning to spot your most common bugs.
- Minimal Memory, Maximum Efficiency: LoRA worked its magic by quantizing the model, making it smaller and faster to train. It did this while maintaining almost no loss in performance — impressive, given that you didn’t need a whole data center to get Gemma-2 ready.
- Deployment: Your Own Custom Gemma-2: After a smooth training session, you didn’t just have Gemma-2 anymore. You had a customized, specialized version that was attuned to your needs. You were able to deploy it straight from Snowflake’s notebook, testing it right there and seeing how your finetuning improved its performance on your tasks.
Let’s start coding!
We will leverage GPU_NV_M for our finetuing Process…
We will start with importing the basic libraries.
!pip install datasets torch peft accelerate bitsandbytes trl
This command installs several key Python libraries necessary for fine-tuning the Gemma 2B model in a Snowflake Notebook:
datasets
: Provides access to a variety of datasets for machine learning.torch
: The PyTorch library, crucial for deep learning tasks.peft
: A library for parameter-efficient fine-tuning, allowing us to fine-tune large models with fewer resources.accelerate
: Helps in distributing model training across multiple GPUs or environments.bitsandbytes
: Enables memory-efficient training by using 8-bit optimizers.trl
: Provides tools for training transformers with reinforcement learning.
This setup ensures that we have all the dependencies required for the fine-tuning process.
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
This configuration enables 4-bit quantization for memory-efficient model loading, using the nf4
quantization type with mixed-precision (bfloat16
) computation, further optimized with double quantization for better performance.
from transformers import GemmaTokenizer, AutoModelForCausalLM, AutoTokenizer
# Load the entire model on the GPU 0
device_map = {"": 0}
model_id = "google/gemma-2b"
print('loading tokenizer')
tokenizer_id = "philschmid/gemma-tokenizer-chatml"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
print('loading model')
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map=device_map)
model.config.use_cache = False
model.config.pretraining_tp = 1
This code loads the Gemma 2B model and tokenizer for causal language modeling. The tokenizer is fetched from a specific pre-trained repository, and the model is loaded onto GPU 0 using the 4-bit quantization configuration (bnb_config
) for memory efficiency. It also disables caching (use_cache = False
) and sets pretraining_tp = 1
to control the model's pretraining behavior.
from datasets import load_dataset
dataset_name = "lucasmccabe-lmi/CodeAlpaca-20k"
dataset = load_dataset(dataset_name, split="train")
This snippet loads the “CodeAlpaca-20k” dataset from Hugging Face’s dataset repository. It retrieves only the training split (split="train"
), which will be used for fine-tuning the model with coding-related tasks.
import torch
from transformers import Conv1D
import bitsandbytes as bnb
def find_all_linear_names(model):
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, bnb.nn.Linear4bit):
names = name.split(".")
# model-specific
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if "lm_head" in lora_module_names: # needed for 16-bit
lora_module_names.remove("lm_head")
return list(lora_module_names)
target = find_all_linear_names(model)
print(target)
In this code, I define a function called find_all_linear_names
to identify and gather the names of all 4-bit linear layers (bnb.nn.Linear4bit
) within the model. It iterates through the model’s modules and adds relevant layer names to a set. I exclude the "lm_head" layer, as it's not needed for certain configurations (like 16-bit). The target
variable stores the list of these identified layers, which I print to verify. This will help in applying techniques like LoRA to specific model layers during fine-tuning.
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
from transformers import TrainingArguments
# Load LoRA configuration
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.05,
r=64,
bias="none",
target_modules=target,
task_type="CAUSAL_LM",
)
args = TrainingArguments(
output_dir="gemma-2b-coder", # directory to save and repository id
num_train_epochs=1, # number of training epochs
per_device_train_batch_size=1, # batch size per device during training
gradient_accumulation_steps=1, # number of steps before performing a backward/update pass
gradient_checkpointing=True, # use gradient checkpointing to save memory
optim="adamw_torch_fused", # use fused adamw optimizer
logging_steps=100, # log every 10 steps
save_strategy="epoch", # save checkpoint every epoch
bf16=True, # use bfloat16 precision
tf32=True, # use tf32 precision
learning_rate=2e-4, # learning rate, based on LoRA paper
max_grad_norm=0.3, # max gradient norm based on LoRA paper
warmup_ratio=0.03, # warmup ratio based on LoRA paper
lr_scheduler_type="constant", # use constant learning rate scheduler
push_to_hub=False, # push model to hub
report_to="tensorboard", # report metrics to tensorboard
)
Here, I configure the fine-tuning process using LoRA (Low-Rank Adaptation) and set up the training parameters.
First, I load the LoRA configuration (peft_config
), specifying key parameters like lora_alpha
, r
, and target_modules
(the layers identified earlier), which control how the model is adapted for efficient fine-tuning.
Then, I define TrainingArguments
to control the training setup, including the number of epochs, batch size, gradient accumulation, and optimizations like bf16
precision and the AdamW optimizer. This configuration ensures efficient, resource-friendly training tailored to the Gemma 2B model's needs
from trl import SFTTrainer
max_seq_length = 1512 # max sequence length for model and packing of the dataset
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
formatting_func=formatting_prompts_func,
data_collator=collator,
args=args,
packing=False,
)
with torch.no_grad():
torch.cuda.empty_cache()
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()
# save model
trainer.save_model()
In this section, I initialize the SFTTrainer to start the fine-tuning process. I set max_seq_length
to 1512 to handle long sequences efficiently. The SFTTrainer is provided with the model, training dataset, LoRA configuration, and additional functions like formatting_prompts_func
(for prompt formatting) and collator
(to handle data batching).
Finally, I clear the GPU cache to free up memory, start the training, and save the fine-tuned model both locally and potentially to the Hugging Face Hub. This process ensures the model is optimized for coding tasks and stored for future use.
And now finally, I will test the model
from transformers import pipeline
pipe = pipeline(task="text-generation", model="gemma-2b-code", tokenizer=tokenizer, max_length=200)
eos_token = tokenizer("<|im_end|>",add_special_tokens=False)["input_ids"][0]
def test_inference(prompt):
prompt = pipe.tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=100, do_sample=True, temperature=0.1, top_k=50, top_p=0.95, eos_token_id=eos_token)
return outputs[0]['generated_text'][len(prompt):].strip()
prompt = "Write a hello world program in Python"
test_inference(prompt)
And we get the output as -> print(“Hello World”)
We can also create a streamlit App for the same
Conclusion
In this tutorial, I demonstrated how to fine-tune the Gemma 2B model for coding tasks using Snowflake Notebooks. By leveraging libraries like Hugging Face’s Transformers, PEFT, and TRL, I set up an efficient training environment with 4-bit quantization and LoRA for parameter-efficient fine-tuning. This approach optimizes resource utilization while maintaining model performance. The final step involved training the model and saving it for future applications, making it ready for coding-related tasks.