Multilingual Audio based Farmer’s helper bot with Intel’s OpenVINO and Gemma
Open-sourcing large language models (LLMs) boosts global access to AI technology. Although unlikely, it’s possible that the next big AI innovation could come from someone without access to widespread computing resources. However, in AI application development, there’s more flexibility in infrastructure choice. The combination of accessible CPUs, scalability, and open-source licensing of LLMs greatly supports AI progress.
This article explores the intriguing process of fine-tuning Gemma and leveraging OpenVINO for quantization to achieve multilingual capabilities.
Gemma
Gemma is designed with responsible AI development in mind, using the same advanced research and technology as Gemini models. Developed by Google DeepMind and other Google teams, Gemma is a lightweight, open model family inspired by Gemini, named after the Latin word for “precious stone.” We’re not only releasing model weights but also providing tools to promote innovation, collaboration, and responsible use of Gemma. Sharing technical aspects with Gemini, which is a widely accessible AI model, allows Gemma 2B and 7B to deliver top performance for their sizes compared to other open models. Gemma models can even run on a developer’s laptop or desktop. Remarkably, Gemma performs better than much larger models on key benchmarks while maintaining our high standards for safety and responsible outputs.
Quantization
Quantization reduces memory and computational needs by decreasing the bit width of model weights and activations, such as from 16-bit floating point (fp16) to 8-bit integers (int8). This reduction means the model uses less memory during inference, speeding up tasks like text generation decoding. Furthermore, both weight and activation quantization allow faster matrix multiplication through integer arithmetic.
In this post, we’ll utilize the information discussed earlier. Beginning with the fine-tuned Gemma model, we’ll implement 4-bit quantization on the model weights using Intel OpenVINO integration available in the Optimum Intel library.
Let’s start first by installing all the necessary packages.
!pip install torch==2.0.1
!pip install transformers
!pip install bitsandbytes
!pip install peft
!pip install accelerate
!pip install datasets
!pip install trl
!pip install einops
!pip install scipy
!pip install --upgrade openvino-nightly
Gemma is a gated model, requiring acceptance of terms and conditions before fine-tuning. The following line of code allows us to log into Hugging Face Hub using your HF_TOKEN.
from huggingface_hub import login
login("HF_TOKEN")
Dataset
We’ll use the Farmers Call Query Data from Kaggle, which is sourced from data.gov.in, an open data platform by the Government of India. This dataset relates to the Kisan Call Centre, where farmers submitted queries and received expert responses.
import time
from datasets import load_dataset
from trl import SFTTrainer
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
TrainingArguments)
file_path = "questionsv4.csv"
dataset = load_dataset("csv", data_files={file_path})
model_name = "google/gemma-2b"p
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
Before beginning the fine-tuning process, we’ll convert the model to an Instructions, Input, and Response format.
prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
def formatting_prompts_func(examples):
inputs = [" " for i in range(len(dataset['train']['questions']))]
print(len(inputs))
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
instructions = examples["questions"]
inputs = inputs
outputs = examples["answers"]
texts = []
for instruction, input, output in zip(instructions, inputs, outputs):
# Must add EOS_TOKEN, otherwise your generation will go on forever!
text = prompt.format(instruction, input, output) + EOS_TOKEN
#print(text)
texts.append(text)
return { "text" : texts}
pass
from datasets import load_dataset
dataset = dataset.map(formatting_prompts_func, batched = True,)
Finetuning
The SFT Trainer from Hugging Face is tailored for efficient fine-tuning of large models on smaller datasets. It simplifies training with a user-friendly interface and memory-efficient methods. To use it, you provide your model, training parameters like learning rate and epochs, and your training data.
training_arguments = TrainingArguments(
output_dir="./results",
bf16=True,
use_ipex=True,
no_cuda=True,
fp16_full_eval=False,
)
You can set packing to true here. In the SFT Trainer, the packing argument allows for example packing, combining multiple short examples into a single longer input sequence for training. This enhances training efficiency by reducing the number of forward passes required.
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
dataset_text_field="text",
max_seq_length=512,
tokenizer=tokenizer,
args=training_arguments,
packing=True,
)
trainer.train()
Once the training is done, we will save the model and upload it to huggingface.
new_model = "finetuned-gemma"
trainer.model.save_pretrained(new_model)
Multilingual Capabilities
The model that we trained above only understands English. We wanted to work it for local languages along with Audio input. We would first do the ASR (Automatic Speech Recognition) in the local languages and then translate the input to English which would be feed to finetuned model. The output of the model would be then translated back from English to local languages.
We would consider Hindi, Telegu, Tamil and Kannada. The inference code is as follows:
import os
import transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
from transformers import AutoModelForSeq2SeqLM, pipeline
from huggingface_hub import login
import numpy as np
new_model = "tensorgirl/finetuned-gemma"
model = AutoModelForCausalLM.from_pretrained(new_model, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(new_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
generator = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
device = 0 if torch.cuda.is_available() else -1
def translate(text, src_lang, tgt_lang):
translation_pipeline = pipeline("translation", model=model, tokenizer=tokenizer, src_lang=src_lang, tgt_lang=tgt_lang, max_length=400, device=device)
result = translation_pipeline(text)
return result[0]['translation_text']
def English(audio):
transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-base.en")
sr, y = audio
y = y.astype(np.float32)
y /= np.max(np.abs(y))
return transcriber({"sampling_rate": sr, "raw": y})["text"]
def Hindi(audio):
transcriber = pipeline("automatic-speech-recognition", model="theainerd/Wav2Vec2-large-xlsr-hindi")
sr, y = audio
y = y.astype(np.float32)
y /= np.max(np.abs(y))
text = transcriber({"sampling_rate":sr, "raw":y})["text"]
return translate(text, "hin_Deva", "eng_Latn")
def Telegu(audio):
transcriber = pipeline("automatic-speech-recognition", model="anuragshas/wav2vec2-large-xlsr-53-telugu")
sr, y = audio
y = y.astype(np.float32)
y /= np.max(np.abs(y))
text = transcriber({"sampling_rate":sr, "raw":y})["text"]
return translate(text, "tel_Telu", "eng_Latn")
def Tamil(audio):
transcriber = pipeline("automatic-speech-recognition", model="Harveenchadha/vakyansh-wav2vec2-tamil-tam-250")
sr, y = audio
y = y.astype(np.float32)
y /= np.max(np.abs(y))
text = transcriber({"sampling_rate":sr, "raw":y})["text"]
return translate(text, "tam_Taml", "eng_Latn")
def Kannada(audio):
transcriber = pipeline("automatic-speech-recognition", model="vasista22/whisper-kannada-medium")
sr, y = audio
y = y.astype(np.float32)
y /= np.max(np.abs(y))
text = transcriber({"sampling_rate":sr, "raw":y})["text"]
return translate(text, "kan_Knda", "eng_Latn")
def predict(audio, language):
if language == "English":
message = English(audio)
if language == "Hindi":
message = Hindi(audio)
if language == "Telegu":
message = Telegu(audio)
if language == "Tamil":
message = Tamil(audio)
if language == "Kannada":
message = Kannada(audio)
print(message)
sequences = generator(
message,
max_length=200,
do_sample=False,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,)
answer = ""
for seq in sequences:
answer = answer + seq['generated_text'] + " "
print(answer)
if language == "English":
return answer
if language == "Hindi":
return translate(answer,"eng_Latn", "hin_Deva")
if language == "Telegu":
return translate(answer,"eng_Latn", "tel_Telu")
if language == "Tamil":
return translate(answer, "eng_Latn", "tam_Taml")
if language == "Kannada":
return translate(answer, "eng_Latn", "kan_Knda")
return answer
nllb-200-distilled-600M is the main model responsible for the translation. Here are its features.
- Machine Translation Model: Its primary function is to translate text from one language to another.
- NLLB-200 Variant: NLLB-200 is the base model, and this particular version (facebook/nllb-200-distilled-600M) is a smaller, “distilled” variant. Distillation refers to a technique that creates a more efficient model by compressing the knowledge from a larger model.
The current pipeline takes almost 2–3 mins for a single Inference. We would speed up the process using Quantization.
Intel OpenVINO
It is an open-source toolkit designed to optimize and deploy deep learning models for various purposes. Here’s a breakdown of its key features:
- Focus on Inference: OpenVINO excels at optimizing the process of using pre-trained deep learning models to make predictions on new data, also known as inference.
- Cross-Platform Compatibility: This toolkit allows you to develop AI solutions that can run on various hardware platforms, not just Intel’s.
- Write Once, Deploy Anywhere: With OpenVINO, you can develop your AI application once and then deploy it on different devices without needing to rewrite the code for each platform.
- Performance Boost: OpenVINO optimizes models for faster inference, especially on Intel hardware, making your AI applications run more efficiently.
- Open-Source and Supported Versions: There are two versions available: 1. The open-source toolkit is free to use and backed by the OpenVINO community.
2. The Intel Distribution of OpenVINO toolkit includes additional features and support from Intel.
Inferencing on Openvino
from transformers import AutoTokenizer, pipeline
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
model_id = "tensorgirl/finetuned-gemma"
# Create the quantization configuration with desired quantization parameters
q_config = OVWeightQuantizationConfig(bits=4, group_size=128, ratio=0.8)
# Create OpenVINO configuration with optimal settings for this model
ov_config = {"PERFORMANCE_HINT": "LATENCY", "CACHE_DIR": "model_cache", "INFERENCE_PRECISION_HINT": "f32"}
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(
model_id,
export=True,
quantization_config=q_config,
device=device,
ov_config=ov_config,
)
Similarly, all the model used for ASR and Translation can be quantized.
You can try the bot on Huggingface at this link.
References:
- https://blog.google/technology/developers/gemma-open-models/
- Farmers helper bot using Intel oneAPI , Gemma-2B and Gemini Pro | by Usha Rengaraju | Mar, 2024 | Medium
- https://huggingface.co/blog/phi2-intel-meteor-lake
- https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html
- https://huggingface.co/facebook/nllb-200-distilled-600M