Banking-ASR: Enhancing Speech Recognition with Intel oneAPI

Usha Rengaraju
11 min readMar 12, 2024

--

In today’s digital age, automatic speech recognition (ASR) has emerged as a pivotal element in voice-driven applications. This project delves into the utilization of pretrained ASR models, particularly focusing on the Whisper-tiny model, for speech recognition using the MINDS14 dataset. Renowned for its ability to reduce size and enhance inference speed, Whisper-tiny is particularly suited for devices with constrained resources. The MINDS14 dataset, replete with diverse language and accent variations, presents a formidable challenge to ASR models in comprehending the intricacies of everyday speech. By amalgamating these two components, this study endeavors to elevate the accuracy of speech recognition and gain insights into both the limitations and potential of pretrained ASR models within practical application scenarios.

This project can provide a valuable starting point for banks to enhance their voice and text-based customer interactions by enabling them to build more accurate and user-friendly conversational AI systems. Improved intent recognition leads to smoother voice interactions for customers. The model can correctly understand what the customer wants, allowing the bank to route them to the appropriate service or provide the information they need.

Whisper

Whisper-ASR is an automatic speech recognition (ASR) system created by OpenAI. Here’s a breakdown of its key features:

  • Large-scale training: Whisper is trained on a massive dataset of 680,000 hours of multilingual and multitask-labeled data collected from the web. This vast amount of training data allows it to handle various speech characteristics like accents, background noise, and technical jargon.
  • Multilingual capabilities: Whisper can transcribe speech in English and several other languages. It can also translate speech from various languages into English.
  • Simple architecture: The underlying architecture of Whisper is a transformer-based encoder-decoder, which is a common approach for many machine learning tasks. This makes it efficient and potentially easier to understand and modify.
  • Open-source availability: OpenAI has released Whisper as an open-source project, allowing researchers and developers to access the code and build upon it.
  • Focus on research and development: While Whisper has potential for various ASR applications, OpenAI positions it more as a research tool for studying speech recognition and its limitations. However, it can also be a valuable solution for developers, particularly for English speech recognition tasks.
Whisper architecture

The Whisper architecture embodies a straightforward end-to-end methodology, realized through an encoder-decoder Transformer setup. The input audio undergoes segmentation into 30-second segments, followed by conversion into a log-Mel spectrogram and subsequent processing by an encoder. Concurrently, a decoder is conditioned to forecast the corresponding textual caption, interspersed with distinct tokens guiding the unified model to execute various tasks. These tasks encompass language identification, timestamping at the phrase level, multilingual speech transcription, and translation of speech into English.

MINDS14 dataset

The MINDS14 dataset is a resource designed for training and evaluating models that perform intent detection on spoken language. Here’s a closer look at what it offers:

Focus: Intent Detection in E-banking Domain

  • MINDS14 concentrates on spoken interactions related to electronic banking.
  • It provides recordings of people asking questions or issuing commands to an e-banking system.

Multilingual Support:

  • A key feature of MINDS14 is its diversity in languages.
  • It includes recordings with 14 different language varieties, making it valuable for training models that can handle speech across various languages.

Data Structure:

  • The dataset consists of audio recordings along with corresponding text information.
  • Each recording is accompanied by:
  • Transcription: Textual representation of what’s being spoken in the recording.
  • English Transcription (if applicable): For languages other than English, an additional English translation of the spoken content is provided.
  • Intent Label: A category indicating the overall purpose or intent behind the spoken query in the recording (e.g., check balance, transfer money).

Accessibility:

  • MINDS14 is publicly available on the Hugging Face Hub, a platform for sharing datasets for machine learning tasks.
  • This allows researchers and developers to easily access and use the data for training their intent detection models.

Intel Extension for PyTorch

The Intel® Extension for PyTorch enhances PyTorch by integrating the latest features and optimizations tailored for enhanced performance specifically on Intel hardware. These optimizations leverage Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs, along with the Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Additionally, the Intel® Extension for PyTorch facilitates straightforward GPU acceleration for Intel discrete GPUs through the PyTorch xpu device.

Along with that, we would also leverage quantization using Intel Neural Compressor:

Intel Neural Compressor

  • Focuses on model compression techniques like quantization and pruning.
  • Reduces model size and improves inference speed for deployment on CPUs or GPUs.
  • Works with PyTorch, TensorFlow, and other frameworks.

Let’s start by installing the necessary Libraries. Do make sure that the torch and intel_extension_for_pytorch has the same version.

!pip install torch==2.0.1
!pip install transformers
!pip install accelerate
!pip install rouge
!pip install datasets[audio]
!pip install evaluate
!pip install jiwer
!pip install intel_extension_for_pytorch==2.0.100

Prior to executing the script, it is crucial to set the following environment variable to ensure the selection of the Intel AMX ISA.

!export ONEDNN_MAX_CPU_ISA="AVX512_CORE_AMX"

Now, let’s do all the necessary imports:

from datasets import load_dataset, load_metric, ClassLabel, Audio
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, Trainer, TrainingArguments
from transformers import pipeline
from transformers import WhisperConfig, WhisperProcessor, WhisperFeatureExtractor, WhisperTokenizer, WhisperForConditionalGeneration
from transformers import Trainer, TrainingArguments, DataCollatorWithPadding, DataCollator
from transformers.utils import send_example_telemetry

import numpy as np
import torch
import transformers
import datasets
import librosa
import re
import nltk
import librosa.display

nltk.download('punkt')

Some of the important libraries are:

  • NLTK (Natural Language Toolkit): Focuses on the core tasks of NLP like text classification, tokenization (breaking text into units), and stemming/lemmatization (reducing words to their base form).
  • librosa: Deals with audio analysis. It provides tools for music information retrieval, audio processing like feature extraction, and working with audio files.
  • Transformers: This library is on the cutting edge of NLP, allowing you to build and use powerful deep learning models for tasks like text summarization, machine translation, and question answering. It’s built on the Transformer architecture, a powerful neural network design for NLP.

Let’s download the dataset using HuggingFace CLI.

!huggingface-cli download PolyAI/minds14 --repo-type dataset --revision refs/convert/parquet --local-dir . --local-dir-use-symlinks False  --include 'en-US/*'
minds_enUS = load_dataset('./en-US', split="train")

Use will use only the English-US dataset for this article. Next, we try to see the distribution of the dataset.

intent_classes = [
{'index': 0, 'name': 'abroad'},
{'index': 1, 'name': 'adress'},
{'index': 2, 'name': 'app_error'},
{'index': 3, 'name': 'atm_limit'},
{'index': 4, 'name': 'balance'},
{'index': 5, 'name': 'business_loan'},
{'index': 6, 'name': 'card_issues'},
{'index': 7, 'name': 'cash_deposite'},
{'index': 8, 'name': 'direct_debit'},
{'index': 9, 'name': 'freeze'},
{'index': 10, 'name': 'latest_transactions'},
{'index': 11, 'name': 'joint_account'},
{'index': 12, 'name': 'high_value_payment'},
{'index': 13, 'name': 'pay_bill'}
]
def plot_label_distribution(dataset):
# Extract intent_class labels
labels = dataset['intent_class']

# Count the frequency of each label
label_counts = {}
for label in labels:
if label in label_counts:
label_counts[label] += 1
else:
label_counts[label] = 1

# Sort labels by their counts in descending order
sorted_labels = sorted(label_counts.items(), key=lambda x: x[1], reverse=True)
sorted_labels = [{'name': intent_classes[label[0]]['name'], 'index': intent_classes[label[0]]['index'], 'count': label[1]} for label in sorted_labels]

# Create x-axis labels combining name and index
x_labels = [f"{label['name']} ({label['index']})" for label in sorted_labels]
counts = [label['count'] for label in sorted_labels]

# Plot distribution
plt.figure(figsize=(12, 6))
plt.bar(x_labels, counts, color='skyblue')
plt.xlabel('Intent Class')
plt.ylabel('Frequency')
plt.title('Distribution of Intent Classes')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

plot_label_distribution(minds_enUS)

The dataset is moreover a fairly distributed.

One can also plot the waveforms and spectrograms using librosa library.

Let’s divide the data into train and test. 80% for training. Also remove some unnecessary columns.

final_dataset = minds_enUS.train_test_split(test_size=0.2)
print("split: ", final_dataset)

final_dataset = final_dataset.remove_columns(
["english_transcription",
"intent_class",
"lang_id",]
)

After that, let’s do some preprocessing and create the vocabulary.

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'

def remove_special_characters(batch):
batch["transcription"] = re.sub(chars_to_ignore_regex, '', batch["transcription"]).lower() + " "
batch["transcription"] = batch['transcription'].rstrip()
return batch

final_dataset = final_dataset.map(remove_special_characters)
def extract_all_chars(batch):
all_text = " ".join(batch["transcription"])
vocab = list(set(all_text))
return {"vocab": [vocab], "all_text": [all_text]}

vocabs = final_dataset.map(
extract_all_chars,
batched=True,
batch_size=1,
keep_in_memory=True,
remove_columns=final_dataset.column_names['train']
)

vocab_list = list(set(vocabs["train"]["vocab"][0]) | set(vocabs["test"]["vocab"][0]))
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
print(vocab_dict)

vocab_dict["|"] = vocab_dict[" "]
del vocab_dict [" "]

vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
print(len(vocab_dict))

It’s time now to load the model.

whisper = 'openai/whisper-tiny'

tokenizer = WhisperTokenizer.from_pretrained(whisper, language="english")
feature_extractor = WhisperFeatureExtractor.from_pretrained(whisper)
processor = WhisperProcessor(
feature_extractor=feature_extractor,
tokenizer=tokenizer,
)

model = WhisperForConditionalGeneration.from_pretrained(whisper)
model.config.forced_decoder_ids = None

The processor in ASR acts as the engine that drives the entire speech recognition process, from raw audio input to the final text output. The type and capabilities of the processor will influence factors like the accuracy and speed of the ASR system.

Let’s try to model on one of the samples and check the results.

audio_array = final_dataset['train']['audio'][0]['array']
audio_sr = final_dataset['train']['audio'][0]['sampling_rate']
audio_text = final_dataset['train']['transcription'][0]

encoded_input = processor(
audio_array,
sampling_rate=audio_sr,
return_tensors='pt',
).input_features

encoded_label = processor(
text=audio_text,
return_tensors='pt',
).input_ids
decoded_predict = processor.batch_decode(predicted_ids, skip_special_tokens=True)
decoded_label = processor.batch_decode(encoded_label, skip_special_tokens=True)

print("decoded predict: ", decoded_predict)
print("decoded label: ", decoded_label)

We can clearly see the difference in the predicted and actual labels; thus, we would need to do some fine-tuning. Let’s start by preparing the dataset.

def prepare_datasets(batch):
audio = batch['audio']

batch = processor(
audio=audio['array'],
sampling_rate=audio['sampling_rate'],
text=batch['transcription'],
)

batch['input_length'] = len(audio['array']) / audio['sampling_rate']

return batch

encoded_datasets = final_dataset.map(
prepare_datasets,
remove_columns=final_dataset.column_names['train'],
num_proc=4,
)
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
processor: Any

def __call__(
self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
) -> Dict[str, torch.Tensor]:
input_features = [
{"input_features": feature["input_features"][0]} for feature in features
]
batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

label_features = [{"input_ids": feature["labels"]} for feature in features]
labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

labels = labels_batch["input_ids"].masked_fill(
labels_batch.attention_mask.ne(1), -100
)

if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
labels = labels[:, 1:]

batch["labels"] = labels

return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

Word Error Rate (WER), F1 score, and BLEU would be our metrics to follow.

from transformers.models.whisper.english_normalizer import BasicTextNormalizer
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import SmoothingFunction
from rouge import Rouge
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import f1_score, r2_score

normalizer = BasicTextNormalizer()
smoother = SmoothingFunction()

rouge_score = Rouge()

def compute_f1(preds, labels):

preds_binary = [1 if p == l else 0 for p, l in zip(preds, labels)]
labels_binary = [1 if l != processor.tokenizer.pad_token_id else 0 for l in labels] # Assuming padding token ID is used for no label

return f1_score(labels_binary, preds_binary)

def compute_r2(preds, labels):
return r2_score(labels, preds)

def compute_metrics(pred):

metric_wer = evaluate.load("wer")

pred_ids = pred.predictions
label_ids = pred.label_ids

label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
label_str = processor.batch_decode(label_ids, skip_special_tokens=True)

wer_ortho = 100 * metric_wer.compute(predictions=pred_str, references=label_str)

pred_str_norm = [normalizer(pred) for pred in pred_str]
label_str_norm = [normalizer(label) for label in label_str]

wer = metric_wer.compute(predictions=pred_str_norm, references=label_str_norm)

correct_preds = sum(1 for p, l in zip(pred_str_norm, label_str_norm) if p == l)
total_preds = len(pred_str_norm)
accuracy = correct_preds / total_preds

rouge_scores = rouge_score.get_scores(pred_str_norm, label_str_norm, avg=True)

f1_score = compute_f1(pred_str_norm, label_str_norm)
r2_score = compute_r2(pred_ids, label_ids)

references = [[ref.split()] for ref in label_str_norm]
predictions = [pred.split() for pred in pred_str_norm]
bleu_score = corpus_bleu(references, predictions, smoothing_function=smoother.method1)

return {
"wer_ortho": wer_ortho,
"wer": wer,
"accuracy": accuracy,
"f1_score": f1_score,
"r2_score": r2_score,
"bleu_score": bleu_score,
"Rouge 1 r": rouge_scores['rouge-1']['r'],
"Rouge 1 p": rouge_scores['rouge-1']['p'],
"Rouge 1 f": rouge_scores['rouge-1']['f'],
}

Let’s creates a new version of the model.generate function that always uses English as the language and "transcribe" as the task and also enable caching for this new function by default, regardless of the global use_cache setting.

from functools import partial

model.config.use_cache = False
model.generate = partial(
model.generate,
language="english",
task="transcribe",
use_cache=True,
)

We will use Seq2Seq Trainer from HuggingFace. Seq2SeqTrainer is a class from the Hugging Face Transformers library designed specifically for training sequence-to-sequence models. It inherits from the more general Trainer class but adds functionalities tailored for tasks like machine translation or text summarization.

Here’s the gist of what Seq2SeqTrainer offers:

  • Generative Metrics: It can automatically compute common evaluation metrics used for generative tasks, such as BLEU score for translation or ROUGE score for summarization. This simplifies the evaluation process.
  • Prediction with Generation: It provides methods specifically designed for tasks involving text generation, allowing you to use the model’s generate function during evaluation.

In essence, Seq2SeqTrainer streamlines the training and evaluation of sequence-to-sequence models in Hugging Face Transformers.

training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-tiny-minds14-english",
per_device_train_batch_size=16,
gradient_accumulation_steps=1,
learning_rate=3e-5,
lr_scheduler_type="constant_with_warmup",
warmup_steps=100,
max_steps=1000,
gradient_checkpointing=True,
fp16=True,
fp16_full_eval=True,
evaluation_strategy="steps",
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=100,
eval_steps=100,
logging_steps=100,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
)
trainer = Seq2SeqTrainer(
args= training_args,
model=model,
train_dataset=encoded_datasets['train'],
eval_dataset=encoded_datasets['test'],
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=processor,
)
trainer.train()
saved_model_pt = "whisper-tiny-minds14-en"
trainer.save_model(saved_model_pt)

We train it for 1000 steps.

Intel Neural Compressor: Shrinking and Speeding Up Deep Learning Models

Intel Neural Compressor is a free and open-source Python library that helps you make your deep learning models smaller and faster. It works on both CPUs and GPUs!

Here’s how it achieves this:

  • Model Quantization: This clever technique reduces the number of bits needed to store the model’s information. Think of it like compressing an image file. Intel Neural Compressor can convert your model from using fancy floating-point numbers (FP32) to more compact integer values (INT8), making it smaller and faster to run.
  • Automates Optimization Techniques: Neural Compressor doesn’t just do quantization. It automates several powerful methods to optimize your model, including:
  • Quantization (as explained above)
  • Compression: This might involve removing unnecessary parts of the model.
  • Pruning: Similar to compression, pruning removes unimportant connections within the model.
  • Knowledge Distillation: This technique involves transferring knowledge from a large, powerful model to a smaller, faster one.
  • Works with Many Frameworks: The best part? Neural Compressor is compatible with popular deep learning frameworks like TensorFlow, PyTorch, and ONNX Runtime.
import os
import time
import torch
import numpy as np
from neural_compressor.conf.config import Quantization_Conf
from neural_compressor.experimental import Quantization, common

QUANTIZED_MODEL_PATH = "quantized_model"

quant_config = Quantization_Conf()
quant_config.usr_cfg.model.framework = "pytorch_fx"
quant_config.usr_cfg.quantization.approach = "post_training_static_quant"
quantizer = Quantization(quant_config)
model = torch.load(saved_model_pt)
quantizer.model = common.Model(model)
quantizer.calib_dataloader = dataloader
quantizer.eval_func = eval_func
q_model = quantizer.fit()
q_model.save(QUANTIZED_MODEL_PATH)

We will do the inferencing and do some benchmark testing.

from neural_compressor.utils.pytorch import load

int8_model = load(QUANTIZED_MODEL_PATH, model)

audio_file = final_dataset['test']['audio'][0]
target = final_dataset['test']['transcription'][0]
device = "cuda:0" if torch.cuda.is_available() else "cpu"

transcriber = pipeline(
"automatic-speech-recognition",
model=int8_model,
chunk_length_s=30,L
device=device,
)

prediction = transcriber(
audio_file,
batch_size=8,
return_timestamps=True,
)['chunks']

print("predict: ", prediction[0]['text'])
print("target: ", target)

By merging the Whisper-tiny model with the MINDS14 dataset, we shine a light on the possibilities and hurdles involved in implementing pretrained ASR models for the detection of multilingual intents. While the model demonstrates adeptness in particular scenarios, particularly with shorter audio inputs, the intricacies stemming from accent diversities and varied linguistic subtleties demand deeper investigation and enhancement. Future initiatives could leverage targeted fine-tuning approaches directed at distinct linguistic categories, fostering the development of more resilient and adaptable AI applications centered around voice interactions. This iterative process of refinement holds the key to unlocking the full potential of ASR technology across diverse linguistic landscapes.

Additional optimization possibilities encompass adjusting hyperparameters, extending training durations, incorporating the Intel® Extension for Transformers, employing the Intel® Neural Compressor, exploring Parameter-Efficient Fine-tuning (PEFT), implementing Low-Rank Adaptions of LLMs (LoRA), and refining Intel training on the Habana Gaudi*-1 and Gaudi-2 accelerators to augment training efficiency.

To ASR hosted on HuggingFace, please refer — Demo

Reference:

  1. https://www.gradio.app/docs/audio
  2. https://huggingface.co/datasets/PolyAI/minds14
  3. Introducing Whisper (openai.com)
  4. https://github.com/zanuura/Whisper-ASR-Minds14-English/

--

--

Usha Rengaraju

Chief of Research at Exa Protocol | Autism Advocate | Corporate Trainer | Adjunct Faculty