Audio Emotion Classification using Intel oneAPI and TFWav2Vec2Model [AI for Autism]

7 min readMar 3, 2024

Autism Spectrum Disorder (ASD) is a complex neurodevelopmental disorder that affects social interaction, communication, and behavior. Individuals with ASD often have difficulty recognizing and expressing emotions, which can impact their social interactions and relationships. Recent advancements in Artificial Intelligence (AI) have opened up new possibilities for supporting individuals with ASD, including the use of AI for audio emotion classification. This technology has the potential to improve our understanding of how individuals with ASD experience and express emotions, leading to more personalized and effective interventions.

Understanding Audio Emotion Classification

Audio emotion classification involves the use of AI algorithms to analyze audio recordings and identify the emotions expressed in the speech. These algorithms can detect subtle cues in the speech, such as pitch, intensity, and rhythm, to determine the emotional content. For individuals with ASD, who may have difficulty recognizing emotions in others and expressing their own emotions, audio emotion classification can provide valuable insights into their emotional state. This technology holds immense potential to assist individuals with autism, and here’s how:

1. Decoding the Nuances of Speech:

Human emotions are often expressed not just through facial expressions, but also through subtle vocal cues like pitch, volume, and speech rhythm. AI models trained on vast audio datasets can learn to recognize these subtle variations and classify them into different emotional categories like happiness, sadness, anger, and frustration.

2. Bridging the Emotional Gap:

Imagine an AI-powered tool that, in real-time, helps individuals with autism understand the emotional undertones of a conversation. This could be through subtle visual cues on a screen, discreet notifications, or even customized feedback mechanisms. By providing insights into the emotional state of others, such tools can empower individuals with autism to participate more confidently in social interactions.

3. Supporting Social Learning:

Beyond immediate communication, audio emotion classification can also be used to develop training programs and therapeutic interventions. By analyzing recordings of past interactions and providing feedback on emotional cues, these programs can help individuals with autism learn and practice recognizing emotions in others, ultimately leading to improved social skills and self-awareness.

4. Ethical Considerations and the Road Ahead:

It’s crucial to acknowledge the ethical considerations surrounding AI-powered emotion recognition. Over-reliance on such tools and potential biases in the training data are concerns that need to be addressed responsibly. Additionally, it’s important to remember that AI is here to assist, not replace, human connection.

Benefits of Audio Emotion Classification for Individuals with ASD

Enhanced Communication: By accurately identifying the emotions expressed in speech, AI can help individuals with ASD better understand the emotional content of conversations, improving their communication skills and social interactions.
Emotion Regulation: AI can also help individuals with ASD learn to regulate their own emotions by providing feedback on the emotional content of their speech. This can help them develop coping strategies for managing difficult emotions.
Personalized Interventions: By analyzing patterns in emotional expression over time, AI can help identify individual differences in emotional processing among individuals with ASD. This information can be used to tailor interventions to meet the specific needs of each individual.
Early Detection of Emotional Distress: AI can help detect early signs of emotional distress in individuals with ASD, allowing for timely intervention and support.

Intel oneAPI

Intel oneAPI is a unified programming model that simplifies the development of applications for heterogeneous architectures. It provides a single set of tools and libraries that developers can use to target a wide range of hardware platforms. Intel has been actively involved in optimizing popular machine learning frameworks like TensorFlow and XGBoost to leverage the performance benefits of Intel architecture, including CPUs and accelerators like Intel Xeon processors and Intel FPGAs. These optimizations aim to improve the speed and efficiency of machine learning workloads, enabling developers to train and deploy models faster and more cost-effectively.

In this blog we have used the Intel oneAPI to show how efficiently we can train and infer models using it.

Dataset

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) is a dataset containing audiovisual recordings of actors performing various vocalizations, including speech and song, with different emotional expressions. The dataset was developed by the Ryerson University and the University of Toronto and is designed for research in emotion recognition and affective computing.

import os
import numpy as np
import librosa

def get_feature(file_path: str, mfcc_len: int = 39, mean_signal_length: int = 110000):
    signal, fs = librosa.load(file_path)
    s_len = len(signal)

    if s_len < mean_signal_length:
        pad_len = mean_signal_length - s_len
        pad_rem = pad_len % 2
        pad_len //= 2
        signal = np.pad(signal, (pad_len, pad_len + pad_rem), 'constant', constant_values = 0)
    else:
        pad_len = s_len - mean_signal_length
        pad_len //= 2
        signal = signal[pad_len:pad_len + mean_signal_length]
    mfcc = librosa.feature.mfcc(y=signal, sr=fs, n_mfcc=39)
    mfcc = mfcc.T
    feature = mfcc
    return feature

from tqdm import tqdm
lst = []
path = '/ravdess'

for subdir, dirs, files in tqdm(os.walk(path)):
    for file in files:
        try:
            X, sample_rate = librosa.load(os.path.join(subdir, file),
                                          res_type='kaiser_fast')
            #
            file = int(file[7:8]) - 1
            arr = X[:64000], file
            lst.append(arr)
        except ValueError as err:
            print(err)
            continue

X, y = zip(*lst)
X, y = np.asarray(X), np.asarray(y)

Model Building

For this particular demo we are using the Wave2Vec 2.0 model for feature extraction and XgBoost model for classification.

Developed by Facebook AI, wav2vec2 is a deep learning model specifically designed to analyze and understand speech audio. Unlike traditional speech recognition models that focus on converting speech to text, wav2vec2 aims to extract meaningful representations of the spoken content. Instead of directly translating words, wav2vec2 captures the underlying characteristics of the speech, such as pitch, rhythm, and intonation, and encodes them into a numerical format.

from transformers import AutoFeatureExtractor
from transformers import TFWav2Vec2Model

MAX_DURATION = 2
# Sampling rate is the number of samples of audio recorded every second
SAMPLING_RATE = 16000
BATCH_SIZE = 2  # Batch-size for training and evaluating our model.
NUM_CLASSES = 8  # Number of classes our dataset will have (11 in our case).
HIDDEN_DIM = 768  # Dimension of our model output (768 in case of Wav2Vec 2.0 - Base).
MAX_SEQ_LENGTH = MAX_DURATION * SAMPLING_RATE  # Maximum length of the input audio file.
# Wav2Vec 2.0 results in an output frequency with a stride of about 20ms.
MAX_FRAMES = 99
MAX_EPOCHS = 5  # Maximum number of training epochs.

MODEL_CHECKPOINT = "facebook/wav2vec2-base"

RAVDESS_CLASS_LABELS = ("angry", "calm", "disgust", "fear", "happy", "neutral","sad","surprise")

labels = RAVDESS_CLASS_LABELS
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label


feature_extractor = AutoFeatureExtractor.from_pretrained(
    MODEL_CHECKPOINT, return_attention_mask=True
)


audio_arrays = X
inputs = feature_extractor(
    audio_arrays,
    sampling_rate=feature_extractor.sampling_rate,
    max_length=MAX_SEQ_LENGTH,
    truncation=True,
    padding=True,
)



def mean_pool(hidden_states, feature_lengths):
    attenion_mask = tf.sequence_mask(
        feature_lengths, maxlen=MAX_FRAMES, dtype=tf.dtypes.int64
    )
    padding_mask = tf.cast(
        tf.reverse(tf.cumsum(tf.reverse(attenion_mask, [-1]), -1), [-1]),
        dtype=tf.dtypes.bool,
    )
    hidden_states = tf.where(
        tf.broadcast_to(
            tf.expand_dims(~padding_mask, -1), (BATCH_SIZE, MAX_FRAMES, HIDDEN_DIM)
        ),
        0.0,
        hidden_states,
    )
    pooled_state = tf.math.reduce_sum(hidden_states, axis=1) / tf.reshape(
        tf.math.reduce_sum(tf.cast(padding_mask, dtype=tf.dtypes.float32), axis=1),
        [-1, 1],
    )
    return pooled_state


class TFWav2Vec2ForAudioClassification(keras.Model):

    def __init__(self, model_checkpoint):
        super().__init__()
        # Instantiate the Wav2Vec 2.0 model without the Classification-Head
        self.wav2vec2 = TFWav2Vec2Model.from_pretrained(
            model_checkpoint, apply_spec_augment=False, from_pt=True
        )
        self.pooling = layers.GlobalAveragePooling1D()
        self.flat = layers.Flatten()
        self.intermediate_layer_dropout = layers.Dropout(0.5)

    def call(self, inputs):
        hidden_states = self.wav2vec2(inputs[0])[0]
        if tf.is_tensor(inputs[1]):
            audio_lengths = tf.cumsum(inputs[1], -1)[:, -1]
            feature_lengths = self.wav2vec2.wav2vec2._get_feat_extract_output_lengths(
                audio_lengths
            )
            pooled_state = mean_pool(hidden_states, feature_lengths)
        else:
            pooled_state = self.pooling(hidden_states)

        intermediate_state = self.flat(self.intermediate_layer_dropout(pooled_state))

        return intermediate_state

input = [keras.Input(shape=(MAX_SEQ_LENGTH,), dtype="float32"),
      keras.Input(shape=(MAX_SEQ_LENGTH,), dtype="int32"),
]

wav2vec2_model = TFWav2Vec2ForAudioClassification(MODEL_CHECKPOINT)
wav =wav2vec2_model(input)
output = layers.Dense(NUM_CLASSES, activation="softmax")(wav)
# Model
model = keras.Model(inputs, output)
# Loss
loss = keras.losses.CategoricalCrossentropy(from_logits=False)
# Optimizer
optimizer = keras.optimizers.Adam(learning_rate=1e-5)
# Compile and return
model.compile(loss=loss, optimizer=optimizer, metrics=["accuracy"])
train_x = [y for x, y in inputs.items()]
tx = np.array(train_x)

model.fit(
    [tx[0],tx[1]],
    y,
    batch_size=BATCH_SIZE,
    epochs=1,
)

After fine-tuning the Wav2vec2 model on our dataset we use it for feature extraction and then use the features for training the xgboost model.

First we use the intel optimisation to train and infer our xgboost model.

feats = wav2vec2_model.predict([tx[0],tx[1]],batch_size=2)
import sklearn
from sklearnex import patch_sklearn, unpatch_sklearn
patch_sklearn()
import xgboost as xgb

xgb_params = {
    'objective':                    'binary:logistic',
    'predictor':                    'cpu_predictor',
    'disable_default_eval_metric':  'true',
}

# Train the model
warnings.simplefilter(action='ignore', category=UserWarning)
t1_start = perf_counter()  # Time fit function
model_xgb= xgb.XGBClassifier(**xgb_params)
model_xgb.fit(feats,y)
t1_stop = perf_counter()
print ("It took", t1_stop-t1_start," to fit.")

t1_start = perf_counter()  # Time fit function
model_xgb.predict(feats)
t1_stop = perf_counter()
print ("It took", t1_stop-t1_start," to fit.")

After that we try it out without the intel optimisation to show the benchmarks.

unpatch_sklearn()
xgb_params = {
    'objective':                    'binary:logistic',
    'predictor':                    'cpu_predictor',
    'disable_default_eval_metric':  'true',
}

# Train the model
warnings.simplefilter(action='ignore', category=UserWarning)
t1_start = perf_counter()  # Time fit function
model_xgb= xgb.XGBClassifier(**xgb_params)
model_xgb.fit(feats,y)
t1_stop = perf_counter()
print ("It took", t1_stop-t1_start," to fit.")

t1_start = perf_counter()  # Time fit function
model_xgb.predict(feats)
t1_stop = perf_counter()
print ("It took", t1_stop-t1_start," to fit.")

Demo

Huggingface Demo is also available. Click here.

Challenges and Future Directions

While audio emotion classification shows promise for supporting individuals with ASD, several challenges remain. These include the need for more diverse and representative datasets, as well as the development of algorithms that can accurately detect subtle emotional cues in speech. Future research in this area will focus on addressing these challenges and further enhancing the effectiveness of AI-based approaches for supporting individuals with ASD.

Conclusion

Audio emotion classification has the potential to significantly impact the lives of individuals with ASD by improving their communication skills, emotion regulation, and overall quality of life. By leveraging the power of AI, we can gain valuable insights into the emotional experiences of individuals with ASD and develop more personalized and effective interventions to support them.