Audio Sentiment Analysis is the process of extracting sentiment information from spoken audio using Computational Linguistics (speech processing) and Machine Learning. We analyze various audio attributes, such as pitch, intonation, prosody, and volume, to identify whether the tone contains sarcasm, frustration, anger, joy, or sadness in this process.
The following steps involve analyzing sentiment in an audio recording.
First, pre-processing is performed on the Audio Signal, where the audio is transformed into a digital signal. These signals are analyzed for their frequency over time, and very low-amplitude signals are removed for silence. During this process, background noise is also canceled utilizing spectral subtraction to eliminate the frequency of noise utilizing filtering techniques such as a high-pass filter.
Second, feature extraction is done to change the speech waveform to a parametric representation. Feature extraction is necessary since all the information in the acoustic (audio) signal is very cumbersome to analyze, and much of this information is irrelevant to detecting emotions. The feature extraction approach produces a multidimensional feature vector for a given speech signal. Per the latest research trends, the most widely used and considered “best” feature extraction technique for audio sentiment detection is Mel-frequency cepstral coefficients (MFCC), as it effectively captures the spectral characteristics of speech signals. Features become input to generate model predictions.
Third, Feature Engineering is performed where we normalize the extracted features in cases where the audio samples are from varying recording conditions. Some features can also be discarded during this stage to reduce the dimensionality of feature space.
Lastly, Sentiment Classification is performed. Both Classical and Deep Learning Approaches can be used to build models to classify the sentiment associated with audio. Classical approaches include algorithms such as Hidden Markov Model (HMM) and Random Forest. Deep Learning techniques use Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based Models. Per the latest research trends, Deep Learning techniques are considered a better choice than classical ML techniques since subtle nuances in tone and audio inflection impact sentiment. Deep learning models can automatically learn intricate features from the audio without manual feature engineering.
Audio Sentiment Analysis is essential in human-computer interaction as we progress toward creating emotionally intelligent systems. My research on Audio Sentiment Detection did not surface its application towards bridging communication gaps for the Neurodivergent community as a popular area of study, indicating that this is an area we must invest our efforts in to make lives better.
References: