Slowing the Tempo of Spoken Audio

While researching neurodiverse communication challenges, I learned that many neurodiverse individuals find high-tempo speech challenging to process and understand. Some people with Autism have difficulty processing rapid speech due to the increased cognitive load of interpreting words, tone, and intent. Similarly, fast-paced speech can be overwhelming for those individuals with ADHD, who often experience challenges with sustained attention. Slower speech can help them focus and follow along more effectively.

Today, I will share what I learned about what technology is best suited for our LinguaLink app to reduce the tempo or pace of speech. Understanding these is fundamental to my goal of integrating this into the LinguaLink application, thereby easing communication for neurodiverse individuals.

Recently, two primary methods have been used for this purpose. First, the ‘Time-Scale Modification (TSM)’ Algorithms are the most common digital signal processing techniques to change speech tempo while keeping the pitch of audio intact, ensuring minimal tonality distortion. The reduction in speech tempo in this method is achieved by dividing the audio signal into smaller frames, increasing the duration of the frames to slow down the speech, and then overlap-add them for final speech output with lower tempo. There are three standard TSM algorithms.

Phase Vocoder: This is a common algorithm used in TSM. It enables tempo modification by analyzing the signal into short frames using the Fast Fourier Transform (FFT), modifies the phase information to achieve the time scaling (i.e., reduction of speech tempo), and then reconstructs the signal by inverse FFT.
PSOLA (Pitch Synchronous Overlap and Add) preserves the audio pitch while performing tempo reduction by aligning the frame boundaries with pitch periods. It is typically used when high-quality speech modification is needed, and audio recordings are already available.
WSOLA (Waveform Similarity Overlap and Add): In this method, a waveform similarity metric is considered while overlapping and adding, which enables preserving the original signal characteristics when time scaling.

Second, In more recent times with advancement in deep learning modeling techniques, machine learning models are used to analyze and recreate slowed speech without sacrificing tonal characteristics. Modern TTS (text-to-speech) applications incorporate deep learning models to adjust tempo while preserving emotional intonation and fluency. The two most used deep learning modeling techniques that stood out during my research were Sequence-to-Sequence (Seq2Seq) Models and GANs (Generative Adversarial Networks).

Android has a Speech Recognizer API, and several open-source libraries can also be used to change the speech rate on this platform. I have yet to test the quality of these and will not really need to work directly on the algorithms listed above to get the functionality up and running on the LinguaLink app, but getting to understand the basics, as always, was fun!!

Until next time!!

References:

https://kennethrobersonphd.com/what-are-the-common-challenges-of-neurodiversity/

https://www.psychologytoday.com/us/blog/beyond-mental-health/202311/i-promise-im-not-trying-to-be-inconsiderate

https://www.psychreg.org/embracing-neurodiversity-understanding-integrating-diverse-minds-benefits-society/

https://drroseann.com/neurotypical-vs-neurodivergent-communication-embracing-diversity-in-dialogue/

http://blogs.zynaptiq.com/bernsee/time-pitch-overview/

https://en.wikipedia.org/wiki/Audio_time_stretching_and_pitch_scaling

https://developex.com/blog/overview-of-speech-recognition-apis-for-android-platform/

How to Set Sample Rate on Text to Speech – Android