Circular Economy

Challenges of Minority Dialects in Speech Recognition: Accuracy Concerns

Title: Study Reveals ASR Models Struggle with Minority Dialects in Transcriptions

Introduction:
The effectiveness of Automatic Speech Recognition (ASR) models in transcribing English speakers with minority dialects has been called into question. A recent study by Georgia Tech and Stanford researchers compared the performance of leading ASR models for Standard American English (SAE) and three minority dialects – African American Vernacular English (AAVE), Spanglish, and Chicano English.

Lead Author:
Interactive Computing Ph.D. student Camille Harris is the lead author of a paper accepted into the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) in Miami. Harris recruited individuals who spoke each dialect to read from a Spotify podcast dataset, using ASR models like wav2vec 2.0, HUBERT, and Whisper to transcribe the audio.

Findings:
The study revealed that SAE transcription significantly outperformed minority dialects, with men speaking SAE being transcribed more accurately than women. Minority men, particularly Black and Latino men, were found to have the most inaccurate transcriptions. Harris noted that the underrepresentation of minority dialects in the training data for ASR models played a significant role in these disparities.

Addressing Underrepresentation:
Harris highlighted the importance of inclusive training data in improving ASR model performance for minority dialects. AAVE showed the best performance under the Whisper model, which had a more inclusive dataset. She also explored the correlation between the underrepresentation of minority men in technology spaces and the performance errors in ASR transcriptions.

Variables and Considerations:
Harris took into account variables like code-switching and regional subdialects among AAVE speakers. She found that speakers who code-switched to SAE performed better in transcriptions. Additionally, she included different regional speakers to capture linguistic variations and generational differences within the data.

TikTok Barriers:
Drawing from her previous research on user-design barriers faced by Black content creators on TikTok, Harris discussed how ASR tools can present challenges for minority users. These creators often had to manually input captions due to inaccuracies in the ASR transcriptions, while SAE speakers benefitted from the built-in feature.

Future Implications:
Harris emphasized the need for ASR tool designers to be more inclusive of minority dialects, considering cultural challenges and the importance of community engagement. She suggested that collecting more minority speech data and seeking community input could help improve the accuracy and relevance of ASR models for diverse user groups.

Conclusion:
In conclusion, the study sheds light on the disparities in ASR model performance for minority dialects and emphasizes the importance of inclusive training data and community engagement in improving transcription accuracy. Moving forward, efforts to address these challenges could lead to more effective and equitable ASR tools for all users. Further research and collaboration with diverse communities will be essential in promoting inclusivity and accuracy in voice recognition technologies.