Bangla TTS Performance Evaluation: A Benchmark Study on Synthesized Speech Quality and Intelligibility
A Benchmark Study on Synthesized Speech Quality and Intelligibility
DOI:
https://doi.org/10.3329/dujs.v74i1.84122Keywords:
Text-to-Speech (TTS), Speech Synthesis, Benchmarking, Objective Evaluation Metrics, Subjective Evaluation MetricsAbstract
Bangla Text-to-Speech (TTS) systems have seen significant advancements in recent years, yet comprehensive benchmarking of their performance remains limited. This study establishes a robust evaluation framework to compare different Bangla TTS models, including Tacotron21, FastSpeech22, VITS3, and Grad-TTS4. The benchmarking approach integrates both objective and subjective assessment methodologies. Objective evaluation employs signal processing metrics such as Mel Cepstral Distortion (MCD), Mel-Spectrogram Mean Squared Error (Mel-MSE), Phoneme Error Rate (PER), Word Error Rate (WER), Signal-to-Noise Ratio (SNR), and Real-Time Factor (RTF). Subjective evaluation involves human perceptual tests such as Mean Opinion Score (MOS) test with native Bangla speakers rating speech quality and intelligibility. The study’s experimental setup ensures a fair comparison by utilizing a standardized dataset, uniform computational conditions, and diverse sentence structures. Results demonstrate the relative strengths and weaknesses of various models, highlighting the need for improved phonetic accuracy and naturalness in Bangla TTS synthesis. This research provides critical insights for advancing Bangla TTS systems and aligning them with state-of-the-art English TTS models.
Dhaka Univ. J. Sci. 74(1): 10-16, 2026 (January)
Downloads
17
8