Research | Arthur Morais

This paper presents GARAGEM, a domain-specific ASR dataset for Brazilian Portuguese focused on automotive repair, combining real speech collected from online sources with synthetic speech generated from curated technical terminology. Testing with Whisper, Wav2vec 2.0, and Conformer models demonstrated that synthetic data improves performance when combined with real recordings, reducing Word Error Rate and Character Error Rate while enhancing recognition of specialized terminology. The work proposes a reproducible methodology for leveraging synthetic speech as an effective data augmentation approach in specialized, low-resource language scenarios.