This paper presents GARAGEM, a domain-specific ASR dataset for Brazilian Portuguese focused on automotive repair, combining real speech collected from online sources with synthetic speech generated from curated technical terminology. Testing with Whisper, Wav2vec 2.0, and Conformer models demonstrated that synthetic data improves performance when combined with real recordings, reducing Word Error Rate and Character Error Rate while enhancing recognition of specialized terminology. The work proposes a reproducible methodology for leveraging synthetic speech as an effective data augmentation approach in specialized, low-resource language scenarios.
Research
- Daniel R. da Silva, Maria Eduarda S. Borba, Állan C. P. Silva, Maria Carolina S. Barreto, Arthur F. de Morais, Paulo V. dos Santos, Guilherme C. Dutra, Sávio S. T. de Oliveira, Anderson da S. SoaresPROPOR 2026 – 17th International Conference on Computational Processing of Portuguese · 2026