Tag
This paper presents the development of parallel and monolingual corpora for scientific machine translation across Spanish-English, French-English, and Portuguese-English, targeting four domains: Cancer Research, Energy Research, Neuroscience, and Transportation. The corpora are used to fine-tune neural machine translation systems, addressing challenges of specialized vocabulary and syntax in scientific text.
MUSCAT is a new multilingual, scientific conversation benchmark dataset for evaluating ASR systems on challenging multilingual scenarios including code-switching, domain-specific vocabulary, and mixed language input. The dataset consists of bilingual discussions on scientific papers between speakers using different languages, with results showing current state-of-the-art systems struggle with these multilingual challenges.