Diagnostic Interview Corpus - Translation
The Diagnostic Interview Corpus is a multilingual dataset of 12,754 French medical consultation sentences (questions and instructions) with translations into 12 languages and associated UMLS-based semantic glosses. It supports research on low-resource medical machine translation, semantic representation, and pictograph generation. Languages - Source: French - Targets (in translations.csv): Albanian, Modern Standard Arabic, Tunisian Arabic, Moroccan Arabic, Algerian Arabic, Dari (Afghan Persian), Farsi (Iranian Persian), Russian, English, Spanish, Tigrinya, Ukrainian - Semantic gloss (in translations.csv): French sentences aligned with UMLS glosses (concept sequences + functional tokens). - Paraphrases (in paraphrases.csv): French paraphrases aligned with the corresponding French source sentences, generated through a grammar-based approach to ensure controlled syntactic variation Domains and registers - Medical consultations - Questions and instructions (e.g., symptom checks, treatment directives) - Categories by body region (e.g., head, chest, abdomen) Features - Parallel multilingual translations created and adapted with clinical experts - Semantic gloss layer (UMLS CUIs + functional tokens) for pictograph generation - Patient-centered simplifications and cultural adaptations to improve comprehension Example French: Avez-vous des nausées ou des vomissements ? English: Do you have nausea or vomiting? UMLS gloss: You | Nausea | or – article | Vomiting | Question Intended Use - Low-resource multilingual MT research - Semantic representation learning (UMLS-based) - Pictograph translation systems for patients with limited health literacy - Evaluation of medical-domain MT beyond surface-level accuracy Acknowledgements This corpus was developed in the context of the BabelDr and PictoDr projects at the University of Geneva in collaboration with Geneva University Hospitals.This work is part of the PROPICTO project, funded by the Swiss National Science Foundation (N°197864) and the French National Research Agency (ANR-20-CE93-0005). This project also received funding by the ”Fondation Privée des Hôpitaux Universitaires de Genève”.
- Organizational unit
- Propicto Project
- Type
- Dataset
- DOI
- License
- Creative Commons Attribution 4.0 International
- Keywords
- low resource machine translation, medical dialogues, medical domain, medical questionnaires, semantic gloss, UMLS, Standard Modern Arabic, Albanian, Moroccan Arabic, Tunisian Arabic, Algerian Arabic, Dari, Farsi, Russian, English, Spanish, Tigrinya, Ukrainian, French
License
Contributors
- Bouillon, Pierrette
- Gerlach, Johanna
- Mutal, Jonathan David
- Spechbach, Hervé
Files
Quality (0 Reviews) Usefulness (0 Reviews)
Datacite metadata
Packages information
Similar archives
Propicto Project
Propicto Project
Propicto Project

