Diagnostic Interview Corpus - Translation

The Diagnostic Interview Corpus is a multilingual dataset of 12,754 French medical consultation sentences (questions and instructions) with translations into 12 languages and associated UMLS-based semantic glosses. It supports research on low-resource medical machine translation, semantic representation, and pictograph generation. Languages - Source: French - Targets (in translations.csv): Albanian, Modern Standard Arabic, Tunisian Arabic, Moroccan Arabic, Algerian Arabic, Dari (Afghan Persian), Farsi (Iranian Persian), Russian, English, Spanish, Tigrinya, Ukrainian - Semantic gloss (in translations.csv): French sentences aligned with UMLS glosses (concept sequences + functional tokens). - Paraphrases (in paraphrases.csv): French paraphrases aligned with the corresponding French source sentences, generated through a grammar-based approach to ensure controlled syntactic variation Domains and registers - Medical consultations - Questions and instructions (e.g., symptom checks, treatment directives) - Categories by body region (e.g., head, chest, abdomen) Features - Parallel multilingual translations created and adapted with clinical experts - Semantic gloss layer (UMLS CUIs + functional tokens) for pictograph generation - Patient-centered simplifications and cultural adaptations to improve comprehension Example French: Avez-vous des nausées ou des vomissements ? English: Do you have nausea or vomiting? UMLS gloss: You | Nausea | or – article | Vomiting | Question Intended Use - Low-resource multilingual MT research - Semantic representation learning (UMLS-based) - Pictograph translation systems for patients with limited health literacy - Evaluation of medical-domain MT beyond surface-level accuracy Acknowledgements This corpus was developed in the context of the BabelDr and PictoDr projects at the University of Geneva in collaboration with Geneva University Hospitals.This work is part of the PROPICTO project, funded by the Swiss National Science Foundation (N°197864) and the French National Research Agency (ANR-20-CE93-0005). This project also received funding by the ”Fondation Privée des Hôpitaux Universitaires de Genève”.

    Organizational unit
    Propicto Project
    Type
    Dataset
    DOI
    License
    Creative Commons Attribution 4.0 International
    Keywords
    low resource machine translation, medical dialogues, medical domain, medical questionnaires, semantic gloss, UMLS, Standard Modern Arabic, Albanian, Moroccan Arabic, Tunisian Arabic, Algerian Arabic, Dari, Farsi, Russian, English, Spanish, Tigrinya, Ukrainian, French
Publication date25/09/2025
Retention date23/09/2035
accessLevelPublicAccess levelPublic
SensitivityBlue
licenseContract on the use of data
License
Contributors
  • Bouillon, Pierrette orcid
  • Gerlach, Johanna orcid
  • Mutal, Jonathan David orcid
  • Spechbach, Hervé
6
0
  • Quality (0 Reviews)
  • Usefulness (0 Reviews)

Datacite metadata

Packages information

Similar archives

Propicto Project
MeDiCo | A Medical Discourse Corpus in French
2021 accessLevelPublic Public 56.2 MB
Propicto Project
UMLS Concepts to Pictographs
2025 accessLevelPublic Public 33.4 MB
Propicto Project
Diagnostic Interview Transcripts (French)
2025 accessLevelPublic Public 315.7 KB
All rights reserved by DLCM and the University of GenevaunigeBlack