Consistent performance of large language models in rare disease diagnosis across ten languages and 4917 cases
BACKGROUND: Large language models (LLMs) are increasingly used medicine for diverse applications including differential diagnostic support. The training data used to create LLMs such as the Generative Pretrained Transformer (GPT) predominantly consist of English-language texts, but LLMs could be used across the globe to support diagnostics if language barriers could be overcome. Initial pilot studies on the utility of LLMs for differential diagnosis in languages other than English have shown promise, but a large-scale assessment on the relative performance of these models in a variety of European and non-European languages on a comprehensive corpus of challenging rare-disease cases is lacking. METHODS: We created 4917 clinical vignettes using structured data captured with Human Phenotype Ontology (HPO) terms with the Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema. These clinical vignettes span a total of 360 distinct genetic diseases with 2525 associated phenotypic features. We used translations of the Human Phenotype Ontology together with language-specific templates to generate prompts in English, Chinese, Czech, Dutch, French, German, Italian, Japanese, Spanish, and Turkish. We applied GPT-4o, version gpt-4o-2024-08-06, and the medically fine-tuned Meditron3-70B to the task of delivering a ranked differential diagnosis using a zero-shot prompt. An ontology-based approach with the Mondo disease ontology was used to map synonyms and to map disease subtypes to clinical diagnoses in order to automate evaluation of LLM responses. FINDINGS: For English, GPT-4o placed the correct diagnosis at the first rank 19.9% and within the top-3 ranks 27.0% of the time. In comparison, for the nine non-English languages tested here the correct diagnosis was placed at rank 1 between 16.9% and 20.6%, within top-3 between 25.4% and 28.6% of cases. The Meditron3 model placed the correct diagnosis within the first 3 ranks for 20.9% of cases in English and between 19.9% and 24.0% for the other nine languages. INTERPRETATION: The differential diagnostic performance of LLMs across a comprehensive corpus of rare-disease cases was largely consistent across the ten languages tested. This suggests that the utility of LLMs in clinical settings may extend to non-English clinical settings. FUNDING: NHGRI 5U24HG011449, 5RM1HG010860, R01HD103805 and R24OD011883. P.N.R. was supported by a Professorship of the Alexander von Humboldt Foundation; P.L. was supported by a National Grant (PMP21/00063 ONTOPREC-ISCIII, Fondos FEDER). C.M., J.R. and J.H.C. were supported in part by the Director, Office of Science, Office of Basic Energy Sciences, of the US Department of Energy (Contract No. DE-AC0205CH11231).
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- US Department of Energy; USDOE Office of Science (SC), Biological and Environmental Research (BER) (SC-23)
- Grant/Contract Number:
- AC02-05CH11231
- OSTI ID:
- 3014511
- Journal Information:
- EBioMedicine, Journal Name: EBioMedicine Vol. 121
- Country of Publication:
- United States
- Language:
- English
Similar Records
The Human Phenotype Ontology in 2024: phenotypes around the world
An evaluation of GPT models for phenotype concept recognition
Phenopacket-tools: Building and validating GA4GH Phenopackets
Journal Article
·
Fri Nov 10 19:00:00 EST 2023
· Nucleic Acids Research
·
OSTI ID:2356802
An evaluation of GPT models for phenotype concept recognition
Journal Article
·
Tue Jan 30 19:00:00 EST 2024
· BMC Medical Informatics and Decision Making (Online)
·
OSTI ID:2470704
Phenopacket-tools: Building and validating GA4GH Phenopackets
Journal Article
·
Tue May 16 20:00:00 EDT 2023
· PLoS ONE
·
OSTI ID:2234216