Accuracy of large language models in interpreting urological clinical guidelines: a comparative study with expert evaluation
Resumen: Abstract
Background: Large language models (LLMs) are increasingly being explored to supporting evidence-based decision-making in urology, but their accuracy in interpreting and applying clinical guidelines remains uncertain.
Objectives: We aimed to evaluate the ability of LLMs to interpret and apply clinical guidelines across the full spectrum of major urological cancers.
Design: This expert-validated study evaluated six configurations of three top LLMs (Claude,Gemini, and ChatGPT) using 25 structured questions for each of the seven major urological cancers: prostate cancer, upper tract urothelial carcinoma, muscle-invasive and non-muscleinvasive bladder cancer, renal cell carcinoma, penile cancer, and testicular cancer.
Methods: Both simple and rephrased prompts were used to assess the impact of prompt engineering on response quality. All figures and tables from the English-language EAU guidelines were systematically converted into plain, structured text and peer reviewed by multidisciplinary experts before evaluating the LLM responses. Each response was independently rated by 9–11 uro-oncology specialists using a five-point Likert scale (1: incorrect/unacceptable, 5: optimal), resulting in 10,500 evaluations.
Results: Claude achieved the highest overall accuracy, with 45.9% of responses rated as optimal (Likert 5) and 87% as optimal/acceptable (Likert 4–5). Tumor-specific performance peaked in muscle-invasive bladder (56.7% optimal, 93% optimal/acceptable), penile (49.5%, 95%), and testicular cancer (60.9%, 94%). Gemini and ChatGPT showed lower optimal rates but acceptable performance (68%–70% optimal/acceptable). Rephrased prompts did not consistently outperform simple versions. All models showed acceptable accuracy, but the results should be interpreted cautiously due to recency bias and fast LLM tech evolution.
Conclusion: This study demonstrates the value of rigorous plain language adaptation and expert validation in benchmarking LLMs, supporting their potential as decision-support tools in uro-oncology.

Idioma: Inglés
DOI: 10.1177/17562872261436905
Año: 2026
Publicado en: Therapeutic Advances in Urology 18 (2026)
ISSN: 1756-2872

Tipo y forma: Article (Published version)
Área (Departamento): Área Tecnología Electrónica (Dpto. Ingeniería Electrón.Com.)
Área (Departamento): Área Mec.Med.Cont. y Teor.Est. (Dpto. Ingeniería Mecánica)
Área (Departamento): Área Urología (Dpto. Cirugía)


Creative Commons You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.


Exportado de SIDERAL (2026-04-30-13:58:43)


Visitas y descargas

Este artículo se encuentra en las siguientes colecciones:
Articles > Artículos por área > Mec. de Medios Contínuos y Teor. de Estructuras
Articles > Artículos por área > Tecnología Electrónica
Articles > Artículos por área > Urología



 Record created 2026-04-30, last modified 2026-04-30


Versión publicada:
 PDF
Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)