Accuracy of large language models in interpreting urological clinical guidelines: a comparative study with expert evaluation

Borque-Fernando, Ángel; Fernández-Pello, Sergio; Guerrero Ramos, Félix; Navarro, Denis; Rubio-Briones, José; Esteban, Luis M.; Doblare, Manuel; Perez-Fentes, Daniel; Fernández Aparicio, Tomás; Izquierdo, Laura; Rodríguez Faba, Oscar; Álvarez-Ossorio Fernández, José Luis; Álvarez-Maestro, Mario; Fernández-Gómez, Jesús María; Medina-López, Rafael A.

doi:10.1177/17562872261436905

000170998 001__ 170998
000170998 005__ 20260430151736.0
000170998 0247_ $$2doi$$a10.1177/17562872261436905
000170998 0248_ $$2sideral$$a148950
000170998 037__ $$aART-2026-148950
000170998 041__ $$aeng
000170998 100__ $$0(orcid)0000-0003-0178-4567$$aBorque-Fernando, Ángel$$uUniversidad de Zaragoza
000170998 245__ $$aAccuracy of large language models in interpreting urological clinical guidelines: a comparative study with expert evaluation
000170998 260__ $$c2026
000170998 5060_ $$aAccess copy available to the general public$$fUnrestricted
000170998 5203_ $$aAbstract
Background: Large language models (LLMs) are increasingly being explored to supporting evidence-based decision-making in urology, but their accuracy in interpreting and applying clinical guidelines remains uncertain.
Objectives: We aimed to evaluate the ability of LLMs to interpret and apply clinical guidelines across the full spectrum of major urological cancers.
Design: This expert-validated study evaluated six configurations of three top LLMs (Claude,Gemini, and ChatGPT) using 25 structured questions for each of the seven major urological cancers: prostate cancer, upper tract urothelial carcinoma, muscle-invasive and non-muscleinvasive bladder cancer, renal cell carcinoma, penile cancer, and testicular cancer.
Methods: Both simple and rephrased prompts were used to assess the impact of prompt engineering on response quality. All figures and tables from the English-language EAU guidelines were systematically converted into plain, structured text and peer reviewed by multidisciplinary experts before evaluating the LLM responses. Each response was independently rated by 9–11 uro-oncology specialists using a five-point Likert scale (1: incorrect/unacceptable, 5: optimal), resulting in 10,500 evaluations.
Results: Claude achieved the highest overall accuracy, with 45.9% of responses rated as optimal (Likert 5) and 87% as optimal/acceptable (Likert 4–5). Tumor-specific performance peaked in muscle-invasive bladder (56.7% optimal, 93% optimal/acceptable), penile (49.5%, 95%), and testicular cancer (60.9%, 94%). Gemini and ChatGPT showed lower optimal rates but acceptable performance (68%–70% optimal/acceptable). Rephrased prompts did not consistently outperform simple versions. All models showed acceptable accuracy, but the results should be interpreted cautiously due to recency bias and fast LLM tech evolution.
Conclusion: This study demonstrates the value of rigorous plain language adaptation and expert validation in benchmarking LLMs, supporting their potential as decision-support tools in uro-oncology.
000170998 540__ $$9info:eu-repo/semantics/openAccess$$aby$$uhttps://creativecommons.org/licenses/by/4.0/deed.es
000170998 655_4 $$ainfo:eu-repo/semantics/article$$vinfo:eu-repo/semantics/publishedVersion
000170998 700__ $$0(orcid)0000-0002-0795-8743$$aNavarro, Denis$$uUniversidad de Zaragoza
000170998 700__ $$0(orcid)0000-0001-8741-6452$$aDoblare, Manuel$$uUniversidad de Zaragoza
000170998 700__ $$0(orcid)0000-0002-3007-302X$$aEsteban, Luis M.
000170998 700__ $$aPerez-Fentes, Daniel
000170998 700__ $$aÁlvarez-Maestro, Mario
000170998 700__ $$aMedina-López, Rafael A.
000170998 700__ $$aRodríguez Faba, Oscar
000170998 700__ $$aRubio-Briones, José
000170998 700__ $$aFernández-Pello, Sergio
000170998 700__ $$aFernández-Gómez, Jesús María
000170998 700__ $$aFernández Aparicio, Tomás
000170998 700__ $$aGuerrero Ramos, Félix
000170998 700__ $$aIzquierdo, Laura
000170998 700__ $$aÁlvarez-Ossorio Fernández, José Luis
000170998 7102_ $$15008$$2785$$aUniversidad de Zaragoza$$bDpto. Ingeniería Electrón.Com.$$cÁrea Tecnología Electrónica
000170998 7102_ $$15004$$2605$$aUniversidad de Zaragoza$$bDpto. Ingeniería Mecánica$$cÁrea Mec.Med.Cont. y Teor.Est.
000170998 7102_ $$11013$$2817$$aUniversidad de Zaragoza$$bDpto. Cirugía$$cÁrea Urología
000170998 773__ $$g18 (2026)$$tTherapeutic Advances in Urology$$x1756-2872
000170998 8564_ $$s3603222$$uhttps://zaguan.unizar.es/record/170998/files/texto_completo.pdf$$yVersión publicada
000170998 8564_ $$s2729312$$uhttps://zaguan.unizar.es/record/170998/files/texto_completo.jpg?subformat=icon$$xicon$$yVersión publicada
000170998 909CO $$ooai:zaguan.unizar.es:170998$$particulos$$pdriver
000170998 951__ $$a2026-04-30-13:58:43
000170998 980__ $$aARTICLE

Repositorio Institucional de Documentos