A multimodal large language model as an end-to-end classifier of thyroid nodule malignancy risk: practical or potential?

Gerald Sng; Yi Xiang; Daniel Lim; Joshua Tung; Tan Jen Hong; Chng Chiaw Ling

doi:10.1530/endoabs.110.P1162

P1162

Prev Next

Section Contents Cite

Endocrine Abstracts (2025) 110 P1162 | DOI: 10.1530/endoabs.110.P1162

ECEESPE2025 Poster Presentations Thyroid (141 abstracts)

A multimodal large language model as an end-to-end classifier of thyroid nodule malignancy risk: practical or potential?

Gerald Sng ^1,2 , Yi Xiang ³ , Daniel Lim ¹ , Joshua Tung ¹ , Jen Hong Tan ¹ & Chiaw Ling Chng ²

Author affiliations

¹Singapore General Hospital, Department of Endocrinology, Singapore, Singapore; ²Singapore General Hospital, Data Science and Artificial Intelligence Laboratory, Singapore, Singapore; ³National University of Singapore, Institute of Systems Science, Singapore, Singapore

JOINT281

Introduction: Thyroid nodules are a prevalent problem in the general population. To date, commercial applications of artificial intelligence (AI) solutions for nodule risk classification have used traditional machine-learning models. Large Language Models (LLMs), especially those equipped for multimodal tasks combining text and image data, have shown promise in various applications, including medical diagnostics. Importantly, they can potentially offer flexibility for application in different imaging classification tasks. This study investigates the effectiveness of a multimodal vision-language model in the ultrasound-based risk stratification of thyroid nodules using the ACR TI-RADS risk stratification system, exploring the model’s accuracy, consistency, and the influence of prompt engineering.

Methods: We utilized Microsoft’s open-source LLaVA model and its instruction-tuned model LLaVA-Med, to assess 192 thyroid nodules from ultrasound cine-clip images with ACR TI-RADS descriptors. The study involved analyzing the output of the model and the effect of the use of basic and modified prompts, and images with and without radiologist-annotated regions of interest. The analysis measured the accuracy of the LLM outputs against manual assessments, and the consistency of outputs.

Results: Out of 4,608 responses, 83.3% were deemed valid, with prompt engineering improving frequency of valid responses. The LLaVA-Med model demonstrated higher accuracy in classifying individual TI-RADS components including composition (42.1% vs 20.3%, P < 0.001) and echogenicity (57.3% vs 49.9%, P = 0.004) compared to the base model, but overall TI-RADS classification accuracy remained low for both models (31.9% vs 38.9%, P = 0.004). The use of labelled images improved accuracy in classifying nodule margins (58.2% vs 53.0%, P = 0.040). Prompt engineering improved the consistency of the overall TI-RADS classification (52.1% vs 26.6%, P < 0.001), but its effect on accuracy varied across different components.

Conclusion: The study explores the use of open-source, multimodal LLMs as a resource-efficient method of end-to-end thyroid nodule risk stratification, including commonly-employed methods of performance optimization. However, the mixed results highlight the challenges in achieving clinically meaningful performance in their current form. The results suggest that while instruction-tuning and prompt engineering can enhance model output, the inherent technical limitations in image interpretation and model stochasticity restrict their clinical utility. Future developments should build on these findings to explore efficient prompting techniques to improve their accuracy and consistency in clinical applications.