ECEESPE2025 Poster Presentations Thyroid (141 abstracts)
1Singapore General Hospital, Department of Endocrinology, Singapore, Singapore; 2Singapore General Hospital, Data Science and Artificial Intelligence Laboratory, Singapore, Singapore; 3National University of Singapore, Institute of Systems Science, Singapore, Singapore
JOINT281
Introduction: Thyroid nodules are a prevalent problem in the general population. To date, commercial applications of artificial intelligence (AI) solutions for nodule risk classification have used traditional machine-learning models. Large Language Models (LLMs), especially those equipped for multimodal tasks combining text and image data, have shown promise in various applications, including medical diagnostics. Importantly, they can potentially offer flexibility for application in different imaging classification tasks. This study investigates the effectiveness of a multimodal vision-language model in the ultrasound-based risk stratification of thyroid nodules using the ACR TI-RADS risk stratification system, exploring the models accuracy, consistency, and the influence of prompt engineering.
Methods: We utilized Microsofts open-source LLaVA model and its instruction-tuned model LLaVA-Med, to assess 192 thyroid nodules from ultrasound cine-clip images with ACR TI-RADS descriptors. The study involved analyzing the output of the model and the effect of the use of basic and modified prompts, and images with and without radiologist-annotated regions of interest. The analysis measured the accuracy of the LLM outputs against manual assessments, and the consistency of outputs.
Results: Out of 4,608 responses, 83.3% were deemed valid, with prompt engineering improving frequency of valid responses. The LLaVA-Med model demonstrated higher accuracy in classifying individual TI-RADS components including composition (42.1% vs 20.3%, P < 0.001) and echogenicity (57.3% vs 49.9%, P = 0.004) compared to the base model, but overall TI-RADS classification accuracy remained low for both models (31.9% vs 38.9%, P = 0.004). The use of labelled images improved accuracy in classifying nodule margins (58.2% vs 53.0%, P = 0.040). Prompt engineering improved the consistency of the overall TI-RADS classification (52.1% vs 26.6%, P < 0.001), but its effect on accuracy varied across different components.
Conclusion: The study explores the use of open-source, multimodal LLMs as a resource-efficient method of end-to-end thyroid nodule risk stratification, including commonly-employed methods of performance optimization. However, the mixed results highlight the challenges in achieving clinically meaningful performance in their current form. The results suggest that while instruction-tuning and prompt engineering can enhance model output, the inherent technical limitations in image interpretation and model stochasticity restrict their clinical utility. Future developments should build on these findings to explore efficient prompting techniques to improve their accuracy and consistency in clinical applications.