Predicting mortality in both diabetes and open-source clinical datasets from free text entries using machine learning (natural language processing)

Christopher Sainsbury; Andrew Conkie; Mark Buchner; Ann Wales; Gregory Jones

doi:10.1530/endoabs.65.P219

P219

Prev Next

Section Contents Cite

Endocrine Abstracts (2019) 65 P219 | DOI: 10.1530/endoabs.65.P219

SFEBES2019 POSTER PRESENTATIONS Metabolism and Obesity (104 abstracts)

Predicting mortality in both diabetes and open-source clinical datasets from free text entries using machine learning (natural language processing)

Christopher Sainsbury ¹ , Andrew Conkie ² , Mark Buchner ³ , Ann Wales ⁴ & Gregory Jones ¹

Author affiliations

¹Diabetes Centre, Gartnavel General Hospital, Glasgow, UK; ²Redstar Consulting, Glasgow, UK; ³Tactuum Ltd, Glasgow, UK; ⁴Scottish Government Digital Health and Care Division, Edinburgh, UK

Objective: We aimed to test the utility of semantic analysis to predict all-cause mortality from free-text entries from both a national diabetes database, and an open source clinical dataset (MIMIC-III). We analysed text entries alone, in order to fully understand the potential of language analysis to predict outcome.

Method: Diabetes dataset: An analysis period of 3 years was defined during which clinical text data were extracted. Mortality status at 1 year was identified. Data was preprocessed and divided randomly into training/validation and test sets 0.8:0.2. The training/validation set was further randomly divided 0.8:0.2. Dimensionality reduction was performed using embedding, and a combined convolutional and recurrent (LSTM) neural network was trained on the training subset for 20 epochs. Class imbalance was managed by applying class weights. A prediction of outcome was made on the withheld test set using the trained model, and area under receiver operator characteristic curve (AUROC) was calculated. MIMIC-III dataset. A similar methodology was applied, with some further development of sophistication of neural network architecture.

Result: Diabetes dataset: 53 954 individuals with data were identified. 2292 deaths were recorded at 1-year post analysis. AUROC of model predictions when applied to withheld testset was 0.62.

MIMIC-III: 11 518 individuals identified, with 2045 deaths at 1 year. AUROC applied to withheld testset 0.86.

Conclusion: By learning from clinician’s summaries, NLP has the potential to leverage the clinical understanding of multiple clinicians and integrate information from multiple data sources. These models may be trained on outcomes such as mortality, aiding risk stratification, or on outcomes such as response particular therapeutic agents, aiding clinical decision making.