Objective: We aimed to test the utility of semantic analysis to predict all-cause mortality from free-text entries from both a national diabetes database, and an open source clinical dataset (MIMIC-III). We analysed text entries alone, in order to fully understand the potential of language analysis to predict outcome.
Method: Diabetes dataset: An analysis period of 3 years was defined during which clinical text data were extracted. Mortality status at 1 year was identified. Data was preprocessed and divided randomly into training/validation and test sets 0.8:0.2. The training/validation set was further randomly divided 0.8:0.2. Dimensionality reduction was performed using embedding, and a combined convolutional and recurrent (LSTM) neural network was trained on the training subset for 20 epochs. Class imbalance was managed by applying class weights. A prediction of outcome was made on the withheld test set using the trained model, and area under receiver operator characteristic curve (AUROC) was calculated. MIMIC-III dataset. A similar methodology was applied, with some further development of sophistication of neural network architecture.
Result: Diabetes dataset: 53 954 individuals with data were identified. 2292 deaths were recorded at 1-year post analysis. AUROC of model predictions when applied to withheld testset was 0.62.
MIMIC-III: 11 518 individuals identified, with 2045 deaths at 1 year. AUROC applied to withheld testset 0.86.
Conclusion: By learning from clinicians summaries, NLP has the potential to leverage the clinical understanding of multiple clinicians and integrate information from multiple data sources. These models may be trained on outcomes such as mortality, aiding risk stratification, or on outcomes such as response particular therapeutic agents, aiding clinical decision making.