2020 Rbme Fs
2020 Rbme Fs
Abstract— Clinical decision-making in healthcare is algorithm. The loss function L(y, ŷ|Θ), also known as
already being influenced by predictions or recommen- the cost function, measures the dissimilarity between
dations made by data-driven machines. Numerous the true labels y and the values ŷ predicted by the
machine learning applications have appeared in the
latest clinical literature, especially for outcome pre- approximated model (e.g., mean square error, binary
diction models, with outcomes ranging from mortality cross-entropy, etc.). An optimization algorithm, such as
and cardiac arrest to acute kidney injury and arrhyth- gradient descent [8], minimizes L(y, ŷ|Θ) in an iterative
mia. In this review article, we summarize the state- manner based on the examples present in the dataset.
of-the-art in related works covering data processing,
inference, and model evaluation, in the context of
outcome prediction models developed using data ex- I. Clinical Context & Frameworks of
tracted from electronic health records. We also discuss Outcome Prediction Models
limitations of prominent modeling assumptions and
highlight opportunities for future research. Care pathways within hospitals vary largely due to the
Recent artificial intelligence (AI) developments seek diversity of admitted patients. Thus, an understanding
to positively impact medicine and clinical practice [1]. of the clinical context is key for developing machine
Machine learning (ML), an application of AI, recognizes learning models that can be incorporated within existing
patterns within large quantities of medical data to make medical processes. As shown in Fig. 2, a patient may be
future predictions, ranging from natural language pro- hospitalized as an emergency or elective admission, where
cessing to computer vision applications [2], [3]. Several the latter constitutes a routine procedure. During hospi-
ML frameworks have been proposed to predict clinical talization, different types of data are routinely collected
outcomes within a certain time period in the future, such from the patient for monitoring purposes.
as cardiac arrest, mortality, or intensive care unit (ICU) Patient monitoring tools, such as early warning sys-
admission [4], [5], [6], [7]. tems [9], are widespread across different hospital wards
In general, designing an ML system involves a multi- to continuously assess for patient deterioration. The def-
disciplinary effort that extends from data engineering to inition of what exactly constitutes clinical deterioration
training and evaluating a predictive model. We consider has evolved over time based on the data collection and
the general model as a mapping of an input to an output: processing techniques. Early attempts to define clinical
deterioration focused on medical neglect and its end
f :X→y (1)
result of clinical complications [10]. Subsequent studies
where f (.) is a function consisting of parameters Θ, X focused on more discrete clinical events, such as se-
is the input and y is the output. For example, X can vere sepsis, unexpected cardiac arrest, ICU admission
consist of vital signs measurements of the patient, such or mortality [11], [12], and tend to select one or more
as heart rate, blood pressure, and respiratory rate, and end-point measures of clinical deterioration. Such events
y can represent a binary label indicating the occurrence incur high costs of prolonged hospital stays, litigation,
of ICU transfer or cardiac arrest during the patient’s staff time, impact on patients and staff, and broader
hospital stay [7]. economic consequences [13]. The latter definition is the
Fig. 1 depicts the typical pipeline of a ML application, most popular one, as it enables researchers to group
starting from the input X, and ending with the corre- patients into discrete classes, such as deteriorating (i.e.,
sponding output represented by y. The first task learns those who experience an outcome) and non-deteriorating
to extract intermediary features (Section III) while the (i.e., those who are discharged without experiencing any
second task learns from patterns in the data to produce outcomes), and as such infer the y labels.
the predicted label (Section IV). Such models are usually The framework of outcome prediction models also
assessed based on clinical utility and interpretability varies across the literature. Some studies predict the risk
(Section V). of an outcome only once using the patient’s first N hours
As we discuss related works throughout this review, of data after admission, such as 24 or 48 hours [14].
we also provide an intuitive explanation of the ML Others choose to predict the risk of an outcome, such
techniques used for feature extraction or predictive in- as ICU readmission, using the patient’s last N hours of
ference. In general, ‘learning’ how to map the input to data prior to discharge. Another common methodology
the output involves approximating the parameters of the is to develop a real-time alerting score, which computes
model f (.), a loss function L(y, ŷ|Θ), and an optimization the risk of deterioration every time a set of clinical
Fig. 1. General ML pipeline that maps an input to a label. The two main steps of the pipeline are (i) extraction of an intermediary
feature space and (ii) label prediction using a classification or clustering algorithm.
observations is collected [15], as in clinical early warning be private, and there have only been a few notable efforts
systems [16]. to release open access datasets, such as the MIMIC-III
database [19]. Data and resource sharing is important for
II. Electronic Health Records
the advancements of the field.
Various types of data can be used to develop outcome It is also commonly agreed that data in EHRs may
prediction models, such as imaging, speech, or claims reflect the recording process present in the hospital
data [17]. Here, we focus on data extracted from elec- rather than being a direct reflection of patient physiology
tronic health records (EHR), which are being increas- [20]. First, EHRs are complex as they may include struc-
ingly deployed in hospitals worldwide. EHRs are used tured and unstructured data; an example of the latter
in hospitals to store longitudinal information of patients is textual information which could require natural lan-
collected in a care delivery setting. Such information guage processing techniques to process [21]. Additionally,
includes patient demographics, vital signs, medications, categorical data, such as diagnostic coding, may adopt
laboratory data, and description of any outcomes that different coding systems across different institutions.
may have occurred to the patient during hospitalization, Another important dimension is data completeness,
or shortly after discharge. which may be defined as “the proportion of observa-
Data extracted from an EHR database can be used to tions that are actually recorded in the system” [22].
develop and evaluate ML models. The dataset is typically Incompleteness of EHRs can be a result of health service
split into a training set and a testing set 1 , either by fragmentation due to inefficient communication following
a random or a nonrandom split based on location or patient transfer among institutions; the recording of
time. According to the Transparent Reporting of a multi- data taking place only during healthcare episodes that
variable prediction model for Individual Prognosis Or correspond to illness, or the increased personalisation of
Diagnosis (TRIPOD) statement, the nonrandom split by attributes per patient [20], [23]. Completeness may also
time is the strongest evaluation technique as it avoids vary across institutions based on adopted protocols.
random variations between the training and testing sets The third challenge is the accuracy of the data, or “the
[18]. During model learning, the training set is used to proportion of recorded observations in the system that
optimize the parameters Θ of the model. The trained are correct” [22]. Errors can occur while clinical staff
model is then evaluated on the held-out test set using observe a patient or record data, and their occurrence
various performance metrics. may be influenced by random and systematic errors
Fig. 3 shows the overall dataset sizes, in terms of such as billing requirements or avoidance of liability
number of patient admissions, reported in studies pub- [20]. The accuracy of EHRs can be assessed by checking
lished in the last decade (arranged in chronological order agreement between different elements within the EHR
from left to right), extracted from EHRs. There is an (such as assigned diagnosis and supplied medications), or
increase of six orders of magnitude between 2008 and by verifying whether values are within expected ranges
2019, which highlights the increased accessibility to EHR [24].
data for research purposes. Most datasets are reported to Finally, it is important to verify whether the data was
1 In clinical studies, the test set is usually termed the validation
recorded within a reasonable period of time [24]. For
set, not to be confused with the portion of the training set used for example, the recorded collection time of vital signs may
ML-oriented tasks, such as hyperparameter selection. precede the time of admission. Although this aspect of
Fig. 2. Visualization of the patient flow in a hospital: Patient is either admitted as an elective or emergency admission, monitored in ward
stay(s) during consultant episode(s). Patient may transfer from one ward to another, or may change the consultant during the in-hospital
stay. * Accident & Emergency patients may be admitted as inpatients or just discharged.
data quality is highly dependent on the efficiency of the from the distributions of the raw data, such as mini-
clinical staff, it also depends on the work flow protocols mum and maximum extremes, moments (mean, standard
adopted at different institutions. Timeliness of data must deviation, and skewness), percentiles or the difference
be assessed to evaluate the chronology of data elements in between two percentiles [25], [45], [32].
relation to admission or discharge decisions, for example Previous research also computed time series features
laboratory results prior to admission may be considered from waveform data [28], [46], [26], [5]. Those features can
as part of subsequent admission, or death within 24 hours be categorized into four types: data adaptive, non-data
of discharge can be considered as in-hospital mortality. adaptive, model-based and data-dictated approaches
This imposes challenges on the usability of the data, [47]. Fourier and wavelet transforms, for instance, decom-
which usually incurs preliminary data pre-processing as pose raw signals into frequency and wavelets respectively.
shown in Fig. 4. The first step is to define an inclusion Time domain, Poincaré nonlinear, cross-correlation anal-
and exclusion criteria to extract the patient cohort of ysis and geometric measures have also been used to
interest. The second step involves setting assumptions investigate variability of vital signs [5], [26].
to aid the analysis of the heterogeneous data, such as Deriving hand-crafted features is a powerful tool in the
defining a minimum length of stay. Finally, meaningful design of ML models and has been used extensively over
features as input variables to the ML model can be the years. However, it is a time-consuming and labor-
extracted using a variety of techniques. intensive process, requires expert knowledge, and may
not scale well to new problems.
III. Feature Extraction
The performance of clinical predictive models relies B. Data Standardization
on the feature representation of the data, as in other ML algorithms require further data preparation steps
domains [44]. As reported in related works, feature to ensure stability of learning. Here, related works reduce
extraction generally involves at least one of domain- the noise, sparsity and irregularity of the clinical data, as
expertise for hand-crafted features (Section III-A), data well as align the scales of the various predictor variables.
standardization (Section III-B), or representation learn- 1) Time-series Modeling: Time-series modeling is
ing (Section III-C). widely used in studies pertaining to early warning models
[29], [40]. It is often used either (i) to infer a pattern
A. Hand-crafted Features of the physiological trajectory or (ii) as an interpolation
Domain expertise is commonly used to provide guid- technique to overcome the sparsity and irregularity of
ance on the design of the data pre-processing pipeline. physiological data.
This involves (i) preliminary feature selection from the Linear dynamic systems have been previously used to
input space, (ii) designing hand-crafted features, and (iii) model physiological variables for ICU monitoring [48]
incorporating prior knowledge of the structure of the and detection of sepsis [49]. Hidden Markov Models
data in the model design. (HMMs) were also used to model health trajectories of
Examples of hand-crafted features in related works are patients [31], [50]. However, such models cannot easily
pulse pressure [38], [26], shock index [25], [34], [38], mean adapt to irregularly sampled vital-sign data. Addition-
arterial pressure [27], [38], oxygen delivery index [34], ally, each hidden state in an HMM only depends on
absolute successive difference of heart rate, estimated the previous state [51]. Another approach for modeling
cardiac output, slope of fitted regression lines, or slope similar data is the kernel-based support vector regression
projections [25]. Statistical measures can be obtained [29].
·107
1.2
0.8
0.6
0.4
0.2
Fig. 3. Dataset sizes reported in the literature in ascending order from left (2008) to right (2019). The vertical axis represents the dataset
size, in terms of the number of patient admissions, and the horizontal axis represents the reference number.
C. Representation Learning
One of the most popular techniques for time series Learning a suitable lower-dimensional embedding or
modeling within the clinical domain is Gaussian Process representation of a high-dimensional input space is a
Regression (GPR). GPR is based on a non-parametric fundamental component of ML research [44]. The embed-
stochastic process that offers a probabilistic approach for ding can represent a medical concept [61] or summarize
time-series modeling by providing confidence intervals a patient’s hospital visit [62]. It often performs better
for estimated values at unobserved time instances. A than the raw input for learning subsequent tasks [63],
comprehensive introduction to GPR can be found in [52]. [64], [65]. We now provide an overview of the techniques
Previous studies illustrate the robustness of the single- for obtaining embeddings in related medical applications:
task GPR [29], [53], [54] in modeling a single physiologi- (i) standard dimensionality reduction techniques, (ii)
cal time-series variable. Others focus on multi-task GPR distributed representations used in language modelling,
[55], [40], [35], which learns similarities across several (iii) using embedding layers as part of a larger model, or
time-series data data and models them simultaneously. (iv) through the latent space of autoencoders and their
The use of GPR relies heavily on the choice of the kernel variants. Such compact representations are then further
that encodes prior knowledge of any nonlinear time-series used as inputs for classification or clustering purposes
dynamics that might be hypothesized to exist in the data. (covered in Section IV).
Most recently, neural processes, a class of neural latent 1) Standard Dimensionality Reduction Techniques:
variable models, were also introduced as a probabilistic One of the most popular statistical dimensionality reduc-
regression approach [56], which generalizes GPR through tion techniques is principal component analysis (PCA)
the use of generative models from deep learning. [66]. PCA transforms a set of possibly correlated vari-
Modeling the physiological trajectory of patients has ables to a set of linearly uncorrelated components. It
has been used to extract features for various clinical and on wards, and we expect it to continue to be an
applications [67], [46], [68], such as for the detection active area of research in the near future. The consistent
of hypotensive episodes [26], mortality prediction across use of hand-crafted features over the years indicates its
stroke patients [69], or prediction of hospital readmission effectiveness in training ML models. Additionally, time-
[70]. The main limitation of PCA is that it extracts series modeling may not be widely used as it requires hy-
linear features that may not well represent non-linear re- perparameter tuning and high computational resources.
lationships present in complex clinical data [44]. Another It also limits end-to-end training of the pipeline, since
popular technique is independent component analysis some operations cannot be differentiated for gradient
(ICA) [71], [37], which transforms the variables to a set descent.
of independent components.
2) Distributed Concept Representations: Patient IV. Predictive Inference
records may contain discrete categorical codes, such The extracted features can then used to train an
as diagnosis, medication, or treatment codes. Several outcome prediction model. The task can be posed either
studies [41], [39], [72] propose learning from such as a classification (Section IV-A) or clustering (Section
variables using embedding techniques derived from the IV-B) problem.
distributional hypothesis in semantic modeling. The
distributional hypothesis states that words that appear A. Outcome Classification Framework
in similar contexts in large samples of language data Table II summarizes the different classification models
are semantically similar [73]. The skip-gram algorithm that have been used to predict various clinical outcomes,
learns the co-occurrence of information inside a context as presented in recent papers. Most papers compare
window of a fixed size [74]. It has been used to convert the performance of their models to those of simple ML
medical codes to dense representations in [33], [61], techniques, such as regression [42], [77], which have
[41]. Similar to skip-gram, the Global Vectors (GloVe) been useful statistical techniques long since before the
algorithm was also used to learn the global co-occurrence rise of ML. We also observe that predictions are often
matrix of medical codes [75]. defined within a particular future time-frame, ranging
3) Embedding Layers: Embedding layers can also be from short-term 48 hours prediction windows [4] to 6
integrated as part of a larger model to transform high- months. The varying definitions in the literature of what
dimensional features into a lower-dimensional space. The exactly constitutes an outcome makes it challenging to
embedding can consist of a simple linear transformation compare methods directly. Additionally, some studies
[76], [77] or as a fully-connected (deep) network [4], tend to focus on specific patient subgroups, such as
[76], [72]. One study projected the input into a higher- pediatrics [38].
dimensional space using a convolutional layer [39]. 1) Regression Models: Logistic regression is one of the
4) Autoencoders and their variants: An autoencoder simplest linear classifiers [83] and is often considered as
is a neural network architecture that is often used for a standard benchmark for sophisticated clinical models
dimensionality reduction or feature extraction [78]. It [84]. Previous studies used logistic regression to predict
first transforms the input space to a (typically noise-free) hemodynamic instability [25], imminent mortality [85],
lower-dimensional representation using an encoder, and or the composite outcome of cardiac arrest, unplanned
then reconstructs the input from this compact represen- ICU admission, and mortality [12]. However, logistic
tation. The sparse autoencoder (SAE) enforces a sparsity regression cannot learn non-linear relationships and as-
constraint on the learned representation, and it has been sumes independence across the input variables.
used to learn latent representations of clinical data [30], Decision tree learning involves the stratification of the
[62]. The denoising autoencoder (DAE) reconstructs the feature space based on a criterion defined by informa-
input from a partially corrupted version of the input. The tion theory, such as entropy. One study developed an
stacked DAE, which consists of several autoencoders that early warning score based on decision trees, using seven
are initially pre-trained independently then connected routinely-collected laboratory tests [86], while another
in one network, has also been used for clinical appli- constructed an ensemble model with gradient tree boost-
cations [79], [37], [58], [80]. Another popular variant of ing and adaptive boosting to predict the likelihood of
autoencoders is the variational autoencoder [81], which transfer to pediatric ICU [38]. Despite the high inter-
is a generative model that learns a probabilistic latent pretability of the aforementioned studies, they heavily
space, unlike the previously mentioned discriminative rely on task-specific hand-engineered features and do not
autoencoders. learn complex patterns in the data.
In Table I, we summarize the feature extraction tech- 2) Kernel Methods: Kernel methods rely on a user-
niques in related outcome prediction studies. In terms defined kernel function that estimates the ‘similarity’
of variable selection, we observe that free clinical text between pairs of data [87]. The support vector machine
is the least-used input. That may be due to the limited is a popular example of kernel methods. It projects data
availability of datasets. We also note that representation into a higher-dimensional space and finds the optimal
learning has gained popularity from approximately 2013 discriminatory hyper-planes between classes [88]. The
TABLE I
Overview of feature representation techniques adopted in related works using a variety of predictor variables: vital
signs (VS), laboratory tests (LT), demographic information (DI), diagnostic codes (DC), interventions (INT) such as
procedures and medications, and free text (TEX).
use of support vector machines heavily relies on the Within the context of predicting adverse clinical out-
choice of the kernel and regularization, and they have comes, this can involve creating a ‘dictionary’ or cluster
shown strong performance in recent clinical applications of healthy patients and computing a similarity metric for
[28], [89], [90], [34]. Computing the kernel matrix for a new patient [45], [53], [95]. Popular similarity metrics
all pairs of data may be computationally expensive for are the Kullback-Leibler (KL) divergence [96] and the
large clinical datasets especially when a non-linear kernel Mahalanobis distance [45]. Clustering analysis has also
is used. Further work must investigate approximation been useful for patient phenotyping [30]. The concept
techniques for applications involving large-scale medical of creating patient dictionaries is a subset of novelty
data. detection. An example of such approaches is ‘one-class
3) Deep Learning: Deep learning models are also classification’ [97], [48].
becoming increasingly popular for outcome prediction
V. Performance Evaluation
tasks [91], [7], [5], [27], [40]. The simplest form of
neural networks is the multi-layer perceptron (MLP), The performance of supervised outcome prediction
which consists of fully-connected perceptrons. The main models on the testing set is evaluated using various
limitation of the MLP is its inability to account for statistical methods. Those statistical methods mainly
temporal dependencies. Recurrent neural networks and assess the performance of the model in terms of accuracy
their variants seek to model temporal behaviour through metrics. In recent years, model interpretability has also
feedback connections. Both Long Short Term Memory become an area of interest as it directly reflects how we
(LSTM) networks [92], [93], [40] and Gated Recurrent translate technologies into clinical practice [98].
Units (GRU) [76], [41] were constructed to predict (and A. Performance Metrics
alert in advance of) clinical outcomes. There is also a
Model discrimination refers to the model’s ability in
growing interest in developing ‘end-to-end’ architectures
separating classes of interest. In the context of outcome
that can jointly extract features and perform classifica-
prediction models, we will here refer to patients who
tion [77], [82], [94]. Although deep learning techniques
experience an adverse outcome as the positive class,
are typically characterized by strong performance, their
and those who do not as the negative class. Many ML
decision-making process lacks interpretability.
models are trained to compute the probability of the
positive class, which is then converted to a binary value
B. Clustering for Abnormality Detection
by fixing a decision threshold. The predictions are then
Clustering algorithms are unsupervised learning tech- compared to the true labels and can classified into one of
niques that group data based on similarity measures. four categories: (1) True Positives (TP): model correctly
TABLE II
Overview of classifiers used for outcome prediction in related works.
predicts the positive class, (2) True Negatives (TN): 0.8 implies that the model has good diagnostic ability.
model correctly predicts the negative class, (3) False An AUROC higher than 0.9 means that the model has
Positives (FP): model incorrectly predicts the positive excellent diagnostic ability [100].
class, and (4) False Negatives (FN): model incorrectly Precision, also known as the Positive Predictive Value
predicts the negative class. (PPV), assesses the proportion of correctly predicted
Accuracy, which summarizes the proportion of cor- positive class across all of the true positive class.
rectly classified samples across all samples, is highly bi-
ased when using highly imbalanced datasets. Therefore, TP
other metrics are usually considered. Sensitivity, or the PPV = (4)
TP + FP
True Positive Rate (TPR), assesses the model’s ability
to correctly predict the positive class.
The Precision-Recall curve, where recall is essentially
TP sensitivity, plots the TPR on the horizontal axis and the
TPR = (2)
TP + FN Precision on the vertical axis and integrates the area
Specificity, also known as the True Negative Rate under the curve. The integral under the curve is the
(TNR), assesses the model’s ability to correctly predict Area under Precision-Recall Curve (AUPRC). Unlike the
the negative class. AUROC, the AUPRC and PPV are highly sensitive to
TN class imbalance. Outcome prediction models are gener-
TNR = (3) ally characterized with low AUPRC and PPV values
TN + FP
[101]. Due to low PPV values, such systems should be
The receiving operator characteristic (ROC) curve considered as risk stratifiers rather than predictors [26].
plots the TPR on the vertical axis and (1-TNR), also There are other commonly assessed metrics, such as
known as the False Positive Rate (FPR), on the horizon- the F1-score [102], [91] and the likelihood ratio [103].
tal axis. The integral under the curve is the Area Under Some studies also report the false positives to true
the Receiving Operator Characteristic Curve (AUROC) positives ratio [4] and the inverse of the PPV known as
[99].2 The AUROC assesses the model’s overall diagnos- the work-up-to-detection ratio [104], [42]. The efficiency
tic ability as the decision threshold is varied. An AUROC curve [105], [86] is a qualitative summary that plots
of 0.5 means that the model is making predictions at the number of positives generated at different decision
random in a two-class setting. An AUROC higher than thresholds against the sensitivity of the model. This tool
2 Some studies refer to the AUROC as the ‘concordance-statistic’ is essential to evaluate the trade-off between the total
(C-statistic). number of positives and the number of false positives.
B. Interpretability Additionally, outcome labels are defined based on a
Despite the good performance of recently introduced specific time-window, where the features are associated
ML models, interpretability remains to be a challenge with a positive outcome label only if they are within
for their clinical utility [98]. There are various defini- N hours to an outcome. This creates a strict cut-off
tions of interpretability in existing literature and they where data collected prior to this N -hours window is not
refer to several distinct ideas [106], [107]. Most of these associated with a future outcome. Realistically speaking,
ideas pertaining to the clinical domain revolve around deterioration is likely to develop gradually over time,
trustworthiness of the results and transparency of the yet this is the state-of-the-art approach in developing
model. In the context of this review, we summarize the outcome prediction models within clinical practice. Fu-
efforts of outcome prediction models that considered ture work should consider time-to-event analysis, which
interpretability as a key component of model assessment. focuses on predicting the time until the occurrence of an
outcome, rather than predicting a binary label.
Mimic learning assumes that shallow models, such as
linear models, are interpretable. It aims to identify the B. Personalized Predictive Models
features that are potentially relevant to the prediction. It
Most of the outcome prediction models are devel-
involves first training a deep learning model for a specific
oped and evaluated population-wide and recent improve-
clinical task. It then trains a shallow model, such as
ments show marginal improvements. As more data is
gradient boosting trees, to mimic the behaviour of the
collected per patient, we hypothesize that the predic-
deep learning model [80], [108]. The local interpretable
tive power of such models could improve by develop-
model-agnostic explanation (LIME) [109] generates a
ing patient-specific models, that account for individual-
local explanation of the model behaviour using a shallow
, disease-, and organizational-based factors [113]. On
model. It has been even used to explain ML models for
an individual-level, factors may include demographics,
the prediction of in-hospital mortality [110]. However, it
lifestyle, coexisting medical conditions, or genetic infor-
has also been argued that linear models, rule-based mod-
mation. Disease-related factors may include degree of
els, and decision trees are not intrinsically interpretable
severity, medications and therapy, rate of progression,
[106]. Other post-hoc interpretability techniques such as
interventions, surgeries, and procedures. Organizational-
saliency maps rely on qualitative visual interpretations
factors may include type of hospital, time of the day, staff
commonly used in computer vision applications.
ratio, or staff training. This also motivates the advance-
It is often argued that deep learning models compro-
ment of internet of things in healthcare to enhance the
mise interpretability for high accuracy [111]. Thus, there
collection of integrated data, and would certainly allow
have been recent breakthroughs in developing inherently
us to move forward towards ‘precision medicine’.
interpretable deep learning models instead of perform-
Additionally, in the development of machine learn-
ing post-hoc interpretation [112]. For instance, attention
ing and deep learning models, it is assumed that the
mechanisms are incorporated within deep learning mod-
data samples are independent and identically distributed
els and assign normalised weights to a set of features.
(i.i.d.) random sets. However, this may not be the case
The weights indicate the feature importance for the
in practice, since some data samples may belong to
prediction of a future diagnosis [94], [39], [75] or high
the same patient and spatio-temporal patterns may be
risk vascular diseases [102]. Other works impose non-
indicative of deterioration prior to an outcome.
negativity [62] or sparsity [30] constraints on the learned
embedding space of medical data. C. General Learning Models
VI. Moving Forward Deep neural networks are powerful processing tech-
niques. However, most of the state-of-the-art models
The prediction of clinical outcomes is essential to seek to learn how to predict a specific outcome or a
detect deterioration in a timely manner and to ease particular task, which can generally be referred to as
burden off clinical staff. The development of the ML ‘narrow AI’. While some of the motivation behind using
pipelines and their subsequent performance can also be representation learning has been to learn general patient
improved by accounting for a few considerations. representations in order to perform a variety of predictive
tasks, more work needs to be done into developing
A. Noisy Outcome Labels generalized models that can automatically learn from
To train outcome prediction models, outcome labels heterogeneous EHR data to perform diverse tasks.
are currently being defined based on the occurrence of While recently developed ML models perform well
discrete clinical events. However, such labels may be within retrospective studies, validating their success in
noisy or inaccurate since EHRs only reflect parts of the practice requires prospective analysis. The progress of
hospital experience. For example, while a patient may the field relies on increased multidisciplinary collabo-
experience cardiac arrest, the patient may be on terminal rations between ML research scientists and clinicians.
care pathways with ‘do not resuscitate orders’, and such While it will take time for both parties to speak the same
information may not be present in the available dataset. language, we hope that this review would demystify the
overall ML pipeline and summarize the assumptions and [16] Royal College of Physicians. National Early Warning Score
techniques of the state-of-the-art. (NEWS) 2: Standardising the assessment of acute-illness
severity in the NHS. Technical report, 2017.
[17] Maggie Makar, Marzyeh Ghassemi, David M. Cutler, and
References Ziad Obermeyer. Short-Term Mortality Prediction for El-
derly Patients Using Medicare Claims Data. International
[1] Kun Hsing Yu, Andrew L. Beam, and Isaac S. Kohane. Journal of Machine Learning and Computing, 2015.
Artificial intelligence in healthcare, 2018. [18] Karel G.M. Moons, Douglas G. Altman, Johannes B. Re-
[2] Naveed Afzal, Vishnu Priya, Sunghwan Sohn, Hongfang Liu, itsma, John P.A. Ioannidis, Petra Macaskill, Ewout W.
Rajeev Chaudhry, Christopher G Scott, Iftikhar J Kullo, and Steyenberg, Andrew J. Vickers, David Ransohoff, and
Adelaide M Arruda-olson. International Journal of Medical Gary S. Collins. Transparent Reporting of a multivari-
Informatics Natural language processing of clinical notes able prediction model for Individual Prognosis or Disagnosis
for identi fi cation of critical limb ischemia. International (TRIPOD): Explanantion and Elaboration. Annals of Inter-
Journal of Medical Informatics, 111(September 2017):83–89, nal Medicine, 162(1):W1–W74, 2015.
2018. [19] Alistair E.W. Johnson, Tom J. Pollard, Lu Shen, Li Wei H.
[3] Varun Gulshan, Lily Peng, Marc Coram, Martin C Stumpe, Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin
Derek Wu, Arunachalam Narayanaswamy, Subhashini Venu- Moody, Peter Szolovits, Leo Anthony Celi, and Roger G.
gopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, Ra- Mark. MIMIC-III, a freely accessible critical care database.
masamy Kim, Rajiv Raman, Philip C Nelson, Jessica L Mega, Scientific Data, 2016.
and Dale R Webster. Development and Validation of a Deep [20] George Hripcsak and David J Albers. Next-generation phe-
Learning Algorithm for Detection of Diabetic Retinopathy notyping of electronic health records. Journal of the Ameri-
in Retinal Fundus Photographs. JAMA : the journal of the can Medical Informatics Association : JAMIA, 20(1):117–21,
American Medical Association, 316(22):2402–2410, 2019. 2013.
[4] Nenad Tomašev, Xavier Glorot, Jack W. Rae, Michal Zielin- [21] Jon D Patrick, Dung H M Nguyen, Yefeng Wang, and Min
ski, Harry Askham, Andre Saraiva, Anne Mottram, Clemens Li. A knowledge discovery and reuse pipeline for information
Meyer, Suman Ravuri, Ivan Protsyuk, Alistair Connell, extraction in clinical notes. Journal of the American Medical
Cían O. Hughes, Alan Karthikesalingam, Julien Cornebise, Informatics Association : JAMIA, 18(5):574–579, 2011.
Hugh Montgomery, Geraint Rees, Chris Laing, Clifton R.
[22] William R Hogan and Michael M Wagner. Accuracy of data
Baker, Kelly Peterson, Ruth Reeves, Demis Hassabis, Do-
in computer-based patient records. Journal of the American
minic King, Mustafa Suleyman, Trevor Back, Christopher
Medical Informatics Association, 4(5):342–355, 1997.
Nielson, Joseph R. Ledsam, and Shakir Mohamed. A clin-
ically applicable approach to continuous prediction of future [23] E. M. Mirkes, T. J. Coats, J. Levesley, and A. N. Gorban.
acute kidney injury. Nature, 572(7767):116–119, 2019. Handling missing data in large healthcare dataset: A case
study of unknown trauma outcomes. Computers in Biology
[5] Hyojeong Lee, Soo-Yong Shin, Myeongsook Seo, Gi-Byoung
and Medicine, 75:203–216, 2016.
Nam, and Segyeong Joo. Prediction of Ventricular Tachy-
cardia One Hour before Occurrence Using Artificial Neural [24] Nicole Gray Weiskopf and Chunhua Weng. Methods and di-
Networks. Scientific Reports, 6(August):32390, 2016. mensions of electronic health record data quality assessment:
[6] M Aczon, D Ledbetter, L Ho, A Gunny, A Flynn, J Williams, enabling reuse for clinical research. Journal of the American
and R Wetzel. Dynamic Mortality Risk Predictions in Pedi- Medical Informatics Association : JAMIA, 20:144–151, 2012.
atric Critical Care Using Recurrent Neural Networks. arXiv, [25] Hanqing Cao, Larry Eshelman, Nicolas Chbat, Larry Nielsen,
pages 1–18, 2017. Brian Gross, and Mohammed Saeed. Predicting ICU hemo-
[7] Scott B. Hu, Deborah J L Wong, Aditi Correa, Ning Li, dynamic instability using continuous multiparameter trends.
and Jane C. Deng. Prediction of clinical deterioration in In Conference proceedings : ... Annual International Con-
hospitalized adult patients with hematologic malignancies ference of the IEEE Engineering in Medicine and Biology
using a neural network model. PLoS ONE, 11(8):1–12, 2016. Society. IEEE Engineering in Medicine and Biology Society.
[8] Sebastian Ruder. An overview of gradient descent optimiza- Annual Conference, volume 2008, pages 3803–6, 2008.
tion algorithms. 2016. [26] Joon Lee and Roger G Mark. An investigation of patterns
[9] M. E.Beth Smith, Joseph C. Chiovaro, Maya O’Neil, Devan in hemodynamic data indicative of impending hypotension
Kansagara, Ana R. Quiñones, Michele Freeman, Makala- in intensive care. BioMedical Engineering OnLine, 9(1):62,
pua L. Motu’apuaka, and Christopher G. Slatore. Early 2010.
warning system scores for clinical deterioration in hospital- [27] Rob Donald, Tim Howells, Ian Piper, I. Chambers, G. Cite-
ized patients: A systematic review. Annals of the American rio, P. Enblad, B. Gregson, K. Kiening, J. Mattern, P. Nils-
Thoracic Society, 11(9):1454–1465, 2014. son, A. Ragauskas, Juan Sahuquillo, R. Sinnot, and A. Stell.
[10] Lucian L Leape, Troyen A Brennan, Nan Laird, Ann G Early Warning of EUSIG-Defined Hypotensive Events Using
Lawthers, Russel Localio, Benjamin A Barnes, Leisi Herbert, a Bayesian Artificial Neural Network Article. Acta Neu-
Joseph P Newhouse, Paul C Weiler, and Howard Hiatt. The rochirurgica Supplementum, 114(January 2012):87–91, 2012.
Nature of Adverse Events in Hospitalized Patients: Results of [28] Marcus Eng Hock Ong, Christina Hui Lee Ng, Ken Goh,
the Harvard MEdical Practice Study II. The New England Nan Liu, Zhi Xiong Koh, Nur Shahidah, Tong Tong Zhang,
Journal of Medicine, 324(6):377–384, 1991. Stephanie Fook-Chong, and Zhiping Lin. Prediction of car-
[11] Daryl Jones, Imogen Mitchell, Ken Hillman, and David Story. diac arrest in critically ill patients presenting to the emer-
Defining clinical deterioration. Resuscitation, 84(8):1029– gency department using a machine learning score incorporat-
1034, 2013. ing heart rate variability compared with the modified early
[12] Matthew M. Churpek, Trevor C. Yuen, and Dana P. Edelson. warning score. Critical care (London, England), 16(3):R108,
Predicting clinical deterioration in the hospital: The impact 2012.
of outcome selection. Resuscitation, 84(5):564–568, 2013. [29] David A Clifton and Marco Pimentel. Gaussian Processes
[13] G Neale, M Woloshynowych, and C Vincent. Exploring the for Personalized e-Health Monitoring With Wearable Sensors
causes of adverse events in NHS hospital practice. Journal Gaussian Processes for Personalized e-Health Monitoring
of the Royal Society of Medicine, 94(7):322–30, 2001. With Wearable Sensors. IEEE Transactions on Biomedical
[14] Sanjay Purushotham, Chuizheng Meng, Zhengping Che, and Engineering, 60(March 2013):193–197, 2013.
Yan Liu. Benchmark of Deep Learning Models on Large [30] Thomas A. Lasko, Joshua C. Denny, and Mia A. Levy. Com-
Healthcare MIMIC Datasets. 2017. putational Phenotype Discovery Using Unsupervised Feature
[15] Farah E. Shamout, Tingting Zhu, Pulkit Sharma, Peter J. Learning over Noisy, Sparse, and Irregular Clinical Data.
Watkinson, and David A. Clifton. Deep Interpretable Early PLoS ONE, 8(6), 2013.
Warning System for the Detection of Clinical Deterioration. [31] Shima Ghassempour, Federico Girosi, and Anthony Maeder.
IEEE Journal of Biomedical and Health Informatics, 2019. Clustering multivariate time series using Hidden Markov
Models. International Journal of Environmental Research Applied to the Detection of Sepsis in Neonatal Condition
and Public Health, 11(3):2741–2763, 2014. Monitoring. UAI’14 Proceedings of the Thirtieth Confer-
[32] Marzyeh Ghassemi, Tristan Naumann, Finale Doshi-Velez, ence on Uncertainty in Artificial Intelligence, pages 752–761,
Nicole Brimmer, Rohit Joshi, Anna Rumshisky, and Peter 2014.
Szolovits. Unfolding Physiological State: Mortality Modelling [50] Li Wei H. Lehman, Shamim Nemati, Ryan P. Adams, and
in Intensive Care Units. Bone, 23(1):1–7, 2014. Roger G. Mark. Discovering shared dynamics in physiological
[33] Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, signals: Application to patient monitoring in ICU. Proceed-
Walter F. Stewart, and Jimeng Sun. Doctor AI: Predicting ings of the Annual International Conference of the IEEE
Clinical Events via Recurrent Neural Networks. In Machine Engineering in Medicine and Biology Society, EMBS, pages
Learning for Healthcare Conference, pages 301–318, 2015. 5939–5942, 2012.
[34] Curtis E Kennedy, Noriaki Aoki, Michele Mariscalco, and [51] Zachary C. Lipton, John Berkowitz, and Charles Elkan. A
James P Turley. Using Time Series Analysis to Predict Critical Review of Recurrent Neural Networks for Sequence
Cardiac Arrest in a PICU. Pediatric critical care medicine Learning. 2015.
: a journal of the Society of Critical Care Medicine and the [52] Rasmussen and Williams. Gaussian Processes for Machine
World Federation of Pediatric Intensive and Critical Care Learning. The MIT Press, 2006.
Societies, 16(9):332–9, 2015. [53] Marco A.F. Pimentel, David A. Clifton, and Lionel
[35] Marzyeh Ghassemi, Tristan Naumann, Thomas Brennan, Tarassenko. Gaussian process clustering for the functional
David a Clifton, and Peter Szolovits. A Multivariate Time- characterisation of vital-sign trajectories. In IEEE Interna-
series Modeling Approach to Severity of Illness Assessment tional Workshop on Machine Learning for Signal Processing,
and Forecasting in ICU with Sparse, Heterogeneous Clinical MLSP, 2013.
Data. In Proceedings of the Twenty-Ninth AAAI Conference [54] Glen Wright Colopy, Stephen J. Roberts, and David A.
on Artificial Intelligence, pages 446–453, 2015. Clifton. Gaussian Processes for Personalized Interpretable
[36] Yu Cheng, Fei Wang, Ping Zhang, and Jianying Hu. Risk Volatility Metrics in the Step-Down Ward. IEEE Journal of
Prediction with Electronic Health Records: A Deep Learning Biomedical and Health Informatics, 2019.
Approach. In Proceedings of the 2016 SIAM International [55] Robert Dürichen, Marco A F Pimentel, Lei Clifton, Achim
Conference on Data Mining. Society for Industrial and Ap- Schweikard, and David A. Clifton. Multitask Gaussian
plied Mathematics, 2016., pages 432–440, 2016. processes for multivariate physiological time-series analysis.
[37] Riccardo Miotto, Li Li, Brian A. Kidd, and Joel T. Dudley. IEEE Transactions on Biomedical Engineering, 62(1):314–
Deep Patient: An Unsupervised Representation to Predict 322, 2015.
the Future of Patients from the Electronic Health Records. [56] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio
Scientific reports, 6(April):26094, 2016. Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye
[38] Jonathan Rubin, Cristhian Potes, Minnan Xu-Wilson, Junzi Teh. Neural Processes. 2018.
Dong, Asif Rahman, Hiep Nguyen, and David Moromisato. [57] T Jayalakshmi and A. Santhakumaran. Statistical Normal-
An ensemble boosting model for predicting transfer to the pe- ization and Back Propagationfor Classification. International
diatric intensive care unit. International Journal of Medical Journal of Computer Theory and Engineering, 3(1):89–93,
Informatics, 112(January):15–20, 2018. 2011.
[39] Huan Song, Deepta Rajan, Jayaraman J. Thiagarajan, and
[58] Patrick Schwab, Gaetano Scebba, Jia Zhang, Marco Delai,
Andreas Spanias. Attend and Diagnose: Clinical Time Series
and Walter Karlen. Beat by Beat: Classifying Cardiac
Analysis using Attention Models. arXiv, 2017.
Arrhythmias with Recurrent Neural Networks. arXiv, 2017.
[40] Joseph Futoma, Sanjay Hariharan, and Katherine Heller.
[59] Harini Suresh, Nathan Hunt, Alistair Johnson, Leo Anthony
Learning to Detect Sepsis with a Multitask Gaussian Process
Celi, Peter Szolovits, and Marzyeh Ghassemi. Clinical Inter-
RNN Classifier. In Proceedings of the 34th International
vention Prediction and Understanding using Deep Networks.
Conference on Machine Learning, 2017.
arXiv, pages 1–16, 2017.
[41] Edward Choi, Andy Schuetz, Walter F. Stewart, and Jimeng
Sun. Using recurrent neural network models for early detec- [60] Narges Razavian, Jake Marcus, and David Sontag. Multi-
tion of heart failure onset. Journal of the American Medical task Prediction of Disease Onsets from Longitudinal Lab
Informatics Association, 24(2):361–370, 2017. Tests. In Proceedings of the 1st Machine Learning for
[42] Alvin Rajkomar and Others. Scalable and accurate deep Healthcare Conference, pages 1–27, 2016.
learning for electronic health records. Nature Digital [61] Youngduck Choi, Chill Yi-i Chiu Ms, and David Sontag.
Medicine, 1(1):1–10, 2018. Learning Low-Dimensional Representations of Medical Con-
[43] Joon myoung Kwon, Youngnam Lee, Yeha Lee, Seungwoo cepts. AMIA Joint Summits on Translational Science pro-
Lee, Hyunho Park, and Jinsik Park. Validation of deep- ceedings, pages 41–50, 2016.
learning-based triage and acuity score using a large national [62] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles,
dataset. PLoS ONE, 2018. Catherine Coffey, Michael Thompson, James Bost, Javier
[44] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Rep- Tejedor-Sojo, and Jimeng Sun. Multi-layer Representation
resentation learning: A review and new perspectives. IEEE Learning for Medical Concepts. Proceedings of the 22nd ACM
Transactions on Pattern Analysis and Machine Intelligence, SIGKDD International Conference on Knowledge Discovery
35(8):1798–1828, 2013. and Data Mining - KDD ’16, pages 1495–1504, 2016.
[45] Jimeng Sun, Fei Wang, Jianying Hu, and Shahram Ed- [63] Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, and
abollahi. Clustering Overly-Specific Features in Electronic Pascal Lamblin. Exploring strategies for training deep neural
Medical Records. ACM SIGKDD Explorations Newsletter, networks. Journal of Machine Learning Research, 2009.
14(1):16, 2012. [64] George E. Dahl, Dong Yu, Li Deng, and Alex Acero.
[46] G. Skolidis, R. H. Clayton, and G. Sanguinetti. Automatic Context-dependent pre-trained deep neural networks for
Classification of Arrhythmic Beats Using Gaussian Processes. large-vocabulary speech recognition. IEEE Transactions on
Computers in Cardiology, 35:921–924, 2008. Audio, Speech and Language Processing, 2012.
[47] Saeed Aghabozorgi, Ali Seyed Shirkhorshidi, and Teh [65] MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin,
Ying Wah. Time-series clustering - A decade review. In- and Hamid Laga. A Comprehensive Survey of Deep Learning
formation Systems, 53(October 2016):16–38, 2015. for Image Captioning. ACM Computing Surveys, 2019.
[48] John A. Quinn, Christopher K.I. Williams, and Neil McIn- [66] Lindsay I Smith. A tutorial on Principal Components Anal-
tosh. Factorial switching linear dynamical systems applied ysis Introduction. Statistics, 2002.
to physiological condition monitoring. IEEE Transactions [67] Paul Sajda. Machine Learning for Detection and Diagnosis
on Pattern Analysis and Machine Intelligence, 31(9):1537– of Disease. Annual Review of Biomedical Engineering, 8:537–
1551, 2009. 65, 2006.
[49] Ioan Stanculescu, Christopher K I Williams, and Yvonne [68] Hayden Wimmer and Loreen Powell. Principle Component
Freer. A Hierarchical Switching Linear Dynamical System Analysis for Feature Reduction and Data Preprocessing in
Data Science. In Proceedings of the Conference on Informa- [88] Christopher J.C. Burges. A tutorial on support vector ma-
tion Systems Applied Research, pages 1–6, 2016. chines for pattern recognition. Data Mining and Knowledge
[69] Songhee Cheon, Jungyoon Kim, and Jihye Lim. The Use of Discovery, 1998.
Deep Learning to Predict Stroke Patient Mortality. Inter- [89] Anneleen Daemen, Dirk Timmerman, Thierry Van den
national journal of environmental research and public health, Bosch, Cecilia Bottomley, Emma Kirk, Caroline Van Hols-
16(11), 2019. beke, Lil Valentin, Tom Bourne, and Bart De Moor. Im-
[70] Denis Krompaß, Cristóbal Esteban, Volker Tresp, Martin proved modeling of clinical data with kernel methods. Arti-
Sedlmayr, and Thomas Ganslandt. Exploiting Latent Em- ficial Intelligence in Medicine, 54(2):103–114, 2012.
beddings of Nominal Clinical Data for Predicting Hospital [90] Yukun Chen, Robert J Carroll, Eugenia R McPeek Hinz,
Readmission. KI - Künstliche Intelligenz, 29(2):153–159, Anushi Shah, Anne E Eyler, Joshua C Denny, and Hua Xu.
2015. Applying active learning to high-throughput phenotyping
[71] A. Hyvärinen and E. Oja. Independent component analysis: algorithms for electronic health records data. Journal of
Algorithms and applications. Neural Networks, 2000. the American Medical Informatics Association : JAMIA,
[72] Cristóbal Esteban, Oliver Staeck, Yinchong Yang, and Volker 20(e2):253–9, 2013.
Tresp. Predicting Clinical Events by Combining Static and [91] Benjamin Shickel, Patrick James Tighe, Azra Bihorac, and
Dynamic Information Using Recurrent Neural Networks. In Parisa Rashidi. Deep EHR: A Survey of Recent Advances
IEEE International Conference on Healthcare Informatics in Deep Learning Techniques for Electronic Health Record
(ICHI), pages 93–101, 2016. (EHR) Analysis. IEEE Journal of Biomedical and Health
[73] Magnus Sahlgren. The distributional hypothesis. Italian Informatics, pages 1–16, 2017.
Journal of Linguistics, 20(1):33–53, 2008. [92] Sepp Hochreiter and J Urgen Schmidhuber. Long Short-Term
[74] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Memory. Neural Computation, 9(8):1735–1780, 1997.
Distributed Representations of Words and Phrases and their [93] Zachary C. Lipton, David C. Kale, Charles Elkan, and Ran-
Compositionality arXiv : 1310 . 4546v1 [ cs . CL ] 16 Oct dall Wetzel. Learning to Diagnose with LSTM Recurrent
2013. arXiv preprint arXiv:1310.4546, 2013. Neural Networks. In Proceedings of ICLR 2016, pages 1–18,
[75] Edward Choi, Mohammad Taha Bahadori, Le Song, Wal- 2015.
ter F. Stewart, and Jimeng Sun. GRAM: Graph-based [94] Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You,
Attention Model for Healthcare Representation Learning. Tong Sun, and Jing Gao. Dipole: Diagnosis Prediction
arXiv, pages 1–15, 2016. in Healthcare via Attention-based Bidirectional Recurrent
[76] Cristóbal Esteban, Danilo Schmidt, Denis Krompaß, and Neural Networks. In Proceedings of the 23rd ACM SIGKDD
Volker Tresp. Predicting sequences of clinical events by using International Conference on Knowledge Discovery and Data
a personalized temporal latent embedding model. Proceed- Mining, 2017.
ings - 2015 IEEE International Conference on Healthcare [95] Tingting Zhu, Glen Wright Colopy, Clare MacEwen, Kather-
Informatics, ICHI 2015, pages 130–139, 2015. ine Niehaus, Yang Yang, Chris W. Pugh, and David A.
[77] Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Clifton. Patient-Specific Physiological Monitoring and Pre-
Andy Schuetz, Walter F. Stewart, and Jimeng Sun. RETAIN: diction Using Structured Gaussian Processes. IEEE Access,
An Interpretable Predictive Model for Healthcare using Re- 7:58094–58103, 2019.
verse Time Attention Mechanism. In NIPS Proceedings, [96] S. Kullback and R. A. Leibler. On Information and Suffi-
2016. ciency. The Annals of Mathematical Statistics, 2007.
[78] Aaron Courville Ian Goodfellow, Yoshua Bengio. Deep
[97] Marco A.F. Pimentel, David A. Clifton, Lei Clifton, and
Learning Book. Deep Learning, 2015.
Lionel Tarassenko. A review of novelty detection. Signal
[79] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua
Processing, 99:215–249, 2014.
Bengio, and Pierre Antoine Manzagol. Stacked denoising
[98] Muhammad Aurangzeb Ahmad, Ankur Teredesai, and Carly
autoencoders: Learning Useful Representations in a Deep
Eckert. Interpretable machine learning in healthcare. In Pro-
Network with a Local Denoising Criterion. Journal of Ma-
ceedings - 2018 IEEE International Conference on Healthcare
chine Learning Research, 2010.
Informatics, ICHI 2018, 2018.
[80] Zhengping Che, Sanjay Purushotham, Robinder Khemani,
and Yan Liu. Distilling Knowledge from Deep Networks with [99] Tom Fawcett. An introduction to ROC analysis. Pattern
Applications to Healthcare Domain. 2015. Recognition Letters, 27(8):861–874, 2006.
[81] Carl Doersch. Tutorial on Variational Autoencoders. 2016. [100] Gary B. Smith, David R. Prytherch, Paul E. Schmidt, and
[82] Phuoc Nguyen, Truyen Tran, Nilmini Wickramasinghe, and Peter I. Featherstone. Review and performance evaluation
Svetha Venkatesh. Deepr: A Convolutional Net for Medical of aggregate weighted ’track and trigger’ systems. Resusci-
Records. IEEE Journal of Biomedical and Health Informat- tation, 77(2):170–179, 2008.
ics, 21(1):22–30, 2017. [101] Peter J. Watkinson, Marco A.F. Pimentel, David A. Clifton,
[83] A. J. Scott, D. W. Hosmer, and S. Lemeshow. Applied and Lionel Tarassenko. Manual centile-based early warning
Logistic Regression. Biometrics, 2006. scores derived from statistical distributions of observational
[84] Evangelia Christodoulou, Jie Ma, Gary S. Collins, Ewout W. vital-sign data. Resuscitation, 129(June):55–60, 2018.
Steyerberg, Jan Y. Verbakel, and Ben Van Calster. A [102] You Jin Kim, Yun-Geun Lee, Jeong Whun Kim, Jin Joo Park,
systematic review shows no performance benefit of machine Borim Ryu, and Jung-Woo Ha. High Risk Prediction from
learning over logistic regression for clinical prediction models. Electronic Medical Records via Deep Attention Networks. In
Journal of Clinical Epidemiology, 110:12–22, 2019. NIPS Proceedings, 2017.
[85] Elsa Loekito, James Bailey, Rinaldo Bellomo, Graeme K. [103] Marko Hoikka, Tom Silfvast, and Tero I. Ala-Kokko. Does the
Hart, Colin Hegarty, Peter Davey, Christopher Bain, David prehospital National Early Warning Score predict the short-
Pilcher, and Hans Schneider. Common laboratory tests term mortality of unselected emergency patients? Scandi-
predict imminent death in ward patients. Resuscitation, navian Journal of Trauma, Resuscitation and Emergency
84(3):280–285, 2013. Medicine, 2018.
[86] Stuart W. Jarvis, Caroline Kovacs, Tessy Badriyah, Jim [104] Santiago Romero-Brufau, Jeanne M. Huddleston, Gabriel J.
Briggs, Mohammed A. Mohammed, Paul Meredith, Paul E. Escobar, and Mark Liebow. Why the C-statistic is not infor-
Schmidt, Peter I. Featherstone, David R. Prytherch, and mative to evaluate early warning scores and what metrics to
Gary B. Smith. Development and validation of a decision tree use. Critical Care, 19(1):19–24, 2015.
early warning score based on routine laboratory test results [105] David R. Prytherch, Gary B. Smith, Paul E. Schmidt, and
for the discrimination of hospital mortality in emergency Peter I. Featherstone. ViEWS-Towards a national early
medical admissions. Resuscitation, 84(11):1494–1499, 2013. warning score for detecting adult inpatient deterioration.
[87] Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Resuscitation, 81(8):932–937, 2010.
Smola. Kernel methods in machine learning. Annals of [106] Zachary C. Lipton. The mythos of model interpretability.
Statistics, 36(3):1171–1220, 2008. Communications of the ACM, 61(10):35–43, 2018.
[107] Finale Doshi-Velez and Been Kim. Towards A Rigorous
Science of Interpretable Machine Learning. 2017.
[108] Zhengping Che, Sanjay Purushotham, Robinder Khemani,
and Yan Liu. Interpretable Deep Models for ICU Outcome
Prediction. AMIA ... Annual Symposium proceedings. AMIA
Symposium, 2016:371–380, 2016.
[109] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin.
"Why should i trust you?" Explaining the predictions of any
classifier. In Proceedings of the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2016.
[110] Shane Nanayakkara, Sam Fogarty, Michael Tremeer, Kelvin
Ross, Brent Richards, Christoph Bergmeir, Sheng Xu, Dion
Stub, Karen Smith, Mark Tacey, Danny Liew, David Pilcher,
and David M. Kaye. Characterising risk of in-hospital
mortality following cardiac arrest using machine learning: A
retrospective international registry study. PLoS Medicine,
15(11):1–16, 2018.
[111] Yin Lou, Rich Caruana, and Johannes Gehrke. Intelligible
models for classification and regression. Proceedings of the
ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, pages 150–158, 2012.
[112] Cynthia Rudin. Stop explaining black box machine learning
models for high stakes decisions and use interpretable models
instead. Nature Machine Intelligence, 2019.
[113] Daryl Jones, Imogen Mitchell, Ken Hillman, and David Story.
Defining clinical deterioration. Resuscitation, 84(8):1029–
1034, 2013.