A core friction in applying data science methods toward healthcare is the competing desire to make a positive impact via implementation and also being sufficiently skeptical of the rigor your work. This isn’t some abstract thought experiment: despite a very large amount of investment¹ and energy in publication² there are very few ML/AI systems in place in healthcare institutions. As a data scientist working in healthcare myself, I know how difficult it is to get stakeholders meaningfully invested in implementation. Not to mention the bevvy of complications involved in the field itself³ ⁴ ⁵. We must balance the desire of interest in our work with the real world risks of bad modeling.
Enter COVID-19 modeling: seemingly the perfect foothold for making a positive impact through ML/AI. Others have had a similar thought leading to repositories like the Reichlab github who compile submitted models and whose best ‘model’ is an ensemble of submitted models⁶. The CDC, rather than using a simple ‘best’ model, is similarly using an ensemble of models⁷. This ensembling is a practical decision to a complex problem that skirts the core issue: the reason that these ensembles are being used is because the published models are pretty awful.
Using the Reichlab COVID-19 forecast hub, I’ve pulled in predictions from every group on a single category of data (new cases) and calculated absolute scaled error (ASE) for each posted model on 9/6/2020. ASE calculated as:
Let’s review a single example before we investigate the entire set, MIT’s CovAlliance SIR model.
That is some error distribution. Given how ASE works, there is a higher penalty as you near 0, although 1750 is pretty outrageous. For readability’s sake, let’s just lop off the worst of the long tail (at 100) and look at a slightly more truncated version.
OK now it’s legible, but what does that tell us? A quick call to .median() and .mean() will get us to the bottom of this:
Interpreting this for a minute before broadening our scope in search of context: a mean ASE of 7.03 means that the model’s predictions are, on average, +- 700% of the true value. While I have not yet given evidence to support this fact, this is very, very poor. Let’s now broaden our horizon to a model from JHU:
Ok, we get it: long tails. What’s the median/mean?
That is… a lot worse. Let’s look at all of them combined up then. Shown below are the summarized errors for all models on the Reichlab github that made new case predictions between 8/30–9/3.
Models without extremely long tails are only predicting state level values. We could pull those out, but honestly, let’s just cut to the median/mean chase here.
For median new case predictions it looks like the best case is around 1–2 — that is the best published models can do is +-100–200%. Ok, one last thing to note before we really dig into this is that last item in Fig. 7: the “Baseline”. One difficulty with forecasting time series data, is understanding how good your ‘fit’ is for your model. When you use a highly interpretable model like a linear regression, we have a few metrics that are useful like our coefficient and the associated p-values alongside the residuals. Such values tell us how likely the observed relationship is, and how strong and what direction it has occurred. We can also look at the residuals and calculate our r² value. With a time series, there isn’t clear ‘baseline’ fit metric. Instead, statisticians that work with this sort of data often, tend to use ‘unskilled’ forecasts or models. Common examples include literal straight lines. For example:
You can see that at 7/27 a simple straight line is extended two weeks into the future. We can use this as a baseline to assess how much better or worse our proposed model is comparatively. Now we can return to Fig 7’s baseline. The Reichlab baseline is, in fact, such a line.
Now we get to the heart of the matter, of the 28 models producing new COVID-19 case forecasts publicly on Reichlab, half of one and two week ahead forecasts are worse than the baseline forecast. 40% are worse at three week ahead forecasts. Let’s take a look at Fig. 8 again, this time looking at the difference between the median ASE of the published models and the baseline ie: 1-(baseline_mase/model_mase).
Not only are some amateur submitted models worse than the baseline, many respected institutions models (JHU, CMU) are in that list. Models that are performing better, such as CU and MIT’s model still aren’t providing terribly impressive results, again, around +-200–300%.
Imagine you are a hospital administrator, or a state official and you use one of these public models to help anticipate healthcare needs for populations you are responsible for. You see a name like MIT or UCLA or UMass and you will likely assume that these models are imperfect surely, but helpful tools in determining where your cases will be heading. In reality these models could mislead you badly if taken at face value.
Taking a step back, it’s worth mentioning that the value of most of these models, which are SIR⁸ models, do have a use. Not in their predictive power, but rather in their capacity to use current knowledge of disease characteristics to estimate impact of various scenarios⁹. Their usefulness lies in their capacity to mix our best knowledge with estimated impacts of various interventions and react accordingly. This is likely the first and most constant thought that anyone with a background in biostatistics has been yelling at their screen throughout this post — assuming they haven’t left in disgust. The issue at hand is that users of these models and others don’t usually understand this deeply and won’t necessarily be keyed in the difference between an SIR model or a ‘AI driven Deep learning approach’ — they may even be tempted to prefer the latter.
But critically it isn’t sufficient to wave our hands at misunderstandings of our models. More than half of the these models are notably worse than an unskilled forecast. Meaning it would be better for planning purposes to not look at any model at all. A concern that I have rarely heard voiced and that certainly doesn’t match the fervor and velocity at which these models are being produced. Which isn’t to say that I think we shouldn’t be exploring modelling the ongoing pandemic, nor should we cease to publish our results. Rather: if you want to try your hand at machine learning for healthcare, maybe consider starting with MIMIC data¹⁰ rather than putting up another model of dubious usefulness. If you aren’t a beginner and your model is performing notably worse than an unskilled forecast, perhaps your model needs some kind of disclaimer. I’m personally a fan of Duke’s Institute for Healthcare Innovation (DIHI) approach¹¹ and in particular their model use labels.
I doubt anyone working in this space has malicious intent: but when we position the work of our institutions or our own as predicting the future, its vitally important that we actually do so with some meaningful degree of accuracy. People are rightfully scared and are looking for comfort and confidence from models being published. Whether or not they are right to do so (they are not), they are part of the audience of our work. It therefore behooves us to not produce models lightly, to not publish them carelessly, and to not give them adequate context when it is needed.