Skip to main content

ENSO forecast mash-ups: What’s the best way to combine human expertise with models?

As meteorologists and climate scientists, we talk about, think about, and commiserate about forecasts a lot. One enhancement that NOAA’s ENSO forecasting team has been working toward is the prediction of the strength of El Niño or La Nina. And judging by the comments left on social media or under our articles, that’s something you want from us too.

It was with that in mind that the ENSO blog’s own Michelle L’Heureux and colleagues set up an experiment for the ENSO forecasting team. She documents the results in a recent journal article (which I am seventh author on (1)). The experiment demonstrated that skillful strength forecasts were feasible, and also something else. It also revealed how we human forecasters can be too conservative—and how wrong certain computer models are at times.  Luckily, this experiment identifies a new, more skillful way to mash up model intel with human expertise and intuition. How nice is that?!

Image of sea surface temperature in May 2019 compared to the 1981–2010.

Sea surface temperature departures from the 1981-2010 average for May 2019 across the Pacific Ocean. The Nino3.4 region where scientists look to for ENSO conditions is also labeled. During May 2019, El Niño conditions remained present across the equatorial central/eastern Pacific Ocean. Climate.gov image using data from NCEI.

You don’t forecast ENSO strength?

Not exactly. Each month, NOAA’s ENSO team organized by the Climate Prediction Center forecasts the probability that surface temperatures (SST) in the heart of El Niño territory (the Niño3.4 region) will fall into three categories: more than 0.5°C below average (La Niña), more than 0.5°C above average (El Niño) or in between (neutral). This is done for nine consecutive, overlapping three-month seasons—for example July–September, then August–October, and so on. The final ENSO outlook averages all the forecasters’ probabilities.

However, we do not provide probabilities for the potential strength of an El Niño or La Niña event.  Forecasting the probabilities for three categories for nine seasons (27 total probability estimates) is tough enough. Imagine if we had to do it for nine categories (say, the odds of a very strong El Niño, a strong El Niño, a moderate El Niño, or a weak El Niño. Plus the same for La Niña.) For nine upcoming seasons, that’d be 81 total probability estimates. Triple the homework? Forecasters would mutiny! I’d cry.

Image of model forecasts for the Niño3.4 index.

Climate model forecasts for the Niño3.4 Index. Dynamical model data (purple line) from the North American Multi-Model Ensemble (NMME): darker purple envelope shows the range of 68% of all model forecasts; lighter purple shows the range of 95% of all model forecasts. Statistical model data (dashed line) from CPC’s Consolidated SST Forecasts. NOAA Climate.gov image from CPC data.

Instead, Michelle asked all of the ENSO forecasters to make a single addition to their forecasts: predict the monthly SST anomaly in the Niño3.4 region. Instead of having the forecasters come up with uncertainty estimates, she derived them from statistical information provided by observations and models (2): how much a given month’s SST has historically varied, and how well model forecasts have historically performed for a given time of year.  

The beauty of this method is that it allows us to forecast the probability of SST anomalies being greater or less than any number, not just the 0.5°C and -0.5°C thresholds used for determining El Niño and La Niña. This means we can issue probabilities for different strengths of La Niña and El Niño for any new category; for example, El Niño strength between 1.0°C and 1.5°C.  

This also means that humans wouldn’t have to forecast multiple probabilities. Instead, we’d have to forecast only one number for the upcoming seasons. That makes life easier. As it turns out, it also makes the forecasts better.

Risk and reward

Coming up with a method for forecasting ENSO strength, though, would be pointless if it didn’t outperform some of the climate models we forecasters look at for guidance (3). After all, our machine overlords climate models already give us strength forecasts. We could always just use those. And it would be worse than pointless if the new method was less skillful than the current approach. So, how did the experimental method perform? Here’s where things get interesting.

Image of mean (top) and median (below) skill scores out to seven seasons for three methods of predicting ENSO.

The Ranked Probability Skill Scores (RPSS) for three methods of predicting El Niño, La Niña and neutral ENSO conditions: Model output, a new human/model hybrid approach, and the current Climate Prediction Center outlook. The RPSS is displayed as both the mean (top) and median (bottom) scores. The verification period was from June 2015 to September 2018. A RPSS of one is perfection while scores less than zero indicate the forecast was less skillful than simply predicting the climatological seasonal averages for El Niño, La Niña and neutral conditions. Also, the RPSS is shown as a function of lead time (x-axis) which shows forecasts out to seven seasons. In both the mean and median RPSS, it is clear that the new hybrid human/model approach performs better than the other methods. Meanwhile, the model output performs better when looking at median RPSS than mean RPSS. Climate.gov image adapted from L'Heureux et al. 2019.

Comparing the average skill (4) of the three methods (human synthesis, model output, new hybrid (5)) for forecasts made from June 2015 to September 2018 shows that the new method that combines human forecasters’ SST predictions with model-derived information performed best for forecasts from one month out to seven.

The analysis also hinted at a quirk of human nature: on average, human forecasters were too conservative, having lower probabilities for certain categories even in cases where it was more obvious a given outcome was becoming likely. This pattern may be consistent with psychological research that suggests people are more troubled by the possibility of loss (a BIG forecast bust) than by missing out on a comparable gain (getting the forecast right).

Meanwhile, the model output was the worst in terms of mean skill score. Interestingly, though, if you instead looked at the median (exact middle) scores, the model output outscores the original human synthesis. It’s like the models have a “Go big or go home” mentality, and they don’t second guess themselves. When they miss, they miss BIG. But when they nail it, they really NAIL IT. Human forecasters lowered their median skill score by not “going all in,” even in situations that called for high confidence. Either way, the new human-model hybrid approach had the best scores.

Image of mean (top) and median (below) skill scores for two methods of predicting the strength of El Niño and La Niña.

The Ranked Probability Skill Scores (RPSS) for two methods of predicting the strength of El Niño and La Niña: Model output and a new human/model hybrid approach. The RPSS is displayed as both the mean (top) and median (bottom) scores. The verification period was from June 2015 to September 2018 and the strength categories use were >2°C, 1.5°C, 1°C, 0.5°C, -0.5°C -1°C, -1.5°C, <-2°C. A RPSS of one is perfection while scores less than zero indicate the forecast was less skillful than simply predicting the climatological seasonal averages for each strength category. Also, the RPSS is shown as a function of lead time (x-axis) which shows forecasts out to seven seasons. In both the mean and median RPSS, it is clear that the new hybrid human/model approach performs better than simply using the model output to predict ENSO strength. Climate.gov image adapted from L'Heureux et al. 2019.

This holds true for the strength forecasts as well: the new hybrid method not only outperformed the model output (6), but it also had skill similar to the current strategy that only forecasts the three categories (El Niño, La Niña, or neutral) and does not say anything about strength. These results indicate that the addition of more strength categories won’t degrade existing skill.

What does that mean?

It is encouraging that this new approach already shows promise. The forecasting team will continue to test it out over a longer time period as part of CPC's plans to integrate strength outlooks into ENSO forecasts. Once the technique is operational, humans will need to forecast only a single number—the SST anomaly—per season, and models and observation statistics will handle the rest.  

Ahhhhh…human and computer peaceful coexistence. A win for all.  

Footnotes

(1) After the first three authors, the rest of us were listed alphabetically. I very easily could have been fourth or thirteenth. So I’ll happily take seventh. Fellow ENSO bloggers, Emily and Nat, also co-authored the paper.   

(2) She took the average of the surface temperature anomaly forecasts from all forecasters and applied a normal, bell-shaped or Gaussian statistical distribution. A normal distribution is simply a type of distribution that is bell-shaped and symmetrical on either side of, in this case, the average SST anomaly forecast. Think of the purpose of the statistical distribution like this. The human-average is the scientist’s best guess at what is most likely to occur, but that is nowhere near a 100% certainty. The distribution helps expand on the probabilities or chances that other SST anomalies could occur. The shape of the bell (is it a narrow bell? A wide bell?) is determined by the SST anomaly observations for the month and the typical forecast skill we can expect for that month—there is more skill in forecasting one month out than eight months into the future for instance. For a description of how statistical distributions are applied, you can head here to see how it applies to one in thousand year rain events. And for a more in depth discussion on how exactly these probabilities were determined, I recommend reading the paper for those details.

(3) The climate models used here are the North American Multimodel Ensemble or NMME. This ensemble of climate models predicts a whole host of atmospheric and oceanic variables allowing forecasters to get a sense of what the future hold across the ocean-atmosphere interface.

(4) The Ranked Probability Score (RPSS) looks at the sum squared error of the cumulative forecast probabilities. This means that the RPSS evaluates the entire probability distribution. So, for example, if the observation qualified for El Niño, a forecaster would get penalized more for putting probabilities in La Niña than Neutral, but the forecaster overall would get penalized for any probability outside of El Niño.  The paper also uses an additional method (LSS) to evaluate the forecasts with similar results.

(5) To be clear, “human synthesis” refers to the current subjective method of doing business, in which humans make probability estimates for ENSO status after considering output from many different model collections.  “Model output” refers to objective numeric output from the North American Multi-model Ensemble (NMME). (The human synthesis considers many models beyond the NMME, but we had to pick something compare against.) The “new hybrid” is the strategy of having humans forecast only the SST anomaly and using model and observation statistics to generate the probabilities.

(6) For this part of the analysis, the new hybrid method is only compared with the model output because the current human synthesis method does not include probabilistic strength forecasts.

Comments

We still don't know what causes the SST behavior. Predictions based on statistics is not true science. I recall things like volcanic vent activity at the ocean floor had been considered as a cause, without success. (It had been proposed that historical records of concomitant earthquakes might serve as a predictor, but a reliable correlation was not found.) I guess that still leaves us with statistics until a scientific predictor can be found. The SST variations have an economic impact, demanding further scientific research. Speaking from a science background: BA Physics, (1972, Knox College); MS Atmospheric Science (1976, under Bill Gray at CSU); NCAR data "steward" 1976-2011 (under Roy Jenne at NCAR/CISL/DSS).

Add new comment

CAPTCHA
Enter the characters shown in the image.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.