**by Philippe Marchand**

In 2016, Joshua Cinner and colleagues published a highly discussed paper in *Nature*, “Bright spots among the world’s coral reefs.” [1] In that study, the authors looked at a global dataset of coral reefs (2500 sites) and modelled fish biomass as a function of 18 environmental and socioeconomic covariates. They identified bright and dark spots: individual sites with the largest positive or negative differences between the observed biomass and the value predicted by the fitted model. This month, a paper in press by Barbara Frei and colleagues in the *Journal of Applied Ecology* adopts a similar approach to identify agricultural landscapes that overperformed or underperformed expectations in terms of their biodiversity and multifunctionality. [2] Both studies look at management and site-specific characteristics shared by bright spots and dark spots, with the goal of identifying practices that support thriving ecosystems.

Broadly, the bright spots approach aims to assess the performance of managed ecosystems by comparing certain ecological outcomes, while controlling for other known drivers of the outcome via a statistical model; in effect, ranking sites based on their residuals from the fitted model. While the method has the potential to reduce *bias* in comparing different sites, the resulting assessment may come with high *variance*. Since statistical models work to capture the systematic variation explained by a range of covariates among samples, model residuals are always “enriched” in noise. Some portion of the residuals is due to true inter-site differences linked to unmeasured variables – the signal that the bright spots approach aims to detect – but they also contain measurement error, intra-site sampling variance, and model specification error (the relationship between the included covariates and outcome may not be captured perfectly by the functional form of the model).

In her book *Weapons of Math Destruction*, Cathy O’Neil discusses the pitfalls of value-added models (VAM) in teacher education, which should provide a cautionary tale for any type of residuals-based assessment. O’Neil gives the example of Tim Clifford, a New York City middle school teacher who was ranked in the 6th percentile by the model on one year and the 96th percentile on the following year. She then explains why these particular VAM scores were basically random noise [3]:

*The problem was that the administrators lost track of accuracy in their quest to be fair. They understood that it wasn’t right for teachers in rich schools to get too much credit when the sons and daughters of doctors and lawyers marched off towards elite universities. […]*

*So instead of measuring teachers on an absolute scale, they tried to adjust for social inequalities in the model. Instead of comparing Tim Clifford’s students to others in different neighborhoods, they would compare them with forecast models of themselves. The students each had a predicted score. If they surpassed this prediction, the teacher got the credit. If they came up short, the teacher got the blame. […]*

*Statistically speaking, in these attempts to free the tests from class and color, the administrators moved from a primary to a secondary model. Instead of basing scores on direct measurement of the students, they based them on the so-called error term – the gap between results and expectations. Mathematically, this is a much sketchier proposition. Since the expectations themselves are derived from statistics, these amount to guesses on top of guesses. *(p.137)

More evidence of the lack of reproducibility in VAMs can be found on O’Neil’s blog. [4]

I do not intend to suggest that the “bright spots” studies should receive the same harsh criticism as the VAMs. The paper by Frei *et al.* clearly discusses some of the challenges of applying this method, and unlike VAMs, the two ecological studies are not aimed at rewarding or punishing sites based to their score. (Although, the “bright” and “dark” spot terminology may come across as a strong normative assessment of the sites.) Rather, I am hoping to start a discussion on ways to quantify and reduce uncertainty in that type of approach.

The high variance in VAM scores could be detected because schools had replicate measures (multiple classes in different grades, subjects or years) for the same assessed unit (i.e. teacher). With enough replication, a less noisy way to evaluate the teachers would have been to fit a multi-level model (or mixed model) to differentiate variance between teachers and residual variance due to random differences in performances between classes assigned to the same teacher. (That method might still fail, e.g. the replicates are not exactly independent here, but it has a better chance of detecting real differences.)

For large-scale ecological studies, especially those synthesizing different data sources, we may not have the luxury of replicated data allowing for separation of inter-site and intra-site (residual) variance. It is still possible to obtain some measure of the uncertainty in bright spot scoring, such as through bootstrapping: performing replicate fits of the model with bootstrap resamples of the original dataset, and extracting the residuals from each fit to construct a confidence interval for the bright spot score (number of standard deviations above/below prediction) at each site.

To evaluate the importance of certain management practices on ecological outcomes, while controlling for other socio-ecological variables, the most straightforward option remains to include all predictors of interest in a unified model, using a hierarchical/multi-level model if necessary. This assumes that the same covariates are available at all sites, or almost all sites if some data imputation is feasible.

In the paper by Cinner *et al.*, getting data on management practices required additional surveys, hence the need to select of a subset of the 2500 original sites. In that scenario, fitting a first model with the covariates available everywhere, then fitting a secondary model to the residuals at a subset of sites is a reasonable strategy. However, instead of selecting only the outliers of the primary model, especially if the true outlier status is highly uncertain, an alternative would be to select a representative subset of the original sites (in terms of the distribution of relevant covariates, the spatial extent, the distribution of residuals, etc.).

[1] Cinner, J.E. et al. (2016) Bright spots among the world’s coral reefs. Nature 535, 416-419. https://doi.org/10.1038/nature18607

[2] Frei, B. et al. (2018) Bright spots in agricultural landscapes: Identifying areas exceeding expectations for multifunctionality and biodiversity. In Press. Journal of Applied Ecology. https://doi.org/10.1111/1365-2664.13191

[3] O’Neil, C. (2016) Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown, New York.

[4] https://mathbabe.org/2012/03/06/the-value-added-teacher-model-sucks/

* Author biography:* Philippe Marchand is an assistant professor in ecology and biostatistics at the Forest Research Institute of the Université du Québec en Abitibi-Témiscamingue (UQAT). He occasionally tweets at @philmrchnd.

Categories: Science

One potential refinement of this approach is not to use residuals at all, but instead to employ quantile regression for the statistical modeling. Quantile regression replaces parametric error distributions and residuals as estimates of the errors with estimating multiple quantiles across the interval [0, 1], providing a distribution-free approach to modeling the cumulative distribution (cdf) of the outcome variable (Y) with whatever statistical predictors one chooses to model. And, yes the conventional linear quantile regression approach is easily extended to handle gam like models to handle nonlinear responses. Then instead of looking at residuals and their magnitude for observations, we would look at where observed values fell in the cdf defined by the quantile regression. So for example, observations that lie above say the 0.75 quantile (>= 75th percentile) might be classified as “bright” spots in Frei et al. (2018) terms, and those below the 0.25 quantile (<= 25th percentile) might be classified as "dark" spots and those in between 0.25 and 0.75 quantiles (central 50%) as average. Of course, picking various quantile levels might need to be considered in applications. The quantile regression approach would handle heterogenous variances and skewed outcomes much easier than typical parametric regression approaches.

Brian S. Cade