Environ Ecol Stat (2011) 18:709?733
DOI 10.1007/s10651-010-0158-4
Geoadditive regression modeling of stream biological
condition
Matthias Schmid ? Torsten Hothorn ?
Kelly O. Maloney ? Donald E. Weller ?
Sergej Potapov
Received: 19 December 2009 / Revised: 12 July 2010 / Published online: 16 November 2010
? Springer Science+Business Media, LLC 2010
Abstract Indices of biotic integrity have become an established tool to quantify the
condition of small non-tidal streams and their watersheds. To investigate the effects
of watershed characteristics on stream biological condition, we present a new tech-
nique for regressing IBIs on watershed-specific explanatory variables. Since IBIs are
typically evaluated on an ordinal scale, our method is based on the proportional odds
model for ordinal outcomes. To avoid overfitting, we do not use classical maximum
likelihood estimation but a component-wise functional gradient boosting approach.
Because component-wise gradient boosting has an intrinsic mechanism for variable
selection and model choice, determinants of biotic integrity can be identified. In addi-
tion, the method offers a relatively simple way to account for spatial correlation in
ecological data. An analysis of the Maryland Biological Streams Survey shows that
nonlinear effects of predictor variables on stream condition can be quantified while,
in addition, accurate predictions of biological condition at unsurveyed locations are
obtained.
Electronic supplementary material The online version of this article
(doi:10.1007/s10651-010-0158-4) contains supplementary material, which is available to authorized users.
M. Schmid (
B
) ? S. Potapov
Institut f?r Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universit?t
Erlangen-N?rnberg, Waldstra?e 6, 91054 Erlangen, Germany
e-mail: matthias.schmid@imbe.med.uni-erlangen.de
T. Hothorn
Institut f?r Statistik, Ludwig-Maximilians-Universit?t M?nchen, Ludwigstra?e 33, 80539 Munich,
Germany
K. O. Maloney ? D. E. Weller
Smithsonian Environmental Research Center, 647 Contees Wharf Rd., P.O. Box 28, Edgewater,
MD 21037-0028, USA
123
710 Environ Ecol Stat (2011) 18:709?733
Keywords Proportional odds model ? Gradient boosting ? Geoadditive regression ?
Stream biological condition ? Maryland Biological Streams Survey
1 Introduction
In view of the growing impact of humans on their natural environment, conserving and
managing small streams and their watersheds have become important. Policy makers
and land managers must assess the ecological effects of their decisions on streams, but
also have to investigate the impacts of stream degradation on human health and the
quality of life. For these reasons, a detailed understanding of the relationship between
anthropogenic stressors and stream ecosystems is essential (Cushing and Allan 2001;
USEPA 2006; Maloney et al. 2009). To develop that understanding, ecologists and
statisticians need to quantify how watershed characteristics affect stream biological
condition. Because small streams are numerous, assessing the biological condition of
all streams in a landscape would be logistically impractical and cost prohibitive. It
is therefore necessary to develop predictive models for site-specific stream condition
using data from a limited number of sample sites. Ideally, those models would both
quantify the effects of anthropogenic stressors on stream condition and accurately
predict biological condition at unsurveyed locations (USEPA 2006). In recent years,
a wide range of statistical tools to characterize and to model the condition of small
streams have been developed (see, e.g., Barker et al. 2006; Collier 2009; Maloney
et al. 2009, or Cooper 2009 for recent studies in this field).
The responses of streams to anthropogenic stress are often examined using biologi-
cal metrics that describe biological conditions from structural and functional measures
of the biological community (Karr 1991; Barbour et al. 1999). However, singlemetrics
only measure a single feature of the community (e.g., number of taxa or diversity) and
may not capture the effects of multiple stressors. Therefore, stream assessments usu-
ally compile several singlemetrics that are selected a priori to relate stream impairment
to anthropogenic stress and then combine those metrics into a single multimetric index
of biotic integrity (?IBI?, Karr et al. 1986; Schleiger 2000; Southerland et al. 2005).
These IBI indicators can then be statistically related to watershed-specific predictor
variables using modeling approaches such as linear or ordinal regression, principal
component analysis, or tree-based methods such as CART and random forests.
In this paper, we address the problem of developing predictive models for indices
of biotic integrity for fish (FIBI) and benthic macroinvertebrates (BIBI). Both indices
have become widely established tools for characterizing stream biological condition
(Karr et al. 1986; Barbour et al. 1999; Southerland et al. 2005). When modeling FIBI
and BIBI indicators, the following key issues need to be addressed:
1. IBI indicators are typically evaluated on an ordinal scale (e.g., using categories
ranging from ?poor condition? to ?very good condition?). Although it is possi-
ble to use linear regression methods to model ordinal indicators, ordinal regres-
sion models such as the proportional odds model are a more appropriate choice
(McCullagh 1980; Agresti 2002; Bigler et al. 2005). However, if maximum likeli-
hood estimation is used to fit the model, and a large number of predictor variables
are considered, proportional odds models tend to overfit the data. Usually, this
123
Environ Ecol Stat (2011) 18:709?733 711
leads to a decrease in prediction accuracy. On the other hand, heuristic strategies
to control the number of predictors are often biased and imprecise (Rawlings et al.
1998).
2. Stream condition is affected by many factors that are often highly correlated.
Moreover, spatial correlation is usually evident in ecological data (Peterson and
Urquhart 2006; Gelfand 2007). A statistical model must be able to identify the
most important factors and to account for spatial correlation in the data. In addi-
tion, prediction models should be able to represent nonlinear relationships that
often exist between predictors and indicators of stream biological condition.
3. It is well-known that maximizing prediction accuracy does not necessarily go
hand in hand with finding a statistical model that is easy to interpret. Common
examples are statistical learning techniques such as bagging or random forests
(Breiman 2001; Cutler et al. 2007), which yield ?black-box? predictions that are
typically accurate but lack interpretability. This is not desirable in situations where
effects of predictor variables need to be quantified.
The aim of this paper is to develop a statistical method for modeling IBI indicators
that simultaneously addresses all issues outlined above. Following Agresti (2002), we
use the proportional odds model framework to accomodate the ordinal structure of
IBI indicators (issue 1). To avoid overfitting the data, however, we do not use classi-
cal maximum likelihood estimation to obtain model estimates but a component-wise
gradient boosting approach (for an overview of boosting methods, see B?hlmann and
Hothorn 2007). Because component-wise gradient boosting has a built-in mechanism
for variable selection and shrinkage of estimates, the method can be used to obtain
regularized fits of many types of statistical models. Consequently, heuristic techniques
for variable selection and model choice are not needed.
In recent years, various authors have shown that gradient boosting can be modified
such that prediction accuracy is optimized while, in addition, a meaningful interpreta-
tion of the model estimates is possible (Friedman et al. 2000; B?hlmann and Yu 2003;
B?hlmann and Hothorn 2007; Kneib et al. 2009). Regarding issue 3, this is exactly
what one wants to achieve: The structure and the interpretability of the proportional
odds model is preserved while, in contrast to maximum likelihood estimation, predic-
tion accuracy is maximized by fitting the model in a regularized way. Most notably,
by using penalized regression splines to model effects of covariates, nonlinear rela-
tionships and spatial information can be easily incorporated into the prediction model.
This is important with respect to issue 2, cf. Kneib et al. (2008, 2009).
While component-wise gradient boosting has become an established tool to fit con-
tinuous and binary data, it has not been possible so far to use boosting methods for
fitting proportional odds models. The reason for this is that the boosting algorithms
considered by B?hlmann and Hothorn (2007) do not allow for the estimation of scale
parameters. The proportional odds model, however, involves the constrained estima-
tion of an ordered set of threshold parameters that have to be estimated simultaneously
with the othermodel parameters. To take this problem into account, we construct a new
boosting algorithm that combines the methods considered by B?hlmann and Hothorn
(2007) with an estimation approach suggested by Schmid et al. (2010). With the latter
approach, it is possible to adapt boosting algorithms to model families depending on
123
712 Environ Ecol Stat (2011) 18:709?733
a set of scale parameters. As will be shown, the method by Schmid et al. (2010) can
be re-formulated such that fitting a proportional odds model via boosting techniques
is feasible.
We will analyze IBI data from the Maryland Biological Streams Survey (MBSS)
to demonstrate that the new algorithm is an efficient modeling tool for the biological
assessment of small streams and their watersheds. Boosting predictions of FIBI and
BIBI indicators are similar to predictions obtained from other established statistical
techniques (see, e.g., Maloney et al. 2009), but spatial covariate patterns are detected
and model estimates can be interpreted in a more meaningful way. This is possible
because the structure of the proportional odds model allows for inspection and visu-
alization of marginal predictor effects. As a consequence, the model can be used both
for extrapolating estimates of stream biological condition to unsurveyed sites and
exploring the determinants of biotic integrity.
The rest of the paper is organized as follows: In Sect. 2, the new algorithm is
presented, along with a number of technical details involved in choosing appropriate
tuning parameters. The characteristics of the algorithm are demonstrated in Sect. 3.
Here, the new method is benchmarked against other regression techniques, and an
analysis of the MBSS data is carried out. A summary and discussion of the main
findings of the paper is given in Sect. 4. Additional figures and technical details are
presented in the Appendix and the Web Appendix of the paper.
2 Methods
Proportional odds model
Let Y be an IBI outcome with K ordered categories and denote the vector of predictor
variables by X = (X1, . . . , Xp). Let (X1, Y1), . . . , (Xn, Yn) be a set of independent
realizations of (X, Y ). Define X := (X1, . . . , Xn) and Y := (Y1, . . . , Yn).
The proportional odds model is given by
P(Y ? k) =
1
1 + exp( f (X) ? ?k) , k = 1, . . . , K , (1)
where f = f (X) is a prediction function depending on the predictor variables and
? ? < ?1 < ? ? ? < ?K?1 < ?K = ? (2)
is a set of threshold values that has to be estimated simultaneously with f . In many
applications, f is restricted to being a linear function of the covariates (see Agresti
2002). To take nonlinear predictor effects into account, we will use a more flexible
approach: f will be modeled as the sum of (possibly nonlinear) marginal prediction
functions f1(X1), . . . , f p(Xp), i.e.,
f (X) ? f (X1, . . . , Xp) =
p
?
j=1
f j (X j ) (3)
123
Environ Ecol Stat (2011) 18:709?733 713
(cf. Kneib et al. 2009).With this approach, themodel has essentially the same structure
as a generalized additive regression model (Hastie and Tibshirani 1990).
From (1) we obtain
P(Y = 1|X) =
1
1 + exp( f ? ?1) ,
P(Y = 2|X) =
1
1 + exp( f ? ?2) ?
1
1 + exp( f ? ?1) ,
.
.
.
P(Y = K ? 1|X) =
1
1 + exp( f ? ?K?1) ?
1
1 + exp( f ? ?K?2) ,
P(Y = K |X) = 1 ?
1
1 + exp( f ? ?K?1) , (4)
which allows for specifying the log-likelihood of the proportional odds model (see
Appendix A). By definition, the probability of observing a large outcome category
increases with the magnitude of the estimates of f j , j = 1, . . . , p. For two sites with
covariate vectors X1 and X2, (4) implies that the log ratio of cumulative odds does not
depend on the category k under consideration:
log
(
P(Y ? k|X1)/P(Y > k|X1)
P(Y ? k|X2)/P(Y > k|X2)
)
= f (X2) ? f (X1). (5)
Equation (5) is the well-known ?proportional odds assumption? which leads to esti-
mates that are interpretable in terms of cumulative odds ratios.
Component-wise gradient boosting
As stated in the introduction, overfitting of the data can be avoided if component-wise
boosting is used to fit the proportional odds model. In the following, we will adapt the
component-wise gradient boosting algorithm considered by B?hlmann and Hothorn
(2007) to the proportional odds model specified above.
In the boosting framework, the aim is to estimate the ?optimal? prediction func-
tion f ? and the ?optimal? set of threshold values ?? := (??1 , . . . , ??K?1) defined by
( f ?, ??) := argmin
f,?
E
Y ,X
[?(Y , f (X), ?)] , (6)
where the loss function ? is assumed to be differentiable with respect to f . In case
of the proportional odds model, it is a natural choice to set ? equal to the negative
log-likelihood derived from (4). The full log-likelihood function and its derivative are
given in Appendix A.
Instead of minimizing the theoretical mean given in (6), we consider the empir-
ical risk R :=
?n
i=1 ?(Yi , f (Xi ), ?) and use the following boosting algorithm to
minimize the R over f and ? :
123
714 Environ Ecol Stat (2011) 18:709?733
1. Initialize the n-dimensional vector ?f [0] and the threshold parameter estimates
?
?
[0]
1 , . . . ,
?
?
[0]
K?1 with offset values.
2. For each of the predictor variables specify a base-learner, i.e., a regression esti-
mator with one input variable and one output variable. Set m = 0.
3. Increase m by 1.
4. (a) Compute the negative gradient ? ??
? f and evaluate at ?f [m?1](Xi ), ?? [m?1] =
(
?
?
[m?1]
1 , . . . ,
?
?
[m?1]
K?1
)
, i = 1, . . . , n. This yields the negative gradient vector
U [m] =
(
U [m]i
)
i=1,...,n
: =
(
?
?
? f ?
(
Yi , ?f [m?1](Xi ), ?? [m?1]
)
)
i=1,...,n
.
(b) Fit the negative gradient vector U [m] to each of the p predictor variables sep-
arately by using the base-learners specified in step 2. This yields p vectors
of predicted values, where each vector is an estimate of the negative gradient
vector U [m].
(c) Select the base-learner that fits U [m] best according to the R2 goodness-of-fit
criterion. Set ?U [m] equal to the fitted values of the best model.
(d) Update ?f [m] ? ?f [m?1] + ? ?U [m], where 0 < ? ? 1 is a real-valued step
length factor.
5. Plug ?f [m] into the empirical risk function ?ni=1 ?(Yi , f, ?) and minimize the
empirical risk over ? . Set ?? [m] equal to the newly obtained estimate of ??.
6. Iterate Steps 3?5 until the stopping iteration mstop is reached (the choice of mstop
will be discussed below).
In the following, we will refer to the boosting algorithm introduced above as ?pro-
portional odds boosting? (P/O boosting). Steps 1?4 of the P/O boosting algorithm
correspond to the classical gradient boosting approach discussed in B?hlmann and
Hothorn (2007). From step 4 it is seen that the algorithm descends the gradient of the
empirical risk R, which is the main feature of all gradient boosting algorithms. In each
iteration, an estimate of the true negative gradient of R is added to the current estimate
of f ?. Consequently, R is minimized in a stagewise fashion, and a structural (regres-
sion) relationship between Y and X is established. Obviously, using P/O boosting
corresponds to replacing classical Fisher scoring algorithms for maximum likelihood
estimation of f ? (McCullagh 1980) by a gradient descent algorithm in function space.
As seen from steps 4(c) and 4(d), the P/O boosting algorithm additionally carries out
variable selection, as only one base-learner (i.e., one component of X) is selected for
updating ?f [m] in each iteration. Due to the additive update, the final boosting estimate
at iteration mstop can be interpreted as an additive prediction function, as defined in (3).
In step 5, the estimation approach of Schmid et al. (2010) is used to obtain updates
of ? . Here, the empirical risk is minimized over ? , using the current estimate of f ?
as offset value. Note that step 5 of the P/O algorithm involves the constrained estima-
tion of an ordered set of parameters, which has not been considered by Schmid et al.
(2010). As shown in Web Appendix A, however, the constrained estimation problem
123
Environ Ecol Stat (2011) 18:709?733 715
can be re-formulated as an unconstrained problem, so that the method by Schmid et al.
(2010) can be applied.
Specification of base-learners
It is clear from step 4 of the P/O boosting algorithm that the specification of the base-
learners is crucial for interpreting the model fit. Here it is important to keep in mind
that, due to the additive update in step 4(d), the estimate of a marginal function f j
at iteration mstop has the same structure as the base-learner used in each iteration.
For example, f j is linear in X j if the base-learner used to model f j in step 4(b) is
a simple linear model (cf. B?hlmann and Hothorn 2007, p. 484). Similarly, f j is a
smooth function of X j if the corresponding base-learner is smooth as well.
Concerning the choice of appropriate base-learners, we follow the approach used by
Kneib et al. (2009): Themarginal functions f j corresponding to continuous predictors
are either modeled as linear functions or as penalized regression splines (?P-splines?,
cf. Wood 2006; Schmid and Hothorn 2008; Kneib et al. 2009), where selection of the
best modeling alternative (smooth nonlinear vs. linear) is carried out automatically
by the P/O boosting algorithm. To do this, we modify step 2 of the P/O algorithm as
follows: For each covariate, we specify two competing base-learners, namely a linear
base-learner and a smooth P-spline deviation from the linear base-leaner (cf. Kneib
et al. 2009, p. 628). Consequently, due to the base-learner selection carried out in step
4(c), the marginal functions f j depending on continuous predictors become either
linear or smooth.
To account for spatial dependency between neighboring sample sites, we addition-
ally include a smooth function quantifying marginal spatial effects into the model.
This function depends on the coordinates of the site locations and is added to the other
functions specified in (3), see Kneib et al. (2008, 2009). As a base-learner for the mar-
ginal spatial effect we use a P-spline tensor product surface depending on the UTM
easting and northing coordinates of the site locations. Thus, denoting the coordinates
by XE and XN, the spatial effect becomes a smooth marginal surface fsp(XE, XN)
depending on the bivariate ?predictor? variable (XE, XN). As noted by Kneib et al.
(2008), fsp(XE, XN) can be interpreted as the realization of a spatially correlated sto-
chastic process, emphasizing the fact that one needs to account for spatial correlation
in the data.
Finally, we model categorical predictors using dummy coded binary variables as
base-learners. As a consequence, the functions f j correspond to linear category effects
in these cases. Detailed descriptions of P-splines and P-spline tensor product surfaces
have been given by Fahrmeir et al. (2004), Wood (2006) and Kneib et al. (2009).
Tuning parameters
In the literature, it has been argued that boosting algorithms should generally not be
run until convergence. Otherwise, overfits resulting in a suboptimal out-of-sample
prediction accuracy are likely (see B?hlmann and Hothorn 2007). As a consequence
123
716 Environ Ecol Stat (2011) 18:709?733
of this ?early stopping? strategy, the stopping iteration mstop becomes the main tuning
parameter of the P/O algorithm. In the following, we will use five-fold cross-valida-
tion to determine the value of mstop, i.e., mstop is the iteration with lowest predictive
empirical risk. Alternatively, information criteria such as AIC or BIC could be used to
determine the stopping iteration mstop. For example, in case of Gaussian regression, a
correctedAIC criterion could be calculated in each boosting iteration, and the stopping
iteration would be given by the iteration with smallest AIC (B?hlmann and Hothorn
2007, p. 495). In this paper, however, we will consider cross-validation instead of
information criteria because the latter have been criticized as being systematically
biased towards stopping boosting algorithms too late (see Hastie 2007). In contrast to
the choice of the optimal stopping iteration, the choice of the step length factor ? has
been shown to be of minor importance for the predictive performance of a boosting
algorithm. The only requirement is that the value of ? is ?small?, such that a stagewise
adaption of the prediction function is possible (see Schmid and Hothorn 2008). We
will set ? = 0.1.
Regularization
A major consequence of the early stopping strategy is that the estimates of f ? are
shrunken towards zero. In fact, using a small step length ? ensures that marginal
function estimates increase ?slowly? in the course of the P/O boosting algorithm but
stop increasing as soon as the optimal stopping iteration mstop is reached. As stated
above, the optimal stopping iteration is chosen such that out-of-sample prediction
accuracy is optimized within the proportional odds framework. In other words, stop-
ping the P/O boosting algorithm at the optimal iteration implies that the amount of
shrinkage is chosen such that the predictive power of the proportional odds model is
maximal. Shrinkage is a well-established strategy to regularize model estimates: Esti-
mates tend to have a slightly increased bias but a decreased variance, thereby improv-
ing prediction accuracy. On the other hand, the choice of the base-learners specified
above ensures that black-box predictions are avoided and marginal effect estimates
are obtained. Although, in contrast to maximum likelihood estimation, estimates are
shrunken towards zero, the main characteristics (and thus the interpretability) of the
proportional odds model are preserved.
Prediction
For given estimates of f ? and ??, the predicted outcome category (denoted by k?) is
the category with highest posterior probability, i.e.,
k? = max
k
?P(Y = k|X), (7)
which can be computed from (4). Thus, misclassification rates and weighted kappa
indices for ordinal data (Fleiss and Cohen 1973) can be used to evaluate the predictive
power of the P/O boosting fit.
123
Environ Ecol Stat (2011) 18:709?733 717
Confidence intervals
Since boosting estimates are shrunken towards zero, computation of confidence inter-
vals for marginal functions is infeasible. This problem can also be found with other
shrinkage methods such as ridge regression or the Lasso (Tibshirani 1996). With the
help of bootstrapmethods, however, it is possible to approximate the distribution of the
boosting estimates in a non-parametric fashion (see Sect. 3). Consequently, the boot-
strapped estimates can be used to assess whether a function estimate is systematically
different from zero.
As an alternative to approximating the distribution of effect estimates via bootstrap
sampling, Bayesian methods could be used to fit the proportional odds model and to
calculate posterior distributions of marginal predictor effects. This approach would
require Bayesian methods for shrinkage (such as the Bayesian Lasso, Park and Casella
2008) and variable selection (O?Hara and Sillanp?? 2009) to be adapted to geoadditive
proportional odds models. While being potentially useful, the Bayesian approach is
conceptually different from the proposed P/O boosting algorithm and will therefore
not be considered in this paper.
3 Analysis of the Maryland biological streams survey
Data sources and pre-processing
In this section, we apply the P/O boosting algorithm to develop a predictive model
for fish (FIBI) and benthic macroinvertebrates (BIBI) indicators of biological condi-
tion. Our study is focused on the 23,408 km2 part of Maryland, USA, lying in the
Chesapeake Bay watershed (Fig. 1). This area includes six Level III ecoregions: Cen-
tral Appalachians, Ridge and Valley, Blue Ridge, Northern Piedmont, Southeastern
Plains, and Middle Atlantic Coastal Plains (see Omernik 1987). Climate types range
from cold with hot summers in the mountainous western area to temperate with hot
summers toward the southeast; vegetation patterns range from northern hardwood for-
ests in the highlands to oak, hickory, pine, and southern mixed forests of the Coastal
Plains. The Appalachian, Ridge and Valley, and Blue Ridge ecoregions are under-
lain largely by folded and faulted sedimentary rocks. The Piedmont ecoregion is under-
lain by crystalline igneous and metamorphic rocks, and the Plains ecoregions are
underlain by unconsolidated sediments.
Our analysis is based on theMarylandBiological Streams Survey (MBSS), which is
an on-goingmonitoring program designed to describewater quality in 1st- to 4th-order
non-tidal streams within the state of Maryland, USA (USEPA 1999). MBSS scientists
used a probabilistic sampling design stratified by major watershed and stream order to
sample approximately 2,500 sites from 1994 to 2004 (cf. Southerland et al. 2005). An
MBSS site is a ?75 m stream segment where data were collected for stream physical
and hydrological attributes (e.g., flow, width, depth, and embeddedness), streamwater
chemistry, location (latitude and longitude), riparian conditions, and biological com-
munities (i.e., benthic macroinvertebrates and fish). For a detailed description of the
MBSS, see http://www.dnr.state.md.us/streams. We considered only the first record
123
718 Environ Ecol Stat (2011) 18:709?733
Fig. 1 The Maryland portion of the Chesapeake Bay watershed, its major ecoregions, and stream assess-
ment sites where data were collected. Inset shows the study area (dark gray) in relation to the Chesapeake
Bay watershed (light gray)
for sites that were sampled more than once. This resulted in a database containing
measurements at n =1,573 stream sites (see Fig. 1). There were 96 sites that had
no fish collected, and these sites were not used to examine FIBI, leaving n =1,477
sites for the FIBI models. Land cover data was obtained from the 2001 US National
Land Cover Database (Homer et al. 2004). Watershed predictors were calculated in
ARCGIS using watershed boundaries and relevant environmental coverages.
Individual IBIs were developed by MBSS scientists separately for each subregion
of the study area and included individual metrics specific to each subregion (Web
Appendix B, see Southerland et al. 2005 for IBI developments and for a complete
list of metrics in each IBI). Following Maloney et al. (2009), we used an ordinal
scale to quantify BIBI and FIBI indicators (1 = ?very poor site?, 2 = ?poor site?,
3 = ?fair site?, 4 = ?good site?). FIBI and BIBI indicators were regressed on site-
specific predictor variables using the P/O boosting algorithm introduced in Section
2. Predictors included UTM easting and northing coordinates, watershed land use,
dominant ecoregion (Omernik 1987), and the ?distance to mainstem? measured from
the MBSS sampling site to a mainstem tributary with >500 km2 in upslope drainage
area. A value of 0 was assigned to sites that drained into the Chesapeake Bay before
reaching a mainstem river. For a detailed description of the predictor variables, see
Appendix B. Predictors with a highly right-skewed distribution were log transformed
before statistical analysis.
Benchmark analysis
We first investigated the prediction accuracy of the P/O boosting algorithm, i.e., the
ability of P/O boosting to predict the FIBI and BIBI values at unsurveyed sites. To
123
Environ Ecol Stat (2011) 18:709?733 719
do this, we carried out a benchmark experiment using 100 bootstrap samples drawn
from the full data set. The bootstrap samples were used as training data sets, and the
P/O boosting algorithm was applied to these samples. Five-fold cross-validation was
carried out on the training data sets to determine the values of mstop. In a next step, we
applied the 100 prediction rules obtained from P/O boosting to the respective sets of
out-of-bootstrap observations (?bootstrap cross-validation?, see, e.g., Hothorn et al.
2005). The predictions and the true outcome values of the 100 out-of-bootstrap data
sets were used to compute classification rates and weighted kappa values.
To benchmark the P/O boosting algorithm against other established techniques,
we also considered the random forest method (Breiman 2001), a non-parametric
classification technique based on ensembles of decision trees. Random forests have
been shown to be one of the most accurate methods for predicting IBI indicators
(Maloney et al. 2009). Because the random forest method neither requires the propor-
tional odds assumption nor the additivity of the prediction function specified in (3),
it is less restrictive than P/O boosting. As above, we estimated prediction accuracy
using the same bootstrap samples to compute classification rates and weighted kappa
values.
Table 1 shows that boosting and the random forest method had similar classification
rates for both indices of biotic integrity. For the FIBI indicator,mean classification rates
were nearly equal (P/O boosting: 51.4%, random forest method: 51.2%) while in case
of the BIBI indicator, classification rates were slightly lower for P/O boosting (45.4%)
than for random forests (46.6%). Weighted kappa values obtained from P/O boosting
were larger on average than the corresponding weighted kappa values obtained from
random forests (Table 2). This result can be explained by the fact that P/O boosting
explicitly accounts for the ordinal structure of the BIBI and FIBI indicators: Due to the
structure of the proportional odds model, misclassification of observations into neigh-
boring categories tends to be more likely than misclassification into categories that are
?far away? from the true category. Note that all weighted kappa values obtained from
P/O boosting were larger than 0.5, i.e., P/O boosting predicted significantly better than
chance alone. Table 1 also suggests that classification rates of outcome categories are
considerably higher if site-specific covariate information is used to obtain predictions
than if the unconditional distribution of the FIBI and BIBI indicators is used. As seen
from Table 1, classification rates are largest for sites with good biological condition.
This result has previously been reported by Maloney et al. (2009).
Analysis of the full data set: FIBI indicator
After demonstrating that the prediction accuracy of P/O boosting is comparable to
that of the random forest method, we analyzed the full data set to examine functional
relationships between predictors and IBI indicators.We applied the P/O boosting algo-
rithm to the full data set and visualized marginal function estimates using partial plots
(Figs. 2, 3, 4, 5). Note that partial plots cannot be obtained from the random forest
method because, in contrast to P/O boosting, random forests yield black-box estimates
that are not easily interpreted. Although the random forest method can provide esti-
mates of variable importance (Cutler et al. 2007), functional relationships between
123
720 Environ Ecol Stat (2011) 18:709?733
Table 1 P/O boosting and random forest classification rates, as obtained from bootstrap cross-validation
Mean Min. 1st Qu. Median 3rd Qu. Max. uncond.
FIBI classification rates, P/O boosting
All sites 0.514 0.461 0.500 0.514 0.528 0.563
Very poor sites 0.372 0.223 0.317 0.357 0.403 0.610 0.167
Poor sites 0.221 0.056 0.183 0.219 0.259 0.361 0.172
Fair sites 0.453 0.320 0.416 0.455 0.485 0.557 0.277
Good sites 0.728 0.634 0.699 0.731 0.759 0.809 0.383
FIBI classification rates, random forests
All sites 0.512 0.466 0.500 0.513 0.526 0.561
Very poor sites 0.395 0.279 0.341 0.391 0.441 0.596 0.167
Poor sites 0.278 0.152 0.241 0.278 0.309 0.405 0.172
Fair sites 0.425 0.304 0.394 0.428 0.457 0.529 0.277
Good sites 0.711 0.609 0.686 0.708 0.739 0.781 0.383
BIBI classification rates, P/O boosting
All sites 0.454 0.414 0.442 0.454 0.468 0.500
Very poor sites 0.435 0.285 0.399 0.435 0.473 0.568 0.181
Poor sites 0.335 0.208 0.297 0.333 0.364 0.503 0.249
Fair sites 0.483 0.329 0.453 0.482 0.516 0.590 0.295
Good sites 0.546 0.440 0.508 0.552 0.578 0.653 0.275
BIBI classification rates, random forests
All sites 0.466 0.424 0.455 0.467 0.478 0.504
Very poor sites 0.474 0.350 0.443 0.480 0.513 0.579 0.181
Poor sites 0.318 0.212 0.283 0.313 0.349 0.441 0.249
Fair sites 0.436 0.279 0.408 0.431 0.466 0.519 0.295
Good sites 0.629 0.532 0.601 0.632 0.655 0.736 0.275
uncond. = unconditional distribution of categories in the full data set
Table 2 Weighted kappa values, as obtained from bootstrap cross-validation
Mean Min. 1st Qu. Median 3rd Qu. Max.
FIBI weighted kappa values, P/O boosting
0.591 0.502 0.577 0.593 0.609 0.657
FIBI weighted kappa values, random forests
0.553 0.473 0.533 0.552 0.576 0.637
BIBI weighted kappa values, P/O boosting
0.586 0.527 0.568 0.583 0.603 0.643
BIBI weighted kappa values, random forests
0.580 0.501 0.565 0.580 0.596 0.634
Fleiss?Cohen weights were used to account for the ordinal structure of FIBI and BIBI indicators (cf. Fleiss
and Cohen 1973)
123
Environ Ecol Stat (2011) 18:709?733 721
Fig. 2 FIBI model?marginal function estimates obtained from applying P/O boosting to the full data set
predictors and outcome variables cannot be quantified directly. We therefore did not
use this method to analyze the full data set.
Let us first consider the FIBI indicator of biological condition. Figure 2 shows
the most pronounced marginal effects obtained from P/O boosting. Light grey lines
correspond to the function estimates obtained from the 100 bootstrap samples used
in the benchmark experiment. Increases in watershed area and average watershed ele-
vation have positive effects on the FIBI indicator, supporting previous findings of
the importance of these factors on stream fishes (Angermeier and Schlosser 1989;
Oberdorff and Hughes 1992; Matthews and Robison 1998; Joy and Death 2004).
While the effect of watershed area is clearly nonlinear, a linear marginal predic-
tion function was obtained for average watershed elevation. This demonstrates the
ability of P/O boosting to incorporate both linear and smooth (nonlinear) predictor
effects into the proportional odds model. Regarding the magnitude of its marginal
function, watershed area is clearly the most important predictor for FIBI (Fig. 2).
For example, consider two stream sites with watershed areas WA1 = 1 km2 and
WA2 = 10 km2. We obtain log(WA1) = 0 and log(WA2) ? 2.3, which results
in marginal function estimates fWA(WA1) ? ?2 and fWA(WA1) ? 1 (see Fig. 2).
123
722 Environ Ecol Stat (2011) 18:709?733
Fig. 3 FIBI model?marginal spatial effect estimate obtained from applying P/O boosting to the full
data set
Assuming constant values for the other predictor variables, it follows from Eq. (5)
that the cumulative odds ratio of site 2 is approximately exp(1 ? (?2)) ? 20
times larger than the cumulative odds ratio of site 1. The strong positive effect of
watershed area might be due to low natural richness of fishes in headwater streams
(Angermeier and Schlosser 1989; Matthews and Robison 1998), which may affect
individual metrics composing the IBI (Schleiger 2000).
Increases in the percentage of upstream watershed under impervious surface cover
have a negative effect on FIBI scores, which is an often-reported pattern (e.g., Wang
and Lyons 2003; Helms et al. 2009) that results from the numerous negative effects
that impervious surfaces have on stream hydrologic and geomorphic factors (Paul and
Meyer 2001; Walsh et al. 2005). The marginal function estimate for the distance from
sampling location to the nearest main stem stream indicates that there is an inverted
U-shaped relationshipwith the FIBI indicator, i.e., FIBI increaseswith low but increas-
ing distance values but decreases again for large distance values. Sites with large dis-
tances from mainstems are likely headwaters having low FIBIs as discussed above.
The lower FIBI scores for short distances to mainstem might reflect sites that directly
drain into the Chesapeake Bay (which were given a distance value of 0). Marginal
function estimates corresponding to other continuous predictors are relatively small in
magnitude (relative to the estimates shown in Fig. 2), indicating theirminor importance
for modeling FIBI. The function estimates corresponding to these predictor variables
are shown in Web Appendix C.
Marginal effect estimates of the categorical covariate ?percentage of bedrock that is
calcareous in a watershed? were small and suggest that FIBI is lower when calcareous
rock is present (percentage of calcareous bedrock >0%, see Table 3), highlighting the
importance of geology in structuring fish assemblages (Montgomery 1999; Joy and
Death 2004). However, we caution over-interpretation of these findings because the
range of the bootstrapped marginal effect estimates contains zero and because this
covariate was analyzed at a coarse scale (presence/absence). The categorical effects of
dominant ecoregions are also relatively small (Table 3), indicating that the dominant
ecoregion is of minor importance for modeling FIBI.
123
Environ Ecol Stat (2011) 18:709?733 723
Fig. 4 BIBI model?marginal function estimates obtained from applying P/O boosting to the full data set
A marginal spatial effect was still evident for the FIBI after accounting for all other
covariates (Fig. 3). Sites in the Blue Ridge region, the Ridge and Valley region, and
the South-Eastern part of the Middle Atlantic Coastal plain tended to have lower FIBI
123
724 Environ Ecol Stat (2011) 18:709?733
Fig. 5 BIBI model?marginal spatial effect estimate obtained from applying P/O boosting to the full
data set
scores than other sites. In contrast, the middle region of the Northern Piedmont region
shows a very positive marginal effect on FIBI. It is important to remember that these
effects are marginal and therefore not caused by variations in other predictors (such
as the dominant ecoregion). They may reflect missing predictors, or, alternatively,
problems with the calculation of FIBI itself. For example, the FIBI for Blue Ridge and
Ridge and Valley ecoregions was stratified only into warmwater or coldwater streams
(see Web Appendix B, Southerland et al. 2005). A more refined FIBI, possibly strat-
ified by ecoregions or sub-ecoregions (Schleiger 2000), might reduce the marginal
spatial patterns in the FIBI.
Analysis of the full data set: BIBI indicator
We next consider the BIBI indicator of biological condition. Figure 4 shows the most
pronounced marginal effects obtained from P/O boosting. Obviously, increases in
watershed area, distance from sampling location to the nearest main stem stream, and
percentage of upstream watershed under tree cover have large positive effects on the
BIBI indicator. In contrast, increases in the percentage of upstream watershed under
impervious surface cover have a very strong negative effect on BIBI. Apart from a
few sites with a small upstream population density (showing large variations in their
effects on BIBI), the effect of population density on BIBI is also negative. The positive
effect of the percentage of upstreamwatershed under tree cover and the negative effect
of the percentage of upstream watershed under impervious surface cover on benthic
macroinvertebrates support previous reports (Roy et al. 2003; Walsh et al. 2005; King
et al. 2005; Maloney et al. 2009) and document the sensitivity of benthic macroin-
vertebrates to watershed conditions. The upstream population density was positively
correlated with the percentage of upstream watershed under impervious surface cover
(r = 0.78), so these covariates are likely to represent the same anthropogenic stressor
(population pressure). The effect of distance to main stem demonstrates the impor-
tance of position within a stream network to the benthic community (Vannote et al.
123
Environ Ecol Stat (2011) 18:709?733 725
Table 3 Effects of the categorical predictors ?percentage of bedrock that is calcareous in a watershed? and
?dominant ecoregion? on FIBI and BIBI indicators of stream condition
Predictor Category Full data set Mean Min. Max.
FIBI indicator, P/O boosting
% Of calc. bedrock =0% 0
>0% ?0.082 ?0.144 ?0.474 0.102
Ecoregion Blue ridge 0
Centr. Appalachian ?0.023 ?0.188 ?0.748 ?0.001
Mid. Atl. coastal plains 0.005 0.032 ?0.004 0.141
Northern Piedmont ?0.017 ?0.082 ?0.259 0.031
Ridge & Valley 0.007 0.083 ?0.017 0.337
South Eastern plains 0.004 0.019 ?0.055 0.165
BIBI indicator, P/O boosting
% Of calc. bedrock =0% 0
>0% ?0.294 ?0.308 ?0.760 ?0.013
Ecoregion Blue ridge 0
Centr. Appalachian ?0.605 ?0.545 ?1.017 ?0.078
Mid. Atl. coastal plains ?0.099 ?0.086 ?0.282 0.069
Northern piedmont ?0.238 ?0.220 ?0.459 0.031
Ridge & Valley 0.511 0.450 0.041 0.878
South Eastern plains 0.266 0.254 0.031 0.574
Values in columns 4?6 were obtained from applying P/O boosting to 100 bootstrap samples drawn from
the full data
1980). All functions shown in Fig. 4 are nonlinear, demonstrating the ability of P/O
boosting to account for nonlinear predictor effects. Marginal function estimates for
other continuous predictors (Web Appendix D) are smaller in magnitude than those
shown in Fig. 4, indicating their minor importance for modeling BIBI. Marginal effect
estimates of the categorical covariate ?percentage of bedrock that is calcareous in a
watershed? suggest that BIBI is lower when calcareous rock is present. This effect is
much stronger than the effect obtained for the FIBI indicator (Table 3), reinforcing
geology as an important structuring factor on local benthic macroinvertebrate assem-
blages (Montgomery 1999; Pyne et al. 2007). Again we caution over-interpretation
of these results because of the coarse scale of this predictor. The categorical effects
of three dominant ecoregions on BIBI are significantly different from zero (Table 3).
The Central Appalachian ecoregion has the lowest marginal biotic integrity while
the Ridge and Valley ecoregion has the largest positive effect on the BIBI indicator.
Maloney et al. (2009) reported similar effects of ecoregions on BIBI.
Themarginal spatial effect estimates show that sites in theBlueRidge region and the
eastern part of the Ridge andValley region tended to have lower BIBI scores than other
sites (Fig. 5). These spatial effects may be due to missing predictors or coarse stratifi-
cation during BIBI development. For example, the Ridge and Valley, Blue Ridge, and
Central Appalachians ecoregions were lumped into a single ?Combined Highlands?
stratum during the BIBI construction (See Web Appendix B, Southerland et al. 2005).
123
726 Environ Ecol Stat (2011) 18:709?733
4 Summary and conclusion
In recent years, much research has been undertaken to assess the degree of impairment
in ecosystem structure and function due to anthropogenic disturbance in watersheds.
As part of this research, biological assessments of stream condition have become an
important tool to identify impairments and to develop appropriate management and
conservation strategies. In this paper, we have considered indicators of biologic con-
dition (Karr et al. 1986; Karr 1991; Barbour et al. 1999), which are indispensable tools
for measuring and managing the health of streams and their watersheds.
We have developed a boosting algorithm for modeling indices of biotic integrity
and applied our method to data collected from small non-tidal streams in the state of
Maryland, USA. Because IBI indicators are often evaluated on an ordinal scale, the
P/O boosting algorithm developed in this paper is based on the well-established pro-
portional odds model introduced by McCullagh (1980). To obtain regularized model
fits, we combined classical gradient boosting techniques with two recent advances:We
used themodeling approach suggested byKneib et al. (2009) to obtain predictionmod-
els accounting for nonlinear effects and spatial correlation, and we re-formulated the
boostingmethod bySchmid et al. (2010) to obtain scale parameter estimates (which are
necessary for adapting classical boosting techniques to the proportional odds model).
In summary, the boosting algorithm presented in this paper combines the following
advantages:
1. P/O boosting allows for fully automatic variable selection and model choice. In
particular, it does not require scientists to select predictor variables using heuristic
approaches such as stepwise variable selection.
2. Although boosting estimates are typically different from classical maximum like-
lihood estimates, the P/O boosting algorithm preserves the structure of the propor-
tional odds model. Therefore, boosting estimates are accessible for interpretation,
which is a major advantage over estimation techniques that result in black-box
estimates. On the other hand, of course, black-box predictions are expected to have
a higher degree of flexibility than predictions that are linked to the pre-specified
structure of the additive proportional odds model.
3. P/O boosting accounts for spatial effects. Although there are numerous methods
to model spatial correlation in ecological data (cf. Bigler et al. 2005; Gelfand
2007), the interaction surfaces used in this paper have the advantage that marginal
spatial effects can be visualized and information contained in unobserved (latent)
predictor variables is quantified.
4. Spatial effects and nonlinear effects of predictor variables are estimated jointly
based on penalized spline functions. Both the joint estimation and the additive
structure of the prediction function facilitate a relatively simple interpretation
of the results. For example, the use of P/O boosting led to clear interpretations of
the relationships between watershed attributes and stream biological condition in
the MBSS data, where both linear and nonlinear marginal predictor effects were
present.
5. Due to the early stopping strategy, prediction accuracy of the P/O boosting fit is
maximized in the proportional odds model framework. Of course, it is well known
123
Environ Ecol Stat (2011) 18:709?733 727
that there is no ?uniformly best? prediction method for ordinal data. Therefore, we
do not claim that P/O boosting is generally superior to other methods for ordinal
data. For the MBSS data, however, predictions obtained from P/O boosting turned
out to be very similar to predictions obtained from the random forest method. This
is remarkable because the random forest method is a completely non-parametric
technique that is generally considered to be one of the most powerful statistical
prediction methods (see Hastie et al. 2009, Chap. 15).
Although, for the sake of interpretation, we restricted ourselves to considering main-
effects models in this paper, the gradient boosting framework can be extended to
include interaction terms between predictor variables in the model formula. As dem-
onstrated byKneib et al. (2009), this can be accomplished by specifying additional sets
of linear and smooth base-learners depending on the products of predictor variables.
When interpreting marginal function estimates, one should be aware of the fact that
P/O boosting is based on two important assumptions: First, we assumed the propor-
tional odds property to hold. Second, we assumed predictor effects to be additive. If
these assumptions are not met, estimates might show a bias caused by model mis-
specification. Usually, this bias cannot be fully compensated by the high flexibility of
the P/O boosting algorithm. Because the early stopping strategy results in regularized
boosting estimates that are shrunken towards zero, it is generally difficult to derive
tests on the appropriateness of model assumptions in the boosting framework. There-
fore, assessing the robustness of boosting estimates against model misspecification
constitutes an important issue of future research.
Instead of regularizing effect estimates via gradient boosting (in combination with
early stopping), it would alternatively be possible to optimize out-of-sample predic-
tion accuracy using penalized regression techniques. For example, the Lasso method
(Tibshirani 1996), which is based on L1-penalized likelihood estimation, would be a
natural approach to incorporate shrinkage and variable selection into a proportional
odds model. However, the original Lasso method has mainly been designed for regres-
sion models with a linear prediction function. Combining penalized estimation with
sparse nonlinear additive modeling has only recently been accomplished (Meier et al.
2009). To date, there is no extension of the method developed by Meier et al. (2009)
to geoadditive proportional odds models. On the other hand, gradient boosting and
the Lasso are closely related, as both algorithms can be embedded into the LARS
framework (Efron et al. 2004). Also, in case of Gaussian regression, there is evidence
that the properties of gradient boosting are similar to those of the Lasso (Hastie et al.
2009, Chap. 16). These results suggest that the role of the boosting stopping iteration
is similar to the role of the (inverse of the) shrinkage parameter used for the L1 penalty
of the Lasso method.
Finally, the boosting algorithm presented in this paper is not restricted to ecological
applications but can be used to analyze very general types of ordinal data. Although
developing a predictionmethod for streambiological conditionwas the primary goal of
this paper, both the proportional odds model and the boosting frameworks are essen-
tially independent of the application context in which they are used. It is therefore
possible to apply the P/O boosting algorithm in many fields, for example in clinical
or biomedical research.
123
728 Environ Ecol Stat (2011) 18:709?733
Software
All computations were carried out with the R System for Statistical Computing (ver-
sion 2.10.1, R Development Core Team 2009). The gamboost() function of R
package mboost (Hothorn et al. 2010) was used to calculate boosting estimates. Base-
learners were made comparable by centering predictors at the beginning of the algo-
rithm and by using the same degrees of freedom for each base-learner (see Kneib et al.
2009 for details). For example, in case of the FIBI model, the R code for specifying
the model formula was given by
> library(mboost)
> FIBI.formula <- FIBI ? bols(EASTING, intercept = FALSE) +
+ bols(NORTHING, intercept = FALSE) +
+ bspatial(EASTING, NORTHING, knots = 20, df = 1,
+ differences = 1) +
+ bols(DrainageDensity, intercept = FALSE) +
+ bbs(DrainageDensity, center = TRUE, df = 1) +
...
+ bols(PerWet, intercept = FALSE) +
+ bbs(PerWet, center = TRUE, df = 1) +
+ bols(Ecoregion, intercept = FALSE, df = 1) +
+ bols(INT, intercept = FALSE, df = 1)
where FIBI denotes the FIBI outcome, EASTING and NORTHING denote the UTM
easting and northing coordinates of the site locations, respectively, and Drain-
ageDensity, PerWet and Ecoregion are examples of predictor variables.
The bols() and bbs() functions in R package mboost (using the intercept
= FALSE and center = TRUE options) correspond to linear base-learners and
smooth P-spline deviations from the linear base-learners, respectively (see Kneib et al.
2009 for details). Specifying the base-learners as shown above ensures that selection of
the best modeling alternative (smooth nonlinear vs. linear) is carried out automatically
by the P/O boosting algorithm. Similarly, the bspatial() function in R package
mboost (using the center = TRUE option) corresponds to a smooth P-spline ten-
sor product deviation from a spatial linear surface. Note that specifying a bbs()
base-learner for Ecoregion was not necessary because this predictor variable is
categorical.
Using the model formula specified above, the proportional odds model was fitted
with the help of the PropOdds() family in R package mboost. The corresponding
R code was given by
> ctrl <- boost_control(mstop = 20000, nu = 0.1)
> FIBI.model <- gamboost(FIBI.formula, data = MBSS.training, family =
+ PropOdds(), control = ctrl)
where MBSS.training is the name of the training data set containing the variables
specified in FIBI.formula and where the step length ? and the initial number
of boosting iterations were specified using the boost_control() function of R
package mboost. Internal five-fold bootstrap cross-validation for determining the opti-
mal stopping iteration was carried out using the cvrisk() function of R package
mboost:
> ntrain <- nrow(MBSS.training)
> bs5 <- rmultinom(5, ntrain, rep(1, ntrain) / ntrain)
123
Environ Ecol Stat (2011) 18:709?733 729
> cvm <- cvrisk(FIBI.model, folds = bs5)
> st <- mstop(cvm)
After having determined the optimal stopping iteration (denoted by st), the ?optimal?
boosting fit at iteration st was calculated as follows:
> FIBI.optimal <- FIBI.model[st]
Afterwards, the predict() function of R package mboost was used to calculate
predictions:
> pred <- predict(FIBI.optimal, newdata = MBSS.test,
type = ?response?)
where MBSS.test denotes the test set of out-of-bootstrap observations (cf. Sect. 3).
The pred object is a matrix containing the posterior class probabilities corresponding
to the out-of-bootstrap observations. Using this object, the predicted outcome catego-
ries were calculated as described in Sect. 2. For a detailed description of the mboost
package we refer to Hothorn et al. (2010).
Random forest analysis was carried out using the R package randomForest (Liaw
andWiener 2002, 2009). The random forest algorithm of R packagerandomForest
has two main tuning parameters: (a) ntree, which is the number of trees used for
the forest, and (b) mtry, which is the number of variables randomly selected at each
node. To achieve sufficiently stable results, the number of trees was set to 2000 (see
Cutler et al. 2007). The hyper-parameter mtry was tuned using additional internal
ten-fold cross-validation.
Topographic surface plots were created using the R package sp (Pebesma and
Bivand 2009). The Kappa function of R package vcd (Meyer et al. 2009) was used
to compute weighted kappa indices.
Acknowledgments The work of Matthias Schmid and Torsten Hothorn was supported by Deutsche Fors-
chungsgemeinschaft (DFG), grant HO 3242/1-3, and by the Interdisciplinary Center for Clinical Research
(IZKF) at the University Hospital of the University of Erlangen-Nuremberg (Project J11). We thank the
Maryland Department of Natural Resources for providing data from their Maryland Biological Stream Sur-
vey (MBSS). Funding for this project was provided by a Smithsonian Institution Post-Doctoral Fellowship
awarded to KOM and by the US Environmental Protection Agency National Center for Environmental
Research (NCER) Science to Achieve Results (STAR), grant #R831369. We thank the editor and two
anonymous reviewers for helpful comments and suggestions.
Appendix
A Log-likelihood of the proportional odds model
The system of equations (4) implies that the log-likelihood of the proportional odds
model is given by
l( f, ?) = ? I(Y = 1) ? log(1 + exp( f ? ?1))
+
K?1
?
k=2
I(Y = k) ?
[
log
(
(1 + exp( f ? ?k))?1 ? (1 + exp( f ? ?k?1))?1
)]
+ I(Y = K ) ? log
(
1 ? (1 + exp( f ? ?K?1))?1
)
.
123
730 Environ Ecol Stat (2011) 18:709?733
Thus, the loss function used for the P/O boosting algorithm becomes ? = ?l. The
negative derivative of ? w.r.t. f is given by
?
??
? f =
?l
? f = ? I(Y = 1) ? (1 + exp(?1 ? f ))
?1
+
K?1
?
k=2
I(Y = k) ?
1 ? exp(2 f ? ?k?1 ? ?k)
1 + exp( f ? ?k?1) + exp( f ? ?k) + exp(2 f ? ?k?1 ? ?k)
+ I(Y = K ) ? (1 + exp( f ? ?K?1))?1.
B Predictor variables used for the analysis of the MBSS data
We included the following predictor variables in our analysis of the MBSS data:
? UTM easting and northing coordinates provided by MBSS (from Maryland State
Plane Coordinate System). These predictors were used to take the spatial depen-
dence structure of sample sites into account (predictor variables XE and XN, see
Sect. 2).
? Watershed Area, i.e., area of drainage upstream of sampling point (in km2).
? Population density (#/km2) of upstream watershed.
? Average upstream watershed elevation (in m).
? Average annual precipitation for upstream watershed elevation (in cm y?1).
? Percentage of upstream watershed under tree cover.
? Percentage of upstream watershed under impervious surface cover.
? Percentage of upstream watershed under pasture cover.
? Percentage of upstream watershed under row crop cover.
? Percentage of upstream watershed under wetland cover.
? Percentage of upstream watershed under barren cover.
? Drainage density, defined as total stream length (in km) / watershed area (in km2).
? Distance from sampling location to the nearest main stem stream (in km). Val-
ues of this predictor variable were set to zero for sites that drained directly into
Chesapeake Bay.
? Average percentage of sand content in soil.
? Percentage of bedrock that is calcareous in a watershed.
? Dominant ecoregion (categorical predictor with six categories, see Sect. 3 and
Omernik 1987).
A preliminary analysis of the data showed that the distributions of watershed area,
population density, drainage density, upstream watershed elevation, and the percent-
ages of upstreamwatershed under impervious surface, wetland, and barren cover were
highly right-skewed.We therefore applied a log transformation to these predictor vari-
ables before fitting the proportional odds models. Since we observed a large number
of zero percentages in calcareous bedrock, we transformed this predictor into a binary
variable with categories ?percentage of calcareous bedrock = 0%? and ?percentage of
calcareous bedrock > 0%?.
123
Environ Ecol Stat (2011) 18:709?733 731
References
Agresti A (2002) Categorical data analysis. 2 edn. Wiley, New York
Angermeier PL, Schlosser IJ (1989) Species?area relationship for stream fishes. Ecology 70:1450?1462
Barbour MT, Gerritsen J, Snyder BD, Stribling JB (1999) Rapid bioassessment protocols for use in streams
and wadeable rivers: Periphyton, benthic macroinvertebrates and fish. 2 edn. Office of Water, US
Environmental Protection Agency, Washington, DC
Barker LS, Felton GK, Russek-Cohen E (2006) Use of Maryland biological stream survey data to deter-
mine effects of agricultural riparian buffers on measures of biological stream health. Environ Monitor
Assess 117:1?19
Bigler C, Kulakowski D, Veblen TT (2005) Multiple disturbance interactions and drought influence fire
severity in Rocky Mountain subalpine forests. Ecology 86:3018?3029
Breiman L (2001) Random forests. Mach Learn 45:5?32
B?hlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting (with
discussion). Stat Sci 22:477?522
B?hlmann P, YuB (2003) Boostingwith the L2 loss: regression and classification. J AmStat Assoc 98:324?
338
Collier KJ (2009) Linking multimetric and multivariate approaches to assess the ecological condition of
streams. Environ Monitor Assess 157:113?124
Cooper C (2009) Assessing environmental impact on riparian benthic community vigor with unconditional
estimates of quantile differences. Environ Ecol Stat (to appear)
Cushing CE, Allan JD (2001) Streams: their ecology and life. Academic Press, New York
Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT (2007) Random forests for classification in ecology.
Ecology 88:2783?2792
Efron B, Johnston I, Hastie T, Tibshirani R. (2004) Least angle regression. Ann Stat 32:407?499
Fahrmeir L, Kneib T, Lang S (2004) Penalized structured additive regression: a Bayesian perspective. Stat
Sin 14:731?761
Fleiss JL, Cohen J (1973) The equivalence of weighted kappa and the intraclass correlation coefficient as
measures of reliability. Educ Psychol Measure 33:613?619
Friedman JH, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting (with
discussion). Ann Stat 28:337?407
Gelfand AE (2007) Guest editorial: spatial and spatio-temporal modeling in environmental and ecological
statistics. Environ Ecol Stat 14:191?192
Hastie T (2007) Discussion of ?Boosting algorithms: Regularization, prediction and model fitting? by
P. B?hlmann and T. Hothorn. Stat Sci 22:513?515
Hastie T., Tibshirani R (1990) Generalized additive models. Chapman & Hall, London
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and
prediction. 2 edn. Springer, New York
Helms BS, Schoonover JE, Feminella JW (2009) Assessing influences of hydrology, physicochemistry, and
habitat on streamfish assemblages across a changing landscape. JAmWater ResourAssoc 45:157?169
Homer C, Huang CQ, Yang LM, Wylie B, Coan M (2004) Development of a 2001 national land-cover
database for the United States. Photogrammetr Eng Remote Sens 70:829?840
Hothorn T, B?hlmann P, Kneib T, Schmid M, Hofner B (2010) mboost: Model-Based Boosting. R package
version 2.0-6. http://cran.r-project.org/web/packages/mboost/index.html
Hothorn T, Leisch F, Zeileis A, Hornik K (2005) The design and analysis of benchmark experiments.
J Comput Graph Stat 14(3):675?699
JoyMK, Death RG (2004) Predictive modelling and spatial mapping of freshwater fish and decapod assem-
blages using GIS and neural networks. Freshw Biol 49:1036?1052
Karr JR (1991) Biological integrity: a long-neglected aspect of water resource management. Ecol Appl
1:66?84
Karr JR, Fausch KD, Angermeier PL, Yant PR, Schlosser IJ (1986) Assessing biological integrity in run-
ning waters: a method and its rationale, 2 edn. Illinois Natural History Survey Special Publication 5,
Champaign, IL
King RS, Baker ME, Whigham DF, Weller DE, Jordan TE, Kazyak PF, Hurd MK (2005) Spatial consider-
ations for linking watershed land cover to ecological indicators in streams. Ecol Appl 15:137?153
Kneib T, Hothorn T, Tutz G (2009) Variable selection and model choice in geoadditive regression models.
Biometrics 65:626?634
123
732 Environ Ecol Stat (2011) 18:709?733
Kneib T, M?ller J, Hothorn T (2008) Spatial smoothing techniques for the assessment of habitat suitability.
Environ Ecol Stat 15:343?364
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2:18?22
LiawA,WienerM (2009) randomForest: Breiman and Cutler?s random forests for classification and regres-
sion. R package version 4.5-33. http://cran.r-project.org/web/packages/randomForest/index.html
Maloney KO, Weller DE, Russell MJ, Hothorn T (2009) Classifying the biological condition of small
streams: an example using benthic macroinvertebrates. J North Am Benthol Soc 28:869?884
Matthews WJ, Robison HW (1998) Influence of drainage connectivity, drainage area and regional species
richness on fishes of the Interior Highlands in Arkansas. Am Midland Nat 139:1?19
McCullagh P (1980) Regression models for ordinal data (with discussion). J R Stat Soc Ser B 42:109?142
Meier L, van de Geer S, B?hlmann P (2009) High-dimensional additive modeling. Ann Stat 37:3779?3821
Meyer D, Zeileis A, Hornik K (2009) vcd: Visualizing Categorical Data. R package version 1.2-7. http://
cran.r-project.org/web/packages/vcd/index.html
Montgomery DR (1999) Process domains and the river continuum. J Am Water Resour Assoc 35:397?410
Oberdorff T, Hughes RM (1992) Modification of an index of biotic integrity based on fish assemblages to
characterize rivers of the Seine Basin, France. Hydrobiologia 228:117?130
O?Hara RB, Sillanp?? MJ (2009) A review of Bayesian variable selection methods: what, how and which.
Bayesian Anal 4:85?118
Omernik JM (1987) Ecoregions of the conterminous United States. Ann Assoc Am Geograph 77:118?125
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681?686
Paul MJ, Meyer JL (2001) Streams in the urban landscape. Annu Rev Ecol Syst 32:333?365
Pebesma EJ, Bivand R (2009) sp: Classes and methods for spatial data. R package version 0.9-47. http://
cran.r-project.org/web/packages/sp/index.html
Peterson EE, Urquhart NS (2006) Predicting water quality impaired stream segments using landscape-scale
data and a regional geostatistical model: a case study in Maryland. Environ Monitor Assess 121:615?
638
Pyne MI, Rader RB, Christensen WF (2007) Predicting local biological characteristics in streams: a com-
parison of landscape classifications. Freshw Biol 52:1302?1321
R Development Core Team (2009) R: A language and environment for statistical computing. R Foundation
for Statistical Computing, Vienna, Austria. url:http://www.R-project.org
Rawlings JO, Pantula SG, Dickey DA (1998) Applied regression analysis: a research tool. 2 edn. Springer,
New York
Roy AH, Rosemond AD, Paul MJ, Leigh DS, Wallace JB (2003) Stream macroinvertebrate response to
catchment urbanisation (Georgia, U.S.A.). Freshw Biol 48:329?346
Schleiger SL (2000) Use of an index of biotic integrity to detect effects of land uses on stream fish com-
munities in west-central Georgia. Trans Am Fish Soc 129:1118?1133
SchmidM,Hothorn T (2008) Boosting additivemodels using component-wise P-splines. Comput Stat Data
Anal 53:298?311
Schmid M, Potapov S, Pfahlberg A, Hothorn T (2010) Estimation and regularization techniques for regres-
sion models with multidimensional prediction functions. Stat Comput 20:139?150
Southerland MT, Rogers GM, Kline MJ, Morgan RP, Boward DM, Kazyak PF, Klauda RJ, Stranko SA
(2005)MarylandBiological StreamSurvey 2000?2004, VolumeXVI: new biological indicators to bet-
ter assess the condition of Maryland streams. DNR-12-0305-0100, Maryland Department of Natural
Resources, Annapolis, MD
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B 58:267?288
USEPA (1999) From the mountains to the sea: the state of Maryland?s freshwater streams. EPA 903-R-
99-023. Office of Research and Development, US Environmental Protection Agency, Washington,
DC
USEPA (2006) Wadeable streams assessment: a collaborative survey of the Nation?s streams. EPA 841-B-
06-002. Office of Water, US Environmental Protection Agency, Washington, DC
Vannote RL, Minshall GW, Cummins KW, Sedell JR, Cushing CE (1980) The river continuum concept.
Can J Fish Aquatic Sci 37:130?137
Walsh CJ, Roy AH, Feminella JW, Cottingham PD, Groffman PM, Morgan RP (2005) The urban stream
syndrome: current knowledge and the search for a cure. J North Am Benthol Soc 24:706?723
Wang L, Lyons J (2003) Fish and benthic macroinvertebrate assemblages as indicators of stream degrada-
tion in urbanizing watersheds. In: Simon TP (ed) Biological response signatures: indicator patterns
using aquatic communities. CRC Press, New York, pp 227?249
Wood S (2006) Generalized additive models: an introduction with R. Chapman & Hall/CRC, Boca Raton
123
Environ Ecol Stat (2011) 18:709?733 733
Author Biographies
Matthias Schmid is currently a researcher at the Department of Biometry and Epidemiology at the
Friedrich-Alexander University Erlangen-Nuremberg, Germany. He received a diploma (2004) and a Ph.D.
(2007) in Statistics from the University of Munich, Germany. His research interests include statistical learn-
ing, regression methods in biostatistics and statistical disclosure control.
Torsten Hothorn was born in Dresden, Germany in 1975. He received a diploma in statistics from
the Universit?t Dortmund in 2000 and was lecturer of statistics at the Friedrich-Alexander-Universit?t
Erlangen-N?rnberg later on. Since 2007 he is professor of Biostatistics at the Institut f?r Statistik, Ludwig-
Maximilians-Universit?t M?nchen. His research interests currently include statistical learning, nonpara-
metric statistics and statistical computing.
Kelly O. Maloney is currently a Research Ecologist with the US Geological Survey in Wellsboro,
Pennsylvania, USA. He graduated with a B.S. from The Pennsylvania State University in 1994 and a
Ph.D. in Biological Sciences from Auburn University, Alabama USA in 2004. His research focuses on
diversity patterns of aquatic communities and how anthropogenic stressors affect this diversity. He is cur-
rently evaluating the applicability of novel statistical techniques in addressing these issues.
Donald E. Weller is a Senior Scientist and Quantitative Ecologist at the Smithsonian Environmental
Research Center in Edgewater, Maryland, USA. He earned a B.A. in Biology from Wabash College, Craw-
fordsville, Indiana, USA (1974) and a Ph.D. in Ecology from the University of Tennessee, Knoxville, USA
(1985). Don has expertise in ecological modeling and landscape ecology. His recent research has focused
on the linkages of watersheds to wetland condition, to stream chemistry and biology, and to estuarine health.
Sergej Potapov is currently a Ph.D. researcher at the Department of Biometry and Epidemiology at the
Friedrich-Alexander University Erlangen-Nuremberg, Germany. He received his M.Sc. in Business Math-
ematics at the University of Augsburg, Germany (2006). His research interests include statistical learning,
classification methods and statistical computing.
123