Biol. Rev. (2002), 77, pp. 211?222 " Cambridge Philosophical Society DOI: 10.1017}S1464793101005875 Printed in the United Kingdom 211 Publication bias in ecology and evolution: an empirical assessment using the ?trim and fill ? method MICHAEL D. JENNIONS " , # * and ANDERS P. M?LLER $ " School of Botany and Zoology, Australian National University, Canberra, A.C.T. 0200, Australia # Smithsonian Tropical Research Institute, Unit 0948, APO AA 34002-0948, USA $Laboratoire d?Ecologie Evolutive Parasitaire, CNRS FRE 2365, Universite Pierre et Marie Curie, 7, quai St. Bernard, Case 237, F-75252 Paris Cedex 5, France (Received 9 July 2001; revised 29 October 2001; accepted 20 November 2001) ABSTRACT Recent reviews of specific topics, such as the relationship between male attractiveness to females and fluctuating asymmetry or attractiveness and the expression of secondary sexual characters, suggest that publication bias might be a problem in ecology and evolution. In these cases, there is a significant negative correlation between the sample size of published studies and the magnitude or strength of the research findings (formally the ?effect size ?). If all studies that are conducted are equally likely to be published, irrespective of their findings, there should not be a directional relationship between effect size and sample size ; only a decrease in the variance in effect size as sample size increases due to a reduction in sampling error. One interpretation of these reports of negative correlations is that studies with small sample sizes and weaker findings (smaller effect sizes) are less likely to be published. If the biological literature is systematically biased this could undermine the attempts of reviewers to summarise actual biology relationships by inflating estimates of average effect sizes. But how common is this problem? And does it really affect the general conclusions of literature reviews? Here, we examine data sets of effect sizes extracted from 40 peer-reviewed, published meta-analyses. We estimate how many studies are missing using the newly developed ?trim and fill ? method. This method uses asymmetry in plots of effect size against sample size (? funnel plots ?) to detect ?missing? studies. For random-effect models of meta-analysis 38% (15}40) of data sets had a significant number of ?missing? studies. After correcting for potential publication bias, 21% (8}38) of weighted mean effects were no longer significantly greater than zero, and 15% (5}34) were no longer statistically robust when we used random-effects models in a weighted meta-analysis. The mean correlation between sample size and the magnitude of standardised effect size was also significantly negative (rs flfi0?20, P! 0?0001). Individual correlations were significantly negative (P!0?10) in 35% (14}40) of cases. Publication bias may therefore affect the main conclusions of at least 15?21% of meta-analyses. We suggest that future literature reviews assess the robustness of their main conclusions by correcting for potential publication bias using the ?trim and fill ? method. Key words : effect size, fail-safe number, fluctuating asymmetry, funnel plots, meta-analysis, publication bias, trim and fill. * Author for correspondence at address 1 (e-mail : Michael.Jennions!anu.edu.au Tel : ?61 2 6125 3540 Fax: ?61 2 6125 5573). 212 Michael D. Jennions and Anders P. M?ller CONTENTS I. Introduction ............................................................................................................................ 212 II. Methods................................................................................................................................... 213 (1) Data set............................................................................................................................. 213 (2) Calculating mean effect sizes ............................................................................................ 214 (3) Testing for publication bias .............................................................................................. 214 III. Results ..................................................................................................................................... 215 IV. Discussion ................................................................................................................................ 218 V. Conclusions .............................................................................................................................. 219 VI. Acknowledgements .................................................................................................................. 219 VII. References................................................................................................................................ 220 VIII. Appendix: the data sets........................................................................................................... 221 I. INTRODUCTION The use of meta-analysis quantitatively to review a field of study is increasingly popular in biology (Arnqvist & Wooster, 1995; M?ller & Jennions, 2001). Meta-analysis summarises the literature on a topic by transforming statistical tests of hypotheses into a common metric (?effect size ?). ?Effect size ? is ? the degree to which the phenomenon is present in a population? or ? the degree to which the null hypothesis is false ? (Cohen, 1988, pp. 9?10). Meta- analysis allows for quantitative answers to questions about the average strength of an hypothesised relationship, or the extent and possible sources of heterogeneity in research findings. It has clear advantages over traditional narrative reviews but, as with any review process, it assumes that the scientific literature is unbiased. Ironically, the greater pre- cision provided by meta-analysis has also prompted biologists to question whether the scientific literature really does accurately reflect the results of the many studies biologists initiate (Csada, James & Espie, 1996; Bauchau, 1997; Alatalo, Mappes & Elgar, 1997; Simmons et al., 1999; Palmer, 1999; Poulin, 2000; for the medical literature see Song & Gilbody, 1998). There are many forms of bias in the scientific literature. Some are fairly innocuous such as pref- erential citation of studies supporting the author?s views, or by those of the same nationality (compre- hensively reviewed by Song et al., 2000), or even a tendency to cite more often authors with surnames beginning with letters near the start of the alphabet (Trengenza, 1997). Most troubling, however, is the situation where the magnitude and}or direction of research findings influences whether or not a completed study is submitted, positively reviewed and eventually accepted for publication. No mal- evolent intent to suppress findings is required to generate a ?publication bias ?, only a systematic prejudice at any stage of the publishing process (Palmer, 2000; Song et al., 2000; M?ller & Jennions, 2001). The most widely cited prejudice of researchers, reviewers and editors is towards statistically signifi- cant results (Palmer, 2000; Song et al., 2000; M?ller & Jennions, 2001). In practice, there is probably an interaction between sample size and statistical significance. For a non-significant result to be published sample sizes must be large. This is reasonable because the statistical power to detect a significant difference is low when samples are small (Cohen, 1988) so the null hypothesis of the absence of an effect of a given magnitude (usually non-zero) can not be accepted with a reasonable degree of confidence. When a result is significant, however, reviewers and editors often ignore sample size. This is not a major problem if the true scientific relationship being examined is close to zero. It simply means there will be selective reporting of non- significant findings from studies with small samples : when the average relationship is calculated it will still be close to zero. The real problem arises when the true relationship is moderate (Palmer, 1999). For studies with small samples, the only results published will tend to be those that are significant in the direction of the true effect (very few studies with an estimated effect opposite in direction to the ? true? effect will reach significance). This can lead to a systematic overestimation of the true effect size (Begg, 1994). Although biologists are now aware of the problem, there has been no systematic attempt to determine its extent (M?ller & Jennions, 2001). Is publication bias so severe that it grossly exaggerates the biological significance of certain phenomena, even 213Publication bias in ecology and evolution generating ?collective illusions ?? Palmer (2000) has argued that this could be the situation, but did not quantify the average effect of publication bias on general conclusions. How widespread is the problem? Ideally this question is resolved by obtaining information on completed but unpublished studies to see whether their inclusion alters the conclusions of meta-analyses. Unpublished studies are, however, extremely difficult to track down. There are several ways to try to model and even to correct for effects of publication bias (e.g. weighted distribution theory, general linear models and Bayesian modelling) (Begg, 1994; Gleser & Olkin, 1996). Unfortunately though the models developed to date are not imple- mented in readily available, user-friendly software; they require restrictive assumptions about the exact effects of probability values and sample sizes on publishability ; and they are only accessible to those with advanced statistical modelling skills (DuMochel & Harris, 1997). In a survey of 44 ecological and evolutionary meta-analyses, we identified a significant absence of studies with small sample sizes that present findings weaker than the weighted average effect size (Jennions & M?ller, 2002). In other words, we found the mean correlation between sample size and the absolute value of the effect size to be significantly negative. Specific published examples of this pheno- menon are given in the second paragraph of Section II.3. Publication bias is therefore a general phenom- enon. Here, we use a new and simple method de- veloped by Duvall & Tweedie (2000a, b) called ? trim and fill ? to estimate the number of unpublished or ?missing? studies. To our knowledge, the only other study to use this approach to estimate the potential impact of publication bias is an analysis by Sutton et al. (2000a) of a set of meta-analyses of clinical medical trials. According to them, the only previous general statistical assessments of the preva- lence of ?missing? studies in a collection of meta- analyses was that of Egger et al. (1997). We then test how robust general conclusions in biology are when ?corrected? for potential publication bias. II. METHODS (1) Data set We made an extensive survey of the ecological and evolutionary literature for meta-analyses published up until the end of 2000. We examined the journals American Naturalist, Animal Behaviour, Behavioral Ecol- ogy, Behavioral Ecology and Sociobiology, Ecological Monographs, Ecology, Evolution, Evolutionary Biology, Journal of Evolutionary Biology, and Quarterly Review of Biology. We also entered the phrase ?meta-analy*? into the electronic database ?WebSpiris ? to find papers where this term occurred in the title or abstract. We then examined the title and place of publication of each paper listed and directly in- spected any that seemed related to evolutionary or ecological biology. Furthermore, we contacted a number of colleagues who have used meta-analyses in their research to locate meta-analyses currently in press. We excluded meta-analyses of genetic herita- bilities because it is unclear whether h# itself or an effect size based on the strength (rather than slope) of the phenotypic relationship between relatives is the more appropriate effect size. Palmer (2000) has already shown that the problem of publication bias is especially severe for h# because negative values are biologically irrelevant and therefore under-reported. We found 40 peer-reviewed meta-analyses where we could also obtain effect sizes and variances for the original studies (either because they were included as appendices in the published paper or where the authors kindly responded to our request and sent us the data). The ability to detect unpublished studies relies on the asymmetric distribution of effect sizes (see below). Such asymmetry is unlikely to be detected with smaller samples. We therefore set a minimum sample size of eight studies per meta- analysis (Sutton et al., 2000a used a minimum of 10). This only removed one otherwise usable meta- analysis (Fernandez-Duque & Valeggia, 1994). The meta-analyses we used are listed in Section VIII. Most of the 40 original meta-analyses asked several different questions (i.e. examined several response variables). In such cases, different response variables were often taken from the same or a closely overlapping set of original empirical studies. To be statistically conservative, we limited our analysis to one response variable per original meta-analysis (that with the largest sample size). In addition, the original authors often found significantly more heterogeneity in effect size among studies than could be explained by sampling error. They therefore looked for underlying structure in the data by classifying studies into groups (e.g. birds versus insects) and testing for significant among-group variance in effect sizes for each categorical factor using Qb (Qb is a measure of the variance in effect size accounted for by differences among groups) (Rosenberg, Adams & Gurevitch, 2000). For each of the original meta-analyses we therefore split the data using whichever categorical factor generated the 214 Michael D. Jennions and Anders P. M?ller greatest differences in effect sizes among groups (but only if P!0?05 for Qb). Finally, using these criteria we selected the group with the largest number of empirical studies (mean?s.e.m.fl 45?1?6?97, range fl 8?246). If there were two or more groups with an equal number of studies we picked one at random. We used the same effect size type as the original authors. These were Pearson?s r (Nfl 21), Hedges? d (Nfl 10), the natural log of the response ratio (lnRR) (Nfl 7) or a customised effect size (Nfl 2). (2) Calculating mean effect sizes We calculated mean effect sizes weighted for sample size using the software package Metawin 2.0 (Rosenberg et al., 2000). We ran both fixed-effect (FE) and random-effect (RE) models for each of the 40 data sets. FE models assume a single true effect common to all the studies. Variation in the observed effects is solely attributed to sampling error. RE models allow for a true random component as a source of variation in effect size between studies, as well as sampling error. In general, RE models are preferred (N.R.C., 1992), especially in biology where there is almost certainly real variation in actual effect sizes among different taxa or ecosystems (Gurevitch & Hedges, 1999). Some earlier meta- analyses, however, only used FE models. Our estimates of weighted mean effect sizes may differ slightly from those reported in the original papers because of rounding errors and}or minor differences in the coding of original studies. We must stress that our main intent is not to criticise individual studies, but to highlight wider trends. We then used bootstrapping with 999 replications to calculate bias-corrected 95% confidence intervals. This does not require that effect sizes are parametrically distributed. The weighted mean effect is significantly different from zero if the 95% confidence intervals do not overlap zero. Analyses based solely on parametric confidence intervals yielded the same conclusions in 139 out of 146 cases. (3) Testing for publication bias A funnel plot of effect size against log-transformed sample size should produce a funnel shape symmetric around the ? true? effect size (Light & Pillemer, 1984). Purely due to sampling error (the larger the sample the more accurate the estimate) the variance in estimates of the ? true? effect size is higher for studies with smaller samples. The observed effect sizes should be normally distributed around the mean effect with no trend in relation to sample size (Light & Pillemer, 1984; Begg, 1994). If studies with statistically significant results are more likely to be published, however, and the true mean is close to zero, this will produce a ?hollowed out ? funnel (see Palmer, 2000). If the true effect is moderate and non-significant results tend not to be published, this will produce a skewed funnel in which the magnitude of the effect size decreases as sample size increases (Begg & Mazumdar, 1994; Palmer, 1999). This second publication bias will lead to an inaccurate estimate of the true effect size. Of course, a skewed funnel plot can be caused by factors other than publication bias since prior knowledge of effect sizes from pilot studies, reduced sample sizes for certain species, choice of effect measures, chance and many other confounding variables may also create asym- metric plots (Thornhill, M?ller & Gangestad, 1999; Gurevitch & Hedges, 1999). Even so, the robustness of meta-analytic conclusions can be tested by making the conservative assumption that skew is due to publication bias. In the biological sciences, aside from the fail-safe number (see below), the only statistical approach widely used to test for publication bias has been to test for a significant relationship between sample size and effect size using rank correlation tests (Begg & Mazumdar, 1994; for related approaches see Macaskill, Walter & Irwig, 2001; M?ller & Jennions, 2001). Palmer (1999) has called this correlation rbias. If studies with small samples are only published when they have significant results, and the ? true? effect is moderate, the funnel plot will be skewed. There will be a decline in the magnitude of the effect size with increasing sample size because for studies with small sample sizes there is less likelihood that an effect opposite in magnitude to the ? true? effect will reach statistical significance. This decline has now been reported in a few specific fields of study (e.g. Palmer, 1999, 2000; Gontard-Danek & M?ller, 1999; Jennions, M?ller & Petrie, 2001). Recently, using 232 data sets from 44 evolutionary ecology meta-analyses that included most of the current database, we found that the average re- lationship is a highly significant, but small, decline in effect size with sample size (Jennions & M?ller, 2002). Here, we estimate these correlations using Spearman?s r specifically to determine whether rbias and the estimated number of studies missing based on funnel plot asymmetry (see below) are related. Although several authors, including ourselves, have previously presented the correlation between sample size and effect size (e.g. Palmer, 2000; Jennions et al., 2001), strictly speaking, effect size should first be 215Publication bias in ecology and evolution standardised to conform to the assumptions of the test (see Begg & Mazumdar, 1994). We therefore standardised effect size here, although this makes little difference in most cases (M. D. Jennions & A. P. M?ller, unpublished data). More recently, the funnel plot has been used to derive a non-parametric method of testing and adjusting for possible publication bias in meta- analysis. The ?trim and fill ? method of Duvall & Tweedie (2000a, b) estimates the number of ?miss- ing? studies due to publication bias. The method is reliant on the symmetric distribution of effect sizes around the ? true? effect size if there is no publication bias, and the simple assumption that the most extreme results have not been published. These will usually be studies with smaller sample sizes, because variance in effect size (hence extreme values) increases as sample size decreases. Once the number of ?missing? studies is estimated, one recalculates the weighted mean effect size and its variance when they are incorporated. The statistical procedure involves an iterative process (Duvall & Tweedie, 2000a, b). To start, one calculates the weighted mean effect size for the full data set and then ?trims off? the outlying part of the funnel plot that is asymmetrical with respect to the mean. Simple formulae are used to estimate the number of studies in the asymmetric part. These studies are then temporarily removed (? trimmed?) and the remainder used to re-estimate the weighted mean effect. Then, again using the full set of studies, one ? trims off? those studies asymmetrical with respect to the new estimate of the mean. After just a few iterations the estimate of the number of studies that need to be trimmed reaches an asymptotic value. One can now ?fill in? the ?missing? studies. These are simply the mirror-image counterparts of the trimmed studies around the final weighted mean effect estimated using the symmetric portion of the data set. The missing studies are given the same variance as their corresponding ?trimmed? counter- parts. Finally, the full data set that includes the trimmed, missing and remaining studies is used to calculate the new mean effect size and its confidence intervals. Duval & Tweedie (2000a, b) present three dif- ferent estimators (R ! , L ! , Q ! ) for the number k of missing studies. Of these, L ! is the best general- purpose estimator. Here, we follow Sutton et al. (2000a) in using L ! to calculate k using formulae entered onto an Excel spreadsheet (freely available on request). We did this for both FE and RE models in case the choice of model has an impact on the assessment of publication bias. However, because RE models are more appropriate we place greater emphasis on the results for these models (N.R.C., 1992). Of course, chance asymmetry in a funnel plot will lead to positive values of L ! (Sterne, 2000). We therefore also estimated how many meta-analyses had a value of L ! that was significant at the 0?05 level (i.e. more studies missing than expected by chance). Critical values of L ! were estimated by extrapolating from the simulations in Table 4 of Duvall & Tweedie (2000b). These are therefore crude approximations. Finally, we also calculated the fail-safe number of studies needed to nullify an effect at the 5% level following Rosenthal (1991, p. 104). This number estimates how many studies with a mean effect of zero are needed to change a significant effect to a non-significant one at the stated P level (here significance must be calculated using parametric 95% confidence intervals). A value of X greater than 5K?10 is usually considered to indicate a robust conclusion, where K is the reported number of studies. Unless otherwise stated data are presented as the mean?s.e.m. For non-significant tests, we pres- ent the statistical power to detect a medium effect as defined by Cohen (1988) with P (two-tailed)fl 0?05. III. RESULTS Of the 40 meta-analyses, the initial estimate of weighted mean effect size differed significantly from zero (P!0?05) in 38 RE models and 35 FE models. There were only three cases in which the conclusion differed depending on the choice of model. The weighted mean effects estimated using RE models were, on average, 29% greater than those from FE models (Wilcoxon?s test, Zfl 3?92, Nfl 40, Pfl 0?0001). With RE models, 30 out of 40 meta-analyses were estimated to have one or more missing studies (L ! "0); with FE models this rose to 36 meta- analyses. We then noted whether the probability of each observed L ! was less than 0?05. For RE models, 15 meta-analyses showed a significant publication bias ; for FE models, 20 were significant. Thus, 38?50% of published meta-analyses show strong evidence for publication bias. The estimated number of studies missing was positively correlated for the 36 meta-analyses where both model types indicated that the studies were missing from the same side of the distribution (or where no study was missing for one or both models) (rfl 0?429, Pfl 0?009, Nfl 36). There were, however, four cases where the side of the distribution from which the studies were missing 216 Michael D. Jennions and Anders P. M?ller 0 0.1 0.2 0.3 0.4 ?0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Original effect size (Z ) R ec al cu la te d e ff ec t si ze ( Z ) Original effect size (Z ) R ec al cu la te d e ff ec t si ze ( Z ) 0.60.40.2?0.2 ?0.1 0 0.1 0.2 0.3 0.4 0.5 Original effect size (Hedges? d) R ec al cu la te d e ff ec t si ze (H ed g es ? d ) 0.4 0.2 0 0.6 0.8 1 1.2 0 0.5 1 1.5 Original effect size (Hedges? d) R ec al cu la te d e ff ec t si ze (H ed g es ? d ) 0.5 0 0 1 1.5 2 2.5 0.5 1 1.5 2 2.5 Original effect size (InRR) R ec al cu la te d e ff ec t si ze (I n R R ) 0.2 0 0 1 0.2 0.4 0.6 0.8 1 0.4 0.6 0.8 1.2 1.4 1.6 Original effect size (InRR) R ec al cu la te d e ff ec t si ze (I n R R ) 0.2 0 0 1 0.2 0.4 0.6 0.8 1 0.4 0.6 0.8 1.2 1.4 1.6 A B C D E F 0 Fig. 1. Original versus recalculated effect sizes for : (A) fixed-effects (FE) models for effect type Z-transformed r (Znew flfi0?004?0?900Z ; r#adj fl 77?8%, F " , "* fl 71?1, P!0?0001, Nfl 21; (B) random-effects (RE) models for Z (Znew flfi0?0261?1?120Z ; r#adj fl 85?3%, F " , "* fl 116?7, P!0?0001, Nfl 21); (C) FE models for effect type Hedges? d (dnew fl 0?008?0?731d ; r#adj fl 72?1%, F " , ) fl 24?2, Pfl 0?0012, Nfl 10); (D) RE models for Hedges? d (dnew fl 0?125?0?763d ; r#adj fl 81?1%, F " , ) fl 39?6, Pfl 0?0002, Nfl 10); (E) FE models for effect type natural log of the re- sponse ratio (lnRRnew flfi0?110?1?463lnRR ; r#adj fl 76?9%, F " , & fl 21?0, Pfl 0?006, Nfl 7); (F) RE models for lnRR (lnRRnew flfi0?095?1?468lnRR ; r#adj fl 80?7%, F " , & fl 26?1, Pfl 0?004, Nfl 7). In all graphs, the line of equality is shown. differed between FE and RE models. On average, the estimate of the number of missing studies was higher for RE models than for FE models (sign test, Pfl 0?031; 7?1?1?2 versus 5?8?1?1). For FE models in 75% (27}36) of cases where we estimated that studies were missing, their addition moved the mean effect closer to zero. There was thus a significant trend for missing studies to make conclusions about the weighted mean effect less robust (binomial test, P!0?005). For RE models, however, in only 57% (17}30) of cases where studies were missing did their addition move the mean effect closer to zero. Correction for missing studies was therefore equally likely to reduce or increase the robustness of conclusions about the weighted mean effect (binomial test, Pfl 0?58; power: 36%). Once 217Publication bias in ecology and evolution Log (original fail-safe) Lo g ( re ca lc u la te d f ai l- sa fe ) 0 0 1 2 3 4 5 1 2 3 4 5 Log (original fail-safe) Lo g ( re ca lc u la te d f ai l- sa fe ) 0 0 1 2 3 4 5 1 2 3 4 5 A B Fig. 2. Log "! fail-safe number (X) before and after recalculation for : (A) fixed-effects models (Xnew fl fi0?0083?0?8778X ; (r#adj fl 41?0%, F " , $) fl 28?1, P!0?0001); (B) mixed-effects models (Xnew fl 0?0501?0?965X ; (r#adj fl 63?8%, F " , $) fl 69?7, P!0?0001) (both Nfl 40). we had added missing cases, we recalculated effect sizes to see if this changed the conclusions of the original meta-analyses based on 95% confidence intervals. For FE models, in three cases an initially significant result became non-significant. For RE models, in eight cases a significant result became non-significant. (These 11 cases were from 11 different meta-analyses.) No means that were orig- inally judged non-significant became significant. In sum, 8?21% of published meta-analyses may have been interpreted incorrectly. The potential for publication bias was also reflected in the correlation between effect size variance and the magnitude of the observed effect size (standardised). The mean Begg?Mazumdar correlation was rsflfi0?201 (tfl 5?34, d.f.fl 39, P!0?0001). Thus, the effect size was closer to zero as sample size increased. There were 14 significantly negative correlations (at P!0?10), but no signifi- cantly positive correlations. [Begg & Mazumdar (1994) recommended the use of Pfl 0?10 because of the low power of the test.] We then tested if the Begg?Mazumdar correlation predicts the number of studies that are missing according to ? trim and fill ? methods and whether these studies are greater or smaller than the weighted mean effect. We coded L ! as positive if the addition of missing studies decreased the mean effect. When the Begg?Mazumdar rs is negative, this implies that studies with effects smaller than the weighted mean are missing so L ! should be positive. We therefore predict that L ! and Begg? Mazumdar rs are negatively correlated. Because the absolute value of L ! will increase with sample size by chance, we first divided L ! by the number of studies in the meta-analysis. As predicted there was a significant negative correlation for both FE models (rflfi0?423, Pfl 0?007, Nfl 40) and RE models (rflfi0?397, Pfl 0?011, Nfl 40). Of course, a change from a significant to a non- significant weighted mean effect is a crude measure of the importance of missing studies if the real aim is to see how robust the original estimate of effect size is to publication bias. We examined this by looking at the three main effect types separately. For Pearson?s r, the mean percentage change (calculated as the absolute difference in r before and after recalculation divided by the original value of rrr) was 57?4?27?1% for FE models and 36?5?11?8% for RE models. This corresponded to mean absolute changes in r of 0?037?0?008 and 0?039?0?008, respectively (Nfl 21). For Hedges? d the mean percentage change was 30?3?7?9% for FE models and 17?0?5?4% for RE models. This corresponded to absolute changes in d of 0?187?0?053 and 0?153?0?061, respectively (Nfl 10). For the natural log of the response ratio (lnRR) the mean percentage change was 20?9?10?9% for FE models and 27?0?10?2% for RE models. This corresponded to absolute changes in lnRR of 0?121?0?086 and 0?135?0?085, respectively (Nfl 7). The percentage change is large even though the absolute difference in effect size estimates is small because in evolution and ecology mean effect sizes are weak (mean rfl 0?21?0?27; A. P. M?ller & M. D. Jennions, in preparation). Original and recalculated effect sizes were similar, but 15?28% of the variation in recalculated mean effect size is unexplained by the original estimate of effect size (Fig. 1). When results are non-significant, researchers often use an a priori estimate of effect size to calculate power. In the absence of pilot studies, they may rely on a general estimate of effect sizes for relationships in their field of study (e.g. A. P. M?ller & M. D. Jennions, in preparation). Most researchers, how- ever, provide power for statistical tests assuming that the true effect is small, medium or large as defined by 218 Michael D. Jennions and Anders P. M?ller Cohen (1988) (e.g. small : rfl 0?1, dfl 0?2; medium: rfl 0?2, dfl 0?5; large rfl 0?3, dfl 0?8). We therefore classified effect sizes before and after recalculation using the criteria of Cohen (1988). We could only do this for 31 meta-analyses (those using r or Hedges? d). For RE models, the weighted mean effect remain unchanged and was classified as large for seven, medium for 10 and small for 11 estimates. One large estimate became medium, and one medium estimate became small ; while one small estimate became medium. Thus, 10% (3}31) of effect sizes had to be reclassified after correcting for publication bias. Finally, we looked at the robustness of weighted mean effects by examining fail-safe numbers. By the convention that X"(5K?10) is robust, 85% (34}40) of the original weighted mean effect esti- mates were robust. With FE models, 20?6% (7}34) of the estimates were no longer robust after being recalculated. For RE models, 14?7% (5}34) of the effect sizes were no longer robust after being recalculated. For the 34 original results that were robust, in 11 cases for FE models (32%) and in 21 cases for RE models (62%), the weighted mean effect was the same or more robust after recalculation (Fisher?s Exact test, Pfl 0?028) (Fig. 2A, B). IV. DISCUSSION To start, there was broad agreement between FE and RE models as to whether or not weighted mean effects were significant, although estimates from RE models were approximately 29% larger. (The larger mean estimates for RE models may ?compensate ? for the generally broader confidence intervals for RE models which, all else being equal, should decrease the likelihood of reporting a significant effect size with RE models.) If anything, the initial use of FE models by biologists may have led to weighted mean effect sizes being slightly underestimated. The use of the ? trim and fill ? method, however, showed a significant number of ?missing? studies for 38% of RE model meta-analyses and 50% of FE model meta-analyses. On average, the number of studies missing was significantly greater for RE models. Previously, the Begg?Mazumdar correlation has been the main test used to detect possible publication bias. As expected, we found significant agreement between estimates of publication bias based on this correlation and those based on the ? trim and fill ? approach. (At present though there is insufficient data to determine whether the Begg?Mazumdar correlation can be used as a ? short-cut ? to predict the effect of adding ?missing? studies in terms of the robustness of recalculated effect sizes.) Furthermore, individually significant Begg?Mazumdar correla- tions also suggest a publication bias in 35% of the 40 meta-analyses. The asymmetry in funnel graphs previously described for specific topics in ecology and evolution (e.g. Palmer, 2000; Jennions et al., 2001) therefore appears to reflect a more general phenomenon (Jennions & M?ller, 2002). Although alternative explanations for a skewed funnel graph should not be neglected (Thornhill et al., 1999), publication bias could be a potential problem for at least one in three meta-analyses. Correcting for publication bias using the ? trim and fill ? method of Duvall & Tweedie (2000a, b) led to three out of 35 and eight out of 38 initially significant weighted mean effects becoming non- significant for FE and RE models, respectively. The weighted mean effect size was no longer robust after being recalculated for 21% of FE models and 15% of RE models. If these findings can be generalized to future studies then 15?21% of meta-analyses in ecology and evolution could reach erroneous conclu- sions if no correction is made for publication bias. By contrast, Sutton et al. (2000a) suggested that only 5?10% of medical meta-analyses might have reached an incorrect interpretation because of publication bias. There was one significant and unexpected differ- ence between FE and RE models. For FE models, 71% of recalculated weighted means were less robust because missing studies reduced the mean effect size. By contrast, for RE models, 59% were the same or more robust. For RE models, the ?missing studies ? were often larger than the initial weighted mean. This is not as predicted by publication bias and suggests that asymmetry in funnel plots may have other, as yet unknown, causes. Sutton et al. (2000a) only reported decreases in weighted mean effect size in an analysis of 48 medical meta-analyses, although this was not the case when one analyses other datasets (R. Tweedie, personal communication). Chance asymmetry provides an insufficient expla- nation for our findings because in eight of the 13 cases where missing studies had effect sizes smaller than the weighted mean effect size the number of studies missing was significantly more than expected by chance. The difference between the medical studies of Sutton et al. (2000a) and the ecological} evolutionary studies analysed here may relate to differences in the range of sample sizes. One or two studies with unusually strong effects and larger sample sizes can generate a weighted mean effect that is larger than most of the reported effect sizes. 219Publication bias in ecology and evolution For example, the removal of three out of 84 cases with large effects and small variances from Gontard- Danek & M?ller (1999) changes the situation from an estimate of 14 missing studies larger than the mean and a recalculated weighted mean of rfl 0?43, to no missing studies and the original weighted mean of rfl 0?35. The effect of publication bias in ecology and evolution when using RE models for meta- analysis, at least for the currently available data, is therefore idiosyncratic. In eight cases, the weighted mean became statistically non-significant, but in 13 cases the mean stayed significant and was actually more robust after correcting for publication bias. In general though, we suggest that any increase in robustness be disregarded. The observed asymmetry in the funnel plots could even be due to selective reporting of studies with effects smaller than the average effect, perhaps because of a prejudice or ?backlash? against a well-established idea or so- called ?bandwagons? (Palmer, 2000; Poulin, 2000). For now, however, we think it is best to be conservative and simply ask whether results remain robust after correcting for funnel-plot asymmetry. Sterne (2000) criticized Sutton et al. (2000a), saying that the ? trim and fill ? method leads to false positive claims of missing studies. This is true, although the criticism can partly be responded to by highlighting how often the number of missing studies is too large to be attributed to chance alone (i.e. P!0?05). Here, this occurred for 38% of the random-effects models. More generally, Sutton et al. (2000b) emphasise that the critical aim of ? trim and fill ? is not to quantify exactly how many studies are really missing. Rather, it is to test whether results are robust and conclusions unchanged when we correct for possible bias. Here, we show that for the preferred RE model approach, 21% (eight of 38 meta-analyses) were not robust to potential publication bias. We therefore strongly recommend that authors of meta-analyses routinely include estimates of recalculated effect sizes using ? trim and fill ? methods. These methods are easy to apply and, along with fail-safe numbers (Rosenthal, 1991) and Begg?Mazumdar correlations (Begg & Mazumdar, 1994), allow readers to decide for themselves how sensitive a reported significant relationship is to potential bias. Finally, reviewers should also be careful about drawing conclusions about factors that lead to heterogeneity in effect sizes. At present, there is no way of knowing what the effect of ?missing? studies will be on tests of significant variation in effect size among different groups. Significant between-group heterogeneity should therefore be assessed in the light of the number of missing studies. Perhaps ?missing? studies could be conservatively assigned so as to reduce between- group heterogeneity to test whether group differ- ences are still statistically significant. V. CONCLUSIONS (1) If asymmetry in a ? funnel plot ? of effect size against sample size is due to publication bias, the ? trim and fill ? method indicates that 38?50% of data sets have a significant publication bias as indicated by an excess of ?missing? studies. (2) The more familiar Begg?Mazumdar corre- lation also showed that, on average, effect sizes are closer to zero as sample size increases. In general, the findings of the ? trim and fill ? method and Begg? Mazumdar correlation were in agreement. ?Trim and fill ?, however, has the advantage that a ?correction? for possible publication bias can be made by adding ?missing? studies. (3) In 75% of the cases analysed using fixed- effects models, the addition of missing studies moved the mean effect closer to zero. For random-effects models, the equivalent figure was 57%. More importantly, 8?21% of data sets where the original estimate of the average relationship differed signifi- cantly from zero became non-significant. (4) Stated slightly differently, using the con- ventional definition of a ?robust ? result based on Rosenthal?s fail-safe numbers, 15?21% of effect size estimates were no longer robust after being recalcu- lated to include ?missing? studies. (5) We conclude that publication bias is a potential problem for reviewers. This is most clearly seen when a quantitative reviewing method like meta-analysis is used, but is equally true for narrative reviews. (6) Reviewers should always test for publication bias. Aside from established techniques, we specifi- cally recommend the use of ? trim and fill ?. It is the only method that allows one to estimate conserva- tively whether publication bias has influenced the conclusions reached by a reviewer. It is easy to use and does not require expensive software or advanced statistical skills. VI. ACKNOWLEDGEMENTS We thank Go $ ran Arnqvist, Michael Brett, Isabelle Co # te ! , Peter Curtis, Mark Forbes, Peter Hamback, Nick Jonsson, Julia Koricheva, Dean McCurdy, Fiorenza Micheli, Iago Mosqueira, Robert Poulin, Howie Riessen, Michael Rosenberg, Gina Schalk, Xianzhong Wang and Peter van 220 Michael D. Jennions and Anders P. M?ller Zandt and others we may have inadvertently omitted for kindly providing unpublished information. J. Shykoff, D. Pope. B. Backwell and J. Christy kindly discussed issues of meta-analysis and the general idea behind performing the present study. M.D.J. thanks the Director of STRI for bridging funding. Papers describing the ?trim and fill ? method can be downloaded for free at : http:}}www. biostat.umn.edu}Ctweedie}documents}tweediecurrentpapers. html. VII. REFERENCES Alatalo, R. V., Mappes, J. & Elgar, M. A. (1997). Heritabi- lities and paradigm shifts. Nature 385, 402?403. Arnqvist, G. & Wooster, D. (1995). Meta-analysis : synthe- sizing research findings in ecology and evolution. Trends in Ecology and Evolution 10, 236?240. Bauchau, V. (1997). Is there a ??file drawer problem?? in biological research? Oikos 79, 407?409. Begg, C. B. (1994). Publication bias. In The Handbook of Research Synthesis (eds. H. Cooper and L. V. Hedges), pp. 399?409. Russel Sage Foundation, New York. Begg, C. B. & Mazumdar, M. (1994). Operating characteristics of a rank correlation test for publication bias. Biometrics 50, 1088?1101. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. 2d ed. L. Erlbaum, Hillsdale, New Jersey. Csada, R. D., James, P. C. & Espie, R. H. M. (1996). The ??file drawer problem?? of non-significant results : does it apply to biological research? Oikos 76, 591?593. DuMochel, W. & Harris, J. (1997). Comments on Givens, G. H., Smith, D. D. and Tweedie, R. L. (1997) Publication bias in meta-analysis : A Bayesian data-augmentation ap- proach to account for issues exemplified in the passive smoking debate (with discussion). Statistical Science 12, 244? 245. Duvall, S. & Tweedie, R. (2000a). A non-parametric ? trim and fill ? method of assessing publication bias in meta-analysis. Journal of the American Statistical Association 95, 89?98. Duvall, S. & Tweedie, R. (2000b). Trim and fill : A simple funnel-plot-based method of testing and adjusting for pub- lication bias in meta-analysis. Biometrics 56, 455?463. Egger, M., Davey-Smith, G., Schneider, M. & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test. British Medical Journal 315, 629?634. Fernandez-Duque, E. & Valeggia, C. (1994). Meta-analysis : a valuable tool in conservation research. Conservation Biology 8, 555?561. Gleser, L. J. & Olkin, I. (1996). Models for estimating the number of unpublished studies. Statistics in Medicine 15, 2493?2507. Gontard-Danek, M. C. & M?ller, A. P. (1999). The strength of sexual selection: A meta analysis of bird studies. Behavioral Ecology 10, 476?486. Gurevitch, J. & Hedges, L. V. (1999). Statistical issues in ecological meta-analyses. Ecology 80, 1142?1149. Jennions, M. D. & M?ller, A. P. (2002). Relationships fade with time: a meta-analysis of temporal trends in publication in ecology and evolution. Proceedings of the Royal Society of London B (in press). Jennions, M. D., M?ller, A. P. & Petrie, M. (2001). Sexually selected traits and adult survival : a meta-analysis. Quarterly Review of Biology 76, 3?36. Light, R. J. & Pillemer, D. B. (1984). Summing Up: The Science of Reviewing Research. Harvard University Press, Cambridge, Massachusetts. Macaskill, P., Walter, S. D. & Irwig, L. (2001). A comparison of methods to detect publication bias in meta- analysis. Statistical Medicine 20, 641?654. M?ller, A. P. & Jennions, M. D. (2001). Testing and adjusting for publication bias. Trends in Ecology and Evolution 16, 580?586. N.R.C. Committee on Applied and Theoretical Statistics. (1992). Combining Information: Statistical Issues and Opportunities for Research. National Academy Press, Washington, D.C. Palmer, A. R. (1999). Detecting publication bias in meta- analysis : a case study of fluctuating asymmetry and sexual selection. American Naturalist 154, 220?233. Palmer, A. R. (2000). Quasireplication and the contract of error : lessons from sex ratios, heritabilities and fluctuating asymmetry. Annual Reviews of Ecology and Systematics 31, 441?480. Poulin, R. (2000). Manipulation of host behaviour by parasites : a weakening paradigm? Proceedings of the Royal Society of London B 267, 787?792. Rosenberg, M. S., Adams, D. C. & Gurevitch, J. (2000). MetaWin: Statistical Software for Meta-analysis. Version 2.0. Sinauer Associates, Massachusetts. Rosenthal, R. (1991). Meta-analytic Procedures for Social Research. Sage Foundation, Newbury Park, California. Simmons, L. W., Tomkins, J. L., Kotiaho, J. S. & Hunt, J. (1999). Fluctuating paradigm. Proceedings of the Royal Society of London B 266, 593?595. Song, F., Eastwood, A. J., Gilbody, S., Duley, L. & Sutton, A. J. (2000). Publication and related biases. Health Technology Assessment 4 (10), 1?115. Song, F. & Gilbody, S. (1998). Increase in studies of publication bias coincided with increasing use of meta-analysis. British Medical Journal 316, 471. Sterne, J. A. C. (2000). High false positive rate for trim and fill method. British Medical Journal (Electronic letter at URL: www.bmj.org}cgi}eletters}320}7249}1574). Sutton, A. J., Duval, S. J., Tweedie, R. L., Abrams, K. R. & Jones, D. R. (2000a). Empirical assessment of effect of publication bias on meta-analyses. British Medical Journal 320, 1574?1577. Sutton, A. J., Duval, S. J., Tweedie, R. L., Abrams, K. R. & Jones, D. R. (2000b). High false positive rate for trim and fill method. British Medical Journal (Electronic letter at URL: www.bmj.org}cgi}eletters}320}7249}1574). Thornhill, R., M?ller, A. P. & Gangestad, S. (1999). The biological significance of fluctuating asymmetry and sexual selection: A reply to Palmer. American Naturalist 154, 234?241. Trengenza, T. (1997). Darwin a better name than Wallace? Nature 385, 480. 221Publication bias in ecology and evolution VIII. APPENDIX: THE DATA SETS Data sets were taken from the following 40 peer- reviewed meta-analyses. The necessary information was either in the published paper or generously made available to us by the authors. Arnqvist, G. & Nilsson, T. (2000). The evolution of polyandry: multiple mating and female fitness in insects. Animal Behaviour 60, 145?164. Arnqvist, G., Rowe, L. & Krupa, J. J. & Sih, A. (1996). Assortative mating by size : a meta-analysis of mating patterns in water striders. Evolutionary Ecology 10, 265?284. Boissier, J., Morand, S. & Mone, H. (1999). A review of performance and pathogenicity of male and female Schistosoma mansoni during the life cycle. Parasitology 119, 447?454. Brett, M. T. & Goldman, C. (1996). A meta- analysis of the freshwater trophic cascade. Proceedings of the National Academy of Sciences U.S.A. 93, 7723? 7726. Co# te! , I. M. & Poulin, R. (1995). Parasitism and group size in social animals : a meta-analysis. Behavioral Ecology 6, 159?165. Co# te! , I. M. & Sutherland, W. J. (1997). The effectiveness of removing predators to protect bird populations. Conservation Biology 11, 395?405. Curtis, P. S. (1996). A meta-analysis of leaf gas exchange and nitrogen in trees grown under elevated carbon dioxide. Plant, Cell and Environment 19, 127?137. (Using the data file at URL: http:}} cdiac.esd.ornl.gov}epubs}ndp}ndp072}ndp072. html) Curtis, P. S. & Wang, X. (1998). A meta- analysis of elevated CO # effects on woody plant mass, form, and physiology. Oecologia 113, 299?313. (Using the data file at URL: http:}}cdiac.esd. ornl.gov}epubs}ndp}ndp072}ndp072.html) Fiske, P., Rintama $ ki, P. & Karvonen, E. (1998). Mating success in lekking males : a meta-analysis. Behavioral Ecology 9, 328?338. Gontard-Danek, M. C. & M?ller, A. P. (1999). The strength of sexual selection: a meta- analysis of bird studies. Behavioral Ecology 10, 476? 486. Gurevitch, J. & Hedges, L. V. (1993). Meta- analysis : combining the results of independent experiments. In Design and Analysis of Experiments (ed. by S. Scheiner and J. Gurevitch), pp. 378?398. Chapman and Hall, New York. (Data as presented in Rosenberg et al., 2000.) Harper, D. G. C. (2000). Feather mites, pectoral muscle condition, wing length and plumage color- ation of passerines. Animal Behaviour 58, 553?562. Ja $ rvinen, A. (1991). A meta-analytic study of the effects of female age on laying-date and clutch-size in the Great Tit Parus major and the Pied Flycatcher Ficedula hypoleuca. Ibis 133, 62?67. Jennions, M. D., M?ller, A. P. & Petrie, M. (2001). Sexually selected traits and adult survival : a meta-analysis of the phenotypic relationship. Quar- terly Review of Biology 76, 3?36. Koricheva, J. (2002). Meta-analysis of sources of variation in fitness costs of plant antiherbivore defenses. Ecology 83, 176?190. Koricheva, J., Larsson, S. & Haukioja, E. (1998). Insect performance on experimentally stressed woody plants : a meta-analysis. Annual Review of Entomology 43, 195?216. Koricheva, J., Larsson, S., Haukioja, E. & Keinanen, M. (1999). Regulation of woody plant secondary metabolism by resource availability : Hypothesis testing by means of meta-analysis. Oikos 83, 212?226. Leung, B. & Forbes, M. R. (1996). Fluctuating asymmetry in relation to stress and fitness : effects of trait type as revealed by meta-analysis. Ecoscience 3, 400?413. (Using the data file at URL: www.biology. ualberta.ca}palmer.hp}DataFiles.htm) Micheli, F. (1997). Eutrophication, fisheries and consumer-dynamics in marine pelagic ecosystems. Science 285, 1396?1398. M?ller, A. P. (1999). Asymmetry as a predictor of growth, fecundity and survival. Ecology Letters 2, 149?156. M?ller, A. P. (2000). Developmental stability and pollination. Oecologia 123, 149?157. M?ller, A. P. & Alatalo, R. V. (1999). Good genes effects in sexual selection. Proceedings of the Royal Society of London B 266, 85?91. M?ller, A. P., Christe, P., Erritz?e, J. & Mavarez, J. (1998). Condition, disease and immune defence. Oikos 83, 301?306. M?ller, A. P., Christe, P. & Lux, E. (1999). Parasitism, host immune function, and sexual selec- tion. Quarterly Review of Biology 74, 3?20. M?ller, A. P. & Ninni, P. (1998). Sperm competition and sexual selection: a meta-analysis of paternity studies of birds. Behavioural Ecology and Sociobiology 43, 345?358. M?ller, A. P. & Shykoff, J. A. (1999). Mor- phological developmental stability in plants : pat- terns and causes. International Journal of Plant Sciences 160, S135?S146. M?ller, A. P. & Thornhill, R. (1998). Bilateral symmetry and sexual selection: a meta-analysis. American Naturalist 151, 174?192. Mosqueira, I. Co# te! , I. M., Jennings, S. & Reynolds, J. D. (2000). Conservation benefits of marine reserves for fish populations. Animal Con- servation 3, 321?332. 222 Michael D. Jennions and Anders P. M?ller Osenberg, C. W., Sarnelle, O., Cooper, S. D. & Holt, R. D. (1999). Resolving ecological ques- tions through meta-analysis : goals, metrics, and models. Ecology 80, 1105?1117. (Data set 1.) Osenberg, C. W., Sarnelle, O., Cooper, S. D. & Holt, R. D. (1999). Resolving ecological ques- tions through meta-analysis : goals, metrics, and models. Ecology 80, 1105?1117. (Data set 2.) Poulin, R. (2000). Manipulation of host be- haviour by parasites : a weakening paradigm? Pro- ceedings of the Royal Society Biological Sciences Series B 267, 787?792. Poulin, R. (2000). Variation in the intraspecific relationship between fish length and intensity of parasitic infection: biological and statistical causes. Journal of Fish Biology 56, 123?137. Riessen, H. P. (1999). Predator-induced life his- tory shifts in Daphnia : a synthesis of studies using meta-analysis. Canadian Journal of Fisheries and Aquatic Sciences 56, 2487?2494. Schmitz, O. J., Hamback, P. A. & Beckerman, A. P. (2000). Trophic cascades in terrestrial systems: a review of the effects of carnivore removals on plants. American Naturalist 155, 141?153. Sokolovska, N., Rowe, L. & Johansson, F. (2000). Fitness and body size in mature odonates. Ecological Entomology 25, 239?248. Thornhill, R. & M?ller, A. P. (1999). The relative importance of size and asymmetry in sexual selection. Behavioral Ecology 9, 546?551. (Only effect sizes for size.) Van der Werf, E. (1992). Lack?s clutch size hypothesis : an examination of the evidence using meta-analysis. Ecology 73, 1699?1705. Van Zandt, P. A. & Mopper, S. (1998). A meta- analysis of adaptive deme formation in phytopha- gous insect populations. American Naturalist 152, 595?604. V?llestad, L. A., Hindar, K. & M?ller, A. P. (1999). A meta analysis of fluctuating asymmetry in relation to heterozygosity. Heredity 83, 206?218. Wang, X. & Curtis, P. S. (2002). A meta- analytical test of elevated CO # effects on plant respiration. Plant Ecology (in press).