A survey of the statistical power of research in behavioral ecology and animal behavior Michael D. Jennionsa and Anders Pape M?llerb aSchool of Botany and Zoology, Australian National University, Canberra, A.C.T. 0200, Australia, Smithsonian Tropical Research Institute, Apartado 2072, Balboa, Republic of Panama, and bLaboratoire d?Ecologie Evolutive Parasitaire, CNRS FRE 2365, Universite? Pierre et Marie Curie, Ba?t. A, 7e`me e?tage, 7 quai St. Bernard, Case 237, F-75252 Paris Cedex 5, France We estimated the statistical power of the first and last statistical test presented in 697 papers from 10 behavioral journals. First tests had significantly greater statistical power and reported more significant results (smaller p values) than did last tests. This trend was consistent across journals, taxa, and the type of statistical test used. On average, statistical power was 13?16% to detect a small effect and 40?47% to detect a medium effect. This is far lower than the general recommendation of a power of 80%. By this criterion, only 2?3%, 13?21%, and 37?50% of the tests examined had the requisite power to detect a small, medium, or large effect, respectively. Neither p values nor statistical power varied significantly across the 10 journals or 11 taxa. However, mean p values of first and last tests were significantly correlated across journals (r ? :67; n ? 10; p ? :034), with a similar trend for mean power (r ? :63; n ? 10; p ? :051). There is therefore some evidence that power and p values are repeatable among journals. Mean p values or power of first and last tests were, however, uncorrelated across taxa. Finally, there was a significant correlation between power and reported p value for both first (r ? :13; n ? 684; p ? :001) and last tests (r ? :16; n ? 654; p, :0001). If true effect sizes are unrelated to study sample sizes, the average true effect size must be nonzero for this pattern to emerge. This suggests that failure to observe significant relationships is partly owing to small sample sizes, as power increases with sample size. Key words: effect size, meta-analysis, publication bias, sample sizes, statistical power. [Behav Ecol 14:438?445 (2003)] The biological literature is dominated by reports ofstatistically significant patterns of association (Csada et al., 1996). This may partly reflect a publication bias toward significant findings (Kotiaho and Tomkins, 2002; Palmer, 1999, 2000). Recent reviews show that studies with both small sample sizes and nonsignificant results are underrepresented in the literature (Jennions and M?ller, 2002a,b). This could bias our assessment of the average strength of biological relationships. Biologists need to ensure that studies are equally publishable whether their results are significant or not. This, however, raises a problem. Should the criteria for publication be an a priori minimal level of confidence in the conclusion of a study in case the observed outcome turns out to be a nonsignificant result? If the answer is yes, then, to determine publishability, we must calculate our confidence in a conclusion that there is no significant effect. If the answer is no, we must still do this because nonsignificant results are then published, and readers need to assess how much confidence to place in a negative conclusion. Leaving aside whether a dichotomy into significant and nonsignificant results is appropriate for biologists (Stoehr, 1999), this question is best answered by statistical power analysis (but see Hoenig and Heisey, 2001). Power is the probability of obtaining a significant result when the null hypothesis is false. Power increases as sample size, a-level of significance, and effect size (magnitude of the difference between the alternative and null hypothesis) increase, and decreases with greater variance in the study population. If the effect size is a standardized measure (e.g., the mean difference between two groups expressed in standard deviations, d; or the cor- relation coefficient, r), it is dimensionless, and there is no need to specify population variance to calculate statistical power (Thomas and Krebs, 1997). The use of standardized measures can, however, yield differences in observed effect size solely owing to predictable differences in the likelihood of measurement error (e.g., between laboratory and field studies; Hurlbert, 1994). In general, however, biologists report post-hoc statistical power to detect standardized measures of effect sizes of specific magnitude, conventionally referred to as small, medium, or large effects (Cohen, 1988). Despite being urged to incorporate power analysis into research design and presentation (see Greenwood, 1993; Peres-Neto and Olden, 2001; Stoehr, 1999; Thomas and Juanes, 1996; Thompson and Neill, 1993; Toft and Shea, 1983), most behavioral ecologists still report nonsignificant results without indicating a test?s statistical power (Stoehr 1999; this study). Since the first survey by Cohen (1962), those in the medical and social sciences have conducted surveys to estimate average power in specific fields or journals (see Chung et al., 1998; Kloster and Layne, 1997; Maddock and Rossi, 2001; Moher et al., 1994; additional examples in other disciplines are given by Cohen, 1988: xi). In biology, the effect of low power has been examined in a few specific areas. For example, Noor and Smith (2000) showed that low power might affect conclusions of studies on sexual isolation in Drosophila. Palmer (2000) recently pointed out that the ability to detect small to medium deviations from a one-to-one sex ratio in vertebrates with even moderate power requires sample sizes far larger than those in most published studies. To our knowledge, there has been no systematic attempt to conduct a power analysis survey of a broad area of research in biology. Why quantify average statistical power? After all, interpreting the results of a specific statistical test depend solely on its power. We propose four reasons. First, surveys invariably show that the general power to detect relationships is far lower than most researchers think (see Dickinson et al., 2000; Kazantzis, Behavioral Ecology Vol. 14 No. 3: 438?445 Address correspondence to M.D. Jennions at the School of Botany and Zoology, Australian National University, Canberra, A.C.T. 0200, Australia. E-mail: michael.jennions@anu.edu.au. Received 7 November 2001; revised 25 June 2002; accepted 2 September 2002.  2003 International Society for Behavioral Ecology 2000). Ignorance of the relationship between sample size and power could explain why researchers often conduct studies with small sample sizes, when even modest increases could have greatly improved their statistical power (Thomas and Juanes, 1996). When confronted with the reality of low power, researchers may be encouraged to explicitly consider sample size and improve experimental design before conducting studies. Second, power surveys can determine whether researchers are becoming more aware. If they are, statistical power should increase through time. For example, Rossi (1990) and Sedlmeier and Gigerenzer (1989) replicated Cohen?s original 1962 survey and found no increase in power between 1960 and 1982?1984. In contrast, Moher et al. (1994) reported an increase in power in randomized control trials in medicine over a 25-year period. Stoehr (1999) disagreed with a reviewer who felt he had created a straw man, by stating that most behavioral researchers do not take power analysis seriously. This difference in opinion is easily resolved by comparing the mean power of tests conducted now and in the future (or past). Third, if the literature is replete with tests with low power, this should influence how researchers interpret the literature. (Even if the main focus of a study is a clearly significant result, interpretation often relies on the absence of confounding variables. Their absence is usually based on subsidiary tests reporting nonsignificant results.) Many papers in behavioral ecology reach strongly worded conclusions after refuting an alternate hypothesis, even though the power to reject the null hypothesis was extremely low. This leads to fallacious vote counting if scientists simply tally the proportion of studies that detect a relationship without considering the influence of sample size (Cooper and Hedges, 1994). Four, behavioral ecologists may fail to report power for fear that this will reduce the likelihood of publication when reviewers see low values. The review process, however, involves assessment of manuscripts relative to hypothetical alternatives (e.g., is this paper in the top 25%?). These alternatives are unavailable, so a reviewer?s assessment is usually based on the quality of previous studies. Knowing the statistical power of recently published studies will provide an empirical benchmark that allows better informed decisions. A statistical power of 30% to detect a small effect is actually impressively high when compared with the average. Here we present a power survey of papers from 10 journals. Eight of these are devoted to ethological and behavioral ecological studies. The other two often contain studies directly reporting on animal behavior or the immediate consequences thereof (e.g., effect of habitat choice on spatio- temporal abundance). METHODS Analysis of statistical power requires knowledge of the effect size we wish to detect. For standardized measure of the magnitude of a relationship, Cohen (1988) has defined small, medium, and large effect sizes for several tests. For example, a small effect has a mean correlation coefficient, r, of .10 (i.e., explains 1% of the variance because r 2 ? 1%), a medium effect has r ? :30; and a large effect has r ? :50: Biologists usually perform analyses estimating the power to detect an effect of medium strength. Here we present data on statistical power if the effect size is small, medium, or large at the p ? :05 level (two-tailed) as defined in chapters 2?8 of Cohen (1988). However, a recent survey of 44 biological meta- analyses examining 242 null hypotheses shows an average effect for ecological or evolutionary studies of r ? :18--:19 (M?ller and Jennions, 2002). Thus, the observed average effect size examined by biologists (at least for relationships subjected to meta-analysis) is below the medium effect of Cohen (1988). We do not know the true effect sizes for the tests we examined. As such, we cannot conclude there is a publication bias simply because most studies report significant results (Bauchau, 1997). The expected proportion of studies reporting significant results depends on the effect size these studies were trying to detect (and the sample sizes). We estimated power for 1362 statistical tests from 697 original papers from 10 journals: American Naturalist (33), Animal Behaviour (187), Behaviour (69), Behavioral Ecology (68), Behavioral Ecology and Sociobiology (102), Behavioural Processes (39), Ethology (73), Journal of Insect Behaviour (54), Journal of Animal Ecology (50), and Ethology, Ecology and Evolution (22). The number of usable papers per journal is indicated in parenthesis. In each case, we examined all issues of the journal with a 2000 publication date (only the Nov/Dec issue of Behaviour was unavailable). For each paper, we looked for the first and last statistical test presented in the text of the Results section. We defined a statistical test as having been presented if the author(s) reported a probability value (henceforth, p value) either exactly or using the phrase ??p, :0X ?? or ??p. :Y ;?? or if the shorthand ??N.S?? or ??n.s.?? was used to denote a nonsignificant p value, and it was clear which statistical test had been used. We did not consider a test to have been presented if the authors simply made a statement such as ??there was no difference between X and Y ?? or ??there was a significant correlation between X and Y.?? If there were fewer than two usable statistical tests in the main text, we then looked at tables and figure legends, reading from top left to bottom right. For 665 of the 697 papers, we obtained data for two tests. Use of the first and last test provided an objective way to collect data. Focusing on the so-called main test of a study may be misleading. Most papers emphasize statistically significant findings, even though these may not have been the original focus of the study (Csada et al., 1996; but see Bauchau, 1997). For each test, we recorded the p value. If p was given as p,X ; we set p ? X: If given as p.Y ; we only set p as Y if Y . :05: (In 24 of 1362 tests [1.8%], the only information was that p. :05:) For summary analyses, we converted p values into their associated standard normal deviates (z scores, z ? 1:96 when p ? :05). We also recorded the taxa/type of study using 11 categories: crustaceans, insects, spiders, other invertebrates (excluding insects, spiders, and crustaceans), fish, amphibians, reptiles, birds, mammals, plants, and species level (e.g., phylogenies). We analyzed mean power for different taxa because there is a general assumption that studies of some taxa (e.g., mammals) have smaller sample sizes than others (e.g., insects) and that these will therefore have reduced statistical power. We calculated the power of the most commonly encoun- tered statistical tests, specifically binomial tests; sign tests; v2 goodness-of-fit or R3C contingency table analyses; G tests (log-likelihood ratio tests for contingency tables); comparison of two proportions; comparison of two correlation coeffi- cients; Fisher?s Exact test for a 23 2 table; Friedman?s non- parametric test; Kruskal-Wallis nonparametric one-way ANOVA; Mann-Whitney U test (two independent samples); paired t test; tests for significance of correlation coefficients; one-sample t test; two-sample t test; Wilcoxon?s matched-pairs test; one-way ANOVA; and test of main effects for fixed factors in ANOVA with simple factorial designs (two-way or three-way ANOVAs), which includes tests for difference in elevation in ANCOVAs (Cohen, 1988: 379). The only tests excluded with any regularity were tests for main effects in ANOVAs with complex designs (specifically repeated measures and nested factors), tests for interaction terms in all ANOVAs, logistic regressions, and tests based on maximum likelihood or Jennions and M?ller ? Statistical power and behavior 439 restricted maximum likelihood approaches. These tests were mainly excluded because of ease of power analysis. They were not excluded on a priori evidence that they had lower, or higher, statistical power than that of the included tests. Statistical power was calculated using tables in Cohen (1988). In addition, we used G*power to calculate power for sample sizes smaller than those presented in the tables (Erdfelder et al., 1996). G*power and Cohen?s tables show close agreement, with the exception of F tests in which the approximation method of Cohen can lead to differing results with complex experimental designs (Bradley et al., 1996). However, we only used Cohen (1988) to calculate power for simple one-way, two-way, and three-way ANOVAs. F tests of the significance of a regression were analyzed as tests of the significance of the regression coefficient. This is equivalent to testing the significance of the correlation coefficient (Cohen, 1988: 76?77). To allow us to collect and analyze data on so many tests, we made a few simplifying assumptions that slightly inflated our estimates of power. They were as follows. 1. For four nonparametric tests, we calculated the power if the available data had been analyzed using the equivalent parametric test. The relative power (efficiency) of non- parametric tests is weaker than their parametric counter- parts because they make fewer assumptions (Siegel and Castellan, 1988: 21). However, with moderate to large sample sizes, the power of nonparametric tests becomes similar to that of the equivalent parametric tests. Specifically, for Wilcoxon tests we calculated power for a paired t test; for Mann-Whitney U test for a two-sample t test; and for Kruskal-Wallis and Friedman?s test for parametric ANOVAs. For Mann-Whitney U tests, Wilcox- on tests, and Kruskal-Wallis tests, the statistical power reaches about 95.5% of that of the equivalent parametric t tests for moderate sample sizes. For Friedman?s test, power is 64% of that of the equivalent F test when there are two groups, increasing to 87% for five groups (Siegel and Castellan, 1988). The estimated power we report for these nonparametric tests is therefore slightly larger than the actual power of the tests used by the original authors. 2. For two nonparametric tests, we calculated power for equivalent nonparametric tests. For G tests we calculated power for v2 tests. These two methods of analysis usually yield the same conclusions, and there is no clear agreement as to which is preferable (Zar, 1999: 475), so they tend to be used interchangeably by researchers. For Fisher?s Exact test, we calculated power for a comparison of two proportions. This was the hypothesis tested by the original authors as one of the margins has fixed totals (Cohen, 1988: Table 6.3.5). 3. In some two-sample t tests or Mann-Whitney U tests, only the total sample size was given. We assumed that group sample sizes were equal, which maximizes power. Again, this will slightly inflate our estimates. We compared statistical power and z scores among journals, taxa, and statistical test types by using Kruskal-Wallis one-way ANOVA. (Power for the equivalent parametric ANOVA to detect small, medium and large effects is given in parenthesis. In all cases, parametric tests yielded the same conclusions.) We compared first and last tests by using Wilcoxon?s tests. When comparing power among groups, we used ??power to detect a medium effect?? as the dependent variable. This is likely to maximize the difference between groups because power is a percentage that shows asymptotic values at small and larger sample/effect sizes. Following the method of Stoehr (1999), we also present the results of our statistical tests as observed effect size, by converting the test statistic or p value to r following the method of Cooper and Hedges (1994: 236?240). To avoid confusion, we denote these as E r. Rather than presenting the mean effect size, we present the 95% confidence intervals. These provide the clearest indication of the certainty with which we can conclude that an effect differed from zero (for an excellent review, see Hoenig and Heisey, 2001; see also our Discussion). RESULTS A few papers presented statistical power for the ??main?? biological hypothesis under test. However, power was not reported in any of the 533 tests with nonsignificant results. We had to calculate power by using the reported sample size. In many instances, even this was difficult. A close inspection of the paper was sometimes required to track down the information (e.g., Journal of Animal Ecology in which sample sizes were presented in the Methods section but not in the Results section). In some cases, we were unable to work out the sample size, either because of an ambiguity in the paper or because it was never provided. First and last tests There were clear differences between first and last statistical tests. First tests had significantly greater statistical power (Wilcoxon?s test: n ? 550 pairs, z ? 6:72; p, :0001; Er ? 0:208--0:362) and smaller p values (Wilcoxon?s test: n ? 584 pairs, z ? 7:43; p, :0001; Er ? 0:135--0:290); 68.4% of the 697 cases for first tests and 52.8% of the 665 cases for last tests were significant at the 0.05 level (Table 1). Across papers, how- ever, the p values and power of first and last tests were both positively correlated (p values: r ? :143; n ? 644; p, :0001; Er ? 0:066--0:218; power: r ? :427; n ? 665; p, :0001; Er ? 0:363--0:487). Based on a comparison of average power and p values of first and last tests, this trend was consistent across journals (z ? 2:80; p ? :005; Er ? 0:586--0:973; z ? 2:701; p ? :0069; Er ? 0:485--0:965; both n ? 10), taxa (z ? 2:93; p ? :0033; Er ? 0:611--0:970; Z ? 2:76; p ? :0058; Er ? 0:753--0:983; both n ? 11) and statistical test types (z ? 2:42; p ? :016; Er ? 0:172--0:875; z ? 3:17; p ?. 0 0 1 5 , Er ? 0:579--0:951; both n ? 14) (all Wilcoxon tests). The more important conclusion, however, is that statistical power is generally low. For first tests, mean power is 13?16% to detect a small size effect and 40?47% to detect a medium effect. This is far lower than the generally recommended 80% (Cohen, 1988: 56) or 95% (Peterman, 1990). Using the 80% criterion for first statistical tests, only 2.9%, 21.2%, and 49.8% of 697 cases had the requisite power to detect a small, medium, or large effect, respectively. Likewise, for second tests, only 1.8%, 13.2%, and 36.5% of the 665 cases had sufficient power. If we only consider those tests that reported nonsignificant relationships, the equivalent figures are 1.4%, 17.8%, and 47.0% for the 219 first tests and 1.3%, 10.8%, and 32.5% for the 314 second tests. Table 1 Three power estimates and z scores for first and last statistical tests First test Last test z score 2.306 0.04 (684) 1.896 0.04 (654) Power (small) (%) 16.26 0.68 (697) 12.86 0.56 (665) Power (medium) (%) 47.26 1.13 (697) 39.46 1.05 (665) Power (large) (%) 72.36 1.00 (697) 65.36 1.04 (665) Mean6 SE. Sample sizes are in parentheses. 440 Behavioral Ecology Vol. 14 No. 3 Journal, taxa, and test type Neither p values (first test: v2 ? 10:48; p ? :313; last test: v2 ? 14:68; p ? :10) nor statistical power (first test: v2 ? 15:61; p ? :08; last test: v2 ? 6:78; p ? :66) varied significantly among the 10 journals (all Kruskal-Wallis tests, df ? 9; Table 2; power for one-way ANOVA: all .37%, .99.5%, .99.5%). This suggests that neither impact factor, policy of the journal or any other assessment of journal ??quality?? is related to the statistical significance of the results presented or the sample sizes on which conclusions are based. Neither p values (first test: v2 ? 9:07; p ? :53; last test: v2 ? 7:59; p ? :67) nor statistical power (first test: v2 ? 9:58; p ? :48; last test: v2 ? 11:84; p ? :30) varied significantly among the 11 taxa (all Kruskal-Wallis tests, df ? 10; Table 3; power for one-way ANOVA: all .35%, .99.5%, .99.5%). We then reanalyzed the data only looking at the three taxa with large sample sizes (birds, mammals, and insects). Again, there was no difference among taxa in p values (first test: v2 ? 2:30; p ? :32; last test: v2 ? 2:51; p ? :29). For statistical power, there was no difference for the first test (v2 ? 2:29; p ? :32), but there was for the last test (v2 ? 8:28; p ? :016; all Kruskal- Wallis tests, df ? 2; power for one-way ANOVA: all .46%, .99%, .99%). Bird studies had less power than insect studies (p, :05 post-hoc pair-wise comparison). Among the 14 statistical test types, p values did not differ significantly (first test: v2 ? 15:69; p ? :27; last test: v2 ? 12:90; p ? :46); but statistical power did (first test: v2 ? 95:45; p, 0:0001; last test: v2 ? 70:86; p, :0001; all Kruskal-Wallis tests, df ? 13; Table 4; power for one-way ANOVA: .36%, .99.5%, .99.5%). Mean p values of first and last tests were significantly correlated across journals (r ? :670; n ? 10; p ? :034; Er ? 0:070--0:914), as was mean power (r ? :630; n ? 10; p ? :051; Er ? 0:001--0:902). There is therefore some evidence that power and p values are repeatable among journals. Mean p values or power of first and last tests were not correlated across taxa (r ? :232; p ? :493; n ? 11; Er ? 0:394--0:711; r ? :18; p ? :596; n ? 11; Er ? 0:439--0:683; power: 6%, 16%, 40%); nor were first and last test p values correlated across statistical tests type (r ? :186; p ? :525; n ? 14; Er ? 0:652--0:382; power: 6%, 18%, 47%), although mean power was (r ? :974; p, :0001; n ? 14; Er ? 0:917--0:992). Power and p values There was a significant positive correlation between power and the reported p value (expressed as z score) for both first (r ? :125; p ? :001; n ? 684; Er ? 0:051 to 0.198) and last tests (r ? :162; p, :0001; n ? 654; Er ? 0:086 to 0.236). Thus, the greater the statistical power, the more often the test reported a significant effect. To determine whether this relationship was owing to combining different types of tests, Table 2 z scores and power to detect a medium effect for 10 biological journals z score Power (medium) Journal (First test) (Last test) (First test) (Last test) American Naturalist 2.676 0.19 (33) 2.336 1.88 (32) 45.06 5.2 (33) 42.06 4.8 (32) Animal Behaviour 2.336 0.08 (184) 1.886 0.08 (176) 46.86 2.2 (187) 37.66 2.0 (179) Behavioral Ecology 2.136 0.13 (68) 1.666 0.13 (67) 51.26 3.6 (68) 41.06 3.3 (67) Behavioural Ecology and Sociobiology 2.256 0.11 (99) 1.756 0.11 (98) 52.96 2.9 (102) 42.26 2.7 (99) Behavioural Processes 2.386 0.17 (39) 1.986 0.17 (38) 40.06 4.8 (39) 33.56 4.4 (38) Behaviour 2.456 0.13 (67) 2.056 0.14 (63) 44.66 3.6 (69) 40.26 3.4 (63) Ethology, Ecology, and Evolution 2.156 0.23 (22) 2.136 0.24 (21) 58.06 6.3 (22) 44.06 5.9 (21) Ethology 2.246 0.13 (73) 2.036 0.13 (66) 43.06 3.5 (73) 35.56 3.3 (69) Journal of Animal Ecology 2.286 0.15 (49) 1.736 0.16 (45) 49.36 4.2 (50) 40.36 4.0 (45) Journal of Insect Behaviour 2.236 0.15 (50) 1.856 0.16 (48) 42.26 4.0 (54) 43.16 3.8 (52) Mean6 SE. Sample sizes are in parentheses. Table 3 z scores and power to detect a medium effect for 11 taxonomic categories z score Power (medium) Taxa (First test) (Last test) (First test) (Last test) Crustaceans 2.006 0.25 (18) 1.966 0.26 (17) 52.16 7.0 (18) 32.86 6.3 (18) Spiders 2.456 0.25 (19) 2.076 0.26 (18) 51.56 6.8 (19) 41.06 6.2 (19) Insects 2.356 0.09 (152) 2.006 0.09 (143) 50.26 2.4 (158) 45.56 2.2 (148) Other invertebrates 2.536 0.23 (21) 2.216 0.24 (21) 39.96 6.3 (22) 40.56 5.9 (21) Amphibians 2.206 0.19 (32) 1.866 0.20 (31) 40.36 5.3 (32) 37.26 4.8 (31) Reptiles 2.376 0.23 (21) 2.116 0.24 (21) 53.26 6.5 (21) 40.26 5.9 (21) Fish 2.446 0.13 (64) 1.816 0.14 (62) 42.56 3.7 (64) 41.76 3.4 (62) Birds 2.286 0.08 (185) 1.796 0.08 (180) 46.16 2.2 (188) 35.16 2.0 (182) Mammals 2.186 0.09 (138) 1.886 0.10 (130) 46.06 2.5 (141) 37.76 2.4 (131) Plants 2.576 0.40 (7) 1.886 0.41 (7) 58.96 11.2 (7) 39.86 10.2 (7) Species level 2.406 0.22 (23) 1.726 0.24 (21) 55.06 6.2 (23) 44.66 5.9 (21) Mean6 SE. Sample sizes are in parentheses. Jennions and M?ller ? Statistical power and behavior 441 we reexamined this relationship for specific statistical tests. Only tests in which n  84 were used because the power to detect a medium effect at a ? 0:05 is then greater than 80%. For correlation coefficient tests, rs ? :076 (p ? :376; n ? 138; Er ? 0:093--0:239) a n d rs ? 0:267 ( p, :001; n ? 173; Er ? 0:123--0:400); for v2 tests , rs ? 0:299 (p ? 0:004; n ? 92; Er ? 0:102--0:476); and for F tests (all types com- bined), rs ? 0:411 (p, :001; n ? 126; Er ? 0:254--0:547) and rs ? 0:011 (p ? :91; n ? 108; Er ? 0:178--0:200). A signifi- cant decrease in p value as statistical power increased was therefore also apparent for specific statistical tests in three of five cases. DISCUSSION The statistical power of behavioral studies to detect relation- ships is low. For example, the power to detect a medium effect is about 39?47%. Only 10?20% of tests exceeded the recommended minimum criterion of 80% power (Cohen, 1988). This was true whether we considered all tests or only those reporting nonsignificant results. Authors still fail to report power for nonsignificant results (see Thomas and Juanes 1996: 859), even in journals that require them to do so. For example, since 1997, Animal Behaviour ?s Instructions for Authors has stated, ??where a significance test based on a small sample size yields a non-significant outcome, the power of the test should normally be quoted.?? Examination of two recent copy of Animal Behaviour showed, however, that of 215 statistical tests in 17 papers in which p. 0:05; none reported statistical power (March 2001), whereas only one of 22 empirical papers presented estimates of power in the January 2001 issue. By comparison, the power to detect medium effects in other fields (mainly medical) was 37% (Sedlmeier and Gigerenzer, 1989), 48% (Cohen, 1962), 57% (Rossi, 1990), ,58% (Kazantzis, 2000), 62% (Chase et al., 1978), 71% (Polit and Sherman, 1990), and 77% (Maddock and Rossi, 2001), and the proportion of studies with a power greater than 80% to detect a medium effect was 0.36 (Moher et al., 1994), ,0.50 (Chung et al., 1998), and .0.80 (Mengel and Davis, 1993). Mean statistical power in behavioral biology is therefore lower than that in medicine. These differences may be owing to the type of statistical test used, which is partly determined by experimental design (e.g., paired t tests are more powerful than two-sample t tests). They are, however, far more likely to reflect differences in sample sizes for studies conducted in the various research fields. Stoehr (1999) noted that criticizing biologists? failure to consider power and interpret studies in terms of effect size is sometimes viewed as attacking a straw man. Surely biologists are aware of these issues? Our survey suggests otherwise. If they were, they would more often report statistical power. One solution is for editors and reviewers to ensure that all statistical tests are uniformly presented with full information on sample size (for each group if this influences power estimates as in, say, a two-sample t test), degrees of freedom, exact p values (or as precise as possible, e.g., 0:5. p. 0:20, but not just p ? ns) and the statistical power to detect small and medium effects as conventionally defined (large effects are probably too rare to warrant presentation). If methods to determine power are not well established (this will be fairly rare as most behavioral papers use a limited set of well-studied statistical tests), the authors should explicitly state this in their methods. Thomas and Juanes (1996) and Thomas and Krebs (1997) review and list several free or purchasable software programs that can be used to calculate a priori statistical power. Stoehr (1999) recommended that authors report effect sizes (hence, our own reporting of the 95% confidence intervals for Er here). Effect sizes are easily calculated and do not require expensive software, textbooks, or heavy invest- ment in learning complex skills. To ensure ready access to the relevant information, journals could publish in print and on their Web sites the formulae to convert common statistics such as t, F and v2 or p values to Pearson?s r. Effect size can then be calculated by using a handheld calculator, spreadsheet, or user-friendly effect-size calculators (e.g., Metawin 2.0; Rosen- berg et al., 2000). There are many effect sizes available, so it would be convenient if biologists agreed on which one to use (when possible). We suggest Pearson?s r as the most useful because r2 is the proportion of variance explained. It can be calculated whenever there is a directional trend (e.g., t tests, correlations or v2, or F tests where df ? 1). For omnibus tests of variation among groups rather than tests for linear trends (e.g., F or v2 tests when the numerator df . 1), if possible R2 should be presented. Stoehr (1999) has already listed some advantages of stating effect sizes. These are mainly related to ensuring that readers do not use p values when comparing the strength of relationships (which assumes equivalent sample sizes). We can add three other advantages. First, it would greatly facilitate the efforts of meta-analysts and reduce error Table 4 z scores and power to detect a medium effect for 14 different statistical tests z score Power (medium) Test type (First test) (Last test) (First test) (Last test) Binomial 2.376 0.21 (26) 1.886 0.26 (18) 35.16 5.5 (26) 35.06 6.2 (18) v2 2.206 0.11 (92) 1.826 0.14 (61) 70.56 2.9 (93) 58.76 3.3 (62) F test (one-way) 2.276 0.12 (74) 1.946 0.14 (60) 44.36 3.3 (74) 37.26 3.4 (60) F test (other) 2.536 0.15 (52) 1.746 0.16 (48) 42.76 3.8 (54) 37.56 3.7 (49) Fisher?s Exact 2.566 0.21 (27) 2.006 0.22 (25) 40.66 5.3 (28) 37.66 5.1 (26) Friedman?s test 2.086 0.36 (9) 2.866 0.54 (4) 37.56 8.9 (10) 32.36 13.1 (4) Kruskal-Wallis 2.266 0.24 (20) 1.766 0.25 (19) 36.56 6.0 (22) 35.56 6.0 (19) Mann-Whitney 2.246 0.14 (58) 2.096 0.13 (69) 39.96 3.9 (61) 31.96 3.1 (73) Paired t test 2.196 0.15 (52) 1.926 0.16 (44) 47.36 3.9 (52) 44.16 3.9 (45) Correlation coefficient 2.476 0.09 (138) 1.896 0.08 (173) 42.86 2.9 (138) 36.46 2.0 (175) Sign test 2.156 0.34 (10) 2.296 0.41 (7) 26.16 8.9 (10) 26.86 9.9 (7) t test (one-sample) 2.496 0.31 (12) 2.336 0.29 (14) 74.96 7.8 (13) 60.76 7.0 (14) t test (two-sample) 1.976 0.14 (57) 1.746 0.14 (58) 44.76 3.7 (58) 35.86 3.4 (58) Wilcoxon?s paired test 2.296 0.14 (57) 1.726 0.15 (52) 50.46 3.7 (58) 41.96 3.6 (53) Mean6 SE. Sample sizes are in parentheses. 442 Behavioral Ecology Vol. 14 No. 3 rates when interpreting or transcribing statistical tests during a literature review (Cooper and Hedges, 1994). Second, stating effect sizes allows researchers planning to replicate or conduct similar studies to calculate more easily the sample size needed to achieve the desired statistical power. Knowing the probable effect size is a prerequisite to a well-designed experiment or data collection protocol. Three, reviewers often remind authors to report summary statistics and not just test statistics. In general, however, effect sizes may be easier to interpret. For example, stating the mean and SD for body size in two groups with different sample sizes requires the reader to somehow visualize the overlap between the groups. In contrast, stating the effect size r immediately tells the reader that group identity explains r2 of the observed variation in body size. If effect sizes and their confidence intervals are presented, power analyses gives no additional insights (Hoenig and Heisey, 2001). We have illustrated this approach here by converting our statistical results into the effect size r (written Er) and giving the 95% confidence intervals. Confidence intervals cover a set of nonrefuted values. If these values are tightly clustered around the null value (usually zero), then we are confidence that the true value is near the null hypothesis. Conversely, a wide confidence interval, even if it includes the null value, is treated more cautiously because we know there is a good chance that the true effect size might lie far from the null value. A second advantage of the confidence interval approach is that it prevents researchers from erroneously concluding that when studies fail to reject the null hypothesis, the greater the statistical power, the greater the likelihood that the null hypothesis is true. That this is flawed reasoning is easily visualized by noting that the observed p value?hence, position of the estimated mean effect size relative to the null value (for a fixed sample size as the p value decreases effect size increases)?is also critical in determining the likelihood that the null hypothesis is correct. For example, an estimate of r ? :02--:04 is more likely to lead to the conclusion that the true effect is close to the null value of zero than a study with lower statistical power (larger confidence intervals) where r ? :40--:40. Publication criteria Should statistical power influence publication decisions? One view is that studies should not be published if they have low statistical power, because if they produce negative findings, there is little confidence in the oft-stated conclusion that the null hypothesis is correct. This would be fine if reviewers were equally likely to reject papers with low power that report significant results. There is, however, good evidence for a publication bias in biology toward significant results, even when sample sizes are small (Csada et al, 1996; M?ller and Jennions, 2001, 2002; Palmer, 2000). This culture will be hard to eliminate. We believe it is more important that the literature as a whole is unbiased, so we take the pragmatic view that statistical power should not be a criterion for publication. In general, we think that synthesis of results from many studies is more valuable than a conclusion based on extrapolation from a few big studies. (For an excellent defense of the search for generalities in behavioral ecology rather than a focus on the peculiarities of specific systems, see Reeve, 2001.) Practically speaking, we believe that a require- ment to present statistical power or confidence intervals for effect size would (1) ensure readers were fully aware of the weakness of a conclusion that there is no effect, (2) encourage researchers to increase sample sizes, and (3) be achievable without loss of print space if journals create on-line sections in which studies with negative results and low power are published. The drive to be cited and for peer acknowl- edgment of one?s work would probably also encourage increased sample sizes as authors strove to publish in the more prestigious print section. Specific findings Between 53% and 68% of tests were significant at the 0.05 level. In contrast, the mean power to detect a medium strength effect was only 40?47%. Why the discrepancy? First, the true effects for the questions asked in first and last tests may be greater than r ? :3: The estimate of a mean effect size in biology of r ? :20 by M?ller and Jennions (2002) was from meta-analyses that may deal with a different set of biological relationships. Second, mean estimates of effect size from biological meta-analyses take into consideration the direction of the effect. Thus, the mean effect is smaller than the magnitude of the absolute effect size. Third, there may be a publication bias toward significant results (Palmer, 2000). Four, authors may organize papers so as to present significant results at the beginning and end. The difference between first and last tests p values suggests that authors do not present analyses randomly with respect to their statistical significance. Significant variation in power among statistical tests was expected. For example, for a given sample size per group, the power of a one-sample t test is greater than that of a two- sample t test, because in the former case, the mean against which the data is compared is specified, whereas in the latter case, both means are estimated with error. Given the same underlying effect size being tested, this should result in lower p values for tests with greater statistical power. However, p values did not vary significantly among tests. This suggests either that the relationship is small and went undetected (power to detect a small effect was ,36%) or that tests with greater power are used when testing for relationships with smaller actual effect sizes. The first test per paper had significantly greater power and smaller p values than did the last test. This trend was consistent across journals, taxa, or statistical test type. In retrospect, the random selection of tests from papers might be desirable. This is, however, easier said than done, which was why we used the first and last test to remove any subjectivity on our part. In addition, despite these differences between first and last tests, both convey the same message: Power to detect small and medium effects is low. Neither p values nor power varied significantly among the 10 journals (although p, :10 for p values of last tests and power of first tests). This implies that impact factor or any assessment of journal quality is unrelated to the statistical power or effect sizes they generally report. However, mean p values and power of first and last tests were significantly correlated across journals, which does suggest some degree of repeatability among journals. Somewhat unexpectedly, neither p values nor statistical power varied significantly among the 11 taxa. (There was, however, about 10% less statistical power in studies of birds compared with insects for the last test per study.) In addition, neither mean p values nor power of first and last tests was correlated across taxa. We had initially assumed that sample sizes, hence power, would be larger for insects than, say, mammals. The lack of detectable variation may partly lie in the kinds of questions asked. For example, a study on primates may ask whether distance moved per day differed between two groups (a very specific question) based on 100 days of observations, whereas a study on insects might ask whether body size differed between mated and unmated males based on 100 males per group. Jennions and M?ller ? Statistical power and behavior 443 There was a significant negative correlation between mean power and p values. There are several possible explanations why smaller studies more often report nonsignificant results. First, true effect sizes may be smaller in study systems in which sample sizes tend to be smaller. We know of no reason why this should be true. Second, for a nonzero effect size, as sample size increases, power increases and p values decrease. So, if there is no underlying correlation between sample size and the true effect size, then one possible interpretation is that some tests fail to report a significant relationship because they lack statistical power. One could argue that there is good evidence for this conclusion because there should be considerable variability in the true effect sizes the 1362 tests were trying to detect. This will greatly reduce the strength of the pattern (e.g., in a study in which the true effect was large; even with a small sample size and low power, the reported p value will tend to be small). Third, researchers may adjust sample sizes based on their assessment of the likelihood of detecting an effect. For example, researchers may be disinclined to increase sample sizes when they infer that there is no significant effect to detect. This would also yield a negative correlation between power and p value. We believe there is a measure of truth to this because, for a given sample size, a researcher who finds that p ? :07 is likely to continue collecting data in the hope of reaching p, :05; whereas a researcher faced with p ? :57 is likely to conclude that she/ he will not reach significance and therefore discontinue collecting data. This is a rational, but worrying, behavior because studies with significant results are more likely to be published than those without (M?ller and Jennions, 2001; Palmer, 2000; Song et al., 2000). Conclusions Statistical power in behavioral ecology is distressingly low. The unavoidable solution is to increase sample sizes. It may be argued that logistic, ethical, conservation, and financial constraints make this impossible. It could be claimed with equal vigor that designing a study with low power is unethical and wasteful because nonsignificant findings are inconclusive. In many cases, especially when dealing with invertebrates, we suspect sample sizes can be increased. At present, many researchers decide on a sample size based on examination of previously published work (i.e., conventions among peers) rather than explicit consideration of power. If sample sizes are not readily increased, then it becomes even more important to conduct meta-analyses to detect broader trends (Cooper and Hedges, 1994). In medicine, meta-analysis of numerous small-scale studies (few of which are likely to detect significant trends) may provide a more cost-effective way of assessing the value of a new treatment than investing in a few large-scale studies (Song et al., 2000: 39). In addition, even in medicine, in which studies are on one species, there is the danger that a large study, no matter how well designed, may generate conclusions that can not be extrapolated to society at large if the study population is unrepresentative (e.g., smoking may greatly increase mortality in long-lived Western societies, but have little effect in developing countries where life expectancy is already low). Biologists may have to show greater restraints when discussing the results of their own studies (when conclusions are hampered by low statistical power) and wait until sufficient studies have been conducted to determine general trends. Unfortunately, this is not how papers are currently written. A researcher whose modest conclusion is ??more studies are needed until the results of my study can be interpreted?? is probably less likely to be published than one who places a strong interpretation on his or her findings. This is true whether the results are significant or nonsignificant. Authors that extrapolate from a single significant study to the world at large commit an equal but opposite sin. The publication process needs to place greater emphasis on evaluation of the design and implementation of experiments or data collection protocols and less on the p values (or even the power) of the relationships detected (Palmer, 2000). We believe there is a need to encourage the quantitative synthesis of the literature using modern meta-analytic techniques. Of course, as with any form of review, conclusions are only as reliable as the studies on which they are based. Those who are concerned that meta-analysis leads to ??rubbish in, rubbish out?? should be emphasizing the importance of a priori ??quality?? criteria for the inclusion of studies in meta- analyses. It is important to note that from a meta-analysis perspective, poor studies are not those with small sample sizes. These studies have little bearing on the outcome of a meta- analysis because effect sizes are weighted by their sampling variance, which is inversely related to sample size. In contrast, because narrative reviews do not explicitly consider the influence of sample size, they are far more likely to incorrectly estimate trends by giving equal weighting to studies that differ greatly in the extent to which we can trust their conclusions. We thank Patricia Backwell and John Christy for advice and support. M.D.J. extends special thanks to Dr. Ira Rubinoff and the Smithsonian Tropical Research Institute for bridging funding during the course of this study. This work was partly funded by the Australian Research Council (grants A00104301 and F00104500). REFERENCES Bauchau V, 1997. Is there a ??file drawer problem?? in biological research? Oikos 79:407?409. Bradley DR, Russell RL, Reeve CP, 1996. Statistical power in complex experimental designs. Behav Res Methods Instrum Comput 24:190? 204. Chase LJ, Chase LR, Tucker RK, 1978. Statistical power in physical anthropology: a technical report. Am J Phys Anthropol 49:133?137. Chung KC, Kalliainen LK, Hayward RA, 1998. Type II (b) errors in the hand literature: the importance of power. J Hand Surg 23:20?25. Cohen J, 1962. The statistical power of abnormal-social psychological research: a review. J Abnorm Soc Psychol 65:145?153. Cohen J, 1988. Statistical power analysis for the behavioural sciences, 2nd ed. Hillsdale, New Jersey: Lawrence Erlbaum. Cooper H, Hedges LV, eds, 1994. The handbook of research synthesis. New York: Russell Sage Foundation. Csada RD, James PC, Espie RHM, 1996. The ??file drawer problem?? of non-significant results: does it apply to biological research? Oikos 76:591?593. Dickinson K, Bunn F, Wentz R, Edwards P, Roberts I, 2000. Size and quality of randomised controlled trials in head injury: review of published studies. Br Med J 320:1308?1311. Erdfelder E, Faul F, Buchner A, 1996. G*power: a general power analysis program. Behav Res Methods Instrum Comput 28:1?11. Greenwood J JD, 1993. Statistical power. Anim Behav 46:1011. Hoenig JM, Heisey DM, 2001. The abuse of power: the pervasive fallacy of power calculation for data analysis. Am Stat 55:19?24. Hurlbert SH, 1994. Old shibboleths and new syntheses. Trends Ecol Evol 9:495?496. Jennions MD, M?ller AP, 2002a. Publication bias in ecology and evolution: an empirical assessment using the ??trim and fill?? method. Biol Rev 77:211?222. Jennions MD, M?ller AP, 2002b. Relationships fade with time: a meta- analysis of temporal trends in publication in ecology and evolution. Proc R Soc Lond Ser B 269:43?48. Kazantzis N, 2000. Power to detect homework effects in psychotherapy outcome research. J Consult Clin Psych 68:166?170. Kloster KL, Layne BH, 1997. Low power, Type II errors, and other statistical problems in recent cardiovascular research. Am J Physiol 273:487?493. 444 Behavioral Ecology Vol. 14 No. 3 Kotiaho JS, Tomkins JL, 2002. Meta-analysis can it ever fail? Oikos 96:551?553. Maddock JE, Rossi JS, 2001. Statistical power of articles published in three health psychology-related journals. Health Psychol 20:76?78. Mengel MB, Davis AB, 1993. The statistical power of family practice research. Family Pract Res 13:105?111. Moher D, Dulberg CS, Wells GA, 1994. Statistical power, sample size, and their reporting in randomized controlled trials. J Am Med Assoc 272:122?124. M?ller AP, Jennions MD, 2001. Testing and adjusting for publication bias. Trends Ecol Evol 16:580?586. M?ller AP, Jennions MD, 2002. How much variance can be explained by ecologists and evolutionary biologists? Oecologia 132:492?500. Noor MA, Smith KR, 2000. Recombination, statistical power, and genetic studies of sexual isolation in Drosophila. J Hered 91:99?103. Palmer AR, 1999. Detecting publication bias in meta-analyses: a case study of fluctuating asymmetry and sexual selection. Am Nat 154:220?233. Palmer AR, 2000. Quasireplication and the contract of error: lessons from sex ratios, heritabilities and fluctuating asymmetry. Ann Rev Ecol Syst 31:441?480. Peres-Neto PR, Olden JD, 2001. Assessing the robustness of randomization tests: examples from behavioural studies. Anim Behav 61:79?86. Peterman RM, 1990. Statistical power analysis can improve fisheries research and management. Can J Fish Aquatic Sci 47:2?15. Polit DF, Sherman RE, 1990. Statistical power in nursing research. Nursing Res 39:365?369. Reeve HK, 2001. In search of unified theories in sociobiology: help from social wasps. In: Model systems in behavioral ecology (Dugatkin LA, ed). Princeton, New Jersey: Princeton University Press; 57?71 Rosenberg MS, Adams DC, Gurevitch J, 2000. MetaWin: statistical software for meta-analysis, version 2.0. Sunderland, Massachusetts: Sinauer Associates. Rossi JS, 1990. Statistical power of psychological research: what have we gained in 20 years? J Consult Clin Psychol 58:646?656. Sedlmeier P, Gigerenzer G, 1989. Do studies of statistical power have an effect on the power of studies? Psychol Bull 105:309?316. Siegel S, Castellan NJ Jr, 1988. Nonparametric statistics for the behavioural sciences, 2nd ed. Singapore: McGraw-Hill. Song F, Eastwood AJ, Gilbody S, Duley L, Sutton AJ, 2000. Publication and related biases. Health Technol Assess 4(10):1?115. Stoehr AM, 1999. Are significance thresholds appropriate for the study of animal behaviour? Anim Behav 57:F22?F25. Thomas L, Juanes F, 1996. The importance of statistical power analysis: an example from Animal Behaviour. Anim Behav 52: 856?859. Thomas L, Krebs CJ, 1997. A review of statistical power analysis software. Bull Ecol Soc Amer 78:126?139. Thompson CF, Neill AJ, 1993. Statistical power and accepting the null hypothesis. Anim Behav 46:1012. Toft CA, Shea PJ, 1983. Detecting community-wide patterns: estimating power strengthens statistical inference. Am Nat 122: 618?625. Zar JH, 1999. Biostatistical analysis, 4th ed. London: Prentice-Hall. Jennions and M?ller ? Statistical power and behavior 445