Replications in Comparative Cognition: What Should We Expect and How Can We Improve?

Direct replication studies follow an original experiment’s methods as closely as possible. They provide information about the reliability and validity of an original study’s findings. The present paper asks what comparative cognition should expect if its studies were directly replicated, and how researchers can use this information to improve the reliability of future research. Because published effect sizes are likely overestimated, comparative cognition researchers should not expect findings with p-values just below the significance level to replicate consistently. Nevertheless, there are several statistical and design features that can help researchers identify reliable research. However, researchers should not simply aim for maximum replicability when planning studies; comparative cognition faces strong replicability-validity and replicability-resource trade-offs. Next, the paper argues that it may not even be possible to perform truly direct replication studies in comparative cognition because of: 1) a lack of access to the species of interest; 2) real differences in animal behavior across sites; and 3) sample size constraints producing very uncertain statistical estimates, meaning that it will often not be possible to detect statistical differences between original and replication studies. These three reasons suggest that many claims in the comparative cognition literature are practically unfalsifiable, and this presents a challenge for cumulative science in comparative cognition. To address this challenge, comparative cognition can begin to formally assess the replicability of its findings, improve its statistical thinking and explore new infrastructures that can help to form a field that can create and combine the data necessary to understand how cognition evolves.

The difference between Population 1 and Population 2 was calculated in order to give the desired power for a one-tailed two sample t-test with n = 10 per group.
10,000 samples were then taken from each Population and compared to each other, and the pvalues and mean difference between each sample recorded. The proportion of p-values under .05 was calculated, and the mean difference between samples associated with these p-values was compared to the mean difference across all samples to calculate the unstandarised effect size inflation.
Next, the expected number of exact replication studies that produced a significant result in the same direction as the original was calculated by multiplying the number of significant results from the simulation by the power of test again, and this was performed for a range of p values (Table 2), as well as overall.
Finally, although not included in the manuscript, the exact replication studies were also simulated predictably, this was consistent with the mathematical derivation, and the p-value distributions of the original published studies and the replication studies are given below in Figures A1 and A2.  Figure A2. Density distributions of the p-values from the first simulation study. Blue plots represent the p-value distribution of the original studies, which all fall below .05 due to a simulated publication bias. Red plots represent the p-value distribution of the exact replication studies. The four plots are arranged by the power of the studies, from 5% at the uppermost panel, to 20%, 50% and 80% at the lowermost panel.

Simulation Study 2 Details
Again the code for the second simulation can be found at: https://github.com/BGFarrar/P-valuesimulations/blob/master/CCreplicationsV1.R The data were simulated using edited code from DeBruine and Barr (2019). Data were simulated from the following model: This model is identical to DeBruine & Barr (2019, p. 7), but we swapped LT (Looking Time) for RT (Reaction Time). The looking time for subject s on item i, , is composed of a population grand mean , a by-subject random intercept , a by-item (either physically possible or impossible image) random intercept , a fixed slope , a by-subject random slope , and a trial-level residual . is the condition.
Across all of our simulations, the following parameters were simulated with the following: Subjects were simulated with a correlation between intercepts and slopes of 0.2, meaning that subjects with larger looking times showed on average larger looking time differences.
Across the simulations we varied the main effect of condition and the number of trials in a 3 x 3 design: (200, 100, 0) x trials (1,5,100) 10,000 datasets were simulated for each design, and the analyses differed slightly between the designs to avoid singular fits. For the single trial designs, the data were analyzed using paired t-tests, and for the five and one hundred trial designs, the data were analyzed using a mixed effect model with the following structure: lmer(LookingTime ~ Condition + (1 | subj_id), simulateddata, REML = FALSE) Finally, for the five trial conditions, the calculated p-values might be slightly inaccurate as a small proportion of simulations still led to singular fits. This may also be something to consider when interpreting the analysis of Bird and Emery (2010), which has even more parameters in the model.