The Status and Value of Replications in Animal Behavior Science

Replications are widely considered an essential tool to evaluate scientific claims. However, many fields have recently reported that replication rates are low and when they are conducted many findings do not successfully replicate. These circumstances have led to widespread debates about the value of replications for research quality, credibility of research findings, and factors contributing to current problems with replicability. This special issue brings together researchers from various areas within the field of animal behavior to offer their perspective on the status and value of replications in animal behavior science.


What is a Replication in Animal Behavior?
Another problem raised by the contributions to this issue, and which might account for the low publication rates of replications, is how to categorize replication studies and interpret their results. The traditional view differentiates between direct and conceptual replications with direct replications acting as tests of reliability and conceptual replications acting as tests of generalizability. The Open Science Collaboration (Open Science Collaboration, 2015, p. aac4716-1) defines a direct replication as "the attempt to recreate the conditions believed sufficient for obtaining a previously observed finding." In contrast, Nosek and Errington (2017, p. 1) define a conceptual replication as a study that uses "a different methodology…to test the same hypothesis." Some of the papers in this special issue fit neatly into this framework. For example, Lawson et al. (2021) tested whether yellow warblers' alarm calls varied based on different types of threats to their nests, namely brood parasites and nest predators. Previous studies used visual or combined visual and auditory stimuli, but Lawson et al. (2021) used only acoustic playbacks of brood parasites or nest predators to investigate the yellow warblers' response to these nest threats. The use of a different methodology to previous studies would typically mean this experiment would be classified as a conceptual replication and demonstrates that responses for threat stimuli presented in one modality generalize to another.
However, the traditional categorization of direct and conceptual replication is not always suitable. If direct replications are demarcated based on researchers' beliefs about what conditions are sufficient for obtaining the previous finding, then the researchers' judgements rely on a complex theoretical framework underpinning their beliefs and may contain competing theories with similar or varying degrees of empirical support. Such judgements are required for practically every decision made when planning a replication and range from questions about what taxonomic rank constitutes the population in which a behaviour/ability is believed to be present (both Boyle, 2021, and Halina, 2021, allude to this) to whether placing a speaker 15m from a focal individual is a substantial methodological difference to placing a speaker at a 30m distance (Salis et al., 2021). An example of how difficult this decision-making process can be is seen in Lundgren, Gómez Dunlop et al. (2021) who ran a study investigating the influence of monoaminergic gene expression on red jungle fowl chicks' interindividual behavioural differences. The authors explicitly noted that there is little consensus about how to measure interindividual differences in animal behavior and that the influence of gene expression can be tested using correlations or through active manipulation (such as knock-out techniques). The combination of the complexity of the systems in question, current understanding, and multiple techniques for investigating the same effect make it difficult to establish which methodological changes could impact the results.
The need to make a range of different decisions regarding methodology is not limited to empirical replications but also impacts attempts to replicate computational models of behaviors. For instance, Invernizzi and Ruxton (2021) wrote new code to reproduce the behavior of a computational model of ant behavior. The originally reported model (Franks & Deneubourg, 1997) did not indicate how time was simulated, with one possibility being that they used computational time, a measure that has changed substantially in the ~30 years since the publication of the original paper. Thus, Invernizzi and Ruxton chose to simulate time based on the number of rounds that were computed where during a round, each simulated ant moved one unit. Given the absence of information about the original method, it is impossible to judge whether the change should make a substantial difference to the outcome of the model.
However, knowing an original study's method does not necessarily help establish whether a replication can be considered direct or conceptual. For a number of studies, the question of whether a future study's method is believed to be sufficient to replicate the original is dependent on one's theoretical position and current knowledge. For example, Becker et al. (2021) tested dogs' perception of the Ebbinghaus-Titchener visual illusion using a spontaneous choice task while previous research employed training-based protocols. In training protocols, the animals learned first to choose the larger (or smaller) of two circles, before being presented with the illusory stimuli during test trials. Thus, a priori, the method used should have had minimal influence on perception per se and on the likelihood of obtaining the previously observed finding. The potential for a replication to be motivated by competing theoretical positions is highlighted by Silva, Faragó et al. (2021) who tested whether a Portuguese sample could perceive the emotional state of dog barks and categorize the contexts in which the barks were made. The original study was conducted on a Hungarian sample and, at one level, it was expected that this ability was fundamental and would span a range of human populations. Under this theoretical position, other nationalities should perform similarly to the original Hungarian participants. However, the authors were also motivated by evidence of some cultural differences in emotion recognition that could lead to differences between the performance of the original and new population. Thus, taken at a broad theoretical level, the study may count as a direct replication because such emotion perception is theorized to be universal but, when the different sample is considered in light of differences in emotion recognition between populations, it is more likely to be considered, under traditional views of replication, as either a conceptual replication or an extension of the original research.
The ambiguity created by misaligned theories for replication studies is not simply a definitional issue (e.g., whether a study is a direct or conceptual replication). These issues have important consequences for what might be considered a failed or successful replication. The definition of a "direct replication" states that the conditions used should be believed to be sufficient to obtain the previously observed findings. Critically, theory influences what constitutes obtaining the previous findings in the same way that theory guides the interpretation of what changes to conditions constitute a direct vs conceptual replication. For example, Lawson et al. (2021) demonstrate that, as in previous research, yellow warblers are more likely to make seet calls to brood parasites than to nest predators but there was no statistically significant difference in how closely the yellow warblers approached the brood parasites compared to the nest predators, a difference that had been reported in previous research. In this case, the call type was of primary theoretical importance and so the replication can be considered successful. However, it also highlights that, when studies have multiple dependent variables, replications may not always replicate every comparison in the original study. A further issue with interpreting the success of replications is that even the closest possible replication is unlikely to produce a result that precisely matches the results of the previous study but there remain questions over what degree of similarity constitutes success. In light of this question, Salis et al. (2021) discuss how the interpretation of whether their results matched the previous study's results largely depends on whether the comparison is based on p-values or effect sizes. Relatedly, O'Neill et al. (2021) did not find a statistically significant difference between the control and experimental conditions, but it remains unclear whether this is linked to the small sample size/ low power and, consequently, due to type II error. To this end, many of the large-scale replication attempts conducted in psychology and other fields have opted to make comparisons of effect sizes between the original study and replication.
With the ever-increasing prevalence and relevance of replications, the question of what exactly constitutes a (successful) replication will likely remain a topic of discussion. Recently, novel perspectives on the debate have been introduced (e.g., Machery, 2020;Nosek & Errington, 2020). (2021) specifically discuss the resampling account (Machery, 2020) as an alternative framework to think about replications in animal behaviour research. The usefulness of this account is that it highlights the similarity between replicability and generalizability of results.

Farrar et al. (2021b) and Halina
Replication studies re-sample from a particular population of participants, but resampling is also possible from populations of all aspects of the original experiment's methodology such as sites, measures, and experimental manipulations. Critically, in all these aspects, researchers are interested in generalizing from their sample to the population. Each of these aspects of an experiment has a specific instance (token) in the original experiment and a replication must use a token from the same class (type or population) of tokens. This means that an experiment is no longer considered a replication if it samples from outside of the original types (populations) used in the original methodology (Machery, 2020). Perhaps the most important aspect, albeit a pragmatic one, of the re-sampling view of replications is that it forces researchers to consider the theoretical basis of their re-sampling because the population that can be re-sampled from must be constrained by statements in the original experiment and/or theory (Halina, 2021). For example, a study that is designed to test the hypothesis that pigs can fly is not replicated by a study testing whether fish can fly when the original hypothesis is clearly confined to the population of pigs, the even-toed ungulates from the genus Sus.
In view of this, Halina (2021) argues that, at least in primate theory of mind research, replications are more common than usually assumed. This is because what are typically considered divergent or novel experiments can be considered as tests of the same underlying hypothesis on the same population. For example, according to Halina's argument, the hypothesis that chimpanzees are sensitive to what conspecifics can perceive has been tested using a range of different measurements, treatments, and populationsthat is, sampling from these different methods and populations, these studies (reviewed in Krupenye & Call, 2019) have replicated the original finding that chimpanzees respond to what others orient towards (Hare et al., 2000).
To assess whether Experiment B is a replication of Experiment A, we need to consider the theoretical claim made by Experiment A in order to decide which experimental components are coming from the same population (Farrar et al., 2021). For example, in O'Neill et al.'s (2021) replication of Taylor et al. (2012), one important control counterbalanced the order of conditions, something that was missing in the original study. This counterbalancing means the replication does not have the same procedure as the original study. However, whether NCCs reason about hidden causal agents should not be influenced by the order of the tests conducted (Boogert et al., 2013;Dymond et al., 2013). Thus, the change in order can be understood as a re-sampling of treatment to test the hypothesis that NCCs reason about the hidden causal agent (the theoretical claim in question sensu Farrar et al., 2021b). Another example is the paper by Salis et al. (2021) who, unknowingly, replicated a study investigating how great tits perceive order of calls of an allopatric species (Dutour et al., 2020). Both sets of authors collected data in the same territory, but in different breeding seasons. Changes were made in the distance of the playback to the tested birds, and to some of the parameters of the calling sequence presented. Such deviations in methodology are often reasonable. However, it also commonly remains open for debate whether these changes re-sample from the same population (here of treatments) as the original study.

The Value of Replications in Animal Behavior Research
The traditional distinction between direct and conceptual replications often highlights that the type of replication conducted is motivated by different epistemological reasons. Direct replications are thought to test an effect's reliability whereas conceptual replications are considered as testing its generalizability because they involve changes to the conditions such that the explanation of the effect is unlikely to be due to just one specific methodology (Nosek & Errington, 2017). However, one consequence of considering the re-sampling approach is that it is apparent that all replications will inevitably re-sample at least one token, namely, time (this issue is at least implicitly acknowledged in Farrar et al., 2021b). Thus, at a minimum, all experiments are a test of an effect's generalizability over time. The resampling approach thus illustrates how the question of replicability is always a question of generalizability (Farrar et al., 2021b). The link between replication and generalizability is further highlighted when the terminology from the resampling account is linked with statistical terminology (as explained in Halina, 2021)a token that is resampled is one that is treated as a random factor (Machery, 2020) and thus stands as an example of a broader class (see also Yarkoni, 2020).  Halina (2021), on the other hand, argues that replications that resample from only one component are to be distinguished from replications that resample from many components at the same time, because in the latter situation, the core theoretical claim is dependent on a larger number of auxiliary hypotheses. In the case of a failed replication, it becomes necessary to test which hypothesis (including the core claim) led to an incorrect prediction, which is harder to establish in the latter case. Until each of the auxiliary hypotheses has been tested there is nothing to suggest the failed replication actually contradicts the original result.
Critically, the same issue applies to original experiments as well as replications (Boyle, 2021) where negative results may be considered the result of some aspect of methodology rather than the absence of an effect or ability. Boyle (2021) argues that the issue of auxiliary hypotheses obscuring the interpretation of empirical results is not specific to replications but can be seen throughout the literature. Most notably, negative results are often considered to be the result of the exact methodology used in an experiment rather than the absence of an effect or ability in the sample. This issue is also relevant for positive results, given that data are regularly compatible with multiple theoretical explanations (Boyle, 2021). According to Boyle (2021), this theoretical openness and uncertainty means that the role of replications is complicated because both the status of a study as a replication and its acceptance as being successful are theoretically driven -yet there are few commonly accepted theories. In light of this, there may be little benefit in distinguishing replications from original research because they have no different epistemic value.
According to Boyle (2021), this theoretical openness is simply a part of the epistemic circumstances of our field. If we accept that, Boyle suggests we could redefine progress in our field in line with Andrews' (2014) suggestion of progress by calibration, where the scientific processes resemble putting together a puzzle. Rather than seeing this as a problem of demarcation and calling into question the 'scientificness' of our field (Chambers, 2013;Schmidt, 2009;Zwaan et al., 2018), Boyle suggests that it can constitute our starting point. In doing so, both original studies and replication studies need to be conducted and, importantly, also published, to allow this calibration. (2021), it is still difficult to receive funding for and to publish replication studies in animal behavior research. Recent developments make us hopeful, however. For example, the last two years have seen more than a half a dozen of papers reporting (successful and unsuccessful) attempts to replicate the influential mirror mark test in corvids (Brecht et al., 2020;Buniyaadi et al., 2020;Clary et al., 2020;Parishar et al., 2021;Soler et al., 2020;Vanhooland et al., 2020;Wang et al., 2020). Moreover, registered replications are becoming more common (for a recent example see Motes-Rodrigo et al., 2021). Animal Behavior and Cognition specifically states that empirical replication studies will be published, and authors are encouraged to use the Pre-registered Replication Articles format the journal offers (Beran, 2020). Researchers in the field seem to recognize the need for replication studies (Farrar et al., 2021b;Fraser et al., 2020), and multiple publications discuss recommendations for increasing replicability of specific fields (Farrar et al., 2020, for comparative cognition;O'Dea et al., 2021, for ecology and evolutionary biology).

Unfortunately, as emphasized by Shaw et al. (2021) and Khan and Wascher
Additional recommendations are also discussed by several authors of this special issue. Farrar et al. (2021b) and Nawroth and Gygax (2021) argue that conducting studies that increase the heterogeneity of samples will lead to more generalizable and thus replicable results (see also Voelkl & Würbel, 2021;von Kortzfleisch et al., 2020). This has clear benefits for animal behavioral sciences in general but may be of particular importance for fields like animal welfare where policies and treatments need to be applied to a wide range of settings that can vary considerably from lab conditions. It may also be important for fields where no replication studies are possible due to different reasons (for example as discussed in Shaw et al., 2021) and where such heterogenization can lead to better inferences (but see Farrar et al., 2021b, for a discussion when increasing homogeneity may be a better way forward). Relatedly, there were also calls for the use of multi-lab or multi-setting experiments to help both increase sample sizes for difficult to access species and to improve the heterogeneity of settings (Khan & Wascher, 2021). Within the field of primatology, the ManyPrimates project has been founded to facilitate collaborative research, ease replication projects, and test the relationship between ecology, behavior, and cognition (ManyPrimates et al., 2019a(ManyPrimates et al., , 2019b(ManyPrimates et al., , 2021. More recently, ManyDogs has been established for similar purposes within canine science and is currently recruiting contributors for the first ManyDogs study on dogs' ability to follow human pointing (https://manydogsproject.github.io/).
Finally, some authors highlight how much of animal cognition research lacks a core theory and many competing perspectives exist (e.g., Boyle, 2021), which makes classifying replications and interpreting their results challenging. Researchers may be able to employ multiple approaches in a field with theoretical openness. In some research areas, the theoretical openness may be decreased through theory development, aided by formalization (Allen, 2014;Farrar et al., 2021b;Guest & Martin, 2020;Lee et al., 2019;Lind, 2018;Smith et al., 2012;van Rooij & Baggio, 2021; for a study that set out to reproduce a model of ant nest wall building in this issue see Invernizzi & Ruxton, 2021). In areas in which this is not possible, claims may have to be adjusted to reflect the epistemic circumstances (Yarkoni, 2020) and criteria other than replications can be employed to evaluate the quality of research and the reliability of claims (Leonelli, 2018).
In conclusion, although barriers remain to conducting and publishing replications it is clear that they are a valuable component of empirical research. This special issue highlights that the value of a replication is more than a single result and that, in practice, replications often have a pragmatic role because they help researchers recognise the theoretical assumptions underpinning their choice of experimental design which pushes forward both experimental design and theory building. and standard error bars.