Académique Documents
Professionnel Documents
Culture Documents
Animal Behaviour
journal homepage: www.elsevier.com/locate/anbehav
Commentary
a r t i c l e i n f o
Most observers in behaviour studies are aware of relevant information about the animals being observed.
Article history: We investigated whether observer expectations influence subjective scoring methods during a class
Received 28 October 2013 practicum. Veterinary students were trained in recording negative and positive interactions between
Initial acceptance 19 November 2013 pigs, in scoring the degree of panting in cattle and in applying qualitative behaviour assessment (QBA)
Final acceptance 14 January 2014 using a fixed set of terms for assessing hens’ behaviour. The students applied these methods in three
Available online 12 March 2014 trials in which they were shown duplicated video recordings of the same animals: the original and a
MS. number: 13-00901 slightly modified version (to prevent recognition at second viewing). When scoring the duplicated re-
cordings they were told either correct or false information about the conditions in which the animals had
Keywords: been filmed. The false information reflected plausible study scenarios in ethology and was used to create
animal welfare
expectations about the outcome. As in reality the students scored the identical behaviour twice, the
behaviour scoring
difference in the scores for the original and modified recordings reflects expectation bias due to
cognitive bias
confirmation bias
providing different contextual information. In all trials there was evidence of expectation bias: students
double-blind experiment scored the ratio of positive to negative interactions higher when told that the observed pigs had been
expectation bias selected for high social breeding value, they scored cattle panting higher when told that the ambient
information bias temperature was 5 C higher than in reality, and in the QBA they indicated more positive and fewer
observer bias negative emotions when told that the hens were from an organic instead of a conventional farm. The
panting score magnitude of the bias in the QBA trial was related to the opinion of the students about hen welfare in
qualitative behaviour assessment organic versus conventional farms. Although veterinary students may not be representative of practising
ethologists, these findings do indicate that observer bias could influence subjective scores of animal
behaviour and welfare.
Ó 2014 The Association for the Study of Animal Behaviour. Published by Elsevier Ltd. All rights reserved.
Scientific research that relies on observation and interpretation opinion versus another as a result of possessing information
by the investigator has long been confronted with a well-known extraneous to the task at hand (Page, Taylor, & Blenking, 2012).
and fundamental problem: humans cannot be assumed always to Information that confirms one’s beliefs or hypotheses is often fav-
process information objectively and accurately. Natural selection oured (Nickerson, 1998; Wason, 1960). Investigators and research
has shaped the human sensory processing system to promote staff are also susceptible to these pitfalls of the human brain. Often,
behaviour that enhances the spread of our genes, not necessarily to they carry out experiments while they are predisposed by strong
provide us with a complete and correct picture of reality. There is expectations about the outcome and deep-rooted assumptions
ample evidence that human perception can be selective and biased, about what is and what is not possible. These expectations may
and that a healthy human brain often makes incorrect associations lead to conscious or unconscious biases in observation and
and deductions (Braeckman & Boudry, 2011). For example, psy- recording of data.
chologists have long recognized that people are prone to expecta- The risk of these types of observer bias can be reduced by
tion bias, which refers to the psychological sway towards one ensuring that the person collecting the data is unaware of which
treatment each subject has received until after the experiment.
Such blind trials are widely considered as the best study design
* Correspondence: F. Tuyttens, Animal Sciences Unit, Institute for Agricultural
to minimize observer bias and are often required to attain reg-
and Fisheries Research (ILVO), Scheldeweg 68, 9090 Melle, Belgium.
E-mail address: frank.tuyttens@ilvo.vlaanderen.be (F. A. M. Tuyttens). ularity approval for new drugs, dietary supplements and medical
http://dx.doi.org/10.1016/j.anbehav.2014.02.007
0003-3472/Ó 2014 The Association for the Study of Animal Behaviour. Published by Elsevier Ltd. All rights reserved.
274 F. A. M. Tuyttens et al. / Animal Behaviour 90 (2014) 273e280
students were falsely informed that one of the video clips was of a clip was shown as a mirror image and was slightly modified (by
control group and the other of a group selected for high social zooming in or out and by adjusting the brightness or image
breeding value, and that the aim of the trial was to test whether the contrast). We tested whether the students recorded a higher
ratio of positive to negative social interactions differed between the panting score when an elevated ambient temperature was falsely
two clips. Group A was told that the original clip was of the high indicated on the clips.
breeding value group and the subsequent modified clip was the
control group; the opposite was told to group B. The true research
Trial 3: Qualitative Behaviour Assessment
aim, however, was to test whether telling the students that one of
the clips was of high social breeding value pigs affected the number
QBA was introduced to the students as an integrative method-
of negative and positive interactions they recorded.
ology that characterizes animal behaviour as a dynamic, expressive
body language (Wemelsfelder 2007; Wemelsfelder, Hunter,
Trial 2: Panting Score Lawrence, & Mendl, 2001). It was explained that it was originally
developed as a free-choice profiling in which the assessors can
The students were informed about thermoregulation and about generate their own descriptors for describing the expressive quality
the signs and consequences of thermal stress in cattle. The panting of how animals behave and how they interact with each other and
score was introduced as a measure of heat stress based on behav- the environment. For the purpose of the practicum, however, we
ioural signs including the rate and depth of respiration and the worked with a fixed list of descriptors as in the Welfare Quality
amount of drooling (Gaughan, Mader, Holt, & Lisle, 2008; Mader, (2009) protocol in which QBA is used as an animal welfare mea-
Davis, & Brown-Brandl, 2006). They were told how to score pant- sure at group level. After the hens were observed (from video for
ing using a tagged visual analogue scale labelled with descriptors of the sake of this practical) their behavioural expression was scored
an increasing degree of panting (Fig. 1). Such tagged visual on a visual analogue scale for each of the 19 fixed qualitative de-
analogue scales do not limit the precision and sensitivity with scriptors. For each term (such as ‘calm’, ‘content’ and ‘frustrated’;
which observers can distinguish different degrees of severity of the see Table 1 for all terms) we used a 100 mm visual analogue scale
condition concerned. They have been shown to have equal or su- defined by the left-hand point, which represented ‘minimum’ (i.e.
perior interobserver reliability in comparison to ordinal scales for the expressive quality indicated by the term is entirely absent in
scoring lameness (Nalon et al., in press; Tuyttens, Sprenger, Van any of the animals seen) and the right-hand point which repre-
Nuffel, Maertens, & Van Dongen, 2009). During the training sented ‘maximum’ (i.e. the expressive quality is dominant across all
phase, the students were shown six video clips of cattle with observed animals).
varying degrees of panting. For the trial proper we used a 5 min video recording of laying
Subsequently, the students were instructed to score 24 video hens in a conventional commercial aviary. This 5 min video was
clips of cattle using the tagged visual analogue scale. During this split into two clips (clip A and clip B). One of these video clips (clip
trial they could use a printed hand-out of the scale with the B) was mirrored and slightly modified (made brighter) to give the
different descriptors (Fig. 1). A vertical coloured bar at the right- impression that it was filmed on a different farm from clip A. The
hand side of each video clip, as well as a video stamp in the up- students were told that one of the video clips (clip B for group A and
per right-hand corner, indicated ambient temperature at the time clip A for group B) was from an organic laying hen farm. Prior to the
of recording. In half of the video clips, however, not the true tem- QBA trial, the main differences between organic and conventional
perature but a 5 C higher temperature was indicated. Seven video egg production were briefly explained. To test their pretest attitude
clips were shown with the correct temperature to group A and with towards organic farming, students were instructed to indicate on a
the elevated temperature to group B, and seven other video clips 100 mm visual analogue scale whether, in their opinion, hen wel-
were shown with the correct temperature to group B and with the fare was much worse (left side of the scale), equal (midpoint) or
elevated temperature to group A. In addition, five video clips were much better (right side of the scale) on organic versus conventional
shown twice, once with the correct temperature and once with the farms. Subsequently, to explain what they would see in the video
elevated temperature, to both groups. To reduce the risk of students clips, the students were shown a demonstration video of hens from
recognizing the video clip when viewing it for the second time, one a different farm but with a similar housing system. The layout of the
No panting, Slight panting, Fast As for 2 but Open mouth As for 3 but with Open mouth with As for 4 but
normal mouth closed, panting, occasional and the tongue out tongue fully head held
respiration, no drool, easy drool open mouth excessive slightly and extended for down, cattle
difficult to to see chest present, panting, drooling, occasionally prolonged periods “breathe”
see chest movement. no open tongue not neck fully extended with excessive from the
movement. mouth. extended. extended, for short drooling, neck flank,
head held periods. extended and drooling may
up. head up. cease.
PS 0 1 2 3 4 5
0 10 20 30 40 50 60 70 80 90 100 mm
Figure 1. The 100 mm tagged visual analogue scale labelled with descriptors used for scoring panting in cattle during trial 2.
276 F. A. M. Tuyttens et al. / Animal Behaviour 90 (2014) 273e280
aviary housing system (litter area, elevated platforms with nest- RESULTS
boxes, feeding chains and drinker lines) was explained. After this
explanation, the students then watched and scored video clip A first Trial 1: Negative and Positive Social Interactions
and then clip B. We tested whether the QBA terms were scored
differently if the students had been told that the hens were filmed A higher number of positive (ANOVA: F1,153 ¼ 29.6, P < 0.001)
on an organic farm and whether this difference was related to their and fewer negative (ANOVA: F1,153 ¼ 75.7, P < 0.001) social in-
opinion about hen welfare on organic as compared to conventional teractions were counted when the students were informed that the
farms. pigs had been selected for high social breeding value as compared
to the control pigs (Fig. 2). Modifying the video clips itself did not
Statistical Analysis affect the recorded number of positive (ANOVA: F1,153 ¼ 1.6,
P ¼ 0.212) or negative interactions (ANOVA: F1,153 ¼ 0.6, P ¼ 0.437).
The results of all trials were analysed using a mixed regression Also, no differences between the two groups of students were
model with a random effect for student to correct for repeated found (linear regression: F1,153 ¼ 1.0, P ¼ 0.309 and F1,153 ¼ 0.00,
measures. In each model, treatment (prior knowledge: e.g. control P ¼ 0.978).
12 12
***
Number of scored behaviours
(a) (b)
10 *** 10
8 8
6 6 Control
SBV+
4 4
2 2
0 0
Positive social behaviour Negative social behaviour
Figure 2. The mean number SE of (a) positive and (b) negative behaviours scored during the 5 min video clip of the control group and the video clip where the students were told
that the animals were selected for high social breeding value (SBVþ). ***P < 0.001.
or high breeding value group, true or elevated temperature indi- Trial 2: Panting Score
cation, conventional or organic production system), group (A or B)
and video clip (original or modified) and all interactions were For the 19 cattle recordings that were shown to each group only
introduced as fixed effects. Interactions and fixed factors (except once, the students that had been led to believe that ambient tem-
main effect of treatment and video clip) were removed from the perature was 5 C higher than in reality recorded a higher panting
final model if the estimated effect was not significant. The model score on average than the students that had been shown the real
assumptions were evaluated using graphical procedures (normality temperature (ANOVA: F1,2772 ¼ 8.5, P ¼ 0.004; Fig. 3). The magni-
of the residuals was evaluated with a histogram and QQ-plot and tude of this difference was small, however (3.1% at 20 C). The di-
the homogeneity of variance was evaluated on a plot of the re- rection of the difference and its magnitude were not consistent for
siduals versus fitted values). the various video clips (Fig. 3). The average panting score increased
For the panting score trial, the data set was divided into two with the true ambient temperature at the time of video recording
parts: (1) the 14 video clips that were shown only once in the same (linear regression: F1,2772 ¼ 984.0, P < 0.001), but did not differ
group combined with the first video clip from those shown twice to between the two groups of students (linear regression,:
both groups (five video clips); and (2) the five video clips that were F1,153 ¼ 0.12, P ¼ 0.729). There was a nonsignificant trend for the
shown twice to both groups. The unique video clip number or true difference between the panting score allocated to the clips with the
temperature was also introduced in the model. For the 19 video indication of an elevated temperature versus that for the real
clips of the first part, we also tested whether the difference be- temperature to decrease with increasing true ambient temperature
tween the scores with correct and elevated temperature indication (ANOVA: F1,2772 ¼ 3.0, P ¼ 0.081). This difference was also not
was associated with the corresponding standard deviation of the related to the standard deviation of the scores (as an indicator of
scores (as an indication of the variation between observers when the amount of variation between observers when scoring the same
scoring the same clip). A large standard deviation could indicate clips; ANOVA: F1,17 ¼ 0.7, P ¼ 0.41).
that these clips were ambiguous and therefore hard to score For the five cattle recordings that had been shown twice to each
consistently by the students. group (once with the correct temperature and once with an
For the QBA trial, the QBA terms were handled using principal elevated temperature of 5 C), the panting score was 5.0% higher (at
component analysis (PCA), with no rotation, using the princomp 20 C) on average for the indication of an elevated temperature
function in R 2.15.1 (The R Foundation for Statistical Computing, versus the real temperature (ANOVA: F1,1384 ¼ 54.6, P < 0.001;
Vienna, Austria, http://www.r-project.org). The first two PCA Fig. 4). The average panting score increased with the true ambient
components and the individual QBA terms were analysed using a temperature at the time of video recording (linear regression:
mixed regression model. F1,619 ¼ 1070.3, P < 0.001), but did not differ between the two stu-
All tests were two tailed at a significance level of 5% and all dent groups (linear regression: F1,153 ¼ 1.1, P ¼ 0.287). The differ-
calculations were performed using the lme function from the nlme ence between the panting score allocated to the clips with the
package in R 2.15.1 for Windows. indication of an elevated temperature versus the real temperature
F. A. M. Tuyttens et al. / Animal Behaviour 90 (2014) 273e280 277
100
Real temperature
90
Elevated temperature
80
70
Panting score
60
50 *
40 **
*
30
**
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Video no.
19 20 20 20 20 23 23 23 23 23 32 32 32 34 Real temperature (°C)
Figure 3. The mean þ SE panting score (0e100 visual analogue score) given to the 14 video clips of cattle shown to one group of students with the correct ambient temperature
indication (real temperature) versus the other group of students with an elevated temperature indication (elevated temperature). The real ambient temperature at the time of
filming is indicated at the bottom of the figure. *P < 0.05; **P < 0.01.
60
clips were recorded in the same aviary at almost exactly the same
50 time and without any clear disturbance of the flock during
** recording.
40 The PCA was carried out to elucidate the relationship between
the 19 descriptors and revealed two major components. The first
30 *** component explained 42.9%, and the second component 12.8%, of
20
Table 1
***
10 Estimates and significance levels of linear mixed-effects models on each of the 19
fixed descriptors used in the qualitative behaviour assessment trial
0 Descriptor Opinion System Video Opinion* System*
15 16 17 18 19 Video no. System Video
their scoring in all three trials. The number of positive social in-
2 teractions was increased and the number of negative interactions
decreased when the assessors were told that the pigs had been
Standardized PC2 (12.8% explained variance)
selected for high social breeding value (trial 1); the panting score
allocated to cattle was higher when the observers were led to
believe that the ambient temperature was higher than in reality
1 (trial 2); and the QBA scoring indicated more positive and fewer
negative emotions when the observers were told that the hens
lm were kept in an organic instead of a conventional farm. Thus,
Depr Bored Ca misleading background information affected scores in all three
essed
Unsure trials, showing direct evidence of expectation bias by the assessors.
0 Frustratedared Relaxed Because we gathered additional pretest data quantifying observers’
Sc arful Frie
Fe ndl views on animal welfare on organic farms, the QBA trial was able to
es s e d y
Distr enses HaCon provide the most convincing evidence that the bias was related to
T u ed Co P pp ten
rvo at nf os y t expectations of the student observers.
Ne Agit id O
en cc Although students are often involved in collecting data in ani-
Activ
–1 t
mal behaviour, an obvious question in the present study is to what
En
erg
e
argued that such subjective observer ratings may offer not only preparing well-designed studies as the collection and recording of
practical (e.g. money-saving) but also scientific advantages (e.g. for data become more subjective (Schulz et al., 2002). Although such
integrating multimodal information across time and situations, and blind trial designs are likely to increase the complexity and cost of
for constructs that are otherwise difficult to assess such as pain). research, the technology now available to record behaviour can in
Moreover, she claims that such ratings can be both reliable and theory make this possible in most experimental trials (Boutron
valid conditional upon careful experimental design to minimize et al., 2007; Day & Altman, 2000).
observer bias. Nevertheless, we do acknowledge that blinding the assessors
Textbooks on behavioural recording methods therefore high- can be difficult to achieve in certain research set-ups. Blinding
light the need to reduce observer bias by properly training asses- observers may be hindered where treatment effects are easy to see
sors, by testing inter- and intraobserver reliability, and by randomly or guess, or in less experimental research settings such as obser-
assigning observers to different groups or treatments (Dawkins, vational studies. An example in place is the on-farm monitoring of
2007; Lehner, 1979; Martin & Bateson, 1993). Another common animal welfare. Farm animal welfare monitoring and auditing
recommendation is to use precise and unambiguous operational protocols increasingly apply behavioural observations and ratings
definitions of behavioural categories (Lerman et al., 2010). Very similar to those used in the present study. There is growing
explicit criteria for each value on a scale make the ratings less consensus that such protocols ought to use animal-based welfare
dependent on personal experience or other factors that will make measures where possible (Anonymous, 2012; Blokhuis, Veissier,
the judgement vary greatly between individuals (Meagher, 2009). Miele, & Jones, 2010; Main, Whay, Leeb, & Webster, 2007). These
Of the three scoring scales that were used in our study, the panting measures are believed to be more directly linked to the true welfare
score was the most precisely defined. Although the indication of an status of the animals than resource-based measures that describe
elevated temperature did significantly increase the students’ the housing and management conditions in which the animals are
average panting score, the absolute size of this effect was small. The kept (Roe, Buller, & Bull, 2011; Webster, Main, & Whay, 2004), but
specific descriptors on the scale may have forced the observers to may also involve a greater level of subjectivity. Animal welfare
check for the appropriate signs by closely observing the animals. assessment schemes such as the Welfare Quality (2009) protocols
This may have reduced any expectation bias effect. With regard to not only have to be valid and reliable, they also ought to be feasible
QBA it should be noted that asking observers to score one video clip and cost-effective if a wide uptake by the food industry is to be
in isolation as was done in this study is not representative of its achieved. The latter requirement favours assessments based on
methods. QBA trials normally involve comparative assessment of observer ratings that are relatively nonintrusive, inexpensive and
20 video clips or more, within a specific investigative context can be used to integrate multimodal information across time and
presented by the experimenter, which allows for a more carefully situations at the expense of tests that are perhaps more accurate
weighted assessment of individual clips in relation to each other. but that are expensive, time consuming or dependent on laboratory
Although expectation bias may still be present in observers, this or technical equipment (Meagher, 2009). With manpower being
design offers greater opportunity for managing the context in the main cost of such animal-based monitoring schemes, it is likely
which clips are assessed. that farms are visited and assessed by only a single auditor.
Such recommendations and actions to reduce interobserver Consequently, interobserver reliability has been rightly considered
variation may not suffice, however, to prevent all possible sources in the selection of the welfare indicators and the training of audi-
of observer bias. For example, Hróbjartsson et al. (2013) reviewed tors (Dalmau et al., 2010; Gibbons, Vasseur, Rushen, & de Pasillé,
16 randomized clinical trials with subjective outcome assessment 2012; Phytian et al., 2013). In addition to these precautions, our
by both blinded and nonblinded assessors. The trial with the largest findings point to the need to pay more attention to how the ex-
degree of observer bias actually had a very good interobserver pectations and predispositions of the auditors may bias their re-
agreement. A high inter- and intraobserver agreement is by no cordings when they assess the welfare of the animals on the farms
means a guarantee of absence of expectation bias, as observers may they visit.
have been biased in the same direction because they hold similar
expectations about the outcome of the study. Also, in our panting
score trial we found no indication that expectation bias was lowest Acknowledgments
(i.e. small difference between the scores with the correct and
elevated temperature indication) for those video clips that were We thank C. Moons, T. Decroos, J. Vander Linden and T. Martens
scored similarly between students (as indicated by a low standard for their help with the student trials; M. Levenson for English
deviation). revision; D. Lapage for the data entry; and S. Buijs, S. De Campe-
Observer blinding, that is, withholding information about the neere and F. Wemelsfelder for commenting on the manuscript. We
differences between the animals or the experimental treatments thank the veterinary students of Ghent University who participated
from the observer, is one of the most effective and powerful solu- in the trial.
tions to avoid the conscious or unconscious suppositions of the
assessor influencing the scoring outcomes. For example, a ‘blinded’ References
QBA study that compared pigs treated with the anxiolytic drug
Azaperone with nontreated pigs in an elevated plus-maze found Anonymous. (2012). EFSA recommends use of animal-based measures when
that observers, although unaware of the treatment, could clearly assessing welfare. Veterinary Record, 170, 112.
Bergsma, R., Kanis, E., Knol, E. F., & Bijma, P. (2008). The contribution of social effects
distinguish between the expressions of treated and untreated pigs, to heritable variation in finishing traits of domestic pigs (Sus scrofa). Genetics,
indicating that when observer bias is managed adequately, 178, 1559e1570.
methods such as QBA can produce valuable results (Rutherford, Blokhuis, H. J., Veissier, I., Miele, M., & Jones, B. (2010). The Welfare QualityÒ project
and beyond: safeguarding farm animal well-being. Acta Agriculturae Scandi-
Donald, Lawrence, & Wemelsfelder, 2012). Our findings thus call
navica Section A: Animal Science, 60, 129e140.
for the more widespread use of blind experiments in animal Boutron, I., Guittet, L., Estellat, C., Mohar, D., Hróbjartsson, A., & Ravaud, P. (2007).
behaviour research whenever possible and for placing greater Reporting methods of blinding in randomized trials assessing non-
credence in results when observers were blinded as compared to pharmacological treatments: a systematic review. PLoS Med, 4, e61.
Braeckman, J., & Boudry, M. (2011). De ongelovige thomas heeft een punt: Een han-
not blinded. Withholding information from researchers and their dleiding voor kritisch denken. (The reasonableness of doubting Thomas: An invi-
data collection staff becomes an increasingly important part of tation to critical thinking). Antwerp, Belgium: Houtekiet.
280 F. A. M. Tuyttens et al. / Animal Behaviour 90 (2014) 273e280
Burghardt, G. M., Bartmess-LeVasseur, J. N., Browning, S. A., Morrsion, K. E., Miller, L. E., & Stewart, M. E. (2011). The blind leading the blind: use and misuse
Stec, C. L., Zachau, C. E., et al. (2012). Perspectives: minimizing observer bias in of blinding in randomized controlled trials. Contemporary Clinical Trials, 32,
behavioral studies: a review and recommendations. Ethology, 118, 511e517. 240e243.
Dalmau, A., Geverink, N. A., Van Nuffel, A., Van Reenen, K., Hautekiet, V., Nalon, E., Maes, D., Van Dongen, S., van Riet, M. M. J., Janssens, G. P. J., Millet, S.,
Vermeulen, K., et al. (2010). Repeatability of lameness, fear and slipping scores et al. (2014). Comparison of the inter- and intra-observer repeatability of three
to assess animal welfare upon arrival in pig slaughterhouses. Animal, 4, 804e gait-scoring scales for sows. Animal (in press). http://dx.doi.org/10.1017/
809. S1751731113002462.
Dawkins, M. S. (2007). Observing animal behaviour: Design and analysis of quanti- Nickerson, R. S. (1998). Confirmation bias: an ubiquitous phenomenon in many
tative data. Oxford, UK: Oxford University Press. guises. Review of General Psychology, 2, 175e220.
Day, S. J., & Altman, D. G. (2000). Statistics notes: blinding in clinical trials and other Page, M., Taylor, J., & Blenking, M. (2012). Context effects and observer bias: im-
studies. British Medical Journal, 321, 504. plications for forensic odontology. Journal of Forensic Sciences, 57, 108e112.
Gaughan, J. B., Mader, T. L., Holt, S. M., & Lisle, A. (2008). A new heat load index for Phytian, C. J., Toft, N., Cripps, P. J., Michalaopolou, E., Winter, A. C., Jones, P. H., et al.
feedlot cattle. Journal of Animal Science, 86, 226e234. (2013). Inter-observer agreement, diagnostic sensitivity and specificity of
Gibbons, J., Vasseur, E., Rushen, J., & de Pasillé, A. M. (2012). A training programme animal-based indicators of young lamb welfare. Animal, 7, 1182e1190.
to ensure high repeatability of injury scoring of dairy cows. Animal Welfare, 21, Risinger, D. M., Saks, M. J., Thompson, W. C., & Rosenthal, R. (2002). The Daubert/
379e388. Kumho implications of observer effects in forensic science: hidden problems of
Goldstein, M. D., Hopkins, J. R., & Strube, M. J. (1994). ‘The eye of the beholder’: a expectation and suggestion. California Law Review, 90, 1e56.
classroom demonstration of observer bias. Teaching of Psychology, 21, 154e157. Roe, E., Buller, H., & Bull, J. (2011). The performance of farm animal assessment.
Hoyt, W. T., & Kerns, M.-D. (1999). Magnitude and moderators of bias in observer Animal Welfare, 20, 69e78.
ratings: a meta-analysis. Psychological Methods, 4, 403e424. Rosenthal, R. (1966). Experimenter effects in behavioral research. New York: Apple-
Hróbjartsson, A., Thomsen, A. S. S., Emanuelsson, F., Tendal, B., Hilden, J., Boutron, I., ton-Century-Crofts.
et al. (2012). Observer bias in randomised clinical trials with binary outcomes: a Rutherford, K. M. D., Donald, R. D., Lawrence, A. B., & Wemelsfelder, F. (2012).
systematic review of trails with both blinded and non-blinded outcome as- Qualitative behavioural assessment of emotionality in pigs. Applied Animal
sessors. British Medical Journal, 344. e1119. Behaviour Science, 139, 218e224.
Hróbjartsson, A., Thomsen, A. S. S., Emanuelsson, F., Tendal, B., Hilden, J., Boutron, I., Schulz, K. F., Chalmers, I., & Altman, D. G. (2002). The landscape and lexicon of
et al. (2013). Observer bias in randomized clinical trials with measurement blinding in randomized trials. Annals of Internal Medicine, 136, 254e259.
scale outcomes: a systematic review of trails with blinded and non-blinded Schulz, K. F., Chalmers, I., Hayes, R. J., & Altman, D. G. (1995). Empirical evidence of
assessors. Canadian Medical Association Journal, 185, E201eE211. bias: dimensions of methodological quality associated with estimates of
Kaptchuk, T. J. (2001). The double-blind, randomized, placebo-controlled trial: gold treatment effects in controlled trials. Journal of the American Medical Associa-
standard or golden calf? Journal of Clinical Epidemiology, 54, 541e549. tion, 273, 408e412.
Lehner, P. N. (1979). Handbook of ethological methods. New York: Garland STPM Schulz, K. F., & Grimes, D. A. (2002). Blinding in randomised trials: hiding who got
Press. what. Lancet, 359, 696e700.
Lerman, D. C., Tetreault, A., Hovanetz, A., Bellaci, E., Miller, J., Karp, H., et al. (2010). Tuyttens, F. A. M., Sprenger, M., Van Nuffel, A., Maertens, W., & Van Dongen, S.
Applying signal-detection theory to the study of observer accuracy and bias in (2009). Reliability of categorical versus continuous scoring of welfare in-
behavioural assessment. Journal of Applied Behaviour Analysis, 43, 195e213. dicators: lameness in cows as a case study. Animal Welfare, 18, 399e405.
Mader, T. L., Davis, M. S., & Brown-Brandl, T. (2006). Environmental factors influ- Wason, P. C. (1960). On the failure to eliminate hypotheses in a conceptual task.
encing heat stress in feedlot cattle. Journal of Animal Science, 84, 712e719. Quarterly Journal of Experimental Psychology, 12, 129e140.
Main, D. C. J., Whay, H. R., Leeb, C., & Webster, A. J. F. (2007). Formal animal-based Webster, A. J. F., Main, D. C. J., & Whay, H. R. (2004). Welfare assessment: indices
welfare assessment in UK certification schemes. Animal Welfare, 16, 233e236. from clinical observations. Animal Welfare, 13, S93eS98.
Marsh, D. M., & Hanlon, T. J. (2004). Observer gender and observer bias in animal Welfare Quality. (2009). Welfare QualityÒ assessment protocol for poultry (broilers,
behaviour research: experimental tests with red-backed salamanders. Animal laying hens). Lelystad, Netherlands: Welfare QualityÒ Consortium.
Behaviour, 68, 1425e1433. Wemelsfelder, F. (2007). How animals communicate quality of life: the qualitative
Martin, P., & Bateson, P. (1993). Measuring behaviour: An introductory guide (2nd assessment of animal behavior. Animal Welfare, Supplement, 16, 25e31.
ed.). Cambridge, UK: Cambridge University Press. Wemelsfelder, F., Hunter, E. A., Lawrence, A. B., & Mendl, M. T. (2001). Assessing
Meagher, R. K. (2009). Observer ratings: validity and value as a tool for animal the ‘whole-animal’: a free-choice-profiling approach. Animal Behaviour, 62,
welfare research. Applied Animal Behaviour Science, 119, 1e14. 209e220.