Vous êtes sur la page 1sur 8

Animal Behaviour 90 (2014) 273e280

Contents lists available at ScienceDirect

Animal Behaviour
journal homepage: www.elsevier.com/locate/anbehav

Commentary

Observer bias in animal behaviour research: can we believe what


we score, if we score what we believe?
F. A. M. Tuyttens a, b, *, S. de Graaf a, J. L. T. Heerkens a, L. Jacobs a, E. Nalon a, b, S. Ott b, c,
L. Stadig a, E. Van Laer a, B. Ampe a
a
Animal Sciences Unit, Institute for Agricultural and Fisheries Research (ILVO), Melle, Belgium
b
Faculty of Veterinary Medicine, Ghent University, Merelbeke, Belgium
c
Faculty of Bioscience Engineering, Katholieke Universiteit Leuven, Belgium

a r t i c l e i n f o
Most observers in behaviour studies are aware of relevant information about the animals being observed.
Article history: We investigated whether observer expectations influence subjective scoring methods during a class
Received 28 October 2013 practicum. Veterinary students were trained in recording negative and positive interactions between
Initial acceptance 19 November 2013 pigs, in scoring the degree of panting in cattle and in applying qualitative behaviour assessment (QBA)
Final acceptance 14 January 2014 using a fixed set of terms for assessing hens’ behaviour. The students applied these methods in three
Available online 12 March 2014 trials in which they were shown duplicated video recordings of the same animals: the original and a
MS. number: 13-00901 slightly modified version (to prevent recognition at second viewing). When scoring the duplicated re-
cordings they were told either correct or false information about the conditions in which the animals had
Keywords: been filmed. The false information reflected plausible study scenarios in ethology and was used to create
animal welfare
expectations about the outcome. As in reality the students scored the identical behaviour twice, the
behaviour scoring
difference in the scores for the original and modified recordings reflects expectation bias due to
cognitive bias
confirmation bias
providing different contextual information. In all trials there was evidence of expectation bias: students
double-blind experiment scored the ratio of positive to negative interactions higher when told that the observed pigs had been
expectation bias selected for high social breeding value, they scored cattle panting higher when told that the ambient
information bias temperature was 5  C higher than in reality, and in the QBA they indicated more positive and fewer
observer bias negative emotions when told that the hens were from an organic instead of a conventional farm. The
panting score magnitude of the bias in the QBA trial was related to the opinion of the students about hen welfare in
qualitative behaviour assessment organic versus conventional farms. Although veterinary students may not be representative of practising
ethologists, these findings do indicate that observer bias could influence subjective scores of animal
behaviour and welfare.
Ó 2014 The Association for the Study of Animal Behaviour. Published by Elsevier Ltd. All rights reserved.

Scientific research that relies on observation and interpretation opinion versus another as a result of possessing information
by the investigator has long been confronted with a well-known extraneous to the task at hand (Page, Taylor, & Blenking, 2012).
and fundamental problem: humans cannot be assumed always to Information that confirms one’s beliefs or hypotheses is often fav-
process information objectively and accurately. Natural selection oured (Nickerson, 1998; Wason, 1960). Investigators and research
has shaped the human sensory processing system to promote staff are also susceptible to these pitfalls of the human brain. Often,
behaviour that enhances the spread of our genes, not necessarily to they carry out experiments while they are predisposed by strong
provide us with a complete and correct picture of reality. There is expectations about the outcome and deep-rooted assumptions
ample evidence that human perception can be selective and biased, about what is and what is not possible. These expectations may
and that a healthy human brain often makes incorrect associations lead to conscious or unconscious biases in observation and
and deductions (Braeckman & Boudry, 2011). For example, psy- recording of data.
chologists have long recognized that people are prone to expecta- The risk of these types of observer bias can be reduced by
tion bias, which refers to the psychological sway towards one ensuring that the person collecting the data is unaware of which
treatment each subject has received until after the experiment.
Such blind trials are widely considered as the best study design
* Correspondence: F. Tuyttens, Animal Sciences Unit, Institute for Agricultural
to minimize observer bias and are often required to attain reg-
and Fisheries Research (ILVO), Scheldeweg 68, 9090 Melle, Belgium.
E-mail address: frank.tuyttens@ilvo.vlaanderen.be (F. A. M. Tuyttens). ularity approval for new drugs, dietary supplements and medical

http://dx.doi.org/10.1016/j.anbehav.2014.02.007
0003-3472/Ó 2014 The Association for the Study of Animal Behaviour. Published by Elsevier Ltd. All rights reserved.
274 F. A. M. Tuyttens et al. / Animal Behaviour 90 (2014) 273e280

devices (Kaptchuk, 2001; Miller & Stewart, 2011). Meta-analyses METHODS


have convincingly shown that randomized trials have substan-
tially larger treatment effects if the assessor is not blinded Subjects and Overall Experimental Set-up
(Hróbjartsson et al., 2012, 2013; Schulz, Chalmers, Hayes, &
Altman, 1995). Nevertheless, the use of nonblinded assessors in The trials were conducted using third-year veterinary medicine
nonpharmacological trials remains common (Hróbjartsson et al., students at Ghent University during the practicum of the Ethology,
2013; Schulz et al., 1995), particularly in animal behaviour Ethics and Animal Welfare course. The students (N ¼ 157) were
studies. Burghardt et al. (2012) reviewed several hundred articles informed that the aim of the practicum was to learn some tech-
published in five leading animal behaviour journals during the niques and methods commonly used in ethological research: (trial
last five decades. They found that, despite the numerous and 1) identifying negative and positive social interactions in pigs; (trial
widely used texts on research methods in animal behaviour that 2) scoring panting as an indicator of heat stress in cattle; (trial 3)
advocate researchers should minimize potential observer bias in qualitative behaviour assessment (QBA) of laying hens. For logistic
their studies, only 6.3% of the empirical research articles re- reasons, the training and trials were based on photographs and
ported that at least one component of the research was con- video recordings rather than live observations. These video re-
ducted blind. This percentage was much higher in two more cordings were obtained from recent or ongoing research projects
human-focused comparison journals that publish research approved by the ILVO ethical committee for experiments on ani-
based on similar behavioural observations and coding strategies mals. All methods concerned animal-based measures that were
(25% for Behavioral Neuroscience and 47.5% for Infancy). We subjective in that they required some interpretation and judgement
checked the 2012 volumes of Animal Behaviour and Applied An- by the students when giving scores or values to their observations
imal Behaviour Science; only 15.3% (37/242) and 9.9% (13/131) of of the animals. The students were not informed about the research
the papers for which we judged it relevant reported that they aim of the practicum, namely to investigate the extent to which
were conducted blind. they were affected by misleading background information when
Marsh and Hanlon (2004) remarked that while the potential applying the three scoring methods they had learned.
for observer biases in animal behaviour research has often been After the students had been informed about the content and
discussed, there have been few quantitative analyses of the organization of the practicum, they were randomly split into two
kinds of biases that may affect behavioural data. We cannot find groups of approximately equal size. For the training and the three
any scientific reason to justify the limited attention being paid to trials group A (N ¼ 80) remained in the auditorium, whereas group
minimizing observer bias in animal behaviour science. Observer B (N ¼ 77) moved to another nearby auditorium. Each trial was
bias is particularly likely when the investigator has strong pre- chaired by a different designated pair of researchers and lasted 30e
conceptions or a vested interest in the outcome, when the un- 40 min followed by a 10 min break. The order of the sessions
derlying data are ambiguous and when the scoring method is differed to enable the designated researchers to repeat their trial in
subjective (Hoyt & Kerns, 1999; Page et al., 2012; Risinger, Saks, both places with both groups. After the scoring sessions, the two
Thompson, & Rosenthal, 2002; Rosenthal, 1966, Schulz, groups were reunited for a plenary session during which they were
Chalmers, & Altman, 2002). In our opinion, all three predispos- informed about the hidden research aim of the practicum and the
ing factors are commonly present in animal behaviour research. reason why we investigated this subject (i.e. to raise awareness
In this study, we investigated the potential of observer bias in about the pitfalls of their own senses and brain when applying
animal behaviour studies. Veterinary medicine students un- subjective animal-based measures for assessing animal behaviour
knowingly participated in an experiment during a class practicum. and welfare).
The students were first given a brief demonstration of several
subjective (in the sense that they rely on an individual’s percep- Trial 1: Negative and Positive Social Interactions
tion and judgement, and can therefore be influenced by experi-
ence or personal views, cf. Meagher, 2009) animal-based scoring The students were taught how to recognize negative and posi-
methods commonly used in ethology. They then received a short tive social interactions in fattening pigs (using video recordings of
training session on using these methods. Last, they applied the various interactions) and were briefly informed about the relevance
methods to score video clips of farm animals that they were led to of these interactions for animal welfare and farm management.
believe had been subjected to different conditions. In reality, They were taught an ethogram that included four types of negative
however, they scored the same video clips twice, with the second interactions (head butt, belly nosing, ear or tail biting, biting other
clip being slightly modified to trick the students into thinking that body parts) and three types of positive interactions (play, sniffing/
the videos were of different animals subjected to the mock con- nosing the body of a pen mate, nose-to-nose contact). During a
ditions. The false information was specifically chosen to create video-based training session, the students then practised counting
expectations among the students about the outcome of the ob- social interactions, differentiating between positive and negative
servations. The research objective was to calculate the magnitude ones.
of observer bias as the difference between the first and second Subsequently, for the trial proper, the students were instructed
scoring of the video clips. In addition to this research objective, the to tally each negative and positive social interaction in a pen of six
practicum was also designed to meet two educational goals: to fattening pigs during two 5 min video clips. Both video clips were
learn and practise a limited but diverse set of scoring methods the same, but the second clip was shown in a mirror image, its
used in ethological research, and to raise the students’ awareness brightness was slightly adjusted and fictional pen numbers and
about observer bias by allowing them to experience how prior dates were shown. The goal of showing the same clip twice (once in
expectations affected their scoring. Goldstein, Hopkins, and Strube the original version and once in the slightly modified version) was
(1994) argued that a classroom demonstration in which the stu- to mislead the students into believing that it concerned two
dents personally experience the powerful effects of previous ex- different groups of pigs from an experiment on social breeding
pectations on perception should improve their learning and value and housed in an opposite pen of the same pig stable. The
memory about observer bias. We do not include that aspect of this theory of social breeding value, in which pigs with a high social
study in the current paper and report on the research objective breeding value have a positive effect on the growth of their pen
only. mates (Bergsma, Kanis, Knol, & Bijma, 2008), was explained. The
F. A. M. Tuyttens et al. / Animal Behaviour 90 (2014) 273e280 275

students were falsely informed that one of the video clips was of a clip was shown as a mirror image and was slightly modified (by
control group and the other of a group selected for high social zooming in or out and by adjusting the brightness or image
breeding value, and that the aim of the trial was to test whether the contrast). We tested whether the students recorded a higher
ratio of positive to negative social interactions differed between the panting score when an elevated ambient temperature was falsely
two clips. Group A was told that the original clip was of the high indicated on the clips.
breeding value group and the subsequent modified clip was the
control group; the opposite was told to group B. The true research
Trial 3: Qualitative Behaviour Assessment
aim, however, was to test whether telling the students that one of
the clips was of high social breeding value pigs affected the number
QBA was introduced to the students as an integrative method-
of negative and positive interactions they recorded.
ology that characterizes animal behaviour as a dynamic, expressive
body language (Wemelsfelder 2007; Wemelsfelder, Hunter,
Trial 2: Panting Score Lawrence, & Mendl, 2001). It was explained that it was originally
developed as a free-choice profiling in which the assessors can
The students were informed about thermoregulation and about generate their own descriptors for describing the expressive quality
the signs and consequences of thermal stress in cattle. The panting of how animals behave and how they interact with each other and
score was introduced as a measure of heat stress based on behav- the environment. For the purpose of the practicum, however, we
ioural signs including the rate and depth of respiration and the worked with a fixed list of descriptors as in the Welfare Quality
amount of drooling (Gaughan, Mader, Holt, & Lisle, 2008; Mader, (2009) protocol in which QBA is used as an animal welfare mea-
Davis, & Brown-Brandl, 2006). They were told how to score pant- sure at group level. After the hens were observed (from video for
ing using a tagged visual analogue scale labelled with descriptors of the sake of this practical) their behavioural expression was scored
an increasing degree of panting (Fig. 1). Such tagged visual on a visual analogue scale for each of the 19 fixed qualitative de-
analogue scales do not limit the precision and sensitivity with scriptors. For each term (such as ‘calm’, ‘content’ and ‘frustrated’;
which observers can distinguish different degrees of severity of the see Table 1 for all terms) we used a 100 mm visual analogue scale
condition concerned. They have been shown to have equal or su- defined by the left-hand point, which represented ‘minimum’ (i.e.
perior interobserver reliability in comparison to ordinal scales for the expressive quality indicated by the term is entirely absent in
scoring lameness (Nalon et al., in press; Tuyttens, Sprenger, Van any of the animals seen) and the right-hand point which repre-
Nuffel, Maertens, & Van Dongen, 2009). During the training sented ‘maximum’ (i.e. the expressive quality is dominant across all
phase, the students were shown six video clips of cattle with observed animals).
varying degrees of panting. For the trial proper we used a 5 min video recording of laying
Subsequently, the students were instructed to score 24 video hens in a conventional commercial aviary. This 5 min video was
clips of cattle using the tagged visual analogue scale. During this split into two clips (clip A and clip B). One of these video clips (clip
trial they could use a printed hand-out of the scale with the B) was mirrored and slightly modified (made brighter) to give the
different descriptors (Fig. 1). A vertical coloured bar at the right- impression that it was filmed on a different farm from clip A. The
hand side of each video clip, as well as a video stamp in the up- students were told that one of the video clips (clip B for group A and
per right-hand corner, indicated ambient temperature at the time clip A for group B) was from an organic laying hen farm. Prior to the
of recording. In half of the video clips, however, not the true tem- QBA trial, the main differences between organic and conventional
perature but a 5  C higher temperature was indicated. Seven video egg production were briefly explained. To test their pretest attitude
clips were shown with the correct temperature to group A and with towards organic farming, students were instructed to indicate on a
the elevated temperature to group B, and seven other video clips 100 mm visual analogue scale whether, in their opinion, hen wel-
were shown with the correct temperature to group B and with the fare was much worse (left side of the scale), equal (midpoint) or
elevated temperature to group A. In addition, five video clips were much better (right side of the scale) on organic versus conventional
shown twice, once with the correct temperature and once with the farms. Subsequently, to explain what they would see in the video
elevated temperature, to both groups. To reduce the risk of students clips, the students were shown a demonstration video of hens from
recognizing the video clip when viewing it for the second time, one a different farm but with a similar housing system. The layout of the

No panting, Slight panting, Fast As for 2 but Open mouth As for 3 but with Open mouth with As for 4 but
normal mouth closed, panting, occasional and the tongue out tongue fully head held
respiration, no drool, easy drool open mouth excessive slightly and extended for down, cattle
difficult to to see chest present, panting, drooling, occasionally prolonged periods “breathe”
see chest movement. no open tongue not neck fully extended with excessive from the
movement. mouth. extended. extended, for short drooling, neck flank,
head held periods. extended and drooling may
up. head up. cease.

PS 0 1 2 3 4 5

0 10 20 30 40 50 60 70 80 90 100 mm

Figure 1. The 100 mm tagged visual analogue scale labelled with descriptors used for scoring panting in cattle during trial 2.
276 F. A. M. Tuyttens et al. / Animal Behaviour 90 (2014) 273e280

aviary housing system (litter area, elevated platforms with nest- RESULTS
boxes, feeding chains and drinker lines) was explained. After this
explanation, the students then watched and scored video clip A first Trial 1: Negative and Positive Social Interactions
and then clip B. We tested whether the QBA terms were scored
differently if the students had been told that the hens were filmed A higher number of positive (ANOVA: F1,153 ¼ 29.6, P < 0.001)
on an organic farm and whether this difference was related to their and fewer negative (ANOVA: F1,153 ¼ 75.7, P < 0.001) social in-
opinion about hen welfare on organic as compared to conventional teractions were counted when the students were informed that the
farms. pigs had been selected for high social breeding value as compared
to the control pigs (Fig. 2). Modifying the video clips itself did not
Statistical Analysis affect the recorded number of positive (ANOVA: F1,153 ¼ 1.6,
P ¼ 0.212) or negative interactions (ANOVA: F1,153 ¼ 0.6, P ¼ 0.437).
The results of all trials were analysed using a mixed regression Also, no differences between the two groups of students were
model with a random effect for student to correct for repeated found (linear regression: F1,153 ¼ 1.0, P ¼ 0.309 and F1,153 ¼ 0.00,
measures. In each model, treatment (prior knowledge: e.g. control P ¼ 0.978).

12 12
***
Number of scored behaviours

(a) (b)
10 *** 10

8 8

6 6 Control
SBV+
4 4

2 2

0 0
Positive social behaviour Negative social behaviour

Figure 2. The mean number  SE of (a) positive and (b) negative behaviours scored during the 5 min video clip of the control group and the video clip where the students were told
that the animals were selected for high social breeding value (SBVþ). ***P < 0.001.

or high breeding value group, true or elevated temperature indi- Trial 2: Panting Score
cation, conventional or organic production system), group (A or B)
and video clip (original or modified) and all interactions were For the 19 cattle recordings that were shown to each group only
introduced as fixed effects. Interactions and fixed factors (except once, the students that had been led to believe that ambient tem-
main effect of treatment and video clip) were removed from the perature was 5  C higher than in reality recorded a higher panting
final model if the estimated effect was not significant. The model score on average than the students that had been shown the real
assumptions were evaluated using graphical procedures (normality temperature (ANOVA: F1,2772 ¼ 8.5, P ¼ 0.004; Fig. 3). The magni-
of the residuals was evaluated with a histogram and QQ-plot and tude of this difference was small, however (3.1% at 20  C). The di-
the homogeneity of variance was evaluated on a plot of the re- rection of the difference and its magnitude were not consistent for
siduals versus fitted values). the various video clips (Fig. 3). The average panting score increased
For the panting score trial, the data set was divided into two with the true ambient temperature at the time of video recording
parts: (1) the 14 video clips that were shown only once in the same (linear regression: F1,2772 ¼ 984.0, P < 0.001), but did not differ
group combined with the first video clip from those shown twice to between the two groups of students (linear regression,:
both groups (five video clips); and (2) the five video clips that were F1,153 ¼ 0.12, P ¼ 0.729). There was a nonsignificant trend for the
shown twice to both groups. The unique video clip number or true difference between the panting score allocated to the clips with the
temperature was also introduced in the model. For the 19 video indication of an elevated temperature versus that for the real
clips of the first part, we also tested whether the difference be- temperature to decrease with increasing true ambient temperature
tween the scores with correct and elevated temperature indication (ANOVA: F1,2772 ¼ 3.0, P ¼ 0.081). This difference was also not
was associated with the corresponding standard deviation of the related to the standard deviation of the scores (as an indicator of
scores (as an indication of the variation between observers when the amount of variation between observers when scoring the same
scoring the same clip). A large standard deviation could indicate clips; ANOVA: F1,17 ¼ 0.7, P ¼ 0.41).
that these clips were ambiguous and therefore hard to score For the five cattle recordings that had been shown twice to each
consistently by the students. group (once with the correct temperature and once with an
For the QBA trial, the QBA terms were handled using principal elevated temperature of 5  C), the panting score was 5.0% higher (at
component analysis (PCA), with no rotation, using the princomp 20  C) on average for the indication of an elevated temperature
function in R 2.15.1 (The R Foundation for Statistical Computing, versus the real temperature (ANOVA: F1,1384 ¼ 54.6, P < 0.001;
Vienna, Austria, http://www.r-project.org). The first two PCA Fig. 4). The average panting score increased with the true ambient
components and the individual QBA terms were analysed using a temperature at the time of video recording (linear regression:
mixed regression model. F1,619 ¼ 1070.3, P < 0.001), but did not differ between the two stu-
All tests were two tailed at a significance level of 5% and all dent groups (linear regression: F1,153 ¼ 1.1, P ¼ 0.287). The differ-
calculations were performed using the lme function from the nlme ence between the panting score allocated to the clips with the
package in R 2.15.1 for Windows. indication of an elevated temperature versus the real temperature
F. A. M. Tuyttens et al. / Animal Behaviour 90 (2014) 273e280 277

100
Real temperature
90
Elevated temperature
80

70
Panting score

60

50 *

40 **
*
30
**
20

10

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Video no.
19 20 20 20 20 23 23 23 23 23 32 32 32 34 Real temperature (°C)

Figure 3. The mean þ SE panting score (0e100 visual analogue score) given to the 14 video clips of cattle shown to one group of students with the correct ambient temperature
indication (real temperature) versus the other group of students with an elevated temperature indication (elevated temperature). The real ambient temperature at the time of
filming is indicated at the bottom of the figure. *P < 0.05; **P < 0.01.

100 recorded on an organic farm they gave, or tended to give, a lower


Real temperature score to all negative descriptors and a higher score to all positive
90 descriptors with the exception of the descriptor ‘active’ (effect of
Elevated temperature system in Table 1). The size of these effects increased along with a
80 * more positive opinion about hen welfare in organic versus con-
ventional systems (opinion*system interactions in Table 1). The
70
scores for most terms also differed between video clips A and B
(effect of video in Table 1). This was surprising, because both video
Panting score

60
clips were recorded in the same aviary at almost exactly the same
50 time and without any clear disturbance of the flock during
** recording.
40 The PCA was carried out to elucidate the relationship between
the 19 descriptors and revealed two major components. The first
30 *** component explained 42.9%, and the second component 12.8%, of
20
Table 1
***
10 Estimates and significance levels of linear mixed-effects models on each of the 19
fixed descriptors used in the qualitative behaviour assessment trial
0 Descriptor Opinion System Video Opinion* System*
15 16 17 18 19 Video no. System Video

20 23 29 32 32 Real temperature (°C) Active 0.1 3.6 9.3** 0.1 1.4


Relaxed 0.3* 48.4*** 2.0 0.8*** 12.8*
Figure 4. The mean þ SE panting score (0e100 visual analogue score) given to the five Fearful 0.3* 27.6** 2.9 0.5*** 0.7
video clips of cattle shown twice to all students: once with the correct ambient Agitated 0.1 26.6* 2.4 0.5** 8.2
temperature indication (real temperature) and once slightly modified and with an Confident 0.3* 37.7*** 12.0*** 0.7*** 9.0(*)
elevated temperature indication (elevated temperature). The real ambient temperature Depressed 0.3* 32.4*** 2.2 0.5*** 8.2
at the time of filming is indicated at the bottom of the figure. *P < 0.5; **P < 0.01; Calm 0.2 25.6* 8.3* 0.5** 1.5
***P < 0.001. Content 0.3** 41.5*** 9.5** 0.7*** 14.0**
Tense 0.3* 58.1*** 7.4* 0.9*** 17.5**
Unsure 0.3* 33.2** 7.6* 0.6*** 7.8
Energetic 0.3* 19.3(*) 12.0*** 0.4** 1.1
decreased with increasing true ambient temperature (linear Frustrated 0.3* 57.8*** 7.8* 0.9*** 18.0**
regression: F1,767 ¼ 22.6, P < 0.001). Bored 0.3* 45.4*** 10.4** 0.8*** 10.6(*)
Friendly 0.4** 32.7*** 1.5 0.6*** 1.1
Positively occupied 0.3* 39.0*** 6.3* 0.8*** 7.1
Trial 3: Qualitative Behaviour Assessment Scared 0.2(*) 17.4(*) 2.4 0.4** 3.0
Nervous 0.1 51.6*** 9.4* 0.9*** 12.6*
The students had a rather positive opinion about laying hen Happy 0.4*** 55.2*** 7.4* 0.9*** 6.6
Distressed 0.2 63.2*** 13.1*** 1.0*** 20.1**
welfare on organic versus conventional egg production units, as
indicated by their average score being above the neutral point of Variables included in the models are self-reported opinion about hen welfare in
the scale (50 mm). The opinion of female students was more pos- organic versus conventional production systems (Opinion), whether the students
were deceived into believing that the video was recorded in an organic versus
itive than that of male students (females: 66 mm (95%CI: conventional farm (System), video clip A or B (Video), and interactions between
63.3;67.8); males: 58 mm (95%CI: 54.8;62.11); ANOVA: F1,153 ¼ 10.6, Opinion and System and between System and Video.
P ¼ 0.001). When the students were told that the hens were video (*)P < 0.1; *P < 0.5; **P < 0.01; ***P < 0.001.
278 F. A. M. Tuyttens et al. / Animal Behaviour 90 (2014) 273e280

their scoring in all three trials. The number of positive social in-
2 teractions was increased and the number of negative interactions
decreased when the assessors were told that the pigs had been
Standardized PC2 (12.8% explained variance)

selected for high social breeding value (trial 1); the panting score
allocated to cattle was higher when the observers were led to
believe that the ambient temperature was higher than in reality
1 (trial 2); and the QBA scoring indicated more positive and fewer
negative emotions when the observers were told that the hens
lm were kept in an organic instead of a conventional farm. Thus,
Depr Bored Ca misleading background information affected scores in all three
essed
Unsure trials, showing direct evidence of expectation bias by the assessors.
0 Frustratedared Relaxed Because we gathered additional pretest data quantifying observers’
Sc arful Frie
Fe ndl views on animal welfare on organic farms, the QBA trial was able to
es s e d y
Distr enses HaCon provide the most convincing evidence that the bias was related to
T u ed Co P pp ten
rvo at nf os y t expectations of the student observers.
Ne Agit id O
en cc Although students are often involved in collecting data in ani-
Activ

–1 t
mal behaviour, an obvious question in the present study is to what
En
erg
e

extent the documented observer biases of undergraduate veteri-


e

nary science students are similar to the biases of practising ethol-


tic

ogists. Probably in most research the data recorders will be better


trained as observers and therefore have more experience than the
–2
students used in our trials. Although we are not aware of studies
–2 –1 0 1 2 that have shown that observer training successfully reduces
Standardized PC1 (42.9% explained variance) expectation bias, the possibility exists that the observed biases
would be less prevalent in practising scientists than in our sample
Figure 5. A plot of scores showing the relation between principal component 1 and
principal component 2 and the loadings of the 19 fixed descriptors used in the qual- of university students. Undergraduate students may be especially
itative behaviour assessment session. Diamonds indicate the scores for students who prone to see, or even pretend to see, what they are told to expect.
were correctly informed that the video was recorded in a conventional aviary. Rect- On the other hand, undergraduate students may not be as strongly
angles indicate the scores for students who were told that the video was recorded in an personally committed to particular views of reality as experienced
organic aviary system. The black (conventional) and grey (organic) ellipses are the
scientists who might have firm expectations and large investments
normal data ellipses which contain 70% of the data of the two groups.
in particular outcomes of their research (Marsh & Hanlon, 2004).
Indeed, research hypotheses are often also personal prophesies
the total variance. Other components explained only a small part of which most researchers like to see come true in order to strengthen
the variance and are not considered further for analysis. The confidence in their scientific insights (Lehner, 1979, p. 129). This
loadings (magnitude and direction) of the different descriptors to could make experienced researchers particularly subject to pre-
both components are shown in Fig. 5. In general, the first compo- diction or confirmatory bias. Moreover, misleading the students by
nent seems to describe the affective states of the hens (the higher providing them with false contextual information about the treat-
the score, the more positive the affective state), whereas the second ments may not evoke equally strong expectations as in real situa-
component seems to describe the level and type of activity (the tions. For example, the effect of different thermal conditions when
lower the score, the more active and energetic the hens). The stu- scoring the degree of panting in cattle is likely to be stronger when
dents’ score for component 1 was higher if they had been told that the observer is personally exposed to these conditions during live
the video was recorded on an organic farm instead of a conven- scoring in the field instead of being given this information when
tional farm (ANOVA: F1,148 ¼ 54, P < 0.001; Fig. 5). The magnitude scoring from video.
of this effect was positively correlated with the students’ opinion Our three trials with farm animals all required some degree of
about hen welfare in organic versus conventional farms (ANOVA: interpretation and subjective judgement when recording or scoring
opinion*system interaction: F1,148 ¼ 75.9, P < 0.001). Deceiving the the behavioural observations from video. As a consequence, these
students into believing that the hens were filmed on an organic trials are not representative for trials with clearly objective mea-
farm did not affect component 2 (ANOVA: F1,148 ¼ 0.4, P ¼ 0.504). surement scale outcomes which leave little room for bias. Examples
of such measures could be the measurement of body temperature
using a thermometer, the concentration of cortisol in plasma as
DISCUSSION analysed in the laboratory or the automated recording of activity
level using pedometers. Although some would like science to focus
This study has demonstrated that observer bias is likely to be a exclusively on empirical facts that can be observed and quantified
more important problem in applied animal behaviour science than objectively, in reality many recording methods and scales used in
is often realized. Observer bias was evaluated in the three student animal behaviour science are subjective to some extent. Gradations
trials by manipulating the contextual information about the ani- or categories in the outcome of behavioural studies as well as
mals they observed and scored. The false contextual information clinical inspections are rarely cut-and-dried or easy to define in a
was intended to simulate a variety of plausible, realistic and precise way. Recording methods often require subjective judge-
contemporary study scenarios that an animal behaviour researcher ments by the assessor to categorize behaviours (e.g. differentiating
may encounter. In addition, these scenarios were deliberately between play and aggression in ethograms), to interpret the
chosen to evoke expectations about the likely scoring outcomes. outcome of behavioural interactions (e.g. indicating the winners
We were specifically interested in evaluating the extent to which and losers of agonistic interactions for calculating social hierarchy),
such a priori expectations influence the interpretation and scoring to characterize the personality of an animal, or to quantify the
of animal behaviour. Although in reality the students unknowingly severity of a physical or clinical condition (e.g. locomotion scoring
scored the same (group of) animals twice, their beliefs influenced for assessing the severity of lameness). Meagher (2009) has even
F. A. M. Tuyttens et al. / Animal Behaviour 90 (2014) 273e280 279

argued that such subjective observer ratings may offer not only preparing well-designed studies as the collection and recording of
practical (e.g. money-saving) but also scientific advantages (e.g. for data become more subjective (Schulz et al., 2002). Although such
integrating multimodal information across time and situations, and blind trial designs are likely to increase the complexity and cost of
for constructs that are otherwise difficult to assess such as pain). research, the technology now available to record behaviour can in
Moreover, she claims that such ratings can be both reliable and theory make this possible in most experimental trials (Boutron
valid conditional upon careful experimental design to minimize et al., 2007; Day & Altman, 2000).
observer bias. Nevertheless, we do acknowledge that blinding the assessors
Textbooks on behavioural recording methods therefore high- can be difficult to achieve in certain research set-ups. Blinding
light the need to reduce observer bias by properly training asses- observers may be hindered where treatment effects are easy to see
sors, by testing inter- and intraobserver reliability, and by randomly or guess, or in less experimental research settings such as obser-
assigning observers to different groups or treatments (Dawkins, vational studies. An example in place is the on-farm monitoring of
2007; Lehner, 1979; Martin & Bateson, 1993). Another common animal welfare. Farm animal welfare monitoring and auditing
recommendation is to use precise and unambiguous operational protocols increasingly apply behavioural observations and ratings
definitions of behavioural categories (Lerman et al., 2010). Very similar to those used in the present study. There is growing
explicit criteria for each value on a scale make the ratings less consensus that such protocols ought to use animal-based welfare
dependent on personal experience or other factors that will make measures where possible (Anonymous, 2012; Blokhuis, Veissier,
the judgement vary greatly between individuals (Meagher, 2009). Miele, & Jones, 2010; Main, Whay, Leeb, & Webster, 2007). These
Of the three scoring scales that were used in our study, the panting measures are believed to be more directly linked to the true welfare
score was the most precisely defined. Although the indication of an status of the animals than resource-based measures that describe
elevated temperature did significantly increase the students’ the housing and management conditions in which the animals are
average panting score, the absolute size of this effect was small. The kept (Roe, Buller, & Bull, 2011; Webster, Main, & Whay, 2004), but
specific descriptors on the scale may have forced the observers to may also involve a greater level of subjectivity. Animal welfare
check for the appropriate signs by closely observing the animals. assessment schemes such as the Welfare Quality (2009) protocols
This may have reduced any expectation bias effect. With regard to not only have to be valid and reliable, they also ought to be feasible
QBA it should be noted that asking observers to score one video clip and cost-effective if a wide uptake by the food industry is to be
in isolation as was done in this study is not representative of its achieved. The latter requirement favours assessments based on
methods. QBA trials normally involve comparative assessment of observer ratings that are relatively nonintrusive, inexpensive and
20 video clips or more, within a specific investigative context can be used to integrate multimodal information across time and
presented by the experimenter, which allows for a more carefully situations at the expense of tests that are perhaps more accurate
weighted assessment of individual clips in relation to each other. but that are expensive, time consuming or dependent on laboratory
Although expectation bias may still be present in observers, this or technical equipment (Meagher, 2009). With manpower being
design offers greater opportunity for managing the context in the main cost of such animal-based monitoring schemes, it is likely
which clips are assessed. that farms are visited and assessed by only a single auditor.
Such recommendations and actions to reduce interobserver Consequently, interobserver reliability has been rightly considered
variation may not suffice, however, to prevent all possible sources in the selection of the welfare indicators and the training of audi-
of observer bias. For example, Hróbjartsson et al. (2013) reviewed tors (Dalmau et al., 2010; Gibbons, Vasseur, Rushen, & de Pasillé,
16 randomized clinical trials with subjective outcome assessment 2012; Phytian et al., 2013). In addition to these precautions, our
by both blinded and nonblinded assessors. The trial with the largest findings point to the need to pay more attention to how the ex-
degree of observer bias actually had a very good interobserver pectations and predispositions of the auditors may bias their re-
agreement. A high inter- and intraobserver agreement is by no cordings when they assess the welfare of the animals on the farms
means a guarantee of absence of expectation bias, as observers may they visit.
have been biased in the same direction because they hold similar
expectations about the outcome of the study. Also, in our panting
score trial we found no indication that expectation bias was lowest Acknowledgments
(i.e. small difference between the scores with the correct and
elevated temperature indication) for those video clips that were We thank C. Moons, T. Decroos, J. Vander Linden and T. Martens
scored similarly between students (as indicated by a low standard for their help with the student trials; M. Levenson for English
deviation). revision; D. Lapage for the data entry; and S. Buijs, S. De Campe-
Observer blinding, that is, withholding information about the neere and F. Wemelsfelder for commenting on the manuscript. We
differences between the animals or the experimental treatments thank the veterinary students of Ghent University who participated
from the observer, is one of the most effective and powerful solu- in the trial.
tions to avoid the conscious or unconscious suppositions of the
assessor influencing the scoring outcomes. For example, a ‘blinded’ References
QBA study that compared pigs treated with the anxiolytic drug
Azaperone with nontreated pigs in an elevated plus-maze found Anonymous. (2012). EFSA recommends use of animal-based measures when
that observers, although unaware of the treatment, could clearly assessing welfare. Veterinary Record, 170, 112.
Bergsma, R., Kanis, E., Knol, E. F., & Bijma, P. (2008). The contribution of social effects
distinguish between the expressions of treated and untreated pigs, to heritable variation in finishing traits of domestic pigs (Sus scrofa). Genetics,
indicating that when observer bias is managed adequately, 178, 1559e1570.
methods such as QBA can produce valuable results (Rutherford, Blokhuis, H. J., Veissier, I., Miele, M., & Jones, B. (2010). The Welfare QualityÒ project
and beyond: safeguarding farm animal well-being. Acta Agriculturae Scandi-
Donald, Lawrence, & Wemelsfelder, 2012). Our findings thus call
navica Section A: Animal Science, 60, 129e140.
for the more widespread use of blind experiments in animal Boutron, I., Guittet, L., Estellat, C., Mohar, D., Hróbjartsson, A., & Ravaud, P. (2007).
behaviour research whenever possible and for placing greater Reporting methods of blinding in randomized trials assessing non-
credence in results when observers were blinded as compared to pharmacological treatments: a systematic review. PLoS Med, 4, e61.
Braeckman, J., & Boudry, M. (2011). De ongelovige thomas heeft een punt: Een han-
not blinded. Withholding information from researchers and their dleiding voor kritisch denken. (The reasonableness of doubting Thomas: An invi-
data collection staff becomes an increasingly important part of tation to critical thinking). Antwerp, Belgium: Houtekiet.
280 F. A. M. Tuyttens et al. / Animal Behaviour 90 (2014) 273e280

Burghardt, G. M., Bartmess-LeVasseur, J. N., Browning, S. A., Morrsion, K. E., Miller, L. E., & Stewart, M. E. (2011). The blind leading the blind: use and misuse
Stec, C. L., Zachau, C. E., et al. (2012). Perspectives: minimizing observer bias in of blinding in randomized controlled trials. Contemporary Clinical Trials, 32,
behavioral studies: a review and recommendations. Ethology, 118, 511e517. 240e243.
Dalmau, A., Geverink, N. A., Van Nuffel, A., Van Reenen, K., Hautekiet, V., Nalon, E., Maes, D., Van Dongen, S., van Riet, M. M. J., Janssens, G. P. J., Millet, S.,
Vermeulen, K., et al. (2010). Repeatability of lameness, fear and slipping scores et al. (2014). Comparison of the inter- and intra-observer repeatability of three
to assess animal welfare upon arrival in pig slaughterhouses. Animal, 4, 804e gait-scoring scales for sows. Animal (in press). http://dx.doi.org/10.1017/
809. S1751731113002462.
Dawkins, M. S. (2007). Observing animal behaviour: Design and analysis of quanti- Nickerson, R. S. (1998). Confirmation bias: an ubiquitous phenomenon in many
tative data. Oxford, UK: Oxford University Press. guises. Review of General Psychology, 2, 175e220.
Day, S. J., & Altman, D. G. (2000). Statistics notes: blinding in clinical trials and other Page, M., Taylor, J., & Blenking, M. (2012). Context effects and observer bias: im-
studies. British Medical Journal, 321, 504. plications for forensic odontology. Journal of Forensic Sciences, 57, 108e112.
Gaughan, J. B., Mader, T. L., Holt, S. M., & Lisle, A. (2008). A new heat load index for Phytian, C. J., Toft, N., Cripps, P. J., Michalaopolou, E., Winter, A. C., Jones, P. H., et al.
feedlot cattle. Journal of Animal Science, 86, 226e234. (2013). Inter-observer agreement, diagnostic sensitivity and specificity of
Gibbons, J., Vasseur, E., Rushen, J., & de Pasillé, A. M. (2012). A training programme animal-based indicators of young lamb welfare. Animal, 7, 1182e1190.
to ensure high repeatability of injury scoring of dairy cows. Animal Welfare, 21, Risinger, D. M., Saks, M. J., Thompson, W. C., & Rosenthal, R. (2002). The Daubert/
379e388. Kumho implications of observer effects in forensic science: hidden problems of
Goldstein, M. D., Hopkins, J. R., & Strube, M. J. (1994). ‘The eye of the beholder’: a expectation and suggestion. California Law Review, 90, 1e56.
classroom demonstration of observer bias. Teaching of Psychology, 21, 154e157. Roe, E., Buller, H., & Bull, J. (2011). The performance of farm animal assessment.
Hoyt, W. T., & Kerns, M.-D. (1999). Magnitude and moderators of bias in observer Animal Welfare, 20, 69e78.
ratings: a meta-analysis. Psychological Methods, 4, 403e424. Rosenthal, R. (1966). Experimenter effects in behavioral research. New York: Apple-
Hróbjartsson, A., Thomsen, A. S. S., Emanuelsson, F., Tendal, B., Hilden, J., Boutron, I., ton-Century-Crofts.
et al. (2012). Observer bias in randomised clinical trials with binary outcomes: a Rutherford, K. M. D., Donald, R. D., Lawrence, A. B., & Wemelsfelder, F. (2012).
systematic review of trails with both blinded and non-blinded outcome as- Qualitative behavioural assessment of emotionality in pigs. Applied Animal
sessors. British Medical Journal, 344. e1119. Behaviour Science, 139, 218e224.
Hróbjartsson, A., Thomsen, A. S. S., Emanuelsson, F., Tendal, B., Hilden, J., Boutron, I., Schulz, K. F., Chalmers, I., & Altman, D. G. (2002). The landscape and lexicon of
et al. (2013). Observer bias in randomized clinical trials with measurement blinding in randomized trials. Annals of Internal Medicine, 136, 254e259.
scale outcomes: a systematic review of trails with blinded and non-blinded Schulz, K. F., Chalmers, I., Hayes, R. J., & Altman, D. G. (1995). Empirical evidence of
assessors. Canadian Medical Association Journal, 185, E201eE211. bias: dimensions of methodological quality associated with estimates of
Kaptchuk, T. J. (2001). The double-blind, randomized, placebo-controlled trial: gold treatment effects in controlled trials. Journal of the American Medical Associa-
standard or golden calf? Journal of Clinical Epidemiology, 54, 541e549. tion, 273, 408e412.
Lehner, P. N. (1979). Handbook of ethological methods. New York: Garland STPM Schulz, K. F., & Grimes, D. A. (2002). Blinding in randomised trials: hiding who got
Press. what. Lancet, 359, 696e700.
Lerman, D. C., Tetreault, A., Hovanetz, A., Bellaci, E., Miller, J., Karp, H., et al. (2010). Tuyttens, F. A. M., Sprenger, M., Van Nuffel, A., Maertens, W., & Van Dongen, S.
Applying signal-detection theory to the study of observer accuracy and bias in (2009). Reliability of categorical versus continuous scoring of welfare in-
behavioural assessment. Journal of Applied Behaviour Analysis, 43, 195e213. dicators: lameness in cows as a case study. Animal Welfare, 18, 399e405.
Mader, T. L., Davis, M. S., & Brown-Brandl, T. (2006). Environmental factors influ- Wason, P. C. (1960). On the failure to eliminate hypotheses in a conceptual task.
encing heat stress in feedlot cattle. Journal of Animal Science, 84, 712e719. Quarterly Journal of Experimental Psychology, 12, 129e140.
Main, D. C. J., Whay, H. R., Leeb, C., & Webster, A. J. F. (2007). Formal animal-based Webster, A. J. F., Main, D. C. J., & Whay, H. R. (2004). Welfare assessment: indices
welfare assessment in UK certification schemes. Animal Welfare, 16, 233e236. from clinical observations. Animal Welfare, 13, S93eS98.
Marsh, D. M., & Hanlon, T. J. (2004). Observer gender and observer bias in animal Welfare Quality. (2009). Welfare QualityÒ assessment protocol for poultry (broilers,
behaviour research: experimental tests with red-backed salamanders. Animal laying hens). Lelystad, Netherlands: Welfare QualityÒ Consortium.
Behaviour, 68, 1425e1433. Wemelsfelder, F. (2007). How animals communicate quality of life: the qualitative
Martin, P., & Bateson, P. (1993). Measuring behaviour: An introductory guide (2nd assessment of animal behavior. Animal Welfare, Supplement, 16, 25e31.
ed.). Cambridge, UK: Cambridge University Press. Wemelsfelder, F., Hunter, E. A., Lawrence, A. B., & Mendl, M. T. (2001). Assessing
Meagher, R. K. (2009). Observer ratings: validity and value as a tool for animal the ‘whole-animal’: a free-choice-profiling approach. Animal Behaviour, 62,
welfare research. Applied Animal Behaviour Science, 119, 1e14. 209e220.

Vous aimerez peut-être aussi