Vous êtes sur la page 1sur 13

Improving Wikipedias Accuracy: Is Edit Age a Solution?

Brendan Luyt, Tay Chee Hsien Aaron, Lim Hai Thian and Cheng Kian Hong
Wee Kim Wee School of Communication & Information, Nanyang Technological University, Singapore 637718.
E-mail: {Brendan, TAY0015, LIMH0050, W060021@ntu.edu.sg}

Wikipedia is fast becoming a key information source for (Sanger, 2005). While Nupedia eventually failed due to the
many despite criticism that it is unreliable and inaccu- slow peer review process, Wikipedia grew rapidly. Today, the
rate. A number of recommendations have been made to scope of Wikipedia is staggering. As of September 2006,
sort the chaff from the wheat in Wikipedia, among which
is the idea of color-coding article segment edits accord- the English version alone has roughly 1.4 million articles
ing to age (Cross, 2006). Using data collected as part of and 43,000 active editors (Wikipedia, 2007a). The modelling
a wider study published in Nature, this article examines of Wikipedias growth shows that, as of 2006, the rate of
the distribution of errors throughout the life of a select articles and the number of edits is still growing exponentially
group of Wikipedia articles. The survival time of each (Buriol, Castillo, Donato, Leonardi, & Millozzi, 2006).
error edit in terms of the edit counts and days was
calculated and the hypothesis that surviving material Supporters (many who call themselves Wikipedians) have
added by older edits is more trustworthy was tested. hailed Wikipedia as yet another success of the open source
Surprisingly, we find that roughly 20% of errors can be movement, involving new commons-based peer produc-
attributed to surviving text added by the first edit, which tion methods (Benkler, 2002, para 2). Others have invoked
confirmed the existence of a first-mover effect (Viegas, the idea of a wisdom of the crowds (Surowiecki, 2005),
Wattenberg, & Kushal, 2004) whereby material added by
early edits are less likely to be removed. We suggest that situations where supposedly a mob of people, collectively,
the sizable number of errors added by early edits is sim- prove wiser than any individual expert. However, on the face
ply a result of more material being added near the begin- of it, the idea of a collaborative encyclopaedia project that
ning of the life of the article. Overall, the results do not anyone can edit seems to be an impossible public good
provide support for the idea of trusting surviving seg- (Ciffolilli, 2003). Yet a study by the journal Nature seems to
ments attributed to older edits because such edits tend
to add more material and hence contain more errors indicate that not only are people willing to contribute articles
which do not seem to be offset by greater opportunities but the quality of those articles is close to that found in
for error correction by later edits. Encyclopaedia Britannica (Giles, 2005). This conclusion
was collaborated by a similar study done on the German
Wikipedia (Beesley, 2004).
Introduction
Nevertheless, critics have attacked Wikipedia on a num-
Wikipedia: Success or Failure? ber of fronts, including Wikipedias ability to handle vandal-
ism and special interest groups (Brandt, 2005), its creeping
Wikipedia is a huge online encyclopaedia of free content
bureaucracy and growing instances of infighting among
ranging from serious academic topics to pop culture. But
editors (Scott, 2004), and the communitys anti-intellectual
what really sets Wikipedia apart from other reference
attitude (Sanger, 2004). One critic even accuses Wikipedia
sources is that it is an online resource that anyone can
supporters of engaging in digital Maoism (an overconfi-
edit.1 Wikipedia began life as a complement to Nupedia, a
dence in online collectivism and aggregation; Lanier, 2006).
free encyclopaedia with content vetted by scholars. Wiki
McHenry (2004) has branded Wikipedia a faith-based en-
technology, which allows visitors to easily edit Web site
cyclopaedia and claims that there is no reason to believe
content, was initially added to help speed up collaboration
that the newer edits are improving articles in the long run. As
between contributors. But the Nupedia Advisory Board ulti-
Andrew Orlowski points out, the comparison to open source
mately rejected the Wiki tool and the project was spun off in
software projects is misleading because, unlike Wikipedia
January 2001 to become the Wikipedia we know today
edits, code actually has to work, not merely be written
(Orlowski, 2005).
Received May 25, 2007, revised August 3, 2007, accepted August 4, 2007

1
http://en.wikipedia.org/wiki/Main_Page Studying the Accuracy of Wikipedia


2007 Wiley Periodicals, Inc. Published online 3 December 2007 in Studying the accuracy of Wikipedia is not easy. However,
Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/asi.20755 there are now a number of studies that have attempted to

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 59(2):318330, 2008
do so. In the above-mentioned study by Nature, 42 Wikipedia their peers to trust, thus providing different results for each
and Britannica articles in the domains of science and math user. More complicated proposals would involve reviewers
were reviewed by scholars (Giles, 2005). The reviewers rating articles and aggregating the results. The main disad-
found a total of 162 and 123 factual errors, omissions or vantage of such systems is that it requires explicit input from
misleading statements in Wikipedia and Britannica, respec- reviewers, and it can be easily abused by spamming.
tively, but classified only eight (four for each resource) of The second class of proposals does not require editors to
these as major (Giles). In a reply to Nature, Britannica explicitly rate articles, but it attempts to automatically assess
pointed out a number of supposed flaws in the study and information quality by calculating metrics based on meta-
disputed specific errors noted by Natures reviewers (Ency- data recorded and stored by Wikipedia. Owing to the diffi-
clopaedia Britannica, Inc., 2006). Nature, however, refused culty of getting large samples of articles reviewed by humans,
to withdraw the article, stating that although their reviewers most research has focused on this second class of proposals
were by no means perfect, any errors made by them were as by suggesting and validating suitable proxies for quality in
likely to impact Wikipedia as Britannica (Nature, 2006). lieu of human judgement.
A study published in ct, a German computer technology Lih (2004) was the first to suggest the metrics Rigor
trade magazine, backs up Natures claim. It compared 66 (total number of edits made for the article so far) and
entries in the German version of Wikipedia with Microsofts Diversity (total number of unique editors for the article) as
Encarta as well as Brockhaus (a popular German encyclopae- metrics to measure quality. However, Lih was interested in
dia). Experts were asked to rate these articles from one to five studying changes in Wikipedia articles after they were cited
in terms of correctness, comprehensibility, breath, and depth. in the press rather than predicting information quality per se,
Wikipedia scored full marks for 24 articles as compared to 17 and he did not attempt to validate this metric, other than using
for Brockhaus and 12 for Encarta (Beesley, 2004). a priori reasoning. Also, information quality itself is a multidi-
Although there are many more examples of direct reviews mensional concept and can be measured in terms of scope,
of Wikipedia content, most are anecdotal in nature (Magnus, accuracy, stability, and so on (Stvilia, Twidale, Gasser, &
2006; Read, 2006; Rosenzweig, 2006). A different and less Smith, 2005), so it is unclear what aspects should be assessed.
demanding approach involves surveying people about whether One common method used by researchers investigating
they think sample articles are accurate. One survey of 50 Wikipedias accuracy involves the assumption that featured
respondents found that 76% agreed or strongly agreed that articles are a higher quality than compared to nonfea-
the article in their area of expertise was accurate (Press, 2006). tured articles. The aim then becomes to see what distin-
Another survey had 258 academics compare Wikipedia guishes the two classes of pages and to use those differences
articles in their area of expertise against articles outside it. to predict levels of accuracy. Wilkinson and Huberman
They found that the former were generally more creditable (2007) found that after taking into account age and visibility
(Chesney, 2006). (using Pagerank as a proxy), featured article status could be
Even the worst critics of Wikipedia would probably con- predicted by an increased number of edits or number of edi-
cede that there are some high-quality articles in Wikipedia, tors. Stvilia et al. (2005) also found that featured articles have
but the difficulty lies in being able to separate the wheat more edits on average and higher readability scores. They
from the chaff. Needless to say, there has been no shortage of also used factor analysis to identity factors that successfully
proposals on how to help users assess the quality of individ- grouped 91% of the featured articles correctly. A study done
ual Wikipedia articles.2 These can be generally split into two on the German version of Wikipedia using a topic-attention-
categories. quality model found that the number of authors is, by far, the
The simplest proposals are those based on explicit article most important factor for predicting quality (Brndle, 2006).
validation (Wikipedia, 2007b) whereby a trusted user McGuinness et al. (2006) advocated link ratio analysis
(defined using various criteria) explicitly marks an article as using the number of inbound links to each article as a measure
good. Although anyone can still edit the page, unregistered much like the well-known Pagerank algorithm (Brin, 1998).
users will be presented with the last version marked good. The basic idea is that an internal link from another Wikipedia
This is currently being considered as an experimental feature article to the article in question is a vote for quality. Each
for the German version of Wikipedia (Slashdot, 2006). Com- editors trust value is then based on the trust values of all the
petitors with Wikipedia, like Sanders Citizendium and entries edited. McGuiness also proposed a trust tab on each
Scholarpedia, already display pages approved by experts Wikipedia article: when pressed, it color codes each frag-
(Anderson, 2007). A peer-based explicit system (Jensen, ment of the article depending on the trust level of the editor
2003) would, in addition, allow users to choose which of who added it. McGuinness et al. found that featured articles
have the highest link ratio, followed by normal articles and
then clean-up articles (articles tagged by editors as being
2
Wikipedias leadership is also aware of the problem of ensuring quality low quality and hence requiring attention).
entries and has supported a number of efforts to implement systems of qual- Anthony, Smith, and Williamson (2005) took a different
ity control. However, only a few of these have born much fruit over the long
term. The exception is the Anti-Vandalism Project which, as the name sug-
tack. Instead of trying to assess the quality of Wikipedia
gests, cleans up the work of vandals. This work, although essential, does not articles and then indirectly assessing the quality of editors as
deal with more subtle errors in entries that are much harder to spot. McGuiness does, they obtained a proxy for the quality of

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGYJanuary 15, 2008 319
DOI: 10.1002/asi
each editors contribution by deriving the percentage Nature study not only rated Wikipedia and Britannica arti-
retained of each editors contribution to the current version. cles in terms of errors found per article but also released the
They argued that an editor whose edits are consistently re- experts exact comments. This detailed critique provides a
verted is likely to be contributing low-quality edits, while unique opportunity to do an analysis at the level of the er-
editors who are contributing high-quality edits tend to have rors. Given the reviewers comments, we pinpoint the exact
higher survival rates for the text they contribute. Anthony et al. edit that introduced the error (in effect assigning blame for
found that for registered users, the level of contribution was the error) and then study the distribution of the edits that
directly related to quality. Surprisingly, unregistered users cause errors over time. Edits causing errors or error edits
with low levels of contribution (dubbed good Samaritans) are then analysed in two major ways: their survival time in
had a higher quality level of edits, even slightly higher on terms of total edits (number of edits they survived without
average than high-level contributing registered members. being removed) as well as their survival times in terms of
Variants of Anthony et al.s (2005) methodology were time (number of days they existed up to the review date
also used by Zeng, Alhossaini, Ding, Fikes, and McGuinness without being removed). The ultimate goal is to test the fea-
(2006) as well as Adler and Alfar (2007) to assess trustwor- sibility of trusting text edits based on their age (whether age
thiness or reputation of authors and articles. Zeng et al. is calculated in terms of total raw edits or time). If the Cross
found that their measure of article trustworthiness gave high thesis stands, one would expect more of the errors to be at-
scores for featured articles as expected. Aldler and Alfars tributed to surviving material from the more recent edits that
study was notable because instead of using a reputation sys- have not undergone as much scrutiny as surviving material
tem (based on text and edit survival) to predict featured arti- from older edits.
cle status, they used volunteers to rate the edits as high or
low quality to validate the metric. They found that while
short-lived edits by low-reputation editors were judged low Method
quality 75% of the time, short-lived edits by high-quality ed-
Determine the Version
itors were considered poor only 9.2% of the time. This
showed that edits could be removed for reasons other than The first step was to determine the exact date and times in
quality. Dondio, Barrett, Weber, and Seigneur (2006) evalu- which samples were extracted from Wikipedia for review by
ated the trustworthiness of Wikipedia articles using various Natures reviewers, as this was not available from the pub-
factors such as leadership, stability, length, and im- lished results. The 42 reviewed entries were matched with
portance, and they found that 77.8% of the featured articles the corresponding version of each article on Wikipedia to
are distributed in the region with trust values greater than obtain the exact date and time stamp. Unfortunately, nine
70%, while only 13% of standard articles had trust values at samples were not available and so an approximate date was
the same level. used for these. In some cases, even with the samples on
The above-mentioned studies attempted to evaluate accu- hand, it was not possible to isolate the exact version because
racy on a per article basis; however, it is likely that some in a small number of cases, some material such as reference
segments in the same the article are more trustworthy or ac- lists, was removed to bring the length of the entries closer to-
curate than others, and, if so, a method that could automati- gether (Giles, 2005). Also, samples sent to reviewers were
cally identify these segments would be very useful. Cross stripped of certain identifying material to prevent them from
(2006) proposed color coding sentence fragments according being able to differentiate between Wikipedias and Britan-
to how old they are in terms of the number of edits they have nicas articles. For example, some excluded see also sec-
survived without being removed. He argues that such an ap- tions, categories that the article was in and because the
proach would make use of the main strength of Wikipedia samples were sent in text form, hotlinks. As a result, in many
its ability to allow large numbers of people to view and cases, more than one version of the article matched the Na-
amend articlesto give users some idea of the trustworthi- ture gave the reviewers. This was particularly common in
ness of the text. To demonstrate his ideas, Cross modified the cases where edits involved nothing more than adding cate-
open source MediaWiki program and tested it against a well- gories, external links, or hotlinks. Lastly, in two cases, more
known Internet error, accusing Glenn Harlan Reynolds, law than one version fitted the reviewed sample because of re-
professor and operator of the popular Instapundit blog, of verts due to vandalism. However, in looking at the overall
drinking blended puppy energy drinks. At one point, this review dates, it was possible to conclude that the articles
error crept onto the Wikipedia page devoted to Reynolds. were harvested between October 20, 2005, and November 1,
Using his age-based algorithm, Cross was able to show that 2005.3 In some cases, the versions were from as early as
the error would have been identified. September 2005, but this reflected a lack of edits between
Cross (2006) approach is interesting, not least due to its that date to the day the articles were harvested. In general,
intuitive appeal. However, it assumes that the age of a sur-
viving text segment is consistently related to its accuracy. Is
this a valid assumption? The rest of this article will, in fact, 3
The articles do not seem to be all harvested on the same day. Among
argue, with the aid of data collected as part of the Nature articles in which an unambiguous match can be determined, there is no single
study, that this is not the case. It is made possible because the date that reconciles everything.

320 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGYJanuary 15, 2008
DOI: 10.1002/asi
FIG. 1. Example of a Wikipedia diff comparing an edit of the Cavity magnetron with the original version.

despite these difficulties, we believe that these differences the castle itself. In other cases. the error was fairly clear
do not substantially affect the precision of the study, as they even without a quote, as in the following comment on the
result in a difference of a few edits, at most, or roughly a few Mendeleev, Dmitry article. He got the job at the Techno-
hours to about a weeks difference given that the average age logical Institute in 1864, not 1863. However, to ensure
of the sample article is more than a 1,000 days. consistency, the coders referred to the Wikipedia page that
tracked point-by-point the corrections editors made in re-
Assigning Errors to Edits sponse to Natures review.4 On that page, editors who cor-
To ensure that the error editing was as consistent and rected the errors linked to a diff of their edit, showing what
reliable as possible, two people were assigned to do the cod- was added and removed to deal with the review point. By
ing. There were two difficulties involved in this task. First, looking at the diffs of the correcting errors, the coders were
the coder had to determine which phrases in the text were able to identify segments that were the problem even with-
responsible for the error. This was easy when the reviewer out much subject knowledge.
quoted the segment and then commented on it. For example, Figure 1 shows the diff for the correcting edit made on
the second error for Royal Greenwich Observatory states December 23, 2005, in response to the first error identified
the castle now houses the International Study Centre . . .
Herstmonceux Science Centre is housed in the observatory 4
http://en.wikipedia.org/wiki/Wikipedia:External_peer_review/Nature_
buildings next to the castle, on the castle grounds, not in December_2005/Errors

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGYJanuary 15, 2008 321
DOI: 10.1002/asi
FIG. 2. Interface of Wikiblame showing partial search results.

in the Cavity Magnetron article. The phrase that was com- cases, the phrase that was searched appeared only because of
pletely unanticipated at the time contained the error. The next a minor word or sentence rewrite and as such the error
step involved searching past revisions of the articles for the should not be assigned to that edit. In such cases, the new
first time the phrase appears. The helpful tool, Wikiblame,5 phrase is searched again until the coder is satisfied that the
was used to do this. Coders would enter the search phrase error can be assigned to the edit.
and instruct the tool to search past versions of the article for Once the error has been assigned to the edit, the coder
the first appearance of the phrase that was completely unan- noted the date and time of the edit and converted it to an or-
ticipated at the time, starting from the date of the review dinal number. For example, in the case described above,
and working backwards. the very first edit for the article was on February 9, 2002,
Figures 2 and 3 are screen shots taken from Wikiblame. The while the edit on June 10, 2002, which was assigned the
first shows the interface and partial results of the search. error, was the third edit and hence the number three is
The second shot shows that the phrase did not exist on recorded. This is done to facilitate intercoder reliability
February 25, 2002, but was in the versions after that. To ver-
ify that this was indeed the case, the link of the edit made on
June 10, 2002, (the first time it is identified by Wikiblame) was
selected and the diff between the version on that date and the
older version checked (Figure 4).
This additional step was important because the tool
shows only exact matches of the phrase searched; in some

5
http://wikipedia.ramselehof.de/wikiblame.php FIG. 3. Wikiblame finds the searched-for phrase after February, 25, 2002.

322 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGYJanuary 15, 2008
DOI: 10.1002/asi
FIG. 4. Diff demonstrating that the edit identified by Wikiblame did, in fact, occur.

comparisons later. It should be noted that the reviewers with no explanation of what it actually was. One coder
identified not only factual errors in the article but also sig- coded this as an omission (zero), while the other assigned the
nificant omissions. In such cases, as there is usually no one blame to the edit that first mentioned the Copenhagen project.
edit that can be blamed for the error, the coder entered zero It was decided to drop all such cases from the study as well as
for the entry. cases in which both coders agreed that the error was an omis-
sion (18 cases). A total of 18  15  33 cases were dropped,
leaving 129 errors. Of the remaining disagreements (15 cases),
Results of Intercoder Reliability
at least four were a result of the reviewer identifying what
Using Scotts Pi and Cohens Kappa measures of inter- could arguably be two error edits instead of one. For exam-
coder reliability, fairly good values of 0.744 and 0.745 were ple, the fifth error noted for Prion stated that the sentence
obtained, respectively. Given that both measures are known linking prions to memory and cellular differentiation is
to be conservative (Lombard, Snyder-Duch, & Bracken, extremely misleading. However, it was found that two dif-
2005), the measure obtained shows a substantial degree ferent edits added the link to memory and cellular differenti-
of intercoder reliability (Landis & Koch, 1977, p. 165). Re- ation at different times and so, arguably, there were two
viewing the differences, it was found that coders disagreed in errors. Similarly, in the case of the third error in the Cambrian
30 cases about what constituted an omission, and, out of explosion article, it was stated that the use of the phrase
these 30, 15 mismatches (50%) were attributed to a disagree- germ lines was incorrect for Diploblastic/ triploblastic, but
ment over whether an error was an omission or whether an the terms diploblastic and tripoblastic were added by differ-
edit could be blamed for it. A representative example is the ent edits after the term germ lines was already included.
first error noted by the reviewer on the article on Lomborg, In cases where there was disagreement, the analysis was
Bjorn: The Copenhagen Consensus project is mentioned but run twice. The later model took the error edit as the later

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGYJanuary 15, 2008 323
DOI: 10.1002/asi
of the two proposed error edits, while the earlier model
took the error edit as the earlier of the two. The results are not
appreciably different, and so we will present results only from
the later model.

Findings

Analysis of Error Edit by Ordinal Position of Edits


Cross (2006) suggested color coding edits in terms of
how old they are. The rationale is that older edits are more
trustworthy. However, there are several ways to make use of
this idea. One way is to trust material added only in the first
n of edits. Another way of looking at it is in terms of the num-
ber of edits survived. Lastly, using absolute edits without
taking into account the total number of edits in the article
might be unproductive, and so it might be better to scale it by
the total number of edits such that only the first n percent of
edits are trusted.
With an eye toward evaluating these three methods, we
analyse the occurrence of the error edits using three mea-
sures. Error Edit (absolute position) measures the absolute
ordinal position of the error edit. This variable ranges from FIG. 5. Location of error edits (absolute position).
one (the very first edit) to N, where N is the total number of
edits made before it was reviewed. Although this gives us
the absolute position of the error edit, the relative position
of the error edit in terms of total edits made is also calcu-
lated. For example, if the error edit is in the tenth position
and there are 100 total edits for that article, the variable
Error edit (relative position) is 10/100 * 100%  10%. This
is interpreted as saying that among all the edits in the life of
the article, the error edit appeared in the first tenth of edits.
Alternatively, 90 more edits were made without anyone
spotting the error, which is what the last statistic, Edits sur-
vived, measures.
Table 1 shows that the median Error Edit (absolute posi-
tion) was 38 (median number of total edits is 107). In terms
of relative position, the median error edit occurred slightly
after 31% of all the edits were made. This seems to indicate
that most error edits occurred relatively early. Also, note that
the mode of the absolute error edit is one, meaning that the
most common error edit is attributed to the first edit! In
terms of edits that occurred after the error edit, half of the FIG. 6. Error edits (absolute position) cumulative percentage.
edit errors survived 63 edits without being removed.

The histogram of Error Edit (absolute position; Figure 5)


TABLE 1. Descriptive statistics of error edits (absolute position).
is highly skewed to the right (skewness  2.032, SD  0.213).
Descriptive Statistics In fact, the cumulative percentage graph in Figure 6 shows
this is attributable to the fact that slightly under 20% of
Error edit Error edit Edits errors could be attributed to the first edit! By the ninth edit,
(absolute position) (relative position) survived
over 30% of the error edits have been made.
N 129.0 129.00 129.0 In terms of error edits relative to total edits or Error edit
Mean 54.3 35.80 88.1 (relative position), the histogram (Figure 7) is also right
Median 38.0 31.30 63.0 skewed (skewness  0.513, SD  0.213) but to a lesser ex-
Mode 1.0 0.51 195.0 tent. The cumulative percentage graph (Figure 8) shows that
Minimum 1.0 0.00 0.0
roughly 11% of error edits occurred in the first 1% (or less)
Maximum 347.0 100.00 312.0
of edits.

324 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGYJanuary 15, 2008
DOI: 10.1002/asi
FIG. 9. Edits survived cumulative percentage.

TABLE 2. Chi-square test results, H0: Error edits are evenly distributed.
FIG. 7. Location of error edits (relative position).
Observed N Expected N Residual

First third of edits 69 43.0 20.0


Second third of edits 32 43.0 11.0
Last third of edits 28 43.0 15.0
Total 129

Test statistics

Thirds

Chi-squarea 23.767
df 2
Asymp. Sig. .000

a
0 cells (.0%) have expected frequencies less than 6. The minimum
expected cell frequency is 43.0.

on edit age is that text segments added by older edits are more
trustworthy. Here, we test the weaker null hypothesis:
FIG. 8. Error edits (relative position) cumulative percentage.

H0: Error edits are evenly distributed between the first third,
second third, and last third of the articles life.
Last, in terms of Edits survived, the graph (Figure 9)
shows that if an arbitrary cut-off of roughly 30 edits is chosen The results of the chi square test (Table 2) show that
so that material in the most recent 30 edits is considered not the null hypothesis is rejected at less than the 1% signifi-
trustworthy, this would remove the number of errors by 25%. cance level. This can be attributed to the excess of cases in
Next, we study the distribution of edit errors by dividing the first third (69 obtained, 43 expected), precisely the group
them into three groups. The first group consists of errors that that, if Cross is correct, should show the least number. In
appeared before the first third of the life of the article (in fact, contrary to Cross, there are less errors than expected in
terms of total edits) or Error Edit (relative position) between the last third of edits.
zero to 33%. The second group consists of errors that ap- Next, three error rates were calculated for each article:
peared during the next third or Error Edit (relative position) the error rate for the first third, second third, and last third
from 34 to 66%. And the last consists of errors that appear of edits. For example, the article Agent Orange had two error
during the last third of the life of the article (in terms of edits) edits, one with an Error edit (relative position) of 30% (first
or Edit Error (relative position) from 67 to 100%. The as- third) and another of 51% (second third). There are a total of
sumption of proposals that seek to color code segments based 225 edits for the article, so each third segment has 225/3  75

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGYJanuary 15, 2008 325
DOI: 10.1002/asi
TABLE 3. ANOVA results, H0: Mean error rates for all three groups are equal.

ANOVA: Single factor


Summary
Groups Count Sum Average Variance

First third 37 2.75792 0.074538 0.013845


Second third 37 7.386 0.199622 1.076185
Last third 37 0.71343 0.019282 0.000613

ANOVA
Source of variation SS df MS F P-value F crit

Between groups 0.631732 2 0.315866 0.868843 0.422347 3.080388


Within groups 39.26315 108 0.363548

Total 39.89488 110

edits. Hence, the error rates are 1/75 for the first third, 1/75 Error edit (days after creation). This is analogous to Error
for the second third, and 0/75 for the last third. edit (absolute position) except it measures how early it was
An Anova test was then carried out to test the null hy- created in days and not in edit counts. This variable ranges
pothesis that the mean error rates for articles in all three from 0 (the very first day) to N, where N is the total number
groups were equal. However, as the results in Table 3 show, of days in the life of the article.
we are unable to reject the null hypothesis. Next, we calculate Error edit (days after creation as a
percent of article life), which scales the above-mentioned
variable by the age of the article life. Last, we calculate Days
Analysis of Error Edit by Length of Time
survivedthe number of days the error exists without being
Analysing error edits by raw counts does not take into removed.
account when the edit was done. The most obvious way in For example, if an error edit appeared 30 days after the
which this can have an impact is in the case of articles having article was created and the age of the article as of the day it
very different ages. Consider a scenario in which two arti- was assessed is 300 days, then Error edit (days after creation)
cles with very different ages have the same number of total will be 30, Error edit (days after creation date as a percent of
edits.6 They will have the same absolute and relative edit article life) will be 30/300  0.1,8 and days survived will be
error position; however, one article was created 3 years ago 300  30  270.
and another was created just last week. Obviously, the nth The second column in Table 4 shows that the median edit
edits for the former would have been created a long time ago error appeared 695 days (23 months) after the creation of
as compared to the former.7 Arguably, one would be more the article. The median edit error survived for 486 days (16
inclined to trust an edit that was created a year ago than one months) from the time it was first added until it was checked
just created an hour ago because more people have looked at by the expert. The third column is calculated by normalizing
it even though they did not change it. the days the errors existed by the total age of the article. In
Even if two articles are created at roughly the same time terms of the articles age, the median error edit appears after
with the same number of total edits, different editing distrib- 62% of the articles life has elapsed. Though it might seem
ution patterns can cause the edits with the same absolute and counter intuitive, these results are fully consistent with the
relative edit positions to be added at very different times. If fact that in terms of raw edits, slightly less than 25% of errors
lots of edits are made in a short span of time after the error edit occurred on the first day (the vertical intercept in Figure 10
was made, it would mean that the life of the edit error was shows that slightly less than 25% of errors existed at the very
shorter than if the same number of edits was spaced out more. start of the life of the article on the first day). These results
As a result, it would be better to analyse the error edits in are also consistent with Figure 6, which shows that slightly
terms of the time they have existed or first appeared. less than 20% of error edits can be attributed to the first edit.9
Parallel to the work done in the last section, we calculate The error edits were then divided into three groups. The
three variables. First, we analyse the occurrence of the error first group consisted of errors that appeared before the first
edit in terms of how early it was created (in numbers of
days) after the creation of the article, which we will call
8
Alternatively and perhaps more intuitively the edit error existed for
270/300  90% of the life span of the article on average. Results were pre-
6
It is unlikely that an older article will have the same number of edits. sented the way they were in the text to facilitate comparison with Error edits
Rather it is more likely they have more edits, hence using relative edit count (relative position).
9
position might compensate for some but not all of the difference. The difference is due to the fact that more than one edit is done on the
7
Assuming approximately equal time distribution of edits. first day.

326 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGYJanuary 15, 2008
DOI: 10.1002/asi
TABLE 4. Descriptive statistics of Error edits (days). TABLE 5. Chi-square test, H0: Error edits (by age) are evenly distributed.

Descriptive statistics Classifying Error edits by age

Error edits
Observed N Expected N Residual
(days after creation
Error edit as percentage of Days
First third 50 43.0 7.0
(days after creation) article life) survived
Second third 28 43.0 15.0
Last third 51 43.0 8.0
N 129 129 129
Total 120
Mean 628 47.4 585
Median 695 62.1 486
Test statistics
Mode 0 0.0 1,207
Minimum 0 0.0 0
Classifying Error edits by age.
Maximum 1,668 100.0 1,670

Chi-squarea 7.860
df 2
Asymp. Sig. .020

a
0 cells (.0%) have expected frequencies less than 5. The minimum
expected cell frequency is 43.0.

Discussion
The analysis of error edits by ordinal position shows that
a sizeable number of error edits occur in the very first edit
and that error rates among the three segments are not signif-
icantly different, thus casting doubt on schemes to code
reliability based on edit age. Viegas et al. (2004), using the
history flow visualization technique, identified what they
call the first-mover advantage whereby text added at the
beginning of an articles life survives longer and undergoes
fewer modifications. They speculate that this is because the
initial author sets the tone of the article and as such their
FIG. 10. Error edits (days after creation) cumulative percentage.
edits are seldom removed compared to later edits. This effect
would not be a serious problem if it was restricted to tone
and direction and if the initial edits were free of errors, but
third of the life of the article. The second group consisted of our analysis suggests that the initial edits (of which typically
errors that appeared during the next third. And the last the first few are by the same editor) are often riddled with
consisted of errors that appeared during the last third of the errors. In terms of absolute edits, marking edits after the first
life of the article. A chi-square test was conducted to see if 38 or the last 63 edits as untrustworthy would cut out 50%
we can reject the following null hypothesis: of the errors. However, considering that the average article
in the sample has 135 edits, this would mean ignoring most
H0: Error edits are evenly distributed between the first third, of the edits! Moreover, it is unlikely that using edits as cut-
second third, and last third of an articles life (in terms of offs would be useful because it would unfairly penalise arti-
days). cles with much more or less total edits than average if the
cut-offs are done using the first n or last n edits, respectively.
As the results presented in Table 5 show, the null hypo- As such, using a cut-off relative to the total number of edits
thesis is rejected (P  0.05). The cross tab shows that this might be better. As shown previously, a cut-off that treats
can be attributed to the excess of cases in the first third (50 edits after the first 31% as untrustworthy would reduce 50%
obtained, 43 expected). of the errors. However, taking into account that the average
The analysis mentioned above shows that once again we error per article is 3.6, such a measure would have a very
can reject the null hypothesis that the errors are evenly dis- high cost associated with it as very little of the article would
tributed throughout the life of the articles (in days). What is be left if such a policy was implemented.
different (when compared to the analysis using edit counts) It is somewhat counter intuitive to find that when error
is that although there are still more cases occurring in the first edits are analysed in terms of length of time they are shifted
third than expected, there are also more cases than expected forward as compared to when they are grouped in terms of
occurring in the last third. This has also increased the median edit counts. In Table 6, the 129 error edits are coded as one
relative error edit from 31% (Table 1) to 62% (Table 4). We when they are in the first third (whether in terms of edits or
will investigate why in the next section. days), two when they are in the next third, and three when

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGYJanuary 15, 2008 327
DOI: 10.1002/asi
TABLE 6. Comparing Error edits when classified by edit and by age. found that plausible errors were not removed within a week
(after which he reverted the errors), compared to obvious
Difference  timeedits Number of edits % of total
vandal edits. Based on the results here, it is likely that if he
1 3 2 did not revert them, they would have persisted for a fairly
Same 85 66 long time. If the criterion is set in relative terms to the age of
1 37 29 the article, you would need to discard edits made after the
2 4 3 first 62% of the articles life to remove half of the errors.
Total 129 100
Why do so many errors accumulate at the beginning
Note. Differences when classifying Error edit by edit count and by time. when the article is new rather than at the end? This is some-
what puzzling because we have no evidence to believe that
the first or early editors are inherently less knowledgeable
they are in the last third. Each edit error is hence grouped in than later editors. Besides the first-mover advantage already
two ways and are then compared to see if there is any differ- noted, one possibility is that more content is simply added at
ence. As Table 6 shows, 85 of the edits (66%) show no dif- the beginning because there is more scope for adding new
ference whether they are grouped using total edits or by age. material. As such, the bulk of errors will come from material
In three cases (2%), grouping the error by time rather than contributed early on, even after taking into account more
raw edits causes the error edit to fall into an earlier group. chances for error correction.
But in almost a third of the cases (29  3  32%), the oppo- Another striking fact is that while there are 129 error edits
site occurs and grouping by age rather than edit counts identified, 54 (41%) stem from edits that are responsible for
causes the edit to fall into a later group. An analysis of the more than one error. For example, in 12 cases the same edit
articles in question indicate that this is largely a result of un- accounts for two errors in the same article, in 6 cases one
even editing distributions. In most cases, the articles began edit accounts for three errors, in 1 case an edit accounts for
to accumulate edits at a faster rate as time went by, probably four errors and in another case one edit accounts for an aston-
due to the growth in numbers of Wikipedia editors. For ex- ishing eight errors! The article on Mendeleev is the biggest
ample, in the case of Agent Orange, the sixty seventh edit culprit here, with the very first edit accounting for eight errors,
occurred roughly after 33 months. However, the next sixty- while a later eighty-sixth edit (by a different editor), which
seventh edit (the one hundred thirty fourth edit) took only added a lot of biographical detail, accounts for four. This
10 months. As a result, although the edit appears relatively analysis does not take into account that some editors might
early (first third) in terms of edits, in terms of length of time space out their contributions into smaller subsequent edits that
it is actually in the last third. Cases in which the reverse hap- are essentially the same edit.10 For instance, all seven errors in
pens are rarer and typically occurring when the article is Acheluen can be considered a part of a super edit by one ed-
very new with a large number of early edits taking place at itor who made 17 edits in a row (of which 12 were within half
the same time followed by few edits. For example, the arti- an hour) from creation date. All this seems to indicate that ma-
cle on Kinetic isotopic effect was created only in October 27, terial is added unevenly such that certain edits add the bulk of
2004. It had a total of 18 edits at the time of harvesting. An the material and hence result in the bulk of the errors. This is
error edit occurred at the ninth position, while edits one to consistent with other findings that a relatively small number of
nine were all by the same editor on the same day, and so the elite editors add the bulk of material (Kittur, Chi, Pendleton,
first nine edits took place in less than a month. The other Suh, & Mytkowicz, 2007; Spek, 2006).
nine edits took another 12 months. As a result, in terms of The fact that content adding edits are unevenly distrib-
raw edits, it was in the second group, but in terms of length uted, coupled with the assumption that most large contribu-
of time, it was actually in the earlier first group. tions occur in the initial stages of an articles life when there
We can now understand why classifying edit errors based is scope for expansion, probably explains why most error
on time as opposed to edit counts yields different results. To edits occur early and why most of the edits after that are
reiterate, because of increasing amounts of edits per unit minor rewrites to improve language rather than fact check-
time (due to the growth of Wikipedia) when comparing edit ing and error correction. It would be tempting to say that the
errors in terms of raw counts and in terms of age, the same results here show that Wikipedias much vaunted strength of
edit will appear in a later group. This accounts for the greater self-healing and error correction does not seem to be work-
number of cases in the last third. However, as mentioned ing; however, such a view should be resisted. This is because
previously 20% of errors fall on the very first edit, and this the analysis is based on errors that existed at a certain snap-
will be in almost all cases still ranked in the first third of shot of time. We have no indication of the actual total num-
edits, even by time. ber of errors that existed throughout the life of the articles
Despite this caveat, it still does not seem productive to and were removed, just those errors that existed at the time
use edit age as a measure of trustworthiness. The median of the assessment.
error survived over 486 days (16.2 months; Table 4). Unlike
acts of vandalism, which have mean survival times in min- 10
One possibility for such a behavior is that editors fear that the system
utes, honest errors are much more difficult to catch. Our will stall before they can finish the edit and hence prefer to lock in their
findings support researcher DFNfrozenNorth (2004) who edits as soon as possible to avoid losing their work.

328 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGYJanuary 15, 2008
DOI: 10.1002/asi
Conclusion Beesley, A. (2004). Wikipedia triumphs in ct study. Retrieved January 26,
2007, from http://www.wikisearch.org/2004/10/wikipedia-triumphs-in-ct
In this article we analysed the distribution of error by sur- study.htm
vival time in terms of edits and time. The general findings Benkler, Y. (2002). Coases Penguin or Linux and the nature of the firm. The
are that a significant number of error edits occur in the very Yale Law Journal, 112, 3, 369446
Brndle, A. (2006). Too many cooks dont spoil the broth. Retrieved January
first few edits. Roughly 20% of error edits appeared on
21, 2007, from http://meta.wikimedia.org/wiki/Transwiki:Wikimania05/
the first day and on the first edit. Paper-AB1.
The ANOVA tests of the means of the error rates of three Brandt, D. (2005). Screen shots of Wikipedia vandalism. Retrieved January
sections (divided by edit counts) show that there is no signif- 22, 2007, from http://www.wikipedia-watch.org/vandals.html
icant difference between them and so classifying edit age by Brin, S. (1998). The anatomy of a large-scale hypertextual Web search en-
gine. Proceedings of the 7th International World Wide Web Conference,
edit counts does not seem useful. Analysing the error edits in
Brisbane, Australia: IW3C2.
terms of survival time shows that things are not as bleak as Buriol, L.S., Castillo, C., Donato, D., Leonardi, S. & Millozzi, S. (2006).
they seem, and that in terms of survival time scaled by the age Temporal analysis of the Wikigraph. International Conference on Web In-
of the article, many more error edits appear in the last third telligence, Hong Kong: IEEE/WIC/ACM (4551). Retrieved 1 November,
mostly due to the fact that more edits were made in a shorter 2007, from http://www.dcc.uchile.cl/~ccastill/papers/buriol_2006_tempo
ral_analysis_wikigraph.pdf.
span of time as Wikipedias popularity grew. However, this
Chesney, T. (2006). An empirical examination of Wikipedias credibility. First
still doesnt give much support to the feasibility of trusting Monday 11, 11. Retrieved July 25, 2007, from http://www.firstmonday.
edit contributions based on survival time. The main problem org/issues/issue11_11/chesney
is that a fifth of the errors are attributable to the first error edit Ciffolilli, A. (2003). Phantom authority, self-selective recruitment and re-
so that any measure of trustworthiness based on chronologi- tention of members in virtual communities: The case of Wikipedia. First
Monday 8, 12. Retrieved July 25, 2007, from http://www.firstmonday.
cal order of edits (whether by time, edits, or editors) will not
org/issues/issue8_12/ciffolilli
be able to filter out this sizeable number of errors. Cross, T. (2006). Puppy smoothies: Improving the reliability of open, col-
It should be noted that this conclusion is particularly sig- laborative wikis. First Monday 11, 9. Retrieved 25 July, 2007, from
nificant in light of the fact that the method used has a built-in http://www.firstmonday.org/issues/issue11_9/cross
bias for locating error edits later rather than earlier. First, in Dondio, P., Barrett, S., Weber, S., & Seigneur, J.M. (2006). Extracting trust
from domain analysis: A case study on the wikipedia project. Paper pre-
the case of disagreements, we use the later model rather
sented at the Lecture Notes in Computer Science (including subseries Lec-
than earlier model. Second, our method of locating the ture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).
error edit is based on finding errors from the most recent Retrieved 1, November 2007, from http://cui.unige.ch/~seigneur/publica-
edits to the earlier edit. Although reasonable care was taken tions/ExtractingTrustfromDomainAnalysisfinalcorrected.pdf.
to ensure that we located the edit that added the substance of DFNfrozenNorth. (2004). How Authoritative is Wikipedia. Retrieved 15
January, 2007, from http://www.frozennorth.org/C2011481421/E652809545
the error rather than the exact wording, it is likely that in
/index.html.
some cases we settled for an edit too soon, missing an earlier Encyclopaedia Britannica, Inc. (2006). Refuting the recent study on ency-
edit from a rewrite that actually implied or made the same clopedic accuracy by the journal Nature, Retrieved 1 November, 2007
erroneous point in a different way. from http:// corporate.britannica.com/britannica_nature_response.pdf
The findings of this article have serious implications Giles, J. (2005). Internet encyclopaedias go head to head. Nature, 438,
7070, 900901.
for the Wikipedia community. It suggests that authors and
Jensen, C., Davis, J., & Farnham, S. (2003). Finding others online: Reputa-
editors need to become more familiar with the possible exis- tion systems for social online Spaces. Proceedings of the SIGCHI Con-
tence of first-mover effects in entries and, as a result, pay ference on Human Factors in Computing Systems: Changing our World,
more attention to critically editing the existing core material Changing Ourselves. Minneapolis, Minnesota: ACM.
rather than blindly adding more. For users of Wikipedia, it Kittur, A., Chi, E., Pendleton, A., Suh, B., & Mytkowicz, T. (2007). Power
of the few vs. wisdom of the crowd: Wikipedia and the rise of the bour-
suggests that age-based mechanisms for coding trustworthi-
geoisie. Proceedings of the 25th Annual Conference on Human Factors in
ness will not be capable of elevating Wikipedias status as a Computing Systems (CHI 2007). San Jose, California: ACM. Retrieved
reference tool. For the foreseeable future, users will have to 1 November, 2007, from http://www.viktoria.se/altchi/submissions/
continue to follow the maxim caveat emptor in their deal- submission_edchi_1.pdf.
ings with this online encyclopaedia. Landis, R.J., & Koch, G.G. (1977). The measurement of observer agree-
ment for categorical data. Biometrics, 33, 1, 159174.
Lanier, J. (2006). Digital Maoism: The hazards of the new online collec-
tivism. Retrieved 20, January, 2007, from http://www.edge.org/3rd_cul
References
ture/lanier06/lanier06_index.html
Adler, B.T., & Alfar, D.L. (2007). A content-driven reputation system for Lih, A. (2004). Wikipedia as participatory journalism: Reliable sources?
the Wikipedia. Proceedings of the 16th International World Wide Web Metrics for evaluating collaborative media as a news resource. Proceed-
Conference, Banff, Alberta, Canada: IW3C2. Retrieved 1 November, ings of the International Symposium on Online Journalism, Austin,
2007 from, http://www2007.org/proceedings.html. University of Texas. Retrieved 1, November 2007 from, http://
Anderson, N. (2007). 2007: The year of the expert wiki? Retrieved jmsc.hku.hk/faculty/alih/publications/utaustin-2004-wikipedia-rc2.pdf.
January 22, 2007, from http://arstechnica.com/news.ars/post/20070112 Magnus, P.D. (2006). Epistemology and the Wikipedia. North American
8604.html Computing and Philosophy Conference. Troy, New York: Rensselaer
Anthony, D., Smith, S.W., & Williamson, T. (2005). Explaining quality in Polytechnic. Retrieved 1 November, 2007, from http://www.
Internet collective goods: Zealots and good Samaritans in the case of fecundity.com/job/wikipedia.pdf.
Wikipedia. Retrieved July 25, 2007, from http://web.mit.edu/iandesemi McGuinness, D., Zeng, H., d. Silva, P.P., Li, D., Narayanan D., M. &
nar/Papers/Fall2005/anthony.pdf Bhaowal (2006). Investigations into trust for collaborative information

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGYJanuary 15, 2008 329
DOI: 10.1002/asi
repositories: A Wikipedia case study. Retrieved July 25, 2007, from Spek, S. (2006). Wikipedia: Organisation from a bottom-up approach.
http://www.l3s.de/~olmedilla/events/MTW06_papers/paper23.pdf International Symposium on Wikis. Odense, Denmark, Wikisym. Re-
McHenry, R. (2004, November 15). The faith-based encyclopedia. TCS trieved 1 November, 2007, from http://arxiv.org/PS_cache/cs/pdf/0611/
Daily. Retrieved January 24, 2007, from http://www.techcentralstation. 0611068v2.pdf.
com/111504A.html Stvilia, B., Twidale, M.B., Gasser, L., & Smith, L.C. (2005). Information
Nature. (2006, March 23). Encyclopaedia Britannica and Nature: A re- quality in a community-based encyclopedia. Proceedings of the Interna-
sponse. Retrieved January 10, 2007, from http://www.nature.com/nature/ tional Conference on Knowledge Management. Charlotte, North
britannica/index.html Carolina: iKnow. Retrieved 1 November, 2007, from http://www.isrl.
Orlowski, A. (2005, October 27). Why Wikipedia isnt like Linux. The Reg- uiuc.edu/~stvilia/papers/quantWiki.pdf.
ister. Retrieved January 23, 2007, from http://www.theregister.co.uk/ Surowiecki, J. (2005). The wisdom of crowds. New York: Random House.
2005/10/27/wikipedia_britannica_and_linux/page2.html Viegas, F., Wattenberg, M., & Kushal, D. (2004). Studying cooperation and
Press, L. (2006). Survey of Wikipedia accuracy and completeness. Retrieved conflict between authors with history flow visualizations. Proceedings of
January 20, 2007, from http://bpastudio.csudh.edu/fac/lpress/wikieval/ the SIGCHI conference on Human factors in computing systems. Vienna,
Read, B. (2006). Can Wikipedia ever make the grade? Retrieved January Austria: ACM. Retrieved 1 November, 2007, from http://alumni.
20, 2007, from http://chronicle.com/free/v53/i10/10a03101.htm media.mit.edu/~fviegas/papers/history_flow.pdf.
Rosenzweig, R. (2006). Can history be open source? Wikipedia and the Wikipedia. (2007a). Wikipedia:about. Retrieved 20 January, 2007 from
future of the past. The Journal of American History 93, 1, 11746. http://en.wikipedia.org/wiki/Wikipedia:About.
Sanger, L. (2004). Why Wikipedia must jettison its antielitism. Retrieved Wikipedia. (2007b). Article validation. Retrieved 21 January, 2007, from
January 21, 2007, from http://www.kuro5hin.org/story/2004/12/30/ http://meta.wikimedia.org/wiki/Article_validation.
142458/25 Wilkinson, D.M., & Huberman, B.A. (2007). Assessing the value of coopera-
Sanger, L. (2005) The early history of Nupedia and Wikipedia: A memoir. Re- tion in Wikipedia. First Monday 12, 4. Retrieved July 25, 2007, from http://
trieved January 21, 2007, from http://features.slashdot.org/ article.pl?sid www.firstmonday.org/issues/issue12_4/wilkinson
05/04/18/164213 Zeng, H., Alhossaini, M., Ding, L., Fikes, R., & McGuinness, D.L. (2006).
Scott, J. (2004). The great failure of Wikipedia. Retrieved Janaury 22, 2007, Computing trust from revision history. Proceedings of the International
from http://ascii.textfiles.com/archives/000060.html Conference on Privacy, Security and Trust, University of Ontario Insti-
Slashdot. (2006). More Wiki than ever. Retrieved January 20, 2007, from tute of Technology. Retrieved 1 November, 2007, from http://ebiquity
http://yro.slashdot.org/article.pl?sid06/08/31/2224230 .umbc.edu/_file_directory_/papers/302.pdf.

330 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGYJanuary 15, 2008
DOI: 10.1002/asi

Vous aimerez peut-être aussi