Académique Documents
Professionnel Documents
Culture Documents
Corpora
Todd Suomela
Geoffrey Rockwell
Ryan Chartier
University of Alberta
todd.suomela@ualberta.ca
Outline
Defining web archiving
What is a just-in-time corpora?
Case studies
Gamergate
Fort McMurray 2016 Wildfire
Web archiving
Web archiving is the process of gathering up data that
has been recorded on the World Wide Web, storing it,
ensuring the data is preserved in an archive, and
making the collected data available for future research.
Niu 2012
Pioneered by the Internet Archive starting in 1996
Many national libraries and archives have followed since
IIPC - International Internet Preservation Consortium
founded in 2003
The audience:
Researchers in any field who wish to use contemporary
cultural materials for analysis.
The tools:
Web archiving writ large harvesting material from websites,
online news sources, social media, and multimedia.
Accountability
blacklivesmatter
2013 US Government
shutdown
Charlie Hebdo
Jasmine Revolution Tunisia
2011
Wikileaks 2010 document
release
Memory
Bright, Arthur. Web Evidence Points to pro-Russia Rebels in Downing of MH17 (+video). Christian Science Monitor, July 17, 2014.
http://www.csmonitor.com/World/Europe/2014/0717/Web-evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video.
Archive-It - Internet Archive Global Events. Accessed May 28, 2016. https://archive-it.org/organizations/89.
Gamergate
Focal point for arguments about the culture and future
of gaming
Started in 2014
Outgrowth of continuing debates around ethics of game
journalism, social justice, feminism
Connections to the harassment of women in online culture
Building infrastructure
Tool Name
Author
URL - Source
Twarc by Ed Summers
Ed Summers
https://github.com/edsu/twarc
json-extractor
Ryan Chartier
https://github.com/recrm/ArchiveTools
warc-extractor
Ryan Chartier
https://github.com/recrm/ArchiveTools
imageboard-scraper
Ryan Chartier
https://github.com/recrm/ArchiveTools
wget
GNU
https://www.gnu.org/software/wget/
Archive-IT
Internet
Archive
https://archive-it.org/
warcbase
Jimmy Lin, et
al
https://github.com/lintool/warcbase
Compute
Canada
https://www.computecanada.ca/
R Project
https://www.r-project.org/
Ryan Chartier
Twarc
Twitter
Search
API
json
Cloudextract
R scripts
Excel
Deskto
p Python
Identified by
researchers
Websit
es Copy and
paste
Sync
File
HTML and
sharing
text
Voyant
Analysi
s Content
identify
channel
4/8cha
n custom
Python script
file storage
mostly
Cloud
images
should we
use?
??
ultimately
deleted
Identified by
research
Websit
es group
URL seed list
Archive-IT
Harves
t wget
WARC
html
Files
process on
desktop
Maybe
Solutions
Conclusions
How to deal with the ongoing tension?
Best effort v. comprehensiveness
Access to so much material and magic technologies, like
Google, gives the impression that comprehensiveness is
possible
Information overload is not a new phenomenon
Parnassan (distilled) v. the universal (comprehensive) library
Collaborative infrastructure
Many hands must be involved to even attempt to gather
material from the web.
Dependent upon a wide-variety of skills and backgrounds