Vous êtes sur la page 1sur 15

Creating Just-in-Time

Corpora
Todd Suomela
Geoffrey Rockwell
Ryan Chartier
University of Alberta

todd.suomela@ualberta.ca

Outline
Defining web archiving
What is a just-in-time corpora?
Case studies
Gamergate
Fort McMurray 2016 Wildfire

Building infrastructure and workflows


Social and technical solutions

Web archiving
Web archiving is the process of gathering up data that
has been recorded on the World Wide Web, storing it,
ensuring the data is preserved in an archive, and
making the collected data available for future research.
Niu 2012
Pioneered by the Internet Archive starting in 1996
Many national libraries and archives have followed since
IIPC - International Internet Preservation Consortium
founded in 2003

What is a just-in-time archive?


The need:
To capture rapidly evolving social phenomenon as they are
expressed on the Internet and World Wide Web

The audience:
Researchers in any field who wish to use contemporary
cultural materials for analysis.

The tools:
Web archiving writ large harvesting material from websites,
online news sources, social media, and multimedia.

Why a just-in-time corpora can


matter?
In 2014 Malaysian airlines
flight MH17 was shot
down over Ukraine
Ukrainian separatist
leaders claimed
responsibility online but
later erased the message
The Internet Archive
wayback machine
recorded the original post

Accountability

An anti- memory hole

Cultural history (from IA)

blacklivesmatter
2013 US Government
shutdown
Charlie Hebdo
Jasmine Revolution Tunisia
2011
Wikileaks 2010 document
release

Memory

Bright, Arthur. Web Evidence Points to pro-Russia Rebels in Downing of MH17 (+video). Christian Science Monitor, July 17, 2014.
http://www.csmonitor.com/World/Europe/2014/0717/Web-evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video.
Archive-It - Internet Archive Global Events. Accessed May 28, 2016. https://archive-it.org/organizations/89.

Gamergate
Focal point for arguments about the culture and future
of gaming
Started in 2014
Outgrowth of continuing debates around ethics of game
journalism, social justice, feminism
Connections to the harassment of women in online culture

Mobilized with a variety of internet tools


Twitter, 4 and 8 chan, reddit, social media, and more

Significant mainstream media coverage in the fall of


2014

Fort McMurray Wildfire 2016


Fire began on May 1, 2016
Community evacuated on May 3, 2016
Significant economic, cultural, and environmental impacts

University of Alberta libraries began collecting news


media, social media, and multimedia on May 4
Optimization of crawls during first 2 weeks of activity
Selecting appropriate seeds
Establishing data limits and crawl scopes to manage the
amount of data collected

Building infrastructure
Tool Name

Author

URL - Source

Twarc by Ed Summers

Ed Summers

https://github.com/edsu/twarc

json-extractor

Ryan Chartier

https://github.com/recrm/ArchiveTools

warc-extractor

Ryan Chartier

https://github.com/recrm/ArchiveTools

imageboard-scraper

Ryan Chartier

https://github.com/recrm/ArchiveTools

wget

GNU

https://www.gnu.org/software/wget/

Archive-IT

Internet
Archive

https://archive-it.org/

warcbase

Jimmy Lin, et
al

https://github.com/lintool/warcbase

Cloud-based virtual server

Compute
Canada

https://www.computecanada.ca/

R Project

https://www.r-project.org/

Javascript for visualization

Ryan Chartier

Workflow for Twitter hashtag

Twarc
Twitter
Search
API

json
Cloudextract
R scripts

Excel
Deskto
p Python

Workflow for text corpora

Identified by
researchers
Websit
es Copy and
paste

Sync
File
HTML and
sharing
text

Voyant
Analysi
s Content

Workflow for 4chan and 8chan

identify
channel
4/8cha
n custom
Python script

file storage
mostly
Cloud
images

should we
use?
??
ultimately
deleted

Workflow wget and Archive-IT

Identified by
research
Websit
es group
URL seed list

Archive-IT
Harves
t wget

WARC
html
Files
process on
desktop

Can we make this easier?


No

Potential information sources are growing but the


identification of those sources is lagging

Web technologies are constantly changing and harvesting


tools inevitably lag behind

Web archiving retrospectively is impossible

Commercial silos like Facebook, YouTube, etc.

Maybe

Integration between twarc and web archive harvesters

Improving analysis tools: warcbase, Archive Spark

Collaboration between libraries, archives, researchers

Social and technical problems /


solutions
Problems

Dealing with infinite scroll, active page refreshes and other


technical changes in web design

Scoping crawls to fit within data budgets

Solutions

Need for collaboration between libraries and researchers

Destroy the silos

Automated or algorithmic collection building

Conclusions
How to deal with the ongoing tension?
Best effort v. comprehensiveness
Access to so much material and magic technologies, like
Google, gives the impression that comprehensiveness is
possible
Information overload is not a new phenomenon
Parnassan (distilled) v. the universal (comprehensive) library

Collaborative infrastructure
Many hands must be involved to even attempt to gather
material from the web.
Dependent upon a wide-variety of skills and backgrounds

Vous aimerez peut-être aussi