Suomelarockwellchartier Jitcorpora

Creating Just-in-Time
Corpora
Todd Suomela
Geoffrey Rockwell
Ryan Chartier
University of Alberta
todd.suomela@ualberta.ca
Outline
Defining web archiving
What is a just-in-time corpora?
Case studies
Gamergate
Fort McMurray 2016 Wildfire
Building infrastructure and workflows

Social and technical solutions
Web archiving
Web archiving is the process of gathering up data that
has been recorded on the World Wide Web, storing it,
ensuring the data is preserved in an archive, and
making the collected data available for future research.
Niu 2012
Pioneered by the Internet Archive starting in 1996
Many national libraries and archives have followed since
IIPC - International Internet Preservation Consortium
founded in 2003
What is a just-in-time archive?

The need:
To capture rapidly evolving social phenomenon as they are
expressed on the Internet and World Wide Web
The audience:
Researchers in any field who wish to use contemporary
cultural materials for analysis.
The tools:
Web archiving writ large harvesting material from websites,
online news sources, social media, and multimedia.
Why a just-in-time corpora can

matter?
In 2014 Malaysian airlines
flight MH17 was shot
down over Ukraine
Ukrainian separatist
leaders claimed
responsibility online but
later erased the message
The Internet Archive
wayback machine
recorded the original post
Accountability
An anti- memory hole
Cultural history (from IA)
blacklivesmatter
2013 US Government
shutdown
Charlie Hebdo
Jasmine Revolution Tunisia
2011
Wikileaks 2010 document
release
Memory
Bright, Arthur. Web Evidence Points to pro-Russia Rebels in Downing of MH17 (+video). Christian Science Monitor, July 17, 2014.
http://www.csmonitor.com/World/Europe/2014/0717/Web-evidence-points-to-pro-Russia-rebels-in-downing-of-MH17-video.
Archive-It - Internet Archive Global Events. Accessed May 28, 2016. https://archive-it.org/organizations/89.
Gamergate
Focal point for arguments about the culture and future
of gaming
Started in 2014
Outgrowth of continuing debates around ethics of game
journalism, social justice, feminism
Connections to the harassment of women in online culture
Mobilized with a variety of internet tools

Twitter, 4 and 8 chan, reddit, social media, and more
Significant mainstream media coverage in the fall of

2014
Fort McMurray Wildfire 2016

Fire began on May 1, 2016
Community evacuated on May 3, 2016
Significant economic, cultural, and environmental impacts
University of Alberta libraries began collecting news

media, social media, and multimedia on May 4
Optimization of crawls during first 2 weeks of activity
Selecting appropriate seeds
Establishing data limits and crawl scopes to manage the
amount of data collected
Building infrastructure
Tool Name
Author
URL - Source
Twarc by Ed Summers
Ed Summers
https://github.com/edsu/twarc
json-extractor
Ryan Chartier
https://github.com/recrm/ArchiveTools
warc-extractor
Ryan Chartier
imageboard-scraper
Ryan Chartier
wget
GNU
https://www.gnu.org/software/wget/
Archive-IT
Internet
Archive
https://archive-it.org/
warcbase
Jimmy Lin, et
al
https://github.com/lintool/warcbase
Cloud-based virtual server
Compute
Canada
https://www.computecanada.ca/
R Project
https://www.r-project.org/
Javascript for visualization
Ryan Chartier
Workflow for Twitter hashtag
Twarc
Twitter
Search
API
json
Cloudextract
R scripts
Excel
Deskto
p Python
Workflow for text corpora
Identified by
researchers
Websit
es Copy and
paste
Sync
File
HTML and
sharing
text
Voyant
Analysi
s Content
Workflow for 4chan and 8chan
identify
channel
4/8cha
n custom
Python script
file storage
mostly
Cloud
images
should we
use?
??
ultimately
deleted
Workflow wget and Archive-IT
Identified by
research
Websit
es group
URL seed list
Archive-IT
Harves
t wget
WARC
html
Files
process on
desktop
Can we make this easier?

No
Potential information sources are growing but the

identification of those sources is lagging
Web technologies are constantly changing and harvesting

tools inevitably lag behind
Web archiving retrospectively is impossible
Commercial silos like Facebook, YouTube, etc.
Maybe
Integration between twarc and web archive harvesters
Improving analysis tools: warcbase, Archive Spark
Collaboration between libraries, archives, researchers
Social and technical problems /

solutions
Problems
Dealing with infinite scroll, active page refreshes and other

technical changes in web design
Scoping crawls to fit within data budgets
Solutions
Need for collaboration between libraries and researchers
Destroy the silos
Automated or algorithmic collection building
Conclusions
How to deal with the ongoing tension?
Best effort v. comprehensiveness
Access to so much material and magic technologies, like
Google, gives the impression that comprehensiveness is
possible
Information overload is not a new phenomenon
Parnassan (distilled) v. the universal (comprehensive) library
Collaborative infrastructure
Many hands must be involved to even attempt to gather
material from the web.
Dependent upon a wide-variety of skills and backgrounds

Suomelarockwellchartier Jitcorpora

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Suomelarockwellchartier Jitcorpora

Transféré par

Droits d'auteur :

Formats disponibles

Creating Just-in-Time

Building infrastructure and workflows

What is a just-in-time archive?

Why a just-in-time corpora can

An anti- memory hole

Cultural history (from IA)

Mobilized with a variety of internet tools

Significant mainstream media coverage in the fall of

Fort McMurray Wildfire 2016

University of Alberta libraries began collecting news

Cloud-based virtual server

Javascript for visualization

Workflow for Twitter hashtag

Workflow for text corpora

Workflow for 4chan and 8chan

Workflow wget and Archive-IT

Can we make this easier?

Potential information sources are growing but the

Web technologies are constantly changing and harvesting

Web archiving retrospectively is impossible

Commercial silos like Facebook, YouTube, etc.

Integration between twarc and web archive harvesters

Improving analysis tools: warcbase, Archive Spark

Collaboration between libraries, archives, researchers

Social and technical problems /

Dealing with infinite scroll, active page refreshes and other

Scoping crawls to fit within data budgets

Need for collaboration between libraries and researchers

Destroy the silos

Automated or algorithmic collection building

Vous aimerez peut-être aussi