Académique Documents
Professionnel Documents
Culture Documents
2016 3 International Conference on Computing for Sustainable Global Development, 16th - 18th March, 2016
Bharati Vidyapeeth's Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA)
rd
Lisha Singh
I INTRODUCTION
We all use Web Browser to extract the needed information
from the Web Sites, if you think this is the only way to access
information from internet then you are missing out a huge
range of available possibilities. Web Scraper is one of those
possibilities, in which we are accessing the information on the
internet using some programs and pre written libraries.
[1]Wikipedia Says Web scraping (web harvesting or web data
extraction) is a computer software technique of extracting
information from websites. Usually, such software programs
simulate human exploration of the World Wide Web by either
implementing low-level Hypertext Transfer Protocol (HTTP),
or embedding a fully-fledged web browser, such as Mozilla
Firefox. Web scraping is closely related to web indexing,
which indexes information on the web using a bot or web
crawler and is a universal technique adopted by most search
engines. In contrast, web scraping focuses more on the
transformation of unstructured data on the web, typically in
HTML format, into structured data that can be stored and
analyzed in a central local database or spreadsheet. Web
scraping is also related to web automation, which simulates
human browsing using computer software. Uses of web
scraping include online price comparison, contact scraping,
weather data monitoring, website change detection, research,
web mashup and web data integration.
[2] Web Harvy says Web Scraping (also termed Screen
Scraping, Web Data Extraction, Web Harvesting etc.) is a
technique employed to extract large amounts of data from
c
978-9-3805-4421-2/16/$31.00 2016
IEEE
689
Proceedings of
o the 10th IND
DIACom; INDIA
ACom-2016; IEEE
I
Conferennce ID: 37465
2016 3 Interrnational Confe
ference on Com
mputing for Suustainable Globbal Developmeent, 16th - 18thh March, 2016
rd
law
wsuits that tie any web scraping with a direct copyright
viollation and veryy clear monetarry damages. There
T
is a casee of
AT&
&Ts where Andrew Au
uernheimer was convicted of
hacking based onn the act of weeb scraping. Although
A
the data
d
a publically available via AT&Ts
A
website,
wass unprotected and
the fact that he wrrote web scrap
pers to harvest that data in mass
m
amoounted to brutte force attack. He did not have
h
to consentt to
term
ms of service to deploy his bots and conduct
c
the web
w
scraaping. The datta was not av
vailable for puurchase. It wassnt
behhind a login. He
H did not even financiallly gain from the
agggregation of the
t
data. Mosst importantlyy, it was bugggy
proggraming by AT
T&T that expo
osed this inform
mation in the first
f
placce. Yet Andreew was at fau
ult. This isnt just a civil suit
s
anyymore. This chharge is a felon
ny violation thhat is on par with
w
hacking or denial of service attaacks and carriees up to a 15-yyear
senttence for each charge.
[4] A Quora user answered Th
he key part is what
w
you wantt to
w the scrapped data. If you
u use it for yoour own, persoonal
do with
use, then it is legal as it falls under fair use doctrine.
d
Theree is
nothhing special inn accessing daata for yourselff with a browsser,
youu can use other means i.e. scraping.The com
mplications starrt if
youu want to use scraped
s
data fo
or other, especiially commerccial,
purpposes. Howeveer even then you
y may be abble to do a lot. A
goood example is deep linking, a practice wheere links to pagges
withhin a site are placed on ano
other site, bypassing target site
s
hom
me page. Theree have been seeveral legal preecedents for suuch
casees and in severral of them cou
urts have ruled that deep linkking
is leegal, even inclluding short descriptions andd meta data frrom
targget pages, as loong as it is cleaar that the site where deep linnks
are placed is not claiming
c
ownerrship of the datta.
Thee legality of thhe scrapper is still
s unclear baased on the vieews
of different
d
peoplle, the only thiing we can stilll conclude about
the legality is thatt if we are nott doing any harrm to the webssite
i is
andd not selling the scrapped datta then it is leggal otherwise it
not any more illeggal.
III DE
ESIGNING A CUSTOM
C
SCR
RAPPER
A Web
W Scrapper broadly
b
compo
osed of two parrts they are:
Webb crawler for crawling
c
links + Data Extracctor from craw
wled
linkks.
A
Web Crawlerr
690
Daata Extractor
are found. Then for added layer filter we can check the new
link with all already found links, but this will add some more
complexity into the code.
D Crawling The Deep Web
Deep Web is the part of internet that is not indexed by even
massive search engines such as google. The surface web is just
a small part of whole internet we have. The sites on Deep Web
contain much more information than Surface Web. So
sometimes we need to crawl and extract the data from Deep
Web too. But the structure of websites on Deep Web is much
challenging, so we need to design our crawler by doing careful
analysis of the structure of website we are focusing on.
Websites placed on Deep Web often goes offline, some
websites are there which comes online at specific time, so to
extract content from such website we need to take into account
lots of issues that can be faced by us.
Browser Identification
Using the USER AGENT field, the HTTP Headers the Web
Crawler is identified to a Web Server. Web Crawler
identification is helpful for Web Administrators to
communicate to the Crawler owner. But some website doesnt
allow those crawlers which dont identify themselves as some
famous Web Browsers, so for these the Crawlers USER
AGENT field is spoofed to a Web Browser so that crawler can
do its work easily. This spoofing sometimes prevents Crawl
traps. So we have to pose like a genuine user.
V DATA EXTRACTOR
The Crawled List of links now available to us in which we are
going to find the required information. If we start to gather
whole website and save it as a copy, this wont be that much
useful. So we have to extract the useful information and
convert that into needed format. All these are done step by
step.
i.
We find out the content we are going to approach;
this can be done using the inspect element button
provided in the context menu of majority of Web
Browser.
ii.
We will find out the common pattern of CSS
Selectors which is same on all other similar pages.
iii.
Check out which method you are going to use you
have support for Xpath or Simple CSS Selectors,
depending upon we can carry our work further.
iv.
We have extractor ready, but we have to store the
data in a well formatted manner which should be very
common to get converted into any other format very
easily, one such format is CSV, check for the support
of CSV into your programming language library.
v.
We have Data Extractor ready, do a recheck by
extracting content on the temporary file so that CSV
format gets a well formatted required output.
Now we have required data, this data can be easily converted
into any other format as CSV is so common to all. We can
convert it into JSON or SQL format.
VI IMPLEMENTAION OF CRAWLER USING PYTHON
691
Proceedings of
o the 10th IND
DIACom; INDIA
ACom-2016; IEEE
I
Conferennce ID: 37465
2016 3 Interrnational Confe
ference on Com
mputing for Suustainable Globbal Developmeent, 16th - 18thh March, 2016
rd
VII IMPLE
EMENTAION OF EXTRACTOR USING
PYTH
HON
Datta can be extraacted using do
oing simple HTML
H
parsing by
usinng python moddule named BeautifulSoup
B
w
which
is goodd in
creaating tree struccture of html elements. Usingg tree structuree of
BeaautifulSoup, wee can navigate through the ellements and reaach
the needed contennt easily.
692
IX FUTURE SCOPE
While throwing lightts on the Futurre of Web Scraaper it can be
seen thhat people willl start to deveelop many kinnd of services
such as Price Compaarison Websitees, Big-Data Analysis.
A
With
Interneet and web tecchnology spreaading, massivee amounts of
data will
w be accessibble on the weeb. Since big data can be
both, structured
s
and unstructured; web scrapers will have to
get shharper and inncisive. With the rise of open source
languaages like Pythoon, R & Ruby,, Customized scraping
s
tools
will onnly flourish brringing in a neew wave of daata collection
and agggregation metthods. The weeb scraping caan be seen as
providing solution too those who want
w
data for their
t
big data
analysiis.
X CONCLU
USION
If proogramming is magic, then web
w scraping is wizardry.
says Ryan
R
Mitchell, author of Book
B
Web Sccraping using
Pythonn. From the diiscussed topicss we can concclude that the
scraping-illegal-depends-on-what-the-meaning-of-theword-is-is/181642
[4]. https://www.quora.com/What-is-the-legality-of-web-scraping
[5]. https://en.wikipedia.org/wiki/Web_crawler
[6]. Kolari , P. and Joshi, A. , Web mining : research and practice ,
Computing in Science & Engineering, IEEE Transactions on
Knowledge and Data Engineering, vol. 6, no. 2,Vol. 6 , No. 4 , 2004
[7]. Malik , S.K. and Ravi , S.M. , Information Extraction Using
693