Vous êtes sur la page 1sur 6

NSSPL-HP @vishnu simmha

Web
Research conducted
on Web Crawling,
Crawling
open source
frameworks across
languages

Open Source Platforms

Web Crawler
(Known

in other terms like Ants, Automatic indexers, Bots, web


spiders, web robots or webs cutters)

Top 5 Web Programming Languages

JAVA
PYTHON
RUBY
PHP
C# , C++ , CROSS PLATFORMS

Open source frame works in each


Language:

1.PYTHON Based
APCHE NUTCH
SCRAPY
KIMONO
SCRAPING HUB
IMPORT.IO
GRUB

2.JAVA BASED

WEBCOLLECTOR
CRAWLER4J
EX-CRAWLER
BIXO
WEB-HARVEST
JOBO
ARACHNID
SMART AND SIMPLE WEB CRAWLER
WEBLECH
CAPEK
GRUNK
LARM
ARALE
SPINDLE
METIS
APETURE
HOUNDER
WEB EATER
ANDJING
PYCREEP
LUCENE
3.PHP BASED
SPHIDER
OPEN WEB SPIDER

4.RUBY
ANEMONE

CLOUD-CRAWLER
4.C# , C++ AND CROSS PLATFORM
DATAPARK SEARCH
GNU WGET
GRU
HT://DIG
HTTRACK
ICDL CRAWLER
MNO GO SEARCH
OPEN SOURCE SERVER
ASPSEEK
HYPER ES TRAILER
OPEN WEB SPIDER
PAVUK
XAPIAN
ARACHNODE.NET
CRAWWWLER
OPESE
CCRAWLER
CONCLUSION :
Python is highly used across crawling
Reason:
Most efficient, highly distributed
The requests library is very powerful while being extremely
simple to use. Python also has a great inbuilt html/xml parser in
LXML - An alternative to LXML is Beautiful Soup.
A scripting language like Python/Perl offers excellent text
processing abilities in the form of regular expressions and low

level string operations. Handling character encodings (which


can be a pain with web crawling) is also very easy to do in
Python - One of my favourite libraries is UniDecode.
With a web crawler, most of your time is spent on network I/O
and thus making it non-blocking is very important for good
throughput. Python has many libraries and frameworks off the
shelf to support this.

Scrapy would be a great choice to build a scalable, distributed


crawler. It is built on top of Twisted (an event-driven networking
engine) and is in use by a few big companies in production
systems. It might be overkill if you are doing a weekend project.
Mechanize is another powerful library that can do pretty much
anything a user can when browsing - it was originally built in
Perl and now comes in Ruby and Python Flavors among others.

It is widely believed that a majority of the Google-bot is written


in Python
Python is a "scripting language" , "interpreted language" for
crawling the web it is best because of scripting feature with its
own built-in memory management and good facilities for calling
and cooperating with other programs

Excellent for beginners


Yet superb for experts
Highly scalable,
Suitable for large projects as well as small ones
Rapid development
Portable,
Cross-platform
Embeddable
Easily extensible
Object-oriented
Simple yet elegant
Stable and mature

Powerful standard libs


Wealth of 3rd party packages
While java we use where we want great security and
portability
there are some specific work which is done by some specific
languages python is best for crawling feature

Bibliography :
www.quora.com
http://stackoverflow.com/questions/5555930/is-there-any-javascript-web-crawler-framework
http://forums.udacity.com/questions/19039/java-vs-python-forwriting-a-web-crawler
http://en.wikipedia.org/wiki/Web_crawler
https://www.coursera.org/
www.google.com
http://opendata-tools.org/en/data/
http://www.garethjames.net/a-guide-to-web-scrapping-tools/

Vous aimerez peut-être aussi