Vous êtes sur la page 1sur 50

A REPORT

ON
WikiData Harvesting Data dumps in mongodb and constructing its

knowledge graph
BY

ADITYA MANGLA

(2012A7PS209P)

AT

Homi Bhabha Centre for Science Education, TIFR


A Practice School-I station of
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI
(23 MAY,2014 17 JULY, 2014)

1|Page

A REPORT
ON
WikiData Harvesting Data dumps in mongodb and constructing its

knowledge graph
BY

ADITYA MANGLA

2012A7PS209P

B.E. (Hons.) Computer Science

Prepared in partial fulfillment of the


Practice School-I Course
AT
Homi Bhabha Centre for Science Education, TIFR
A Practice School-I station of
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI
(23 MAY,2014- 17 JULY, 2014)

2|Page

3|Page

ACKNOWLEDGEMENTS
Research opportunities and Industrial exposure are the sole means for students to
understand and appreciate the practical applications of theoretical concepts. The
successful realization of any project is the outcome of a consolidated effort of the team
comprising of the mentors and protege. It is only with their support, guidance,
inspiration and encouragement that any student can achieve his/her goals.
I would have never succeeded in completing my training without the cooperation and
encouragement provided to me by various people. Firstly, my sincere thanks to the
Gnowledge Lab Team, for their help during this internship.
I would like to take this opportunity to express my heartfelt gratitude to my project
mentor Dr. Nagarjuna G. senior scientist and Incharge of the GLab at HBCSE, TIFR for
his constant guidance and overwhelming support. His wisdom, clarity of thought and
persistent encouragement motivated me to not only take up this project but also bring
it to its present state . Working with him has been a great learning experience.
I would like to thank my PS-1 faculty incharge ,Dr Bibhas Sarkar for his constant
support , caring nature and constant guidance at each stage of the internship. Special
thanks to student mentor,Mr Akshay Hoshing for his cordial support and guidance.
This project would not have been possible without the constant guidance and support
of Mr Sunny Chaudhary,Mr Avadoot Nachankar ,Mr. Kedar Aitawdekar and Mr Dhiru
Singh. A special word of thanks to all my fellow research interns at the institute for their
constant support and willingness to discuss and deliberate on all issues.
A special thanks to my colleague and friend Rohan Badlani with whom I did this
project. In the spirit of open source development my gratitude to all those developers
who have contributed to this project and best wishes to all those who will do so in the
future.
Working at Homi Bhabha Centre for Science Education(TIFR) as a Research Intern has
been an enriching experience for me and I would like to express my deep gratitude
towards everyone associated with the Project.I look forward to such golden
opportunities in the future.

4|Page

BIRLA INSTITUTE OF TECHNOLOGY AND


SCIENCE
PILANI (RAJASTHAN)
Practice School Division
Station
Centre
Mentor

Homi Bhabha Centre for Science Education(TIFR) ,Mumbai


Anushakti Nagar, Mumbai
Dr Nagarjuna G.
Developer GNOWSYS, Glab Incharge,HBCSE
M.Sc. in Biology, Delhi University
M.A. in Philosophy, Delhi University
Ph.D. in Philosophy of Science, IIT Kanpur.

Duration
Date of start
Date of Submission
Title of Project
Name
ID
PS Faculty
Student Coordinator

8 weeks
23rd May,2014
17th July , 2014
Wikidata project for MetaStudio
Aditya Mangla
2012A7PS209P
Dr Bibhas Ranjan Sarkar
Mr Akshay Hoshing

Key Words
Wikidata , data dumps ,data harvest ,incremental dumps , triples ,N triples,Turtle
triples, JSON , RDF , NDF architecture ,knowledge graph, MetaStudio ,GitHub ,
Python,django , mongoDB , django-mongokit ,D3JS ,Topic , Theme ,Freebase, Yago
database.
Project Areas
Entire project is based on Open Source development. The project is a part of
MetaStudio platform .Areas are as follows :Website Development
-Front end HTML5 , JavaScript , CSS ,D3JS
-Backend Django,Python,mongoDB
Database Handling
-Django-mongokit , mongoDB
5|Page

Algorithms
-Iterative algorithms
-Recursive Algorithms
-Depth First Search algorithm

Abstract :
The aim of the project was to harvest a open source data dump like Wikidata or
Yago data dump into the mongoDB structure being used for MetaStudio.This project
involved harvesting big data from online data stores in a dynamic manner through a
python script.This purpose was to be achieved by running a robust and optimized
python script on the dedicated server.
A log file was supposed to be maintained throughout the running script to keep
track of all errors,exceptions(if any) that might be thrown during the lifetime of the
script.
Subsequently a django app called Wikidata was to be developed that provides a
front end intuitive interface to the user of MetaStudio to access and browse through
the harvested data.
Finally make a knowledge graph from harvested data for easy visualization using
D3JS.

_________________________________

____________________________________

______________________________

PS-1 Cordinator,HBCSE(TIFR)

Project Incharge,HBCSE(TIFR)

Software developer,GLab

BITS Pilani

Mentor Incharge

6|Page

Mentor

7|Page

TABLE OF CONTENTS

ACKOWLEDGMENTS -------------------------------------------------------- ii
ABSTRACT ----------------------------------------------------------------------iv
1.Introduction -----------------------------------------------------------------------------------12
1.1 About the Institute--------------------------------------------------------------------12
1.2 Project MetaStudio-------------------------------------------------------------------13
1.3 MetaStdudio Framework------------------------------------------------------------16
1.3 Motivation behind the Project------------------------------------------------------18
1.3 Aim of the project--------------------------------------------------------------------19

2.Contents----------------------------------------------------------------------------------------20
2.1.Previous work done in data harvesting in MetaStudio--------------------------20
2.2Approach to the Project--------------------------------------------------------------21
2.3Design Steps--------------------------------------------------------------------------24
2.3.1 Choice of data dump----------------------------------------------------------24
2.3.2 Choice of Mapping -----------------------------------------------------------27
2.3.3 Choice of Algorithm -------------------------------------------------------29

2.4Timeline of the project ---------------------------------------------------------------31


2.5Final contribution to MetaStudio---------------------------------------------------34

2.6Screenshots----------------------------------------------------------------------------------35
2 .6.1 iterative_script running on local host----------------------------------------------35
2.6.2 front end of Wikidata app------------------------------------------------------------37

8|Page

2.7 Technical skills acquired-----------------------------------------------------------------41


2.8 Life skills acquired------------------------------------------------------------------------44
3. Conclusion-------------------------------------------------------------------------------------45
4. Future Scope -----------------------------------------------------------------------------------46
5. Bibliography -----------------------------------------------------------------------------------47
6.References ---------------------------------------------------------------------------------------48
7.Glossary ------------------------------------------------------------------------------------------49

9|Page

TABLE OF FIGURES

Figure 1: MetaStudio Class Hierarchy from Node----------------------------------------16


Figure 2: MetaStudio Class Hierarchy from Node ---------------------------------------17
Figure 3: Basic approach for project(3 step approach)------------------------------------22
Figure 4: Freebase logo------------------------------------------------------------------------23
Figure 5: Yago and Wikidata logo------------------------------------------------------------24
Figure 6 : Tag cloud representing Big data--------------------------------------------------25

Figure 7 : 400mb text file required to harvest 100s of gb of wikidata-------------------25

Figure 8 : JSON of Japan-----------------------------------------------------------------------26

Figure 9 : Structure of harvested data---------------------------------------------------------27

Figure 10 : Specific example for class hierarchy--------------------------------------------27

Figure 11 :Depth First Search------------------------------------------------------------------28

Figure 12 :Script running on local machine--------------------------------------------------35

10 | P a g e

Figure 13: log files created after the script has run -----------------------------------------------36
Figure 14: Wikidata app on metastudio-------------------------------------------------------------36
Figure 15: Number and name of all objects harvested from script------------------------------37
Figure 16:Hover and click on objects to view its details-----------------------------------------37
Figure 17: Display details of object-----------------------------------------------------------------38

Figure 18: View of object details ------------------------------------------------------------------38

Figure 19 : Tag based navigation--------------------------------------------------------------------39

Figure 20 : Quoting reference/source of info and license----------------------------------------39

Figure 21 : Knowledge graph------------------------------------------------------------------------40

Figure 22 : Location-----------------------------------------------------------------------------------40

Figure 23 : Python and Ubuntu logo-----------------------------------------------------------------41

Figure 24 : Python and Ubuntu logo-----------------------------------------------------------------41

Figure 25 : Wikidata logo------------------------------------------------------------------------------42

Figure 26 : lifeskills and soft skills--------------------------------------------------------------------43

11 | P a g e

HOMI BHABHA CENTRE FOR SCIENCE EDUCATION

ABOUT THE INSTITUTE


Established in 1974 , Homi Bhabha Centre for Science Education (HBCSE) is a National
Centre of the Tata Institute of Fundamental Research (TIFR), Mumbai. The broad goals
of the Centre are to promote equity and excellence in science and mathematics
education from primary school to undergraduate college level, and encourage the
growth of scientific literacy in the country.
Activities:
To these ends it carries out a wide spectrum of inter-related activities, which may be
viewed under three broad categories:

Research and Development


Teacher Education and Science Popularisation
Olympiads, NIUS and other Students' Nurture Programmes.
Three decades after its inception, HBCSE has emerged as the premier institution in the
country for research and development in science, technology and mathematics
education. It is India's nodal centre for Olympiad programmes in Mathematics, Physics,
Chemistry, Biology and Astronomy.
Graduate School:
HBCSE (TIFR) runs a Graduate School in Science Education. Students
admitted to HBCSE Graduate School work towards the Ph.D. degree of the
Tata Institute of Fundamental Research (TIFR) , which is a Deemed University.

12 | P a g e

PROJECT MetaStudio
(A web portal for making, sharing and seeking knowledge)
Building the ship while sailing on it
Vision A free open-source platfrom to MAKE, SHARE and SEEK.

About - It is a collaborative workspace for shaping up education and research. Here


we design and execute a virtual peer-to-peer academy to make learning and
research a way-of-life. Another mission of the project is to inculcate a scientific
temperament in the masses from diverse backgrounds.

Involve Teachers, Students and Citizens


You can discuss, design and construct various educational initiatives collaboratively
and create a social network with members of a community by (a) campaigning about
your educational initiatives and (b) broadcast your work to the world through blogs,
articles, share resources and seek responses.
User Groups for Collaboration
Groups are an easy way to share content and conversation, either privately or with
the world. Many times, a group may already exist within the metastudio.org for a
specific project or topic. In case you can't find one that you like, feel free to start your
own and invite others.
Design Close To Life Learning Contexts
You can do this by involving any citizen to join a research project or by designing a
project that provides concrete learning experiences to participants.It is the vision of
the creators of MetaStudio to embark on several long and extensive Citizen Science
Projects to reinstill the belief in the public mondset that science can and should be
an integral part of every citizens life for the betterment of society and human life.
This ensures the spread of scientific temperament to every corner of the country
and eventually the world.

13 | P a g e

WHAT IS METASTUDIO ?
This is an initiative of the Homi Bhabha Centre for Science Education, TIFR ,
Mumbai, India for establishing collaboration among students, teachers, researchers
or anyone else interested, to shape education and research in strikingly different
ways. Why is the platform called "metaStudio" is described in the article
metaStudio. However, if you join at this portal as a registered user, it does not imply
that you endorse the idea of studio based education.

What facilities does this platform offer?


The site offers a multitude of features for collaboration of all sorts.

a wiki: you can create wiki style pages collaboratively on topics and subjects of
your choice.

form a group: a private, public, an invite-only, visible or invisible group.

send messages to members within the group (social networking)

write about a topic inviting discussion

ask questions and also respond to questions asked by others (You will get points
for your contributions).

upload useful resources such as lessons, documents, pictures, videos, articles,


software etc (You will get points for your contributions).

post announcement of events as well as report on them, to tell every one in the
group to keep up to date with experiments,observation,hypothesis and results in
the scientific world online.

create a profile of your own and upload your bio-data almost like a social
networking platform .This ensures transparency and trust in the scientific
community also making the user experience a lot more vibrant and dynamic.

write blogs on topics of your interest

14 | P a g e

Start collaborative research projects on any area of interest under the Creative
Commons license.
All in all it is the vision of the makers of metastudio to make it a complete
package serving the primary purpose of a common platform for scientific
interaction, sharing and learning .But at the same time the path breaking
initiative aims to be different in its approach by including many user friendly
and attractive features to take the experience of a school , college, a science
lab or even a natural observatory and put it up online all for free. This way
science learning will no more be limited by physical barriers of time space
and will reach uptil the grass root levels of humanity bringing people closer
irrespective of the diversity.

Site Protocol: Creative Commons using open standards

As a project inspired by free software philosophy , this site encourages the


members to adhere to the principles that protect freedom, sharing, collaboration
and socially acceptable protocols.

When you upload resources (digital documents and software) please ensure that
you are uploading them under the Creative Commons license or other copy left
license or public domain.

Another essential point to remember in the open source community is that even
though all information,code and data is open for access it is ethically and legally
mandatory to cite references and acknowledgements to the source of any open
source information.

It is also important to notice that the wikidata dumps by the Wikimedia


organization are also registered under the Creative Commons License(CC0) and so
the appropriate reference, acknowledgement and terms and conditions of usage
have to be cited wherever necessary.

15 | P a g e

METASTUDIO FRAMEWORK
The MetaStudio Framework is an NDF (Node Description Framework) where a
generic class called Node describes the basic structure of the objects present in the
website. There exists a dense Object Oriented Architecture.

PURPOSE OF PROJECT

Fig 1 MetaStudio class hierarchy of Node class

16 | P a g e

Fig 2 MetaStudio class hierarchy of Triple class

Then there is a separate generic class called as Triple. There are 2 classes that
inherit the class Triple namely GRelationand Gattribute.Triple is based on the
concept of defining subject and its associated value. (be it a value of an attribute or a
relation type)
Then there is a separate generic class called as Triple. There are 2 classes that
inherit the class Triple namely GRelationand Gattribute.Triple is based on the
concept of defining subject and its associated value. (be it a value of an attribute or a
relation type)

17 | P a g e

MOTIVATION BEHIND THE PROJECT


The motto of MetaStudio right from its inception has been Make Share Seek.
Through this project the Seek aspect of MetaStudio is to be enhanced. The vision
behind the conception of a platform like MetaStudio is greater collaboration
between mentors and proteges in the field of Science. But a website like this can
only attract learners and general masses if it has enough stores of information in
its database relevant to science so that even beginners can spend some quality
time on the website and boost their own knowledge.
Lets consider a scenario if a student is reading the recently uploaded observations
about the Koel Birds nesting patterns uploaded by a prominent scientist on
MetaStudio and he/she wishes to know more about Koel or say he/she doesnt
know anything about the koel then he/she will do the most obvious thing search
for that object on Wikipedia or a search engine like google. That is indeed a
drawback of the site and heres where lies the importance of the project.
It also needs to be appreciated that no popular site provides(till now) the facility to
visualize an open source knowledge graph that highlights the relations that exist
between the multitudes of objects that exist on the web(that continue to grow at
an unimaginably high speed)
Now if metastudio itself had a collection of data like Wikipedia then the user does
not have to go to any other website to explore information about any object. Thats
what we plan to do , harvest the data from an open source data dump and construct
intuitive knowledge graph from it so as to support easy visualization of data .This
way the user will not only get all open source information regarding an item like say
- Koel ( including the Wikipedia link and page preview for the same) but also the
position of this object in the knowledge map. Hence as a result of our project not

only does the user get to know the meaning of an object but also gets to know
the relative position of that object in the overall scheme of things. As a result
of this display the user can well appreciate the kind of relationships that exist
between objects and various aspects of these relations. These concepts of
graphs between objects is the driving fundamental principle behind many social
networking platforms and even in page ranking algorithms used by search engines
like Google. This forms the very driving principle behind the concept of Semantic
Web envisioned and popularized by Tim Berners-Lee.
18 | P a g e

AIM OF THE PROJECT


1)The aim of the project was to harvest a open source data dump like Wikidata or
Yago data dump into the mongoDB structure being used for MetaStudio.This project
involved harvesting big data from online data stores in a dynamic manner through a
python script.This purpose was to be achieved by running a robust and optimized
python script on the dedicated server.
2)A log file was supposed to be maintained throughout the running script to keep
track of all errors,exceptions(if any) that might be thrown during the lifetime of the
script.
3)Subsequently a django app called Wikidata was to be developed that provides a
front end intuitive interface to the user of MetaStudio to access and browse through
the harvested data.
4)Finally make a knowledge graph from harvested data for easy visualization using
D3JS.

19 | P a g e

PREVIOUS WORK DONE IN THE FIELD OF DATAHARVESTING


IN METASTUDIO
A group of students had worked on data harvesting in 2012-13 in which they had
tried to harvest Yago data into PostGreSQL .The database pf MetaStudio at that
time was being maintained in PostGreSQL which is a typical SQL database.
A python script had been written by the students and the limited objective at that
time had been achieved.It must be noted that considerable changes have taken
place in the framework and the code of MetaStudio open source project. Most
importantly the choice of database was shifted from PostGreSQL to MongoDB , an
open source NoSQL data store.Hence now the old script has been rendered
redundant and theres a need to approach this problem in a new and robust
manner.
A few limitations to note are 1) The database was subsequently changed to a NoSQL store namely MongoDB.
2) As is the characteristic of ay open source project , with time the complexity of the
use cases and the Data Flow Diagrams increase and hence the script coded
previously was quite simplistic but now the same task will need greater coding
skill and time as the system design has become considerably complex and the
choice of database has also shifted to an open source NoSQL database i.e.
MongoDB.Basically the complexity of the mapping between the data dump and
the mongodb structure has increased.
3) The group had not included code to process and include incremental data dumps
available in the future.
4) There was no User Interface developed by the group to make any real sense of the
harvested data. As a result even though the data was being harvested it could not
really be seen or utilized.
Documentation available on http://www.beta.metastudio.org
The project was being done under the guidance of Mr Dheerendra Singh sir

20 | P a g e

APPROACH TO THE PROJECT(OUTLINE)


After several brain storming sessions the following 4 step approach has been
developed to the problem statement at hand.

Step 1 : Writing a python script to harvest the Big Data available on any of the
data dumps and store it into the structure of mongoDb being used in MetaStudio.

Step 2: Writing a log script to keep track of the exception,errors and the overall
progress of the script. The log file is also a python script and the file is made
dynamically when the harvesting script is run.

Step 3:Develop a django-app called as WikiData(member_of wikidata app has


to be the metaType Gapp as it is supposed to appear as a tab ) and develop the
front end template so as to provide an intuitive interface for the user to access
and utilize the data harvested from wikidata.

Step 4:Develop a knowledge graph based on the harvested data from wikidata
and incorporate that as a display option in the Wikidata app itself.This takes the
entire user experience of the website to a whole new level. As a research and
study oriented topic the knowledge graph also provides a perfect case study for
the areas of ontology,the study of semantic relationships and forms the basis of
the semantic web. The project thus adheres to the fundemantal ideology and core
beliefs on which the entire project MetaStudio is being developed.This is also
what uniquely identifies and highlights the very purpose of this project and this
to be able to visualize the relationship between various objects.

21 | P a g e

3 Fundamental Steps in the Project

Fig 3 Basic Approach for project(3 step approach)

22 | P a g e

DESIGN STEPS
1)Choice of Database
2)Choice of Mapping
3)Choice of Algorithm

Choice of Datadump
The most crucial choice before starting project is the choice of datadumps. Now all
datadumps are essentially triples in one form having minor differences in
organization, content, amount of information etc.
A choice had to be made among the 3 most popular datadumps
1)FreeBaseAPI Open Source project founded in 2004 and acquired by Google in
2010. They provide the RDF data in a serialized N-Triples format.

Fig 4 Freebase(Google) logo

2) YagoDatabase Semantic Knowledge base derived from Wikipedia ,


WordNetand GeoNames. Processed and structurally formatted by Max Planck
Institute of Informatics in Germany.
Format:Processed Data available in xml format that has fields along with specific
values.Interestingly all previous work in this field for MetaStudio had been done
using this as the primary database. Also this datadump is actually preprocessed and
structured so it would be easier to harvest such a data but then it was a consensus
that it would be better to harvest raw data from a source like say Wikidata.
23 | P a g e

Fig 5 Yago datadump and Wikidata logo

3) Wikidata is a free knowledge base that can be read and edited by humans and
machines alike. It is for data what Wikimedia Commons is for media files: it
centralizes access to and management of structured data, such as interwiki
references and statistical information.
Format:
Q<id_of_topic>Eg: Q1 universe,Q100-Boston
Advantage:
1.Only the list of all topic ids is required. No need to process data in RDF format to
harvest data.
2.The wikidata also provides updated dumps along with statistical information
about the items.
The final choice of data dumps was wikidata due to the following reasons
a) As per our discussion with Prof GN , wikidata is one of the biggest and the most
extensive database and that justifies the choice. Interestingly thats not all !
It was during the course of our groundwork research for the project that we realised
that interestingly unlike Yago the entire database need not be downloaded. So gone
are the days of first downloading 100s of GBs of data on the servers and then
processing the big data.
b)All we need is a file containing a list of all objects like Q2-Earth,Q100-Boston etc.
and a working internet connection and our script is GOOD TO GO !!The system has
just a 400MB file containing object ids.
Here in lies the advantage of Wikidata. All information regarding the triplets that are
basically Relations and Attributes are found in json files which are available online
on the urlsLike
http://www.wikidata.org/wiki/Special:EntityData/Q17.json

24 | P a g e

c)So all we need to do to access the data for an object is access the url dynamically
and start parsing its json by using inbuilt python modules like urllib2,json and
csv(used to parse the text file).

Fig 6 Tag cloud representing Big Data

It is as a result of this choice that we can proudly say

big data is BIG no more

Fig 7 The only text file of around


400 Mb needed for harvesting of
data in 100s of Gb.

25 | P a g e

Choice of Mapping
Basically there was the obvious need to map the fields of the wikidata json to the
mongodb structure of MetaStudio.
For eg
Aliases(from wikidata json) altnames(of mongodb class Node)
Label(extracted in English from wikidata json) name(of mongodb class Node)
Descriptions(extracted in English from wikidata json) content(of mongodb class
Node)
Q<id> from wikidata json () an attribute called as topic_id(AttributeType called
as topic_id)
Globe-coordinates stored as standard geojson in location(of mongodb class
Node)
All the relations and attributes are present as claims in the wikidata json.The
attributes that could not be harvested directly.A suitable attributeType and then the
attribute was created.
The JSON of Japan Q17
http://www.wikidata.org/wiki/Special:EntityData/Q17.json

Fig 8 JSON of Japan


http://www.wikidata.org/wiki/Special:EntityData/Q17.json

26 | P a g e

Choice of Mapping

Fig 9 Structure of harvested data in mongoDB

Fig 10Specific example


for class
hierarchy of
harvested data

27 | P a g e

Choice of Algorithm
According to our research the knowledge Graph canbedeveloped from the data
dump using 2 fundamental algorithmic approaches
1)Recursive-Processing each object and its relations and attributes one at a time
almost dynamically (on the fly)
EgRamesh is a student of Bits Pilani.
Time Complexity o(n)
2) Iterative -First creating nodes for all objects. Then processing all relationships
and attributes for each object.
Time Complexity o(n2).Quite simple and intuitive.
For wikidata the recursive algorithm is much more optimized as it works in a DFS
like manner in linear time.

The problem with recursive algorithmThe recursive algorithm works in a Depth First Search (DFS) like manner.It keeps
going deeper and deeper into the tree until it bottoms out/reaches the tree and then
starts returning.

Fig 11 Depth First Search

28 | P a g e

In the recursive algorithm the STACK OVERFLOWS !


This one of the most interesting moments of the project when a completed code fails
because Pythons stack size allows only 1000 recursive calls.
We suggested using STACKLESS PYTHON but then the idea of including newer
imports and dependancies was unanimously rejected.
Close to 6 object list was leading to the creation of more than 1200 objects.Such is
the density of the knowledge network available in the wikidata dumps.
The code is available as final_script.py and works for small no of objects.For only
certain objects.If the objects are too intensely entagled in the graph the call stack
will overflow.

THE FINAL MODIFIED ALGORITHM


Now the final algorithm is a combination of approaches and uses 3 iterations to
harvest the data completely.Thereis a separate log file for each iteration.
1)Iteration 1 The file is processed and objects are created along with
attributetypes and attributes.
2) Iteration 2 file is processed again this time objects are already made so this
iteration aims to generate the entire hierarchy of instanceOf and subclassOf.
instanceOf(P31)is stored in member_of
Subclass_of(P279)is stored in type_of.
It is this iteration that the code enters a DFS like approach.So every GSystems
member_of will have a list of ObjectIds of its class and all its ancestors.
Every GSystemTypes type_of has a list of ObjectIds of all ancestor classes.
3) Iteration 3 In the 3rditeration the file is processed to create relations and
relationTypes.
As in
Delhi --------------------------India
(capital of)
29 | P a g e

TIMELINE OF THE PROJECT


The duration of PS-1 was close to 8 weeks (23rd May,2014-17th July,2014)
The following is a brief account of progress made by me documented in a weekly
manner -

Week 1( 23rd May,2014 to 29th May,2014) - The first 2 days were primarily used
for orientation programmes.This included an enriching talk about the organization
in general(HBCSE,TIFR) and the GLab in particular which is an open source lab
established on the principles of open source development laid down by notable
names like Richard Stallman, Linus Torvalds etc. The project topics and the required
skill sets required were also discussed and in depth discussion took place.
In the following week there were sessions held by the PS-1 instructor,Dr Bibhas
Sarkar and student mentor,Mr Akshay Hoshing to enlighten us about the guidelines
regarding the PS-1 program.
In the first week itself cordial discussions were organised with our peers regarding
every project.The details of every project were deliberated on at length with our
mentor Dr Nagarjuna.The required skill set of every project was also discussed at
length.Most importantly this was the period when most of us were beginning our
journey into the vast and amazing world of open source development.The ubuntu
operating system was supposed to be installed by all .In words of our extremely
accommodating mentor the choice of project was to be done based on the level of
interest and enthusiasm and everybody's opinion was accommodated , queries
solved and only then was the allocation process completed. Some concepts that
were clarified were
1) All students at HBCSE(TIFR) fro PS-1 would be working on an open source
project , which is basically a science learning platform following NDF
architecture.
2) The common skill sets for almost all projects included Python programming,
django framework,mongodb NoSQL database, javascript (for front end work if
any).
3) All projects were added features or functionalities to the metastudio
framework which was live even at that time.So it was quite motivating to
know that all our work woul finally contribute to a real life , live project which
will be up on the internet for everybody to see.
30 | P a g e

Week 2(30th May,2014 6th June,2014)--The entire batch of interns started off
work by understanding the fundamental aspects and principles of the open source
project MetaStudio.Some essential and common skills were supposed to be acquired
before any real contribution to the code could happen.So we started our work in
primarily mastering the following concepts and skills1) Python language - an open source, object oriented , high level language
developed by Guido Von Rossum.I reffered to multiple sources like books,official
documentation,blogs etc to acquire a working knowledge of the language.

2)Django framework -a robust and popular open source web framework written in
the python language.The entire MetaStudio is built using the django framework and
I followed to django official documents to learn how to develop apps in this
framework.This was a relatively challenging but immensely satisfying task.The
learning of django framework involved making apps and trying out running server
on the local host etc. This part of the project led to greater learning of backend web
development , interaction with database, administrator management and query
optimization.

3)mongodb - This aspect deserves special attention as this is one of the most recent
developments in the world of modern day computer science namely NoSQL
databases.The concept, though covered briefly in class is still quite challenging and
interesting.I got to pursue my academic interest in databases by covering this
hitherto new area of NoSQL data stores.There are various NnoSQL datastores
available for usage like Cassandra,Mongodb etc .The database being used in
MetaStudio id MongoDB and this proved to be a very enriching and valuable
learning of the PS.It should be noted that NoSQl databases are schema less hence
very flexible and query execution on these is highly optimized and dynamic.

4)Metastudio framework -This is an essential step that is indispensable before


contributing to any open source project. It is quite a herculean task to attempt to
understand such a vast project with innumerable features and many more being
added each day but that effort made me realise the true meaning and hidden power
of open source development. This step involved understanding the models.py and
31 | P a g e

primarily I focussed on understanding the use case of the project and the data flow
diagram . It was this stage in the words of Prof GN that we could truly say "Open source development is like Building a ship while sailing on it
This also contributed to a sense of fulfilment of working not a theoretical or
redundant project but on a live, growing and latest project led by such a wonderfully
creative, supportive and forever helping mentor.

Week 3(7th June,2014 13th June,2014) - The momentum had now been
gathered and all of us were now confident of the true extent and all requirements
ofd our respective projects.It was now that we faced another unexpected challenge
called the GIT.
Developed by the legendary programmer and the 'messiah' of open source
movement Linus Torvalds the git is what is essentially called as a VCS(Version
Control System).It basically enables multiple contributors/developers to contribute
simultaneously to an ongoing project from their own location without any physical
contact whatsoever.The interesting part is that each change , however small and
insignificant it may seem is always stored as a version ( think of it like a snap at a
moment of time) and as the changes keep coming the author or the administartor of
the project may ( at any instant of time) revert back to an older version.This way the
system becomes foolproof.No code is ever lost as each change is recorded as a
version.It also maintains records of contributions by each developer hence giving a
true picture and almost the entire life story of any project.It has several variations
and various other features. We worked on the GitHub and it is at this stage we
learnt its terminologies like - repository, clone , fork,push,pull etc

Week 4(14th June,2014-20th June,2014)PHASE -1 - The first part of the project


involved studying the various data dumps available and choosing the best one that
could suit my needs.The essential choice was among
1)Yago database
2) freebase
3) wikidata
Samples from All 3 were studied and after multiple deliberations the final choice by
consensus was Wikidata.
32 | P a g e

The concept of triplets was understood as <subject><predicate><object>.Work on


the basic script called as system_script2.py was started to handle all the query based
methods and this is the script that interacts with the mongodb database.

Week 5- (21st June,2014-27th June,2014)PHASE -2 - The next crucial step was to


start coding the script to do the json parsing from the wikidata urls and pass all the
harvested data ina systematic manner to the functions in system_script2.py.I
started work on the script that would eventually run on the server called
final_script.py. For this one major hurdle was the choice of algorithm.
We had 2 major options 1) iterative_algorithm
2) recursive_algorithm
It was a general consensus that even though the recursive algorithm which was
theoretically supposed to execute in a "depth-first search" like fashion was more
sophisticated giving O(n2) for Yago database but only O(n) for Wikidata.
It was GN's idea to try out the recursive algorithm if not for the actual excercise but
as a test of our theoretical knowledge and to add to our pursuit of academic
excellence.
As expected the code ran fine until we faced a stack overflow because the stack size
of python is limited to 1000 recursive calls and our script was running on a far more
complex data.

Week 6 (28th July,2014 -3rd July,2014) PHASE 3 -The algorithm changed to


iterative but it was realized that a combination of the 2 is neede to optimize and
successfully achieve the aim of the project. While facing problems in the script and
the subsequent changes it was GN sir who suggested the use of a log file which is
basically another python script that would keep track of all changes that happened
once the script is run.These functions for file handling were written in a file called
as log_script.py.Work on the django app called WIkidata was also started.

33 | P a g e

Week 7(4th July,2014-10th July,2014)PHASE 4 Now the ground work was all as
all basic portions had been coded and tested.It towards this time that our mentor
pointed out some gaping flaws in the system design and so the code needed
restructuring . The script final_script2 gave way to the final script called as
iterative_script.py with a finally working algorithm.Server acces to a public IP was
given and script was run on that server.

Week 8(4th July,2014-10th July,2014) The front end of the app was improved by adding features like tag based navigation
as in Wikipedia , location mapping based on globe coordinates and most importantly
the concept graph ( a kind of knowledge graph)
Reports,presentations and documentations were prepared and pull request was
sent to the mentors.This marked a successful end to the project and to PS-1 program
as well.

FINAL CONTRIBUTION TO METASTUDIO


There is no external dependency required to run the script or the django app.
1)system_script2.pyThis script has all functions that interact with the mongodb
and run queries etc.It has functions like create_Gsystem() etc.
2)log_script.py a supporting script needed to write a log file dynamically to keep
track of all objects created.
3)iterative_script.pythe final script that uses functions of the first two and is
responsible for all the json_parsing functions.
4)The complete djangoapp wikidata a view wikidata.py , a urlwikidata.py and
a template wikidata.html
5) list_of_objectscontains a sample list of 10 objects that anybody can run to test
the script.
6)wikidata_instructions.txt - contains instructions about running the script from
command line

34 | P a g e

SCREEN SHOTS
(I) The Python script itearive_script.py running on local machine

Fig 11 Script running on local machine

35 | P a g e

3 Log files , one for each iteration

File list of objects on


which
iterative_Script.py will
run

Fig 13 log files created after script has run entirely.


List_of_objects on which script run

(II)Front End Display of the Django app WIKIDATA

Fig 13 Wikidata app on metaStudio


The tab of Wikidata app.Click on this
tab.

36 | P a g e

Fig 15 Number and list of all objects harvested from script

Search bar has been provided


to search for topic
name.Functionality needs to be
integrated with the searchs
team work.
The script was run to harvest
only 7 objects only say.It
displays that 7 objects were
found and then shows list of
those 7 objects.

Fig 16 Hover on an object and click on it to


view its details.
37 | P a g e

Fig 17 Display details of an object

A preview of the Wikipedia page

Fig 18 View of Wikipedia page

38 | P a g e

Tag based navigation

Fig 19 Tag based navigation

Reference and acknowledgement of sources

Fig 20 Quoting reference,source and license

39 | P a g e

Knowledge map

Fig 21 Knowledge graph

Location on the world map(using Open Street Map)

Fig 20 Location

40 | P a g e

TECHNICAL SKILLS ACQUIRED


Ubuntu This was my first experience in open source development.Not only was
I introduced to the concept and mantra of open source development but I also
embraced the Ubuntu OS. Ubuntu is a Debian based Linux Operating System.I was
introduced to the pleasure of working in the terminal. I have now come to
appreciate how powerful,safe from viruses and efficient ubuntu can be and you
can always modify its code,develop apps for it and contribute to the community.
Most importantly one and all can use it for free. I also read about various
visionaries of open source movement and was inspired by the life stories of Linus
Torvalds,Richard Stalllman ,Aaron Swartz and even Prof. G Nagarjuna, the leader
of the Free Software Chapter of India.I also learnt about SSH commands to access
a server remotely through the internet.

Fig 23 Python and Ubuntu

Python Python is an object-oriented,high level language developed by Guido


Von Rossum.It is an extremely powerful,developer friendly,close to English
language.The amount of documentation which is present on the internet about
the language is exhaustive which makes the task of learning python quite easy.
The number of libraries which are present in Python, makes the programming
experience even easier as the coder can then focus on the higher level
requirements of a program rather than the line by line analysis of the library.It is
indeed a technical skill which is must for any app-developer or project oriented
coder.

41 | P a g e

mongoDB - Mongo-DB is a cross platform document oriented database.This is


an example of the latest area of research and trend in computer science.Working
on a NoSQL data store like mongoDB gave me the opportunity to study the
extension of the database systems course.Traditional databases called, relational
databases can become extremely slow, when the amount of data on the servers
begin to increase. Thus, new type of databases known as NoSQL databases were
developed. There is no fixed schema for the documents which are present in
Mongo-DB, or we can also say that the documents of Mongo-DB have a dynamic
schema. Mongo-DB stores its documents in form of JSON key value pairs. MongoDB provides high performance, high availability and automatic scaling.

GitHub -Since, it was my first open


source project, I learnt how people
sitting in different parts of the world come together to contribute to a single
project. This is done using a platform called GITHUB which uses GIT as its
version control system. With the help of GIT, people can contribute to a
particular project in such a manner that they do not change the original code but
can copy the entire code, improve it and then ask the owner of the project to
accept the changes which have been made by him/her.
Fig 24 github and mongodb

MetaStudio The collaborative open source platform being developed by HBCSE


meant as a platform for discussion,learning, exploration by science students,
teachers and even common citizens.All BITS students contributed to this
platform in some way or another.
WIkidata -is a free knowledge base that can be read and edited by humans and
machines alike. It is for data what Wikimedia Commons is for media files: it
centralizes access to and management of structured data, such as interwiki
references and statistical information
Fig 22 Wikidata
42 | P a g e

LIFE SKILLS ACQUIRED

Collaborating to a common cause. I learnt


that even though one must learn to contribute
to the growth of the group but at the same
time not lose his/het personal qualities and a
sense of individuality.
. In terms of soft skills, this was an
enlightening experience. I learnt how to
present my points calmly but firmly. How to
present things which do not offend other
people but at the same time, does not
undermine my opinions.
Managerial skills were learnt by the group.
Through multiple group discussions,
presentations and brainstorming sessions with
our mentors, the dynamics of mentor-student
relationship got richer and stronger in our
heart.Essential life skills like leading by
example and yet giving every team member
the importance and respect they deserve.
Time management was re-enforced as a
crucial skill to succeed be it in the Gds or
report submission, PS-1 was a great teacher of
effective utilization of time.

Fig 23 Life skills and soft skills


43 | P a g e

CONCLUSION
The aim of the project was to write a python script to harvest big data dumps from
an open source data document collection like Wikidata or Yago database and store
it into the MongoDB structure being used for MetaStudio project.
The next part of the project was to code an log script in python to maintain a
record of all exceptions,errors or warnings that are being encountered dynamically
when the script runs. All such messages are being written into a text file.
Then from the harvested data a django app called as Wikidatawas developed to
display the harvestd BigData in an intuitive and comprehensive manner.
This includes navigation through tags and displaying a knowledge graph in which
every item(GSystem or GSystemType is a node and the links between nodes are the
relationships that exist between these items)
The algorithm being used is a modified custom made one which is primarily
iterative in nature but contains functions within which are iterative and progress
in a depth first search like manner.
The skillset developed during the project includes python, django web
framework,mongodb, front end web development(html, css, javascript) and
working on the git. Most importantly the project inculcated the some essential life
skills like teamwork, confidence, comprehensive articulation both in writing and
speaking.
Most importantly the experience of PS-1 gave me an opportunity to explore the
application of classroom knowledge to real-life live projects be it in the areas of
database management systems, query optimization, data structures and
algorithms, data mining or even operating systems(OS).This gave me an
opportunity to work in a leading centre for computer science research, understand
their pedagogy and work culture, meet leading researchers and scientists, work on
state of the art platforms and explore cutting edge technology. Last but not the least
it gave us an opportunity to explore my own interests in the field of Computer
science and indicated our weaknesses and strengths. This has surely proved to be
much more than an internship and will definitely be cherished fondly as a life
experience.

44 | P a g e

FUTURE SCOPE
It is a defining characteristic of any good and successful open source project that it
must clearly identify its flaws and limitations and clearly lay out a vision for the next
programmer/contributor so that the voyage continues and the platform keeps
developing. I have tried to do the same.
1)Need for greater Integration
A tag cloud like feature or other D3JS data representation schemes like bar
graphs etc. may be integrated into the front end to lead to better
understanding of the data. . Its human tendency that humans respond better
and faster to visual representations of data so the whole user experience will
be taken to a whole new level if the harvested data (Big Data) is displayed in
an intuitive fashion.
A search bar has been provided by us in the UI and the search teams work
should be integrated into it so as to make accessing objects easier and faster.
Both these things lay outside the scope of our project.

2)The script running on the server can be modified a little so as to work in an


incremental fashion so as to update the database in periodically .The required
parameters have been passed. This is an extension of the project.

3)A possible extension is to create a UI based on filters and options in which the
user can select options to see certain pages of wikidata and explore in a well
thought out manner.
Theme1
Topic 1, Topic2, , Topic N
Theme2
Topic1, Topic2, , Topic M

45 | P a g e

BIBLIOGRAPHY
1) Beginning Python , from novice to professional by Magnus Lie Hetland
2) www.djangoproject.com the official documentation of django
https://docs.djangoproject.com/en/1.6/intro/tutorial01
https://docs.djangoproject.com/en/1.6/intro/tutorial02
https://docs.djangoproject.com/en/1.6/intro/tutorial03
https://docs.djangoproject.com/en/1.6/intro/tutorial04
https://docs.djangoproject.com/en/1.6/intro/tutorial05
https://docs.djangoproject.com/en/1.6/intro/tutorial06
3) Python video tutorials by New Boston
4) www.freebase.com Freebase API and official documentation
5) www.wikidata.org/wii/wikidata:main_page
6) dumps.wikimedia.org
7) www.tutorialspoint.com for python and mongodb
8) www.github.com GitHub
9) www.d3js.org official page for Data driven documents(D3).A powerful
javascript library to create knowledge graphs and other visual representations
of data
10) https://bost.ocks.org/mike/ - tutorials and samples for d3js codes
11) https://github.com/peterbe/django-mongokit - open source documentation
of django-mongokit
46 | P a g e

REFERENCES
1) Singhal, Amit (May 16, 2012). "Introducing the Knowledge Graph: Things,
Not Strings". Official Blog (of Google). Retrieved May 18, 2012.
2) Waters, Richard (May 16, 2012). "Google To Unveil Search Results
Overhaul". Financial Times. Retrieved May 16, 2012.
3)http://en.wikipedia.org/wiki/Turtle_(syntax)
4)http://en.wikipedia.org/wiki/N-Triples
5) http://en.wikipedia.org/wiki/Semantic_Web
6)http://www.wikidata.org/wiki/Special:EntityData/Q17.json

47 | P a g e

GLOSSARY
Data Dumps : Downloadable versions of Big data available in a specific format
maintained as different versions according to updates in the data. These are
available from various sources in various formats like Yago database, Freebase and
Wikidata etc.The choice of data dump is a crucial decision that had to be taken
before starting the coding part.
Knowledge Graph : a semantic methodology to structure the data as a collection of
nodes joined by links ( that are relationships). It solves 2 purposes . Knowledge
Graph is a knowledge base used by Google to enhance its search engine's search
results with semantic-search information gathered from a wide variety of sources.
Knowledge Graph display was added to Google's search engine in 2012, starting in
the United States, having been announced on May 16, 2012.[1] It provides
structured and detailed information about the topic in addition to a list of links to
other sites. The goal is that users would be able to use this information to resolve
their query without having to navigate to other sites and assemble the information
themselves.[2]

48 | P a g e

N Triples : N-Triples[4] is a format for storing and transmitting data. It is a linebased, plain text serialisation format for RDF (Resource Description Framework)
graphs, and a subset of the Turtle (Terse RDF Triple Language) format.

Resource Description Framework(RDF): RDF is keeping information so that


we(essentially the machines) are able to make statements about the data. When we
keep information in the subject-predicate-object form it helps provide the user
with the information with greater efficiency and certainty.

RDF triples: Encoded as an RDF triple, the subject and predicate would have to be
resources named by URIs. The object could be a resource/url or literal element. For
example, in the N-Triples form of RDF, the statement might look like:
<http://www.w3.org/People/EM/contact#me>
<http://www.w3.org/2000/10/swap/pim/contact#fullName> "Eric Miller" .
<http://www.w3.org/People/EM/contact#me>
<http://www.w3.org/2000/10/swap/pim/contact#mailbox> <mailto:em@w3.org> .
<http://www.w3.org/People/EM/contact#me>
<http://www.w3.org/2000/10/swap/pim/contact#personalTitle> "Dr." .
<http://www.w3.org/People/EM/contact#me> <http://www.w3.org/1999/02/22-rdfsyntax-ns#type> <http://www.w3.org/2000/10/swap/pim/contact#Person>

49 | P a g e

Semantic Web: The term was coined by Tim Berners-Lee for a web of data that can
be processed by machines. The internet ever since its conception has been a
network of computers comprising primarily of human understandable
data,structures and relations but with the multitudes of data in the recent times we
have progressed onto times of bot users and crawlers.Search engines help to
present the desired data from the vast storehouses of big data available online and
this task has been implemented using a concept of Semantic Web.It aims to create a
web network of data not only understandable by humans but also machines.This
way the crawlers,bots and search engines will become a lot faster,efficient and
useful hence reducing human time wastage and effort.This has been a growing area
in computer science and it borrows ideas from the field of machine learning,data
mining and most importantly Artificial Intelligence(AI)

Triples: <subject> <predicate> <object>: a triple is the smallest irreducible


representation for binary relationship.
EgRamesh is a student of BITS Pilani.
Ramesh subject
Is a student of predicate
BITS Pilani object

Turtle Triple:Turtle[3] was defined by Dave Beckett as a subset of Tim BernersLee and Dan Connolly's Notation3 (N3) language, and a superset of the minimal NTriples format.

50 | P a g e

Vous aimerez peut-être aussi