Vous êtes sur la page 1sur 6

IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

A Study on Various Sorting Algorithms in Information Retrieval


Systems
Ankita Arora1, D Sumathi2, S.Kannan3
1M.Tech Scholar, 2Professor, 3Associate Professor, Dept of CSE, Malla Reddy Engineering College, Hyderabad
1ankita201706@gmail.com, 2sumathi.research28@gmail.com,36kannan6@gmail.com

ABSTRACT

During the recent decades, a huge volume of the data is generated around the world. The design and
implementation of Information Retrieval System (IRS) is concerned with different sorting methods, organizing the sorted
data in specific order and searching the information in collection of volume of records. To get relevant information
accurately from the large network in less duration of time is a complex task in IRS. For effective content searching, the
Search Engines are used widely.

The search engines use certain sorting and searching algorithms to retrieve the related web pages which are very
closely related to the searched keywords. So, sorting algorithms play a very crucial role in retrieving the document in an
efficient manner. Hence, by considering the importance of sorting algorithms in Information Retrieval Systems, this paper
discusses the origin and improvement of different sorting algorithms with their advantages and disadvantages. Also an
insight view of considering the multi-dimensions for a micro blog has been discussed in addition to it.

Keywords- sorting algorithm, retrieval, document structure, Query, Page Rank,

INTRODUCTION
Information retrieval is considered to be the hot spot in recent days due to the massive development in technologies since
the data might be in various types that occur from various sources. The components of the information retrieval system
are

Crawling- The document is fetched by browsing the document collection.


Indexing- An index has been built based on the documents
Querying- User issues the query
Ranking- Documents that are pertinent to the query are displayed to the user
Relevance Feedback- The user after finding the relevant information might give relevant feedback to the search
engine.

Traditional retrieving algorithms uses vector space model which is based on the use of index terms. Weights are
assigned to the term and it is called as term weighting. The two factors namely Term Frequency (TF) and Inverse
Document Frequency (IDF) is used to assign weights to the terms based on the occurrence of the term in the document.
When the user enters the query, billions of web pages that are available might be returned and these results must be ranked
in such a way that the relevant page must be displayed first. Several schemes were used by search engines for ranking the
results.
The most powerful Google Search engine works on the principle of PageRank that is based on the Random
Surfer model. In this model, the user toggles from one page to another page when the user identifies that there is no
relevant information related to the search. PageRank is defined as the probability of the page that has been assigned to the
visited page by the user. This is based on the analysis of the in-links and out-links of the pages in the web and ranks the
analysis accordingly.

VOLUME 4, ISSUE 6, NOV/2017 157 http://iaetsdjaras.org/


IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

HITS algorithm (Hypertext Induced Topic Selection) used the link structure of the web in order to identify the
page that is relevant to the topic. The web page that returns based on the relevancy of the query is considered as the
authoritativeness of a web page. This page might link to authorities that are called as hubs. The results are returned by a
text-based search engine and it is known as the root set. Then, the root set might be expanded by including the number of
pages that are predefined and linked to the root set. This set is called as base set. Hubs and authorities are performed on
the base set.

There is difference between the popular page and authoritative page. Popular pages are not related to the topic but
it might be highly referenced. There could be an overlap between the set of pages that are related on the same topic and
these pages are treated as hubs. Hence, an excellent combination occurs between the good hubs and good authorities and
it has to be identified independently. Another algorithm that identifies the expert pages based on the query topic and then
the experts compute the ranking of the target pages is known as Hilltop. This algorithm returns the relevant expert pages
on the query topic and then the authoritative pages on the query topic are identified by tracing the out-links of the expert
pages.

Information Need

Query

Matching Query
Rules Refinement

Results

Figure 1: Classical Information Retrieval

Frameworks such as Lucene, Solr and Elastic Search etc were used for the data retrieval. Solr acts as a wrapper
over Lucene library. This uses Lucene classes in order to create an index and this is known as Inverted Index. A list which
contains the words/terms/phrases and the corresponding places where they occur are maintained. Recently the social
networking sites generates an exceptional amount of social media data and these type of data becomes immensely valuable
since it emerges from various kinds of applications such as event detection, financial micro blogs and targeted marketing
etc. Data retrieval sorting algorithms works on certain factors such as the true intention of users. The most important
factor in data retrieval sorting algorithms is the results that are obtained based on the query. The results must be relevant
to the search. Hence, this paved the way for identifying and creating the model for multidimensional data retrieval sorting
optimization. Normally all the recently developed microblog IR system works on the principle of TF-IDF, BM25 and
probabilistic model.

LITERATURE REVIEW
Usually, data retrieval algorithms work on the principle of identifying the words frequency and position in the
query. Searching for information could be improved by considering the structure of the document. The score of a
document could be high when the query words or phrases are found in the title of a web page as found in Alta Vista,
HotBot and Yahoo [2]. The position on query term occurrence plays a vital role in Lycos as identified in the rank function.
Direct Hit algorithms specialize in data quality and user feedback, and its basic plan is to trace the user click behavior and
continuance time on every search result [3]. A link analysis algorithm that predicts the significance of a document by
examining the link structure of a hyperlinked set of documents.

VOLUME 4, ISSUE 6, NOV/2017 158 http://iaetsdjaras.org/


IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

The page rank of a page gets divided evenly among its outgoing links. HITS algorithm relies on the in-links and
out-links on the page to determine the importance of internet pages [5] [6]. T2S algorithmic programs (Top-k by Table
Scan) focus is to compute top-k results on large amount of data efficiently [7]. Hill Top algorithmic program considers the
specific document that is relevant to the retrieval theme holds a higher relevance than the document which is irrelevant to
the retrieval theme [8].

Stages of Information Retrieval System


The development of data retrieval systems has versed the subsequent 3 Stages.

i. Based on Word Frequency

Its fundamental is: if a retrieving entry includes a higher frequency in a document and appears at a more important
position, the document is considered to hold a greater relevance with the retrieval entry. TFIDF is the most important
invention of this stage, which solves the relationship between frequency and location, and calculates the relevance for
sorting.

ii. Based on Link Analysis

The retrieval plan supported link analysis is originated from the citation indexes mechanism that considers a
document holds a greater price if its additional by different documents or cited by additional authoritative documents.

iii. Based on Intelligent Retrieval

This retrieval system based on intelligent retrieval comes into being. It provides personalized services to solve the
simplification of retrieval results and realized intelligent retrieval.

INTRODUCTION TO SORTING ALGORITHMS


i. Page Rank Algorithm

The concept of PageRank comes from the citing frequency of a paper, which means the more a paper is cited, the
higher the authority of it is [9]. The PageRank of page A depends primarily on three factors:

The PageRank values of the in-degree pages.


The number of out-degree pages of page A.
The number of in-degree pages of page A.

The damping factor is subtracted from 1 (and in some variations of the algorithm, the result is divided by the
number of documents (N) in the collection) and this term is then added to the product of the damping factor and the sum
of the incoming PageRank scores. That is,

PR(A)= 1-d/N+d(Pr(B)/L(B)+ Pr(C)/L(C)+ Pr(D)/L(D)+)

So any page's PageRank is derived in large part from the Page Ranks of other pages. PageRank can be computed
either iteratively or algebraically.

VOLUME 4, ISSUE 6, NOV/2017 159 http://iaetsdjaras.org/


IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

Page B
Page C
38.4%
34.3
Page
A
3.3%
Page
Page E F
3.9%
Page D 8.1%
3.9%

1.6% 1.6%
1.6%
1.6%
1.6%

Fig.2 Page Ranking Example

The biggest advantage of PageRank algorithm is that it is a static algorithm having nothing to do with the retrieval
entry. So all PageRank values are calculated offline, and thus the online computation amount gets reduced and sthe
response time is very short when compared to other retrieval systems. However, the biggest disadvantage of PageRank
algorithm is that it ignores the theme relevance of pages.

ii. Hill Top Algorithm

HillTop and PageRank both determine the retrieval results by page link information. But HillTop considers
document relevant to the retrieval theme holds a higher relevance than the document irrelevant to the retrieval them.
Bharat called the document relevant to the theme as "expert document"[10]. The calculation of expert documents is
operated online, and thus it affects the retrieval response time. With the increase of expert page sets, limitations appear
seriously in the scalability of the algorithm [11].

iii. HITS algorithm

HITS algorithm analyzes the importance of pages according to the in-degree and outdegree of a page. Linking to
other page is defined as Hub value, and linked from other page is defined as Authority value. The relationship between
Hub and Authority is the basis of HITS algorithm. The basic idea behind HITS algorithm is that when the user issues the
query, a group of pages that are relevant to the search are retrieved by the search engine and then authority ranking hub
ranking of these pages are produced. Hubs and Authorities have mutual strengthening relationship.

iv. Page Rank using Machine Learning

A well suitable trained machine learning algorithm known as Graph Neural Network (GNN) which is capable of
generating a common model that includes different types of the numerical page ranking methods for web pages. The
GNN belongs to the new class of neural network based algorithms. The information connected to a node in the graph, its
neighborhood and links are utilized for the computation of outputs in a graph. Information related to the node and the
inputs to the current node are processed. Each node use a multilayer perceptron (MLP) architecture to process this
information as inputs, except that the MLP architecture does not have any output layer. Each node makes use of the
hidden layer neurons as the outputs to the other internal nodes.

VOLUME 4, ISSUE 6, NOV/2017 160 http://iaetsdjaras.org/


IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

The output layer of the MLP exists only at the output nodes of the graph. The same number of hidden layer
neurons is used in each MLP that is related to each node in order to reduce the number of variables.

Table 1.Comparison of different Algorithms

Algorithm Advantages Disadvantages


Elimination of interference of
Combines the advantages of garbage information and
Page Rank Algorithm hyperlink structure and web determination of proportion of
content objective and subjective is little
bit complex.
calculation of expert documents
is operated
Hill TOP algorithm online
Affects the retrieval response
time
With increase in expert page set
Convergence speed is Faster scalability of algorithm is
HITS algorithm
than Page Rank complex
It is difficult to apply on dataset.
A Novel Table scan Runs 5.59 times faster than Performs sequential scan on the
based T2S algorithm nave algorithm table
Genetic Algorithms in Assigning importance Factors to
Retrieval effectiveness can be
Search, Optimization different classes is little bit
improved by 39.6% or higher
and Machine Learning complex and time taking.
In contrast to the time-
consuming and labor intensive
manually assessment, this
Only restricted to navigational
Learning based algorithm approach is efficient and it can
type queries
process about 400 topics
automatically in merely one
hour.

MULTIDIMENSIONAL DATA RETRIEVAL SORTING


OPTIMIZATION MODEL
Different users might have different intention and the way of searching for information also might vary. Certain factors
such as the way of achieving users true intentions and obtaining the most suitable results related to the query plays the
vital role in an efficient data retrieval system. The main intention of the user who searches for the information could be the
relevant information pertained to the search must be obtained. Hence, the focus could be on data retrieval sorting
algorithms. As a result of this, a multidimensional data retrieval sorting optimization model might consider metrics such as
the data characteristics, users and application characteristics. The obtained sorted results must be optimized in three
dimensions.
CONCLUSION AND FUTURE WORK
This paper provides the basics of retrieval systems and gives an overview of data retrieval algorithms. In addition an
analysis of ranking algorithms is also done. It is understood that there is a need for an optimization in the sorting results
that are obtained while searching for information. In addition, a study on various data retrieval sorting optimization
models could be done and integration of machine learning could be done for further optimization as a future work.

VOLUME 4, ISSUE 6, NOV/2017 161 http://iaetsdjaras.org/


IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

REFERENCES
[1]. Lu Z X. Research and Improvement of PageRank Sort Algorithm Based on Retrieval Results International Conference on Intelligent
Computation Technology and Automation. IEEE, 2014:468 - 471.
[2]. Michal Cutler, Yungming Shih, Weiyi Meng. Using the structures of HTML documents to improve retrieval, in Proc of Usenix
Symp on Internet Technologies and Systems (USITS 97). Piscataway, NJ: IEEE, 1997: 241- 251
[3]. Cutler M, Deng H, Maniceam S S, et a1. A new study on using HTML structures to improve retrieval, in Proc of the 11th IEEE
Conf on Tools with Artificial Intelligence (ICTAl). Piscataway, NJ: IEEE, 1999: 406-409.
[4]. Lawrence Page, Sergey Brin, Rajeev Motwani. The PageRank citation ranking: Bring order to the Web, S1DL-WP-1999-0120.
Stanford: Stanford lnfoLab Publication Server, 1999.
[5]. Sergey Brin, Larry Page. The Anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems
(ISDN), 1998, 30: 107-1 17.
[6]. Jon M Kleinberg. Authoritative sources in a hyperlinked environment, in Proc of ACM- SIAM Symposium on Discrete Algorithms.
New York: ACM, 1998: 668-677
[7]. Han X, Li J, Gao H. Efficient Top-k Retrieval on Massive Data. IEEE Transactions on Knowledge & Data Engineering, 2015,
27(10):1-1.
[8]. Bharat K, Mihaila G A. When Experts Agree: Using Non-affiliated Experts to Rank Popular Topics[C]. Proceedings of the 10th
International Conference on World Wide Web. ACM, 2001: 597-602.
[9]. http://baike.baidu.com/view/1518.html?fromTaglist.
[10]. Shaohua L, Wenyu G. Survey of Page-ranking Algorithms [J]. Application Research of Computers, 2007, 24(6): 4-7.
[11]. McSherry F. A Uniform Approach to Accelerated PageRank Computation[C]. Proceedings of the 14th International Conference on
World Wide Web. ACM, 2005: 575-582.

VOLUME 4, ISSUE 6, NOV/2017 162 http://iaetsdjaras.org/

Vous aimerez peut-être aussi