Vous êtes sur la page 1sur 4

International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013

ISSN: 2231-5381 http://www.ijettjournal.org Page 3756



A Distance Cache Mining by Metric Access Methods
Rajnish Kumar, Pradeep Bhaskar Salve
Department of Computer Engineering
Sir Visvesvaraya Institute of Technology, Nasik

AbstractThis is related to increase the DBMS performance
and resolve all issues and risks. Here we implement the new
caching techniques and buffering techniques. These new caching
techniques consume the I/O cost utilization. Previous system
working procedure starts in complex databases. User forward
the query, same query result is present in different databases;
using similarity operation extracts the results from the
distributed databases. All related or relevant results are
displayed here. It can have the retrieval performance as very low.
Here utilization of I/O cost and CPU cost is high. It can have
minor performance under computation cost. Next we have
changed the query format like k-nearest neighbour. It can
display the results at least 80%. It can have the non-relevant
results of information. It is expensive query based data
extraction. We are proposing structures related cache distances.
Any user forward any kind of query, automatically it can search,
run timely and display the results. Run timely in database
technology perform the analysis process and provides the results
with optimization of I/O cost here. It can work based on distance
based caches in implementation. It can provide the results as a
useful. It can provide the results in indexing and querying. It can
display the results are effective. Compare all the previous
schemes pivot based query provides the effective results. It comes
under good performance approach compare to all previous
approaches.

Keywords Distance Cache, complex databases, indexing,
database technology, k-nearest neighbour query, Metric Access
Method, M Tree
I. INTRODUCTION
This Paper comes under Data Mining domain. As we know,
World Wide Web has more and more online databases and the
number of database is increasing day by day hence extracting
the effective data has become very difficult. When any query
is submitted to database then it retrieves the information from
that database and display the result. Another thing which is
important is that as distance increases, extraction matter. As
we know, server always try to extract data which is close to
thembut on World WideWeb, when any user want to search
data for different location as in USA then it may chances that
data may not properly extracted due to distance.
Let us take an example for better understanding by taking an
example for any website such as dell, Microsoft or any. When
we type this keyword in the search box of any internet
interfaces then the relevant URL comes on below page. What
we see on that page? We see that the nearest server of
Microsoft which is India server appears first. It means google
uses the concept of Distance related technique which is used
to extract the data distance wise. So, it is a simple example
regarding D cache. We will discuss more in further
explanation. Hence, we are going to develop such type of
technique fromwhich distance wont matter.
II. EXISTING SYSTEM
Following are the main problems in the Existing System:
1) 2.1 Problem of deep Extraction based on distance:
Already there are number of problem such as webpage
programming dependency, scripting dependency, version
dependency in extraction but now a days, many technique has
been released such as page level extraction, fiva tech
extraction, vision based extraction, Genetic Programming
fromwhich efficient extraction can be done.
But the main problem is extraction based upon distance. Many
of time, we observe that we dont find that type of result what
we want.
Suppose there is a website in United States for courier service
(e.g. trackon Limited) related. This courier company have also
branches in another location such as in India, china, Russia
and etc. Obvious all branches may have relevant website in
different location. The problemis that, when any user wants to
search a branch for that courier company in search box then
International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013

ISSN: 2231-5381 http://www.ijettjournal.org Page 3757

sometimes he find only main branch (USA) which is a
problem in extraction. The extraction approach having a
problemto find nearer located branch i.e. Distance has been
not considered in that tool (extraction tool).
Another example is, suppose we want the information about
java programming language. We type this keyword in any
search box then what the server do? They try to find the java
programming language containing information then this time,
the concept of similarity is used. Server will first match the
data after then extract. Here also, distance matter? Which data
should be presented, nearer or far distanced data? Which data
having sufficient information for user? So this type of problem
existed in existing system.
2.2 Problem of data retrieval based on duplication and
web dependency:
Another problemis duplication. When data is uploaded from
different location then it may having chance of duplicated data.
If we consider a digital library website such as google, yahoo,
Wikipedia then there exists too many unwanted data. One
links may occur many times. As all the links having some
information behind them. If they will occur more than one
time then space will be taken more. Hence performance will
be automatically decreased and after then response time will
be increased which is not a solution for good extraction.so,
this type of problemexists in existing system.
Due to these existing problem, the main disadvantages are low
performance, high computational cost and more processing
time.
III. PROPOSED SYSTEM
Hence whatever the problemexists in present system, we will
remove here.
We introduce a new extraction approach with caching distance.
This is called as a disk based caches. User entered a distance
range search and find out the results. Here we are going to use
parsing technique.it can extract the results fromdesired caches
and distances. Hence, it will give the faster extraction results.
Due to our proposed approach, the main advantages are high
performance, low computational cost and low processing time.
IV. D CACHE
So, the main concept about this project is D Cache.
D Cache is a technique/tool for general metric access methods
that helps to reduce a cost of both indexing and querying. The
main task of D cache is to determine tight lower and upper
bound of an unknown distance between two objects.
First we have to understand about Metric access methods
Metric access methods are the technique which is used in that
situations where the similarity searching can be applied. E.g.
search for SBI, it can search in entire country i.e. similar
search has been invoked. First try to understand the concept of
similarity searching. When any user submit a query in the
search box or any database then the process of responding to
these queries is termed as similarity searching. Given a query
object this involves finding objects that are similar to q in a
database S of N objects, based on some similarity measure.
Both q and s are drawn fromsome universe U of objects,
but q is generally not in S. We assume that the similarity
measure can be expressed as a distance metric such that
d(01,02) becomes smaller as 01 and 02 are more similar thus
(s, d) is said to be a finite metric space.
Now, metric access method will facilitate the retrieval process
by building an index on the various features which are
analogous to attributes. These indexes are based on treating
the records as a points in a multidimensional space and use
point access methods.
Metric access methods uses a structure for caching distances
computed during the current runtime session. The distance
cache ought to be an analogy to the classic disk cache widely
used in DBMS to optimize I/O cost. Hence instead of sparing
I/O, the distance cache should spare distance computations.
The main idea behind the distance caching resides in
approximating the requested distances by providing their
lower and upper bound.
In whole project, there are mainly two operations used for
both side i.e. for administrator and user. Each operation is
worked by different algorithm.
4.1 Distance Calculation: Distance Calculation operation is
performed on user side. When any user type any keyword
then distance will be calculated. The main D Cache
functionality is operated by methods(get distance, get
lower bound distance)that means while distance
retrieval process, first distance will be found and lower
bound distance will be first allocated.
For that purpose Algorithm for Distance Calculation is
used.
The number of dynamic pivots used to evaluate get lower
bound distance which is set by the user while this parameter is
an exact analogy to the number of pivots in pivot tables.
First we should try to understand about pivot tables and M
Tree.
Pivot tables: A simple but efficient solution to similarity
search represents methods called pivot tables or distance
metric methods.
In general, a set of p objects (which is called pivot) is selected
fromdatabase and after then for every database object, a p-
International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013

ISSN: 2231-5381 http://www.ijettjournal.org Page 3758

dimensional vector of distances to the pivots is created and
represented as in a table which is termed as a pivot table.
M Tree: The M Tree is a dynamic index structure that provide
good performance in secondary memory.
The M Tree is a hierarchical index, where some of the data
objects are selected as centres (local pivots) of ball shaped
regions, while the remaining objects are partitioned among the
regions in order to build up a balanced and compact hierarchy
of data regions.
So, with the help of pivot tables and M Tree construction,
Distance is retrieved.
4.2 Distance Insertion Operation: this operation is
performed on administrator side. Every time a distance is
computed by the MAM, the distance is inserted into a
database in D cache.
Particularly, we consider two policies for replacement by a
new entry:
Obsolete: The first obsolete (not containing id of a current
dynamic pivot) entry in the collision interval is replaced.
Obsolete percentile: This policy includes two steps:
In first step, we try to replace the first obsolete entry as in
obsolete policy. If none of the entries is obsolete, we replace
an entry with the least useful distance. Among all entries in
the collision interval, the entry that is closest to the middle
distance is the least useful thus it is replaced. In second step, if
any entry is not obsolete then we keep as it is.
For this operation Algorithm for Distance Insertion is used.
Another two algorithmis used in this project for enhancing
the Sequential search that is Algorithm for Range Query
and Algorithm for Dynamic Similar Search.
All two algorithmemphasize that the D cache together with
sequential search could be used as a standalone metric access
method that requires no indexing at all. It is used in that type
of situations where indexing is not possible or too expensive.
We use a different algorithmfor enhancing M Tree which is
termed as Algorithm for M-Tree Range Query. In this
algorithm, the D cache is used to speed up the construction of
M Tree, where we use both the exact retrieval of distance
(method get distance) and also the lower bounding
functionality. In this algorithm, node splitting is done for the
computation of distance matrix of all pairs of node entries.
The value of this matrix can be stored in D cache and some of
themreused later. When node splitting is performed on the
child nodes of the previously split node.



V. PROJ ECT MODULES
We are using here five modules in our project:
1) 5.1 Suitability of D Cache:
Any user can forward any type of distance based query which
starts the searching process and create the runtime object and
database object. Each and every object session time and index
are calculated here for particular distance based query. Other
user forward same query extracts the results fromprevious
distance. Automatically index value is increases here. It is the
procedure of D-cache. D-cache starts the searching process
and quickly displays the results. It can calculates lower bound
and upper bound, which is the nearest locations results those
results are displayed as a final results. It can give relevant
distance based caches results only in output.
Example: When any user search data from search box (i.e.
From database) then our project will detect whether the
suitability of D cache should be applied or not. E.g. If we type
1+1 then here there is no need of D cache concept because
online calculator can automatically convert that type of search.
There are so many example such as1 $=? Rs, 1 feet=? Inch, if
we have mentioned converter already then there is no need of
D cache but if we type java in search box then the principle
of D cache will be applied because it will try to retrieve the
distance of java fromnearer server. Hence, the first module
works on the suitability of D cache.

5.2 Selection of dynamic pivot:
It consider the input of first module. That is called as a
preprocessing data or indexing data. In this particular data
only performthe similarity search operations. Automatically
creates the dynamic pivot calculation and display the final
results in output. It is very cheap for extraction of results and
provides the results as an output. It can give the results as a
minimized result of content.

5.3 D cache Alteration:
In this process, searching process is based on radius that mean
operation will be worked. It means all two algorithmwill be
worked here. It searches the data within the region. It start the
search in all number of dimensions. It display the result after
collection of multidimensional objects.

5.4 Approximate similarity search:
It can start the search by exact approximate similarity search.
It can save the cost under extraction of results. This type
search retrieves the exact results. It is good incremental search
without lower and upper bound distances. It is related good
hierarchy related search mechanismhere.
International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013

ISSN: 2231-5381 http://www.ijettjournal.org Page 3759

Ex: Suppose we type java in search box then this module
will give the similar Result for java.

5.5 D cache performance:
For better D cache performance, we have used three more
algorithmapart fromExtraction searching.
We have used two algorithm such as D-file range query
algorithm and D-file KNN query algorithm for enhancing
sequential search and one algorithmi.e. D-M Tree range query
Algorithm for fast M-Tree formation.

VI. FUTURE ENHANCEMENTS
There are so many thing which can be done in future for
enhancement in this project. First is related to performance.
Other algorithms, tools or extraction approach can be used for
increasing the performance. Second thing is related to tree
formation. Other techniques can be used for fast M-Tree
formation.
VII. CONCLUSIONS
So, by using this project, User can extract data based upon
distance. Dependency has been also considered thats why
some dependency such as Web Page Dependency, Scripting
Dependency, Version Dependency has been removed and also
the data duplication removal process will work here so that
User will get effective and non-duplicated data after extraction.





























REFERENCES
[1] H. Zhao, W. Meng, Z. Wu, and C. Yu, Automatic
Extraction of Dynamic Record Sections from Search
Engine Result Pages, Proc. 32
nd
Int1 Conf. Very Large
data Bases (VLDB), 2006.
[2] V. Crescenzi, P. Merialdo, and P. Missier, Clustering
Web Pages Based on Their Structure, Data and
Knowledge Eng., vol.54, pp. 279-299, 2005.
[3] B. Liu, R.L. Grossman, and Y. Zhai, Mining Data
Records in Web Pages, Proc. Intl Conf. Knowledge
Discovery and Data Mining (KDD), pp. 601-606, 2003.
[4] K. Simon and G. Lausen, ViPER: Augmenting
Automatic Information Extraction with Visual
Perceptions, Proc. Conf. Information and Knowledge
Management (CIKM), pp. 381-388, 2005.
[5] M. Wheatley, Operation Clean Data, CIO Asia
Magazine.
[6] N. Koudas, S. Sarawagi and D. Srivastava, Record
Linkage: Similarity Measures and Algorithms, Proc.
ACM SIGMOD Int1 Conf. Management of Data, pp.
802-803, 2006.
[7] R. Bell and F. Dravis, Is You Data Dirty? and Does that
Matter?, Accenture Whiter Paper,
http://www.accenture.com, 2006.
[8] J.R. Koza, Gentic Programming: On the Programming of
Computers by Means of Natural Selection. MIT Press,
1992.

Vous aimerez peut-être aussi