Vous êtes sur la page 1sur 16

Genetic Programming Approach for Record De-Duplication

AIM
Identifying record replicas in digital libraries and other types of digital repositories is to improve the quality of their content and services as well as yield eventual sharing efforts. Managing data growth to save time , money and valuable resources.

ABSTRACT
Several system relay on consistent data such as digital libraries may be affected by existence of duplicates Several deduplication strategies are available but they relay on manually chosen settings to combine evidence used to identify the records as being replicas. Genetic Programming approach was used to record

deduplication . This method effectively identifies replicas in


the repositories using genetic operations.

EXISTING SYSTEM

In previous data present with replicas that is no standardized representation this called dirty data. Dirty data caused by integration of distinct data sources. For that data cleaning, record linkage record matching techniques are used.

DISADVANTAGES
Performance degradation -more time to response simple query. Quality loss -replicas leads to distortions in conclusions Increasing operational cost because of additional volume of useless data.

PROPOSED SYSTEM In existing genetic programming approach is used for record deduplication. This generates gene value for each record using genetic operations . if that gene value matched with any other record that record was considered as a duplicate record. Genetic operations are: reproduction, crossover , mutation. This operations are to enhance the attributes of given record.

ADVANTAGES
It efficiently maximize the identification record replica while avoiding making mistakes during the process. This approach able to automatically find effective

deduplication functions , even when the most suitable


similarity function for each record attribute is not known in advance.

MODULES
Attribute Supersede. Attribute Assessment Concern heritage measures. Reproduction Cross over Mutation Designate decision. spawn gene.

Attribute Supersede:
Initialize the population .

Attribute Assessment
Assign numeric rating to all individual populations. For that attributes find the fitness value for each record Using machine learning approach. If fitness value satisfied then no need to apply genetic operations

Concern heritage measure Reproduction:


Reproduction copies the individuals without modifying them . this method used to keep the fittest individuals across the changes in the generations.

Cross Over:
This operation allows genetic content exchange between two partner.

Mutation:
This operator used to keep the minimum diversity level of individuals in the population. In which sub tree is selected and replaced by random sub tree.

Designate decision.
Check if randomly generated individuals satisfies the fitness value or not. If fitness reached then end the process . other wise continue from step three.

spawn gene. If fitness reached gene for that record calculated. If another record with same gene present that record surely discarded.

Software Requirements Operating System : Language SDK IDE Database : : : : Windows Xp Java jdk1.7 NetBeans 6.9.1 MY SQL

Hardware Requirements
System Hard Disk Monitor Mouse

: : : :

Pentium IV 2.4 GHZ 160 GB 15 VGA color logitch

RAM

: 2 GB

CONCLUSION To provide consistent data to digital libraries and e-commerce brokers need to remove the duplicate records. We propose genetic programming approach to identifying and handling duplicate records in the repositories. They automatically suggest deduplication functions to handle duplicates based on the evidence present in the repositories.

FUTURE ENHANCEMENTS

Record deduplication can be improved by using Practical Swarm Optimization(PSO) it reduces computational complexity by using simpler mathematical functions compared to Genetic programming functions.

REFERENCES
[1] M. Wheatley, Operation Clean Data, CIO Asia Magazine, http://www.cio-asia.com, Aug. 2004. [2] N. Koudas, S. Sarawagi, and D. Srivastava, Record Linkage: Similarity Measures and Algorithms, Proc. ACM SIGMOD Intl Conf.Management of Data, pp. 802-803, 2006. [3] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, Robust and Efficient Fuzzy Match for Online Data Cleaning, Proc. ACM SIGMOD Intl Conf. Management of Data, pp. 313-324, 2003. [4] I. Bhattacharya and L. Getoor, Iterative Record Linkage for Cleaning and Integration, Proc. Ninth ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery, pp. 11-18, 2004

Vous aimerez peut-être aussi