Vous êtes sur la page 1sur 27

Model Question Paper Subject Code: MC0088 Subject Name: Data Minning Credits: 4 Marks: 140

Part A (One mark questions)


1. OLAP is a technology that is used to create____________ software. a. Image processing b. Decision support c. Data support d. Relational database 2. Data mining methods or techniques find the relations between variables or data in the given data base and express these relations using a. Set theory b. Algorithm relations c. Statistical nomenclature d. Algebraic expression 3. The analytical techniques used in data mining are often well-known _________________________ and techniques. a. Mathematical algorithms b. Machine learning c. Statistics

d. Artificial Intelligence 4. Nearest neighbor A classification technique that classifies each record based on the records most similar to it in a ______________.. a. Statistical environment b. Genetic mutation c. Classified data d. Historical database 5. A________________ facilitates customer relationship management since it provides a consistent view of customers and items across all lines of business, all departments and all markets. a. Top down view b. Data warehouse c. Data warehouse view d. Bottom up approach 6. The 0 D cuboid, which holds the highest level of summarisation, is called the ___________. a. Apex cuboid b. Base cuboid c. Hybrid OLAP (HOLAP) d. ROLAP 7. The most popular data model for a data warehouse is a _____________________________. a. fact constellation schema b. Fact table c. Star schema d. Multi-dimensional model

8. Fact constellation schema in SQL - define dimension branch as _____________________________________. a. (location_key, street, city, province_or_state, country) b. (branch _key, branch_name, branch_type) c. (item_key,, item_name, brand, type, supplier_type) d. sales [time, item, branch, location] 9. Converting data into knowledge and making it available throughout the organisation are the jobs of processes and applications known as ________________. a. Business processing b. Business Intelligence c. Data mining d. Data Processing 10. Choose the Business Intelligence tool, which define the statement given below: Software that allows the user to ask questions about patterns or details in the data. a. Query Tools b. Multi-dimensional Analysis Tools c. Data warehouse d. Data Mining Tools 11. ___________________ deals with all aspects of managing the development, implementation and
operation of a data mart including meta data management, data acquisition etc.

a. Business Intelligence b. Data Mining c. Relational database management system (RDBMS) d. Data warehousing

12. Business intelligence, typically drawn from an ________________, is used to analyse and uncover
information about past performance on an aggregate level.

a. Business firms b. Enterprise data warehouse c. Relational database management system (RDBMS) d. Data banks 13. In ___________________, the sorted values are distributed into a number of buckets, or bins. a. Regression b. Binning method c. Redundancy d. Generalisation of the data 14. Which of the following data analysis task combines data from multiple sources into a coherent data store, as in data warehousing?
a. Data integration b. Data cleaning c. Data transformation d. Data reduction

15. In _______________, the new attributes are constructed and added from the given set of attributes to help the mining process. a. Aggregation b. Normalisation c. Generalisation d. Attribute construction 16. _______________ techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. a. Data compression b. Data reduction

c. Discretisation d. Binning 17. The major limitation of Regression technique is that it only works well with ________________________. a. Algorithm b. Data Base Management System c. Continuous quantitative data d. Prediction 18. ________________ is a method of grouping data into different groups, so that the data in each group share similar trends and patterns. a. Clustering b. Description c. Artificial Neural Network d. Classification 19. An Artificial Neural Network ANN is configured for a specific application, such as______________or data classification through a learning process. a. Regression b. Prediction c. Pattern recognition d. Classification 20. The basic premise of an_________________is to find all associations, such that the presence of one set of items in a transaction implies the other items. a. Sequential patterns b. Association c. Clustering d. Artificial Neural Network

21. If a rule concerns associations between the presence or absence of items, it is a _____________________. a. b. c. d. Multidimensional association rule Boolean association rule Quantitative association rule Single dimensional association rule

22. A __________________ is a frequent pattern, p, such that any proper superpattern of p is not frequent. a. Maxpattern b. Frequent closed itemset c. Maximal frequent set d. Border set

23. Which database requires reading it completely for each pass? a. Transactional b. Document-oriented c. Relational d. Disk resident 24. Any superset of an infrequent set is an infrequent set can be called as __________________. a. Downward closure property
b. Upward closure property c. Maximal frequent set d. Item set

25. In______________clustering, we consider the distance between one cluster and another cluster to be equal to shortest distance from any member of one cluster to any member of the other cluster. a. Single-link

b. Complete-link c. Average-link d. Single linkage 26. The________________approach to clustering starts out with a fixed number of clusters and allocates all records into exactly that number of clusters. a. Complete linkage method b. Hierarchical clustering c. K-means d. Centroid distance 27. A cluster hierarchy can also be generated top-down. This variant of hierarchical clustering is called _____________.
a. Bottom-up clustering b. Top-down clustering c. Centroid-based clustering d. Exclusive clustering

28. _____________, offers clustering based on Localisation of Anomalies (LA) algorithm.


a. Bayesia Lab b. Cviz c. IBM Intelligent miner

d. PolyAnalyst

29 . In data classification, each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the _________________.
a. Training samples b. Class label attribute c. Test set samples

d. Training data set 30. _____________ can be viewed as the construction and use of a model to assess the class of an unlabeled sample, or to assess the value or value ranges of an attribute that a given sample is likely to have. a. Classification b. Regression c. Prediction d. Interpretability

31. ________________ refers to the preprocessing of data in order to remove or reduce noise and the treatment of missing values. a. Relevance analysis
b. Data transformation c. Normalisation d. Data cleaning

32. ________________ involves scaling all values for a given attribute so that they fall within a small specified range. a. Predictive accuracy b. Normalisation c. Tree pruning d. Classification 33. Which of the following search for relevant information, uses characteristics of a particular domain (and possibly a user profile,) to organise and interpret the discovered information? a. Personalised web agents b. Information filtering c. Intelligent Web agents d. Web Query Systems

34. __________ is a collection of machine learning algorithms for data mining tasks.
a. Weka b. Java code c. Open source software d. SAS data quality solution

35. In client Side monitoring, by inserting _________ into each Web page, looks to be a promising approach but raises privacy issues in an organisation. a. Virus b. Spywares c. Bugs d. Firewall 36. _____________retrieves product information from a variety of vendor sites using only general information about the product domain. a. ShopBot b. ILA c. OCCAM d. Information Manifold 37. A rule like If at least 50% of the upper part of the picture is blue, then it is likely to represent sky belongs to this category since it links the image content to the keyword sky. Which category are we mentioning here?
a. Associations among image contents that are not related to spatial relationships b. Associations between image content and non-image features c. Associations among image contents related to spatial relationships d. Description-based retrieval systems

38. Some systems, such as________________,support both sample-based and image feature specification queries. a. QBIC (Query By Image Content) b. Image Excavator c. Feature descriptor d. Layout descriptor 39. _____________________ is labour-intensive if performed manually.
a. Content-based retrieval b. Description-based retrieval c. Image sample-based queries d. Colour histogrambased signature

40.

________________are rapidly growing due to the increasing amount of information

available in electronic form. a. Information retrieval b. Document ranking methods


c. Text databases d. Term table

Part B (Two mark questions)


41. Consider the following statements: 1. The aim of data mining is to extract implicit, previously known and potentially useful (or actionable) patterns from data. 2. Many years of practice show that data mining is a process, and its successful application requires data preprocessing (dimensionality reduction, cleaning, noise/outlier removal), post processing (under standability, summary, presentation), good understanding of problem domains and domain expertise.
State True or False: a. 1- True, 2- False b. 1- False, 2- False

c.

1- False, 2- True

d. 1- True, 2- True

42. Consider the following statements: 1. Data Mining is not specific to any industry it requires intelligent technologies and the willingness to explore the possibility of hidden knowledge that resides in the data. 2. Data Mining is also referred to as knowledge development in databases (KDD). State True or False a. 1- True, 2- True b. 1- False, 2- True c. 1- False, 2- False d. 1- True, 2- False 43. Consider the following statements:
1. A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files and on line transaction records. This is called Non-volatile. 2. A data warehouse is organised around major subjects, such as customer, supplier, product and sales. This is called Subject - oriented. State True or False:

a. 1- False, 2- True b. 1- False, 2- False c. 1- True, 2- False d. 1- True, 2- True 44. Consider the following statements: 1. An OLTP system focuses mainly on the current data within an enterprise or department, without referring to historical data or data in different organisations. 2. An OLTP system often spans multiple versions of a database schema, due to the evolutionary process of an organisation. State True or False:

a. 1- True, 2 True b. 1- False, 2- False c. 1- False, 2- True d. 1- True, 2- False 45. Business intelligence is used to refer to systems and technologies that provide the _____________ with the means for decision-makers to extract personalised meaningful information about their business and industry, not typically available from ____________ alone. a. Business, internal systems b. Management, data c. Business, raw data d. Partners, raw data 46. Consider the following statements: 1. Data Mining allows users to sift through the enormous amount of information available in data warehouses. 2. Business intelligence, typically drawn from an enterprise data warehouse, is used to analyse and uncover information about past performance on an aggregate level. State True or False: a. 1- True, 2- False b. 1- True, 2- True c. 1- False, 2- True d. 1- False, 2- False 47. Consider the following statements:
1. Knowledge engineering tools may also be used to detect the violation of known data constraints. 2. There may also be inconsistencies due to data sampling, where a given attribute can have different names in different databases.

State True or False: a. 1- True, 2- True b. 1- False, 2- False c. 1- True, 2- False d. 1- False, 2- True

48. Consider the following statements:


1. Data mining often requires data integration where intergartion is merging of data from multiple data stores. 2. Databases and data warehouses typically have metadata that is, data about the data. Such metadata can be used to help avoid errors in schema integration.

State True or False: a. 1- True, 2- False b. 1- False, 2- True c. 1- False, 2- False d. 1- True, 2- True 49. Consider the following statements:
1. The goal of the techniques for Classification is to detect relationships or associations between specific values of categorical variables in large data sets. 2. These powerful exploratory techniques have a wide range of applications in many areas of business practice and also research - from the analysis of consumer preferences or human resource management to the history of language.

State True or False: a. 1- True, 2- True b. 1- False, 2- False c. 1- False, 2- True d. 1- True, 2- False 50. Consider the following statement: 1. Classification is a Data Mining (machine learning) technique used to predict group membership for data instances. 2. Association rules and neural networks are the only two Popular classification techniques. State True or False: a. 1- True, 2- False b. 1- False, 2- False c. 1- True, 2- True

d. 1- False, 2- True 51. According to Mining Quantitative Association Rules: In Equidepth binning, the interval size of each bin is the same. In Equiwidth binning, each bin has approximately the same number of tuples assigned to it.
State True or False: a. 1- True, 2- False b. 1- True, 2- True c. 1- False, 2- True d. 1- False, 2- False

52. Consider the following statements: 1. For many applications, it is difficult to find strong associations among data items at low or primitive levels of abstraction due to the sparsity of data in multidimensional space. 2. Data mining systems should provide capabilities to mine association rules at individual levels of abstraction and traverse easily among different abstraction spaces. State True or False:
a. 1- True, 2- False b. 1- True, 2- True c. 1- False, 2- True d. 1- False, 2- False

53. Consider the following statements with respect to Clustering: 1. The K-means procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. 2. In K-means clustering, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters each containing a single object. State True or False:
a. 1- True, 2- False b. 1- True, 2- True

c. 1- False, 2 - True d. 1- False, 2- False

54. Consider the following statements: 1. In single-link clustering, if the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster. 2. In complete-link clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the average of distances from any member of one cluster to any member of the other cluster. State True or False:
a. 1- True, 2- True b. 1- False, 2- False c. 1- True, 2- False d. 1- False, 2 - True

55. Consider the following statements: 1. The accuracy of a model on a given test set is the percentage of test set samples that are correctly classified by the model. 2. The individual tuples making up the training set are referred to as training data set and are randomly selected from the sample population. State True or False:
a. 1- True, 2- False b. 1- False, 2- True c. 1- True, 2- True d. 1- False, 2- False

56. Consider the following statements: 1. In data cleaning, most classification algorithms have some mechanisms for handling noisy or missing data, which can help in reducing confusion during learning.

2. Ideally, the time spent on relevance analysis, when added to the time spent on learning from the resulting reduced feature subset should be less than the time that would have been spent on learning from the original set of features. State True or False:
a. 1- True, 2- False b. 1- False, 2- False c. 1- False, 2- True d. 1- True, 2- True

57. Consider the following statements: 1. Click-stream is the series of page-views that the user takes through the site. Proxy servers hide page-views and it is often impossible to map a user's complete visit. 2. A page view is a click-stream record, spanning the entire Web and is typically related to that specific site that is available. State True or False: a. 1- True, 2- False b. 1- False, 2- True c. 1- True, 2- True d. 1- False, 2- False 58 Consider the following statements: 1. Web content mining requires creative applications of Data mining and / or Text mining techniques and also its own unique approaches. 2. Web content mining is also different from Text mining because of the semi-structure nature of the Web, while Text mining focuses on complete structured texts. State True or False:
a. 1- True, 2- True b. 1- True, 2- False c. 1- False, 2- True

d. 1- False, 2- False

59. Consider the following statements: 1. The construction of a multimedia data cube will facilitate multidimensional analysis of multimedia data primarily based on visual content and the mining of multiple kinds of knowledge, including summarisation, comparison, classification, association and clustering. 2. It is easy to implement a data cube efficiently despite a large number of dimensions. State True or False:
a. 1True, 2True b. 1- False, 2- False c. 1 - True, 2- False d. 1- False, 2- True

60. A stop list is a set of words that are deemed ______________. Stop lists may vary per _____________. a. Relevant, file set b. Rational, document set c. Irrational, file set d. Irrelevant, document set

Part C (Four mark questions)


61. Data cleaning and Pre-processing comprises of the following: 1. Removal of noise and outliers 2. Collecting necessary information to model or account for noise 3. Handling of missing data 4. Finding useful features to represent the data relative to the goal
State True or False: a. All statements are true b. Statements 2,3 & 4 are true

c. Statements 1,2 & 4 are true d. Statements 1, 2 & 3 are true

62. Match the following: Part A 1. Artificial Neural networks 2. Decision Tree 3. Rule Induction 4. Nearest Neighbor
Part B

A. A classification technique that classifies each record based on the records most similar to it in a historical database. B. The extraction of useful if-then rules from databases on Statistical significance. C. Non-linear predictive models that learn through training and resemble biological neural networks in structure. D. Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. a. 1A, 2B, 3C, 4D b. 1D, 2C, 3B, 4A c. 1B, 2C, 3D, 4A d. 1C, 2D, 3B, 4A

63. Match the following with respect to Star schema: Part A 1. Roll up 2. Drill down 3. Slice and dice 4. Pivot Part B A. It is also known as rotate is a visualisation operation that rotates the data axes in view, in order to provide an alternative presentation of the data. B. It is performed by dimension reduction, one or more dimensions are removed from the given cube.

C. It can be realised by either stepping down a concept hierarchy for a dimension or introducing additional dimensions. D. This operation performs a selection on one dimension of the given cube, resulting in a sub cube. a. 1A, 2B, 3C, 4D b. 1B, 2C, 3D, 4A c. 1D, 2C, 3B, 4A d. 1B, 2A, 3D, 4C 64 Consider the following statements: 1. Business Intelligence can reveal emerging trends from which the company might profit. 2. Data Mining allows users to sift through the enormous amount of information available in data warehouses. 3. It is from this sifting process that business intelligence gems may be found. 4. Data warehousing is not an intelligence tool or framework. State True or False. a. Statements 1, 2 & 3 are true b. Statements 1, 3 & 4 are true c. Statements 2, 3 & 4 are true d. Statements 1, 2 & 4 are true

65. Match the following:


Part A 1. Smoothing 2. Aggregation 3. Generalisation 4. Normalisation Part B

A. Where low level or primitive (raw) data are replaced by higher level concepts through the use

of concept hierarchies. B. Where the attribute data are scaled so as to fall within a small specified range, such as 1.0 to 1.0, or 0.0 to 1.0. C. Where summary or aggregation operations are applied to the data. D. Which works to remove the noise from data. Such techniques include binning, clustering and regression. a. 1A, 2B, 3C, 4D b. 1D, 2C, 3A, 4B c. 1B, 2C, 3D, 4A d. 1C, 2B, 3D, 4A 66. Consider the following statements: 1. Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. 2. Mining on the large data set should be more efficient yet produce the same (or almost the same) analytical results. 3. Data cubes store multidimensional aggregated information. 4. The cube created at the lowest level of abstraction is referred to as the base cuboid. State True or False: a. Statements 1,2 & 3 are true b. Statements 2,3 & 4 are true c. Statements 1, 3 & 4 are true d. All statements are true

67. Consider the following statements: 1. Clustering is a method of grouping data into different groups, so that the data in each group share similar trends and patterns. 2. Clustering constitutes a major class of DBMS algorithms. 3. The algorithm attempts to automatically partition the data space into a set of regions or clusters, to which the examples in the table are assigned, either deterministically or probabilitywise.

4. The goal of the process is to identify all sets of similar examples in the data, in some optimal fashion. State True or False: a. Statements 1,2 & 3 are true b. Sattements 1,3 & 4 are true c. Satements 1,2 & 4 are true d. Statements 2,3 & 4 are true

68. Consider the following statements: Itemset Counting Algorithm:

According to Dynamic

1. The rationale behind DIC is that it works like a train running over the data, with stops at intervals M between transactions. 2. When the train reaches the end of the transaction file, it has made one pass over the data, and it starts all over again from the end for the next pass. 3. The passengers on the train are itemsets. When an itemset is on the train, we count its occurrence in the transactions that are read. 4. The four different structures defined in DIC algorithm are: Dashed Box, Dashed circle, Solid box, Solid circle. State True or False: a. Statements 1, 2 & 3 are true b. Statements 2, 3 & 4 are true c. Statements 1, 3 & 4 are true d. Statements 1, 2 & 4 are true

69. Consider the following statements: 1. For many applications, it is difficult to find strong associations among data items at low or primitive levels of abstraction due to the sparsity of data in multidimensional space. 2. Data warehousing systems should provide capabilities to mine association rules at multiple levels of abstraction and traverse easily among different abstraction spaces.

3. A concept hierarchy defines a sequence of mappings from a set of low level concepts to higher level, more general concepts. 4. Data can be generalized by replacing low level concepts within the data by their higher level concepts, or ancestors, from a concept hierarchy. State True or False: a. Statements 1, 3 & 4 are true b. Statements 1,2 & 4 are true c. All statements are true d. Statements 2, 3 & 4 are true

70. The four methods of clustering are: 1. K-Means 2. Hierarchial 3. Aggregative 4. Divisive Which of the above options is false? a. Option 1 is false b. Option 2 is false c. Option 3 is false d. Option 4 is false 71. Consider the following statements: 1. The information gain measure is used to select the test attribute at each node in the tree. 2. The attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node. 3. The attribute with the minimum information gain minimises the information needed to classify the samples in the resulting partitions and reflects the least randomness or impurity in these partitions. 4. Theoretic approach minimises the expected number of tests needed to classify an object and guarantees that a simple (but not necessarily the simplest) tree is found. State True or False: a. Statements 1, 2 & 3 are true b. Statements 2,3 & $ are true

c. Statements 1, 2 & 4 are true d. Statements 1, 3 & 4 are true

72. Consider the following statements: 1. In the prepruning approach, a tree is pruned by halting its construction early. 2. Postpruning, removes branches from a fully grown tree. 3. A tree node is pruned by removing its branches. 4. If portioning the samples at a node would result in a split that falls below a prespecified threshold, then further partitioning of the given subset is not halted.
State True or False: a. Statements 1, 2 & 3 are true b. Statements 2, 3 & 4 are true c. Statements 1, 2 & 4 are true d. Statements 1,3 & 4 are true

73. Consider the following statements: 1. Web Usage Mining focuses on techniques that could predict the behaviour of users while they are interacting with the WWW. 2. Web usage mining, discover user navigation patterns from web data, tries to discovery the useful information from the secondary data derived from the interactions of the users while surfing on the Web. 3. Web structure mining collects the data from Web log records to discover user access patterns of web pages. 4. The insight knowledge could be utilised in personalisation, system improvement, site modification, business intelligence and usage characterisation.

State True or False: a. Statements 1, 2 & 4 are true b. Statements 1,2 & 3 are true c. Statements 2,3 & 4 are true d. Statements 1, 3 & 4 are true

74. Match the followings: Part A


1. Web mining 2. Web content mining 3. Database approach 4. Agent-based approach

Part B A. It aims on modeling the data on the Web into more structured form in order to apply standard database querying mechanism and data mining applications to analyse it. B. It helps to solve the problem of discovering how users are using Web sites. C. Due to the heterogeneity and the lack of structure of Web data, automated discovery of targeted or unexpected knowledge information still present many challenging research problems. D. It aims on improving the information finding and filtering. a. 1B, 2C, 3A, 4D b. 1A, 2B, 3C, 4D

c. 1D, 2C, 3B, 4A d. 1C, 2B, 3A, 4D

75. Consider the following statements: According to Multidimensional Analysis of Multimedia Data: contains two descriptors: a feature descriptor and a layout descriptor. 1. Each image

2. The description information encompasses fields like image file name, image URL, image type, a list of all known Web pages referring to the image, a list of keywords and a thumbnail used by the user interface for image and video browsing. 3. The layout descriptor is a set of vectors for each visual characteristic. 4. The main vectors are a color vector containing the color histogram quantised to 512 colors, an MFC vector and an MFO vector. State True or False: a. Statements 1, 2 & 3 are true b. Statements 1, 3 & 4 are true c. Statements 2, 3 & 4 are true d. Statements 1, 2 & 4 are true

Answer Keys

Part - A Q. No. Ans. Key Q. No. Ans. Key Q. No.

Part - B Ans. Key Q. No.

Part - C Ans. Key

1 b 2 c 3 a 4 d 5 b 6 a 7 d 8 b 9 b 10 a 11 d 12 b 13 b 14 a 15 d 16 b 17 c 18 a

21 b 22 a 23 d 24 b 25 a 26 c 27 b 28 d 29 b 30 c 31 d 32 b 33 c 34 a 35 c 36 a 37 b 38 a

41 c 42 d 43 a 44 d 45 a 46 b 47 c 48 d 49 c 50 a 51 d 52 a 53 a 54 c 55 a 56 d 57 a 58 b

61 d 62 d 63 b 64 a 65 b 66 c 67 b 68 c 69 a 70 c 71 c 72 a 73 a 74 a 75 d

19 c 20 b

39 b 40 c

59 c 60 d

Vous aimerez peut-être aussi