Management Information Systems: Concepts, Data Processing, Storage and Retrieval, Data Warehousing and Mining, Decision Support Systems and Applications

Management information System Concepts of Data, Information and knowledge. Processing of data using computers.
Storage and Retrieval of massive data on computers. Data Warehousing and Data mining. Document Image Management Systems (DIMS). Phases in software Systems Life Cycle. Decision Support Systems, Knowledge Based Systems and their Applications in management.
Data is more than the raw material of information systems. The concept of data resources has been broadened by managers and information systems professionals. They realize that data constitutes a valuable organization resource. Data can take many forms, including traditional alphanumeric data, composed of numbers and alphabetical and other characters that describe business transactions and other events and entities. Text data, consisting of sentences and paragraphs used in written communications; image data, such as graphic shapes and figures; and audio data, the human voice and other sounds, are also important forms of data. Information in raw or unorganized form (such as alphabets, numbers, or symbols) that refer to, or represent, conditions, ideas, or objects. Data is limitless and present everywhere in the universe. The word "data" is the plural of datum, which means fact, observation, assumption or occurrence. More precisely, data are representations of facts pertaining to people, things, ideas and events. Data are represented by symbols such as letters of the alphabets, numerals or other special symbols.
The word Data is derived from the plural form of latin word Datum, which means to give. Originated at mid 17th century. Data is a collection of raw facts. May or may not be meaningful. Input to any system may be treated as Data. Understanding is difficult. Data must be processed to understand. Data may not be in the order. Example: Statistics, numbers, characters, images.
The word Information is derived from latin word Informare, which means to instruct. Originated at late middle time. Information is the outcome derived after processing the data. Information is always meaningful. Output after processing the system is Information. Understanding is easy.
Information is already in understandable form, it may be processed further to make it more understandable. Information should be in the order. Example Reports, Knowledge
MANAGEMENT INFORMATION SYSTEM Definition A Management Information System is an integrated user-machine system, for providing information, to support the operations, management, analysis & decision-making functions in an organization. The System utilizes computer hardware & software, manual procedures, models for analysis, planning, control & decision making and a database. MIS provides information to the users in the form of reports and output from simulations by mathematical models. The report and model output can be provided in a tabular or graphic form. MIS provide a variety of information products to managers which includes 3 reporting alternatives: Periodic Scheduled Reports , Exception Reports , Demand Reports and Responses MIS provide a variety of information products to managers which includes 3 reporting alternatives:, Periodic Scheduled Reports: E.g. Weekly Sales Analysis Reports, Monthly Financial Statements etc. , Exception Reports: E.g. Periodic Report but contains information only about specific events. , Demand Reports and Responses: E.g. Information on demand.
Both data and information are entered from the environment. Database contains the data provided by the subsystem. The database contents are used by software that produces periodic and special reports as well as mathematical models that simulate various aspects of the firms operations. The software outputs are used by persons who are responsible for solving the firms problems.
COMPONENTS OF MIS
Process Net contribution of many individual processes in the MIS design. Conversion of Inputs into Outputs Inputs Sales in units by each salesman for a period. Estimated sales in units of competitors. Economic conditions & trends. Outputs Sale by Product Sales by Salesman Sales by Region, Salesman & Products. Sales Trend Analysis Sales Forecasts MIS CHARACTERISTICS Management Oriented/directed Business Driven Integrated
Common Data Flows Heavy Planning Element Subsystem Concept Flexibility & Ease of Use Database Distributed Systems Information as a Resource
STRUCTURE OF MIS Approaches Physical Components Information System Processing Functions Decision Support Levels of Management Activities Organizational Functions Based on Physical Components Hardware Software Database Procedures Operating Personnel Input & Output Hardware: E.g. CPU, Monitor, Keyboard, Printer etc. Software: E.g. System and Application S/W. Database: E.g. Data stored in files. Procedures: E.g. Manuals etc. Operating Personnel: E.g. Computer Operators, Programmers, System Analysts, System Manager etc. Input & Output: E.g. Printouts, Reports etc. Based on Processing Functions To Process Transactions To Maintain Master Files To Produce Reports To Process Enquiries To Process interactive Support Applications To Process Transactions: E.g. Making a purchase or a sale of a product. To Maintain Master Files: E.g. For preparing an employees salary, required data items are Basic Pay, Allowances, Deductions etc. To Produce Reports: For e.g. Specific or Adhoc reports To Process Enquiries: For e.g. Regular or Adhoc enquiry. To Process interactive Support Applications: E.g. Applications designed for planning, analysis and decision making.
Based on Output For Users Transaction Documents or Screens Preplanned Reports Preplanned Inquiry Responses Adhoc Reports & Inquiry Responses User-machine Dialog Results Based on Management Activities
Based on Organizational Functions
MIS Support for Decision Making Structured / Programmable Decisions Unstructured / Non-Programmable Decisions Semi-Structured Decisions Structured / Programmable Decisions: O Decisions that are repetitive, routine and have a definite procedure for handling them. O For e.g. Inventory reorder formula, Rules for granting Credit. Unstructured / Non-Programmable Decisions: O Non-routine decision in which the decision maker must provide judgment, evaluation, and insights into the problem definition. Semi-Structured Decisions: O Decision where only part of the problem has a clear cut answer provided by an accepted procedure.
EDP and MIS EDP: These systems process mostly clerical and supervisory type of applications related to record keeping, processing of large volume of data and generation of authentic and accurate reports for operational management. These systems offer cost reduction by saving upon manpower and time resource. These serve as information source to operational management and assist in operational control and planning. Application Uses: Payroll, Inventory control, Production, Costing, Purchase and Logistics. An Example : A typical EDP application for ledger accounting that consists of modules for data storage of account vouchers and generation of accounting reports such as ledgers, trial balance, profit & loss account etc. The primary objective of the application is book keeping. The motive of this application is to ease the clerical functions and assist in operational control. EDP/MIS/DSS .EDP was first applied to the lower operational levels of the organization to automate the paperwork.
Characteristics: A focus on data, storage, processing and flows at the operational level. Efficient transaction processing. Scheduled and optimized computer runs. Integrated files for related jobs. Summary reports for management. EDP level of activity in many firms has become an efficient facility for transaction processing. MIS: An information focus, which is aimed at middle managers. A structured information flow. Integration of EDP by business functions. Inquiry and report generation with a database. When controls are incorporated in an EDP application, then these are upgraded to MIS applications. DSS: It is focused higher in the organization with an emphasis on the following characteristics: Decision focused ,Aimed at top managers and executive decision makers. ,Emphasis on flexibility, adaptability and quick response. ,User initiated and controlled. ,Support for the personal decision making styles of individual managers. Pitfalls in MIS Development Organization does not have a reliable management system Organization has not defined its mission clearly Organizations objectives have not been specified Management lacks interest in MIS development process & relies solely on MIS developments specification. Communication gap exists between MIS development team and the management MIS development team is incompetent
Data Processing Concept Introduction Each organisation, regardless of its size or purpose, generates data to keep a record of events and transactions that take place within the business. Generating and organising this data in a useful way is called data processing. In this lesson, we shall discuss about various terms such as data, information, data processing and data processing system. Objectives After going through this lesson, you will be in a position to define the concepts of data, information and data processing, explain various data processing activities, utilise data processing cycle, explain data elements, records, files and databases. Data The word data is the plural of datum, which means fact, observation, assumption or occurrence. More precisely, data are representations of facts pertaining to people, things, ideas and events. Data are represented by symbols such as letters of the alphabets, numerals or other special symbols. Data Processing Data processing is the act of handling or manipulating data in some fashion. Regardless of the activities involved in it, processing tries to assign meaning to data. Thus, the ultimate goal of processing is to transform data into information. Data processing is the process through which facts and figures are collected, assigned meaning, communicated to others and retained for future use. Hence we can define data processing as a series of actions or operations that converts data into useful information. We use the term data processing system to include the resources that are used to accomplish the processing of data. Information Information, thus can be defined as data that has been transformed into a meaningful and useful form for specific purposes. In some cases data may not require any processing before constituting information. However, generally, data is not useful unless it is subjected to a process through which it is manipulated and organised, its contents analyzed and evaluated. Only then data becomes information. There is no hard and fast rule for determining when data becomes information. A set of letters and numbers may be meaningful to one person, but may have no meaning to another. Information is identified and defined by its users. For example, when you purchase something in a departmental store, a number of data items are put together, such as your name, address articles you bought, the number of items purchased, the price, the tax and the amount you paid. Separately, these are all data items but if you put these items together, they represent information about a business transaction. Data Processing Activities As discussed above, data processing consists of those activities which are necessary to transform data into information. Man has in course of time devised certain tools to help him in processing data. These include manual tools such as pencil and paper, mechanical tools such as filing cabinets, electromechanical tools such as adding machines and typewriters, and electronic tools such as calculators and computers. Many people immediately associate data processing with computers. As stated above, a computer is not the only tool used for data processing, it can be done without computers also. However, computers have outperformed people for certain tasks. There are some other tasks for which computer is a poor substitute for human skill and intelligence. Regardless to the type of equipment used, various functions and activities which need to be performed for data processing can be grouped under five basic categories as shown in Fig. 2.1
Collection Data originates in the form of events transaction or some observations. This data is then recorded in some usable form. Data may be initially recorded on paper source documents and then converted into a machine usable form for processing. Alternatively, they may be recorded by a direct input device in a paperless, machine-readable form. Data collection is also termed as data capture. Conversion Once the data is collected, it is converted from its source documents to a form that is more suitable for processing. The data is first codified by assigning identification codes. A code comprises of numbers, letters, special characters, or a combination of these. For example, an employee may be allotted a code as 52-53-162, his category as A class, etc. It is useful to codify data, when data requires classification. To classify means to categorize, i.e., data with similar characteristics are placed in similar categories or groups. For example, one may like to arrange accounts data according to account number or date. Hence a balance sheet can easily be prepared. After classification of data, it is verified or checked to ensure the accuracy before processing starts. After verification, the data is transcribed from one data medium to another. For example, in case data processing is done using a computer, the data may be transformed from source documents to machine sensible form using magnetic tape or a disk. Manipulation Once data is collected and converted, it is ready for the manipulation function which converts data into information. Manipulation consists of following activities: Sorting: It involves the arrangement of data items in a desired sequence. Usually, it is easier to work with data if it is arranged in a logical sequence. Most often, the data are arranged in alphabetical sequence. Sometimes sorting itself will transform data into information. For example, a simple act of sorting the names in alphabetical order gives meaning to a telephone directory. The directory will be practically worthless without sorting. Business data processing extensively utilises sorting technique. Virtually all the records in business files are maintained in some logical sequence. Numeric sorting is common in computer-based processing systems because it is usually faster than alphabetical sorting. Calculating: Arithmetic manipulation of data is called calculating. Items of recorded data can be added to one another, subtracted, divided or multiplied to create new data as shown in fig. 2.2(a). Calculation is an integral part of data processing. For example, in calculating an employees pay, the hours worked multiplied by the hourly wage rate gives the gross pay. Based on total earning, income-tax deductions are computed and subtracted from gross-pay to arrive at net pay. Summarizing:To summarize is to condense or reduce masses of data to a more usable and concise form as shown in fig. 2.2(b). For example, you may summarize a lecture attended in a class by writing small notes in one or two pages. When the data involved is numbers, you summarize by counting or accumulating the totals of the data in a classification or by selecting strategic data from the mass of data being processed. For example, the summarizing activity may provide a general manager with sales-totals by major product line, the sales manager with sales totals by individual salesman as well as by the product line and a salesman with sales data by customer as well as by product line.
Comparing:To compare data is to perform an evaluation in relation to some known measure. For example, business managers compare data to discover how well their compaines are doing. They many compare current sales figures with those for last year to analyze the performance of the company in the current month. Managing the Output Results Once data has been captured and manipulated following activities may be carried out : Storing: To store is to hold data for continued or later use. Storage is essential for any organised method of processing and re-using data. The storage mechanisms for data processing systems are file cabinets in a manual system, and electronic devices such as magnetic disks/magnetic tapes in case of computer based system. The storing activity involves storing data and information in organised manner in order to facilitate the retrieval activity. Of course, data should be stored only if the value of having them in future exceeds the storage cost. Retrieving: To retrieve means to recover or find again the stored data or information. Retrieval techniques use data storage devices. Thus data, whether in file cabinets or in computers can be recalled for further processing. Retrieval and comparison of old data gives meaning to current information. Communication Communication is the process of sharing information. Unless the information is made available to the users who need it, it is worthless. Thus, communication involves the transfer of data and information produced by the data processing system to the prospective users of such information or to another data processing system. As a result, reports and documents are prepared and delivered to the users. In electronic data processing, results are communicated through display units or terminals. Reproduction To reproduce is to copy or duplicate data or information. This reproduction activity may be done by hand or by machine. The Data Processing Cycle The data processing activities described above are common to all data processing systems from manual to electronic systems. These activities can be grouped in four functional categories, viz., data input, data processing, data output and storage, constituting what is known as a data processing cycle. (i) Input The term input refers to the activities required to record data and to make it available for processing. The input can also include the steps necessary to check, verify and validate data contents (ii) Processing The term processing denotes the actual data manipulation techniques such as classifying, sorting, calculating, summarizing, comparing, etc. that convert data into information. (iii) Output It is a communication function which transmits the information, generated after processing of data, to persons who need the information. Sometimes output also includes decoding activity which converts the electronically generated information into human-readable form. (iv) Storage It involves the filing of data and information for future use. The above mentioned four basic functions are performed in a logical sequence as shown in Fig. 2.3 in all data processing systems.
Computer Processing Operations A computer can perform only the following four operations which enable computers to carry out the various data processing activities we have just discussed. (a) Input/Output operations A computer can accept data (input) from and supply processed data (output) to a wide range of input/output devices. These devices such as keyboards, display screens, and printers make humanmachine communication possible. (b) Calculation and text manipulation Operations Computer circuits perform calculations on numbers. They are also capable of manipulating numerics and other symbols used in text with equal efficiency. (c) Logic/Comparison Operations A computer also possesses the ability to perform logic operations. For example, if we compare two items represented by the symbols A and B, there are only three possible outcomes. A is less than B (A<B); A is equal to B (A=B): or A is greater than B (A>B). A computer can perform such comparisons and the, depending on the result, follow a predetermined path to complete its work. This ability to compare is an important property of computers. (d) Storage and Retrieval Operations Both data and program instructions are stored internally in a computer. Once they are stored in the internal memory, they can be called up quickly or retrieved, for further use. Data Processing System The activity of data processing can be viewed as a system. According to James Obrien a system can be defined as a group of interrelated components that seeks the attainment of a common goal by accepting inputs and producing outputs in an organised process. For example, a production system accepts raw material as input and produces finished goods as output. Similarly, a data processing system can be viewed as a system that uses data as input and processes this data to produce information as output. There are many kinds of data processing systems. A manual data processing system is one that utilizes tools like pens, and filing cabinets. A mechanical
data processing system uses devices such as typewriters, calculating machines and book-keeping machines. Finally, electronic data processing uses computers to automatically process data. Mass Storage Basics A mass-storage device is electronic hardware that stores information and supports a protocol for sending and retrieving the information over a hardware interface. The information can be anything that can be stored electronically: executable programs, source code, documents, images, spreadsheet numbers, database entries, data logger output, configuration data, or other text or numeric data. Mass-storage devices typically store information in files. A file system defines how the files are organized in the storage media. When to Use a Storage Device Implementing a mass-storage function is a solution for systems that need to read or write moderate to large amounts of data. If the device has a Universal Serial Bus (USB) interface, any PC or other USB host can access the storage media. Generic USB mass-storage devices include the hard drives, flash drives, CD drives, and DVD drives available from any computer-hardware store. Table 1-1 lists popular device types. These devices have just one function: to provide storage space for the systems they connect to. Another type of USB mass-storage device (or storage device for short) is the special-purpose device with storage capabilities. For example, a camera can capture images and store the images in files. A data logger can collect and store sensor readings in files. A robotic device can receive files containing configuration parameters. With the addition of a USB mass-storage interface, any of these devices can use USB to exchange files with PCs and other USB hosts. Generic storage devices are readily available and inexpensive. Unless youre employed by a storagedevice manufacturer, there isnt much point in designing and programming your own generic devices. But special-purpose USB storage devices are useful in many embedded systems, including one-of-a-kind projects and products manufactured in small quantities. Another option for some systems is to add USB host-controller hardware and mass-storage firmware. The embedded system can then store and read files in off-the-shelf USB storage devices. Benefits Adding storage-device capabilities to a system has several benefits: With a USB device controller, a system can make the contents of its storage media available to any PC or other USB host computer. File systems provide a standard way to store and access data. A PC or other USB host can format the media in a USB storage device to use the FAT16 or FAT32 file system. When the device is connected to a PC, the operating system enables reading and writing to files. Users can access the files without having to install and learn a vendor-specific application. Storage media is readily available. Flash-memory cards are convenient and have enough capacity for many applications. Some cards require only a few port pins to access. Devices that need large amounts of storage can interface to hard drives. Computer data storage Computer data storage, often called storage or memory, is a technology consisting of computer components and recording media used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a computer is what manipulates data by performing computations. In practice, almost all computers use a storage hierarchy, which puts fast but expensive and small storage options close to the CPU and slower but larger and cheaper options farther away. Often the fast, volatile technologies (which lose data when powered off) are referred to as memory, while slower permanent technologies are referred to as
storage, but these terms can also be used interchangeably. In the Von Neumann architecture, the CPU consists of two main parts: control unit and arithmetic logic unit (ALU). The former controls the flow of data between the CPU and memory; the latter performs arithmetic and logical operations on data. Functionality Without a significant amount of memory, a computer would merely be able to perform fixed operations and immediately output the result. It would have to be reconfigured to change its behavior. This is acceptable for devices such as desk calculators, digital signal processors, and other specialised devices. Von Neumann machines differ in having a memory in which they store their operating instructions and data. Such computers are more versatile in that they do not need to have their hardware reconfigured for each new program, but can simply be reprogrammed with new inmemory instructions; they also tend to be simpler to design, in that a relatively simple processor may keep state between successive computations to build up complex procedural results. Most modern computers are von Neumann machines. Data organization and representation A modern digital computer represents data using the binary numeral system. Text, numbers, pictures, audio, and nearly any other form of information can be converted into a string of bits, or binary digits, each of which has a value of 1 or 0. The most common unit of storage is the byte, equal to 8 bits. A piece of information can be handled by any computer or device whose storage space is large enough to accommodate the binary representation of the piece of information, or simply data. For example, the complete works of Shakespeare, about 1250 pages in print, can be stored in about five megabytes (40 million bits) with one byte per character. Data is encoded by assigning a bit pattern to each character, digit, or multimedia object. Many standards exist for encoding (e.g., character encodings like ASCII, image encodings like JPEG, video encodings like MPEG-4). By adding bits to each encoded unit, the redundancy allows both to detect errors in coded data and to correct them based on mathematical algorithms. Errors occur regularly in low probabilities due to random bit value flipping, or physical bit fatigue, loss of the physical bit in storage its ability to maintain distinguishable value (0 or 1), or due to errors in inter or intra-computer communication. A random bit flip (e.g., due to random radiation) is typically corrected upon detection. A bit, or a group of malfunctioning physical bits (not always the specific defective bit is known; group definition depends on specific storage device) is typically automatically fenced-out, taken out of use by the device, and replaced with another functioning equivalent group in the device, where the corrected bit values are restored (if possible). The cyclic redundancy check (CRC) method is typically used in storage for error detection and correction. Data compression methods allow in many cases to represent a string of bits by a shorter bit string (compress) and reconstruct the original string (decompress) when needed. This allows to utilize substantially less storage (tens of percents) for many types of data at the cost of more computation (compress and decompress when needed). Analysis of trade-off between storage cost saving and costs of related computations and possible delays in data availability is done before deciding whether to keep certain data in a database compressed or not. For security reasons certain types of data (e.g., credit-card information) may be kept encrypted in storage to prevent the possibility of unauthorized information reconstruction from chunks of storage snapshots.
Hierarchy of storage Various forms of storage, divided according to their distance from the central processing unit. The fundamental components of a general-purpose computer are arithmetic and logic unit, control circuitry, storage space, and input/output devices. Technology and capacity as in common home computers around 2005. Generally, the lower a storage is in the hierarchy, the lesser its bandwidth and the greater its access latency is from the CPU. This traditional division of storage to primary, secondary, tertiary and off-line storage is also guided by cost per bit. In contemporary usage, memory is usually semiconductor storage read-write random-access memory, typically DRAM (Dynamic-RAM) or other forms of fast but temporary storage. Storage consists of storage devices and their media not directly accessible by the CPU, (secondary or tertiary storage), typically hard disk drives, optical disc drives, and other devices slower than RAM but are non-volatile (retaining contents when powered down). Historically, memory has been called core, main memory, real storage or internal memory while storage devices have been referred to as secondary storage, external memory or auxiliary/peripheral storage.
Primary storage Primary storage (or main memory or internal memory), often referred to simply as memory, is the only one directly accessible to the CPU. The CPU continuously reads instructions stored there and executes them as required. Any data actively operated on is also stored there in uniform manner. Historically, early computers used delay lines, Williams tubes, or rotating magnetic drums as primary storage. By 1954, those unreliable methods were mostly replaced by magnetic core memory. Core memory remained dominant until the 1970s, when advances in integrated circuit technology allowed semiconductor memory to become economically competitive. This led to modern random-access memory (RAM). It is small-sized, light, but quite expensive at the same time. (The particular types of RAM used for primary storage are also volatile, i.e. they lose the information when not powered). As shown in the diagram, traditionally there are two more sub-layers of the primary storage, besides main large-capacity RAM: Processor registers are located inside the processor. Each register typically holds a word of data (often 32 or 64 bits). CPU instructions instruct the arithmetic and logic unit to perform various calculations or other operations on this data (or with the help of it). Registers are the fastest of all forms of computer data storage. Processor cache is an intermediate stage between ultra-fast registers and much slower main memory. Its introduced solely to increase performance of the computer. Most actively used information in the main memory is just duplicated in the cache memory, which is faster, but of much lesser capacity. On the other hand, main memory is much slower, but has a much greater storage capacity than processor registers. Multi-level hierarchical cache setup is also commonly used primary cache being smallest, fastest and located inside the processor; secondary cache being somewhat larger and slower. Main memory is directly or indirectly connected to the central processing unit via a memory bus. It is actually two buses (not on the diagram): an address bus and a data bus. The CPU firstly sends a number through an address bus, a number called memory address, that indicates the desired location of data. Then it reads or writes the data itself using the data bus. Additionally, a memory management unit (MMU) is a small device between CPU and RAM recalculating the actual memory address, for example to provide an abstraction of virtual memory or other tasks. As the RAM types used for primary storage are volatile (cleared at start up), a computer containing only such storage would not have a source to read instructions from, in order to start the computer. Hence, non-volatile primary storage containing a small startup program (BIOS) is used to bootstrap the computer, that is, to read a larger program from non-volatile secondary storage to RAM and start to execute it. A non-volatile technology used for this purpose is called ROM, for readonly memory (the terminology may be somewhat confusing as most ROM types are also capable of random access). Many types of ROM are not literally read only, as updates are possible; however it is slow and memory must be erased in large portions before it can be re-written. Some embedded systems run programs directly from ROM (or similar), because such programs are rarely changed. Standard computers do not store non-rudimentary programs in ROM, rather use large capacities of secondary storage, which is non-volatile as well, and not as costly. Recently, primary storage and secondary storage in some uses refer to what was historically called, respectively, secondary storage and tertiary storage. Secondary storage Secondary storage (also known as external memory or auxiliary storage), differs from primary storage in that it is not directly accessible by the CPU. The computer usually uses
its input/output channels to access secondary storage and transfers the desired data using intermediate area in primary storage. Secondary storage does not lose the data when the device is powered downit is non-volatile. Per unit, it is typically also two orders of magnitude less expensive than primary storage. Modern computer systems typically have two orders of magnitude more secondary storage than primary storage and data are kept for a longer time there. In modern computers, hard disk drives are usually used as secondary storage. The time taken to access a given byte of information stored on a hard disk is typically a few thousandths of a second, or milliseconds. By contrast, the time taken to access a given byte of information stored in randomaccess memory is measured in billionths of a second, or nanoseconds. This illustrates the significant access-time difference which distinguishes solid-state memory from rotating magnetic storage devices: hard disks are typically about a million times slower than memory. Rotating optical storagedevices, such as CD and DVD drives, have even longer access times. With disk drives, once the disk read/write head reaches the proper placement and the data of interest rotates under it, subsequent data on the track are very fast to access. To reduce the seek time and rotational latency, data are transferred to and from disks in large contiguous blocks. When data reside on disk, block access to hide latency offers a ray of hope in designing efficient external memory algorithms. Sequential or block access on disks is orders of magnitude faster than random access, and many sophisticated paradigms have been developed to design efficient algorithms based upon sequential and block access. Another way to reduce the I/O bottleneck is to use multiple disks in parallel in order to increase the bandwidth between primary and secondary memory. Some other examples of secondary storage technologies are: flash memory (e.g. USB flash drives or keys), floppy disks, magnetic tape, paper tape, punched cards, standalone RAM disks, and Iomega Zip drives. The secondary storage is often formatted according to a file system format, which provides the abstraction necessary to organize data into files and directories, providing also additional information (called metadata) describing the owner of a certain file, the access time, the access permissions, and other information. Most computer operating systems use the concept of virtual memory, allowing utilization of more primary storage capacity than is physically available in the system. As the primary memory fills up, the system moves the least-used chunks (pages) to secondary storage devices (to a swap file or page file), retrieving them later when they are needed. As more of these retrievals from slower secondary storage are necessary, the more the overall system performance is degraded.
A hard disk drive with protective cover removed. Tertiary storage Tertiary storage or tertiary memory,[4] provides a third level of storage. Typically it involves a robotic mechanism which will mount (insert) and dismount removable mass storage media into a storage device according to the systems demands; this data is often copied to secondary storage before use. It is primarily used for archiving rarely accessed information since it is much slower than secondary storage (e.g. 560 seconds vs. 110 milliseconds). This is primarily useful for extraordinarily large data
stores, accessed without human operators. Typical examples include tape libraries and optical jukeboxes. When a computer needs to read information from the tertiary storage, it will first consult a catalog database to determine which tape or disc contains the information. Next, the computer will instruct a robotic arm to fetch the medium and place it in a drive. When the computer has finished reading the information, the robotic arm will return the medium to its place in the library. Off-line storage Off-line storage is a computer data storage on a medium or a device that is not under the control of a processing unit.[5] The medium is recorded, usually in a secondary or tertiary storage device, and then physically removed or disconnected. It must be inserted or connected by a human operator before a computer can access it again. Unlike tertiary storage, it cannot be accessed without human interaction. Off-line storage is used to transfer information, since the detached medium can be easily physically transported. Additionally, in case a disaster, for example a fire, destroys the original data, a medium in a remote location will probably be unaffected, enabling disaster recovery. Off-line storage increases general information security, since it is physically inaccessible from a computer, and data confidentiality or integrity cannot be affected by computer-based attack techniques. Also, if the information stored for archival purposes is rarely accessed, off-line storage is less expensive than tertiary storage. In modern personal computers, most secondary and tertiary storage media are also used for off-line storage. Optical discs and flash memory devices are most popular, and to much lesser extent removable hard disk drives. In enterprise uses, magnetic tape is predominant. Older examples are floppy disks, Zip disks, or punched cards.
Large tape library. Tape cartridges placed on shelves in the front, robotic arm moving in the back. Visible height of the library is about 180 cm. Characteristics of storage Storage technologies at all levels of the storage hierarchy can be differentiated by evaluating certain core characteristics as well as measuring characteristics specific to a particular implementation. These core characteristics are volatility, mutability, accessibility, and addressibility. For any particular implementation of any storage technology, the characteristics worth measuring are capacity and performance. Volatility Non-volatile memory-Will retain the stored information even if it is not constantly supplied with electric power. It is suitable for long-term storage of information. Volatile memory-Requires constant power to maintain the stored information. The fastest memory technologies of today are volatile ones (not a universal rule). Since primary storage is required to be very fast, it predominantly uses volatile memory.
Dynamic random-access memory-A form of volatile memory which also requires the stored information to be periodically re-read and re-written, or refreshed, otherwise it would vanish. Static random-access memory-A form of volatile memory similar to DRAM with the exception that it never needs to be refreshed as long as power is applied. (It loses its content if power is removed). An uninterruptible power supply can be used to give a computer a brief window of time to move information from primary volatile storage into non-volatile storage before the batteries are exhausted. Some systems (e.g., see the EMC Symmetrix) have integrated batteries that maintain volatile storage for several hours. Mutability Read/write storage or mutable storage -Allows information to be overwritten at any time. A computer without some amount of read/write storage for primary storage purposes would be useless for many tasks. Modern computers typically use read/write storage also for secondary storage. Read only storage -Retains the information stored at the time of manufacture, and write once storage (Write Once Read Many) allows the information to be written only once at some point after manufacture. These are called immutable storage. Immutable storage is used for tertiary and offline storage. Examples include CD-ROM and CD-R. Slow write, fast read storage -Read/write storage which allows information to be overwritten multiple times, but with the write operation being much slower than the read operation. Examples include CD-RW and flash memory. Accessibility Random access-Any location in storage can be accessed at any moment in approximately the same amount of time. Such characteristic is well suited for primary and secondary storage. Most semiconductor memories and disk drives provide random access. Sequential access-The accessing of pieces of information will be in a serial order, one after the other; therefore the time to access a particular piece of information depends upon which piece of information was last accessed. Such characteristic is typical of off-line storage. Addressability Location-addressable -Each individually accessible unit of information in storage is selected with its numerical memory address. In modern computers, location-addressable storage usually limits to primary storage, accessed internally by computer programs, since location-addressability is very efficient, but burdensome for humans. File addressable-Information is divided into files of variable length, and a particular file is selected with human-readable directory and file names. The underlying device is still location-addressable, but the operating system of a computer provides the file system abstraction to make the operation more understandable. In modern computers, secondary, tertiary and off-line storage use file systems. Content-addressable-Each individually accessible unit of information is selected based on the basis of (part of) the contents stored there. Content-addressable storage can be implemented using software (computer program) or hardware (computer device), with hardware being faster but more expensive option. Hardware content addressable memory is often used in a computers CPU cache. Capacity Raw capacity -The total amount of stored information that a storage device or medium can hold. It is expressed as a quantity of bits or bytes (e.g. 10.4 megabytes). Memory storage density-The compactness of stored information. It is the storage capacity of a medium divided with a unit of length, area or volume (e.g. 1.2 megabytes per square inch).
Performance Latency-The time it takes to access a particular location in storage. The relevant unit of measurement is typically nanosecond for primary storage, millisecond for secondary storage, and second for tertiary storage. It may make sense to separate read latency and write latency, and in case of sequential access storage, minimum, maximum and average latency. Throughput-The rate at which information can be read from or written to the storage. In computer data storage, throughput is usually expressed in terms of megabytes per second or MB/s, though bit ratemay also be used. As with latency, read rate and write rate may need to be differentiated. Also accessing media sequentially, as opposed to randomly, typically yields maximum throughput. Granularity-The size of the largest chunk of data that can be efficiently accessed as a single unit, e.g. without introducing more latency. Reliability-The probability of spontaneous bit value change under various conditions, or overall failure rate Energy use Storage devices that reduce fan usage, automatically shut-down during inactivity, and low power hard drives can reduce energy consumption 90 percent. 2.5 inch hard disk drives often consume less power than larger ones. Low capacity solid-state drives have no moving parts and consume less power than hard disks. Also, memory may use more power than hard disks. Fundamental storage technologies As of 2011, the most commonly used data storage technologies are semiconductor, magnetic, and optical, while paper still sees some limited usage. Media is a common name for what actually holds the data in the storage device. Some other fundamental storage technologies have also been used in the past or are proposed for development. Semiconductor Semiconductor memory uses semiconductor-based integrated circuits to store information. A semiconductor memory chip may contain millions of tiny transistors or capacitors. Both volatile and non-volatile forms of semiconductor memory exist. In modern computers, primary storage almost exclusively consists of dynamic volatile semiconductor memory or dynamic random access memory. Since the turn of the century, a type of non-volatile semiconductor memory known as flash memory has steadily gained share as off-line storage for home computers. Non-volatile semiconductor memory is also used for secondary storage in various advanced electronic devices and specialized computers. As early as 2006, notebook and desktop computer manufacturers started using flash-based solid-state drives (SSDs) as default configuration options for the secondary storage either in addition to or instead of the more traditional HDD. Magnetic Magnetic storage uses different patterns of magnetization on a magnetically coated surface to store information. Magnetic storage is non-volatile. The information is accessed using one or more read/write heads which may contain one or more recording transducers. A read/write head only covers a part of the surface so that the head or medium or both must be moved relative to another in order to access data. In modern computers, magnetic storage will take these forms: Magnetic disk Floppy disk, used for off-line storage Hard disk drive, used for secondary storage Magnetic tape, used for tertiary and off-line storage In early computers, magnetic storage was also used as:
Primary storage in a form of magnetic memory, or core memory, core rope memory, thin-film memory Tertiary (e.g. NCR CRAM) or off line storage in the form of magnetic cards. Magnetic tape was then often used for secondary storage. Optical Optical storage, the typical optical disc, stores information in deformities on the surface of a circular disc and reads this information by illuminating the surface with a laser diode and observing the reflection. Optical disc storage is non-volatile. The deformities may be permanent (read only media ), formed once (write once media) or reversible (recordable or read/write media). The following forms are currently in common use: CD, CD-ROM, DVD, BD-ROM: Read only storage, used for mass distribution of digital information (music, video, computer programs) CD-R, DVD-R, DVD+R, BD-R: Write once storage, used for tertiary and off-line storage CD-RW, DVD-RW, DVD+RW, DVD-RAM, BD-RE: Slow write, fast read storage, used for tertiary and off-line storage Ultra Density Optical or UDO is similar in capacity to BD-R or BD-RE and is slow write, fast read storage used for tertiary and off-line storage. Magneto-optical disc storage is optical disc storage where the magnetic state on a ferromagnetic surface stores information. The information is read optically and written by combining magnetic and optical methods. Magneto-optical disc storage is non-volatile, sequential access, slow write, fast read storage used for tertiary and off-line storage.3D optical data storage has also been proposed. Paper Paper data storage, typically in the form of paper tape or punched cards, has long been used to store information for automatic processing, particularly before general-purpose computers existed. Information was recorded by punching holes into the paper or cardboard medium and was read mechanically (or later optically) to determine whether a particular location on the medium was solid or contained a hole. A few technologies allow people to make marks on paper that are easily read by machinethese are widely used for tabulating votes and grading standardized tests. Barcodes made it possible for any object that was to be sold or transported to have some computer readable information securely attached to it. Uncommon Vacuum tube memory -A Williams tube used a cathode ray tube, and a Selectron tube used a large vacuum tube to store information. These primary storage devices were short-lived in the market, since Williams tube was unreliable and the Selectron tube was expensive. Electro-acoustic memory -Delay line memory used sound waves in a substance such as mercury to store information. Delay line memory was dynamic volatile, cycle sequential read/write storage, and was used for primary storage. Optical tape-is a medium for optical storage generally consisting of a long and narrow strip of plastic onto which patterns can be written and from which the patterns can be read back. It shares some technologies with cinema film stock and optical discs, but is compatible with neither. The motivation behind developing this technology was the possibility of far greater storage capacities than either magnetic tape or optical discs. Phase-change memory- it uses different mechanical phases of Phase Change Material to store information in an X-Y addressable matrix, and reads the information by observing the varying electrical resistance of the material. Phase-change memory would be non-volatile, random-access read/write storage, and might be used for primary, secondary and off-line
storage. Most rewritable and many write once optical disks already use phase change material to store information. Holographic data storage- it stores information optically inside crystals or photopolymers. Holographic storage can utilize the whole volume of the storage medium, unlike optical disc storage which is limited to a small number of surface layers. Holographic storage would be nonvolatile, sequential access, and either write once or read/write storage. It might be used for secondary and off-line storage. Molecular memory- stores information in polymer that can store electric charge. Molecular memory might be especially suited for primary storage. The theoretical storage capacity of molecular memory is 10 terabits per square inch.
Information systems and data storage systems I.S. need to store information so that they can manipulate it at a later time, e.g. month-end, year-end. It is important that I.S. can retrieve the information they have acquired in a fast and accurate manner, or else the I.S. will not perform as expected and required. Old information can have unexpected uses (Data mining). Data is structured in a manner that can keep track of individual data elements, and related groupings of information. The entire structure is called the Data Hierarchy How is data structured in a data storage system? Entities: an entity is a thing that you are storing information about in a data file or database. e.g. Person, Plant, Animal, Mineral, etc. Attributes: A characteristic or quantity that further describes a particular entity e.g. a persons address, height, age, etc. Attributes correspond to fields in the record detailing a particular entity, e.g. colour is an attribute of fruit and veg in the records shown earlier. At least one of these must be a key field a unique way of identifying the entity Relationships: details of relationships between different entities An Entity-Relationship model is used to provide a machine-independant, graphical view of an I.S. data. This model can then be used as the basis for file and database design To create an E-R model: o identify all entity types o identify all attributes for each entity and the entity identifier o identify all relationships between entities o draw the E-R diagram Data Normalisation Optimizing table structures, Removing duplicate data entries, Accomplished by thouroughly investigating the various data types and their relationships with one another, Follows a series of normalization forms or States Why Normalize? Improved speed,More efficient use of space,Increased data integrity (decreased chance that data can get messed up due to maintenance) Often performed as a series of tests on a relation to determine whether it satisfies or violates the requirements of a given normal form. 4 most commonly used normal forms are first (1NF), second (2NF), third (3NF) and Boyce-Codd (BCNF) normal forms. Based on functional dependencies among the attributes of a relation. A relation can be normalized to a specific form to prevent the possible
occurrence of update anomalies. Major aim of relational database design is to group attributes into relations to minimize data redundancy and reduce file storage space required by base relations. A sad, sad database: Refer to the following poor database design:
Problems- no need to repeatedly store the class time and professor ID redundancy introduces the possibility for error (Matj148) First Normal Form calls for the elimination of repeated groups of data by creating separate tables of related data Student information: Class information: Professor Information:
Second Normal Form Elimination of redundant data Example data in Class Information:
Third Normal Form eliminate all attributes(column headers) from a table that are not directly dependent upon the primary key college and collegeLocation attributes are less dependent upon the studentID than they are on the major attribute
Revised student table:
Results of Data Normalisation There are two particular effects of normalising data to the 3NF: All repeating groups are removed from the file/database Some form of identifier must be included in each separate file to allow the user or a program to match up data in different files in a useful way Advantages of Data Normalisation Data duplication is avoided e.g. a customers address is stored in one place only, i.e. the customer file Changes and deletions only need to happen once e.g. if a customers address changes, then the only thing that needs to be updated is that customers address field in the customer file A platform-independant data structure is created. Data Storage Techniques There are, broadly, two major options when storing data in an I.S.: Use a Traditional File Environment (TFE) Use a Database Management Systems (DBMS) The choice of which option to choose is determined by many factors, e.g. cost, available expertise, suitability to the problem, etc. Sequential/indexed sequential file access methods In sequential file organisation data records must be retrived in the same sequence as which they were stored. It is the only method that can be used with magnetic tape storage Indexed sequential file access methods store files sequentially but allow records to be accessed in any order using an index of key fields.The index consists of a list of record keys and their associated storage location. Used for sequential processing of large number of records in batch mode. The diagrams illustrate these two methods of organising data.
Random (Direct) file access methods Uses a key field to access the physical location of a record. Uses a mathematical formula called a randomising or hash algorithm to translate the key field into the records physical location .most appropriate for applications requiring records directly and rapidly for immediate online processing e.g. online order processing application. The diagram illustrates the direct file access method, where 4467 is contents of the record key. 997 is a prime number and 479 is the address of the record.
Problems with the Traditional File Environ Cost: each application must typically be created, documented, and maintained separately there are numerous data files to keep track of work is often duplicated (even with OO design and development tools!) Program/Data Dependance: Particular programs create and manage particular files. Each program knows the structure of the data in its own data file only program has to explicitly know the datas structure Data Redundancy: Multiple copies of the same information stored in a number of different files, Waste of Storage space Data Confusion can occur
What is Data Warehouse? The Data Warehouse is a database of unique data structure with advanced data archiving capabilities that allows relatively quick and easy performance of complex queries over large amounts of data. A classical, production information system is primarily built for gathering data inputs and their processing. It allows the company to be operational and run smoothly, and that means mostly data entry. On the other hand, the data storage system in its structure allows fast and easy retrieval of large amounts of data. This makes it suitable for the construction of the socalled systems for business decision support(DSS Decision Support System). The data stored daily in the production system must ultimately serve the administrative structure of the company. Administrative structure of the company should be able to extract useful information from large amounts of data, which she will serve for evaluation of the results achieved, planning and decisionmaking. For this purpose it is necessary to ensure a quick and easy access to data stored in complex structures of production systems. Data Warehouse provides just such a mode that is faster and easier access to information, to view and analyze large amounts of data, in which time measures the reach in seconds or minutes. In building the data warehouse, implementation have to meet specific problems that do not appear in the construction of production (transactionoriented)information systems. Most problems are related to the construction of a system for extracting data that is a periodically automated transfer of data from the source to the destination of the production system data warehouse. Some of the problems encountered in the construction of warehouses are: Integration of diverse data from multiple sources(multiple production systems) implemented on different platforms; Rapid detection of the changesin the source system; Iterative nature of modelbuilding data warehouses and hence the iterative nature of building software for extraction. Data warehousing is a collection of methods, techniques, and tools used to support knowledge workerssenior managers, directors, managers, and analyststo conduct data analyses that help with performing decision-making processes and improving information resources. We can use the previous list of problems and difficulties to extract a list of key words that become distinguishing marks and essential requirements for a data warehouse process, a set of tasks that allow us to turn operational data into decision-making support information: accessibility to users not very familiar with IT and data structures; integration of data on the basis of a standard enterprise model; query flexibility to maximize the advantages obtained from the existing information; information conciseness allowing for target-oriented and effective analyses; multidimensional representation giving users an intuitive and manageable view of information; correctness and completeness of integrated data. Data warehouses are placed right in the middle of this process and act as repositories for data. They make sure that the requirements set can be fulfilled Data Warehouse Architectures. The following architecture properties are essential for a data warehouse system (Kelly, 1997): Separation- Analytical and transactional processing should be kept apart as much as possible. Scalability- Hardware and software architectures should be easy to upgrade as the data volume, which has to be managed and processed, and the number of users requirements, which have to be met, progressively increase.
Extensibility- The architecture should be able to host new applications and technologies without redesigning the whole system. Security- Monitoring accesses is essential because of the strategic data stored in data warehouses. Administerability- Data warehouse management should not be overly difficult.
Two-Layer Architecture The requirement for separation plays a fundamental role in defining the typical architecture for a data warehouse system, as shown in Figure 1-3. Although it is typically called a two layer architecture to highlight a separation between physically available sources and data warehouses, it actually consists of four subsequent data flow stages (Lechtenbrger, 2001): Source layer- A data warehouse system uses heterogeneous sources of data. That data is originally stored to corporate relational databases or legacy1 databases, or it may come from information systems outside the corporate walls. Data staging-The data stored to sources should be extracted, cleansed to remove inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one common schema. The socalled Extraction, Transformation, and Loading tools (ETL) can merge heterogeneous schemata, extract, transform, cleanse, validate, filter, and load source data into a data warehouse (Jarke et al., 2000). Technologically speaking, this stage deals with problems that are typical for distributed information systems, such as inconsistent data management and incompatible data structures (Zhuge et al., 1996). Data warehouse layer- Information is stored to one logically centralized single repository: a data warehouse. The data warehouse can be directly accessed, but it can also be used as a source for creating data marts, which partially replicate data warehouse contents and are designed for specific enterprise departments. Meta-data repositories store information on sources, access procedures, data staging, users, data mart schemata, and so on.
Analysis- In this layer, integrated data is efficiently and flexibly accessed to issue reports, dynamically analyze information, and simulate hypothetical business scenarios. Technologically speaking, it should feature aggregate data navigators, complex query optimizers, and user-friendly GUIs. The architectural difference between data warehouses and data marts needs to be studied closer. The component marked as a data warehouse in Figure 1-3 is also often called the primary data warehouse or corporate data warehouse. It acts as a centralized storage system for all the data being summed up. Data marts can be viewed as small, local data warehouses replicating (and summing up as much as possible) the part of a primary data warehouse required for a specific application domain. A data mart is a subset or an aggregation of the data stored to a primary data warehouse. It includes a set of information pieces relevant to a specific business area, corporate department, or category of users. The data marts populated from a primary data warehouse are often called dependent (independent if there is no primary data warehouse). Although data marts are not strictly necessary, they are very useful for data warehouse systems in midsize to large enterprises because they are used as building blocks while incrementally developing data warehouses; they mark out the information required by a specific group of users to solve queries; they can deliver better performance because they are smaller than primary data warehouses. The following list sums up all the benefits of a two-layer architecture, in which a data warehouse separates sources from analysis applications (Jarke et al., 2000; Lechtenbrger, 2001): In data warehouse systems, good quality information is always available, even when access to sources is denied temporarily for technical or organizational reasons. Data warehouse analysis queries do not affect the management of transactions, the reliability of which is vital for enterprises to work properly at an operational level. Data warehouses are logically structured according to the multidimensional model, while operational sources are generally based on relational or semi-structured models. A mismatch in terms of time and granularity occurs between OLTP systems, which manage current data at a maximum level of detail, and OLAP systems, which manage historical and summarized data. Data warehouses can use specific design solutions aimed at performance optimization of analysis and report applications.
What Is Data Mining? Simply stated, data mining refers to extracting or mining knowledge from large amounts of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data mining should have been more appropriately named knowledge mining from data, which is unfortunately somewhat long. Knowledge mining, a shorter term, may not reect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the process that nds a small set of precious nuggets from a great deal of raw material . Thus, such a misnomer that carries both data and mining became a popular choice. Many other terms carry a similar or slightly different meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery. Knowledge discovery as a process is depicted in Figure 1.4 and consists of an iterative sequence of the following steps: 1. Data cleaning (to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined)1 3. Data selection(where data relevant to the analysis task are retrieved from the database) 4. Data transformation(where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)2 5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)
Steps 1 to 4 are different forms of data preprocessing, where the data are prepared for mining. The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base. Note that according to this view, data mining is only one step in the entire process, albeit an essential one because it uncovers hidden patterns for evaluation. We agree that data mining is a step in the knowledge discovery process. However, in industry, in media, and in the database research milieu, the term data mining is becoming more popular than the longer term of knowledge discovery from data. Therefore, in this book, we choose to use the term data mining. We adopt a broad view of data mining functionality: data mining is the process of discovering interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories. Based on this view, the architecture of a typical data mining system may have the following major components (Figure 1.5): Database, data warehouse, World Wide Web, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. Database or data warehouse server:The database or data warehouse server is responsible for fetching the relevant data, based on the users data mining request. Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of
abstraction. Knowledge such as user beliefs, which can be used to assess a patterns interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources). Data mining engine:This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classication, prediction, cluster analysis, outlier analysis, and evolution analysis. Pattern evaluation module: This component typically employs interestingness measures (Section 1.5) and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to lter out discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efcient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to conne the search to only the interesting patterns. User interface:This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms.
Classication of Data Mining Systems Data mining is an interdisciplinary eld, the conuence of a set of disciplines, including database systems, statistics, machine learning, visualization, and information science (Figure 1.12). Moreover, depending on the data mining approach used, techniques from other disciplines may be applied, such as neural networks, fuzzy and/or rough set theory,
knowledge representation, inductive logic programming, or high-performance computing. Depending on the kinds of data to be mined or on the given data mining application, the data mining system may also integrate techniques from spatial data analysis, information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, business, bioinformatics, or psychology. Becauseofthediversityofdisciplinescontributingtodatamining,dataminingresearch is expected to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear classication of data mining systems, which may help potential users distinguish between such systems and identify those that best match their needs. Data mining systems can be categorized according to various criteria, as follows:
Document Image Management Systems Document management is the conversion of paper documents into electronic images on your computer. Once on your desktop, these documents can be retrieved effortlessly in seconds. Thousands of organizations around the world use document management every day instead of paper filing systems. The reasons for this change are simple: Document Management: Prevents lost records. Saves storage space. Manages records easily.
Finds documents quickly. Makes images centrally available. Eliminates the need for file cabinets. The program saves both an image file (an actual picture of the original document) and a text file created by the program through its OCR (character recognition) capabilities. It's a Windows-based program that can be accessed on an internal network, an intranet and/or via the Internet. The program has full-text search and proximity searches. The files are saved in a standard folder tree similar to what you would expect to see in any common file manager program. It even has annotation features like highlighting, redacting and virtual "sticky" notes. The volume and type of documents expected to be added each year the level and type of access desired for each document the desired retention period for each document. The steps necessary to introduce document management: Documents are scanned into the system. The document management system stores them somewhere on a hard drive or optical disk. The documents then get indexed. When a person later wants to read a document, he or she uses the retrieval tools available in the document management system. Which documents can be read and what actions performed on these documents is dependent on the access provided by the document management system. A complete document management system comprises five elements: Input Major advancements in scanning technology make paper document conversion fast, inexpensive and easy. A good scanner will make putting paper files into your computer easy. In addition documents can be input using SnapShot, which enables you to use a printto-file type feature, and drag-and-drop methods. Storage
The storage system provides long-term and reliable storage for documents. A good storage system will accommodate changing documents, growing volumes and advancing technology. Indexing The index system creates an organized document filing system and makes future retrieval simple and efficient. A good indexing system will make existing procedures and systems more effective. Retrieval The retrieval system uses information about the documents, including index and text, to find images stored in the system. A good retrieval system will make finding the right documents fast and easy. Document viewing should be readily available to those who need it, with the flexibility to control access to system. A good access system will make documents viewable to authorized personnel, whether in the office, at different locations, or over the Internet. Retention/Destruction A document management system with the records management feature will enable documents to be tied to a records retention schedule. This allows the documents that have reached the end of their retention schedule to be quickly identified and destroyed. By overwriting the files a number of times, the documents are effectively shredded.
The Main Function of a Document Imaging Management System
The technologies available in a document imaging management system all work towards helping it perform its main function, which is the scanning and storage of scanned images. This software is most useful when scanned images are created continuously by scanners or multifunctional devices and these have to routed and stored promptly. In such a situation these types of software takes charge of scanned images as and when they are created, and works to rectify any
scanning errors that might have occurred. This is an all important process as many scanned images have errors in them, and these have to be corrected if such images are to be processed accurately with OCR software. In addition to this a document imaging management system can do this when it has the appropriate technologies store scanned images from the time they are created, until they are deleted or discarded. This ultimately results in paper not being used in offices, so much so when using a document imaging management system an office or company can reduce their carbon footprint and become greener.
When using a document imaging management system a company can reduce its carbon footprint throughout all its office locations or in specific offices. This can be achieved as such software can efficiently convert paper based documents into scanned text searchable documents. As a result of this, no papers need to be stored or filed in cabinets. This ultimately leads to lesser storage space being needed and used throughout the company. When a document imaging management system is implemented companywide the entire energy bill of the company can be reduced and sometimes this results in several million dollars of savings per year. A document imaging
management system can bring about cost savings in many ways. In addition to saving money on energy it can even be used to reduce the money needed for paper, filing cabinets and office supplies. Most importantly this software can be used to reduce the money spent on data entry teams.
software Systems Life Cycle The Software System Lifecycle & A software process is a partially ordered collection of actions, carried out by one or more software engineers, software users, or other software systems in order to accomplish a (software engineering) task.. & The software system lifecycle is a software process by which a software system is developed, tested, installed and maintained throughout its useful history. & The concept of software lifecycle is a useful project management tool. A lifecycle consists of phases, each of which is a software process. & Think of lifecycles as coarse-grain software processes. There is a lot of work on fine-grain software processes, such as fixing a bug, extending a module, testing a module, etc.
We focus here on information system development lifecycles What is Described by a Lifecycle? n The lifecycle describes the temporal, causal and I/O relationships between different lifecycle phases n The lifecycle concept includes the concept of feedback (returning to a previous phase) as well as moving forward to the next phase n In the past, the lifecycle concept was applied to the management of complex systems that had some sort of physical hardware as their end product, e.g., missiles, communication networks, spacecraft, etc. n However, for hardware systems there is a tangible end product that can be measured and observed,... It is not as easy to measure and observe the results of information systems analysis and design What is SDLC? Software Development Life Cycle Defined
SDLC stands for Software Development Life Cycle. A Software Development Life Cycle is essentially a series of steps, or phases, that provide a model for the development and lifecycle management of an application or piece of software. The methodology within the SDLC process can vary across industries and organizations, but standards such as ISO/IEC 12207 represent processes that establish a lifecycle for software, and provide a mode for the development, acquisition, and configuration of software systems.
Benefits of the SDLC Process

The intent of a SDLC process it to help produce a product that is cost-efficient, effective, and of high quality. Once an application is created, the SDLC maps the proper deployment and decommissioning of the software once it becomes a legacy. The SDLC methodology usually contains the following stages: Analysis (requirements and design), construction, testing, release, and maintenance (response). Veracode makes it possible to integrate automated security testing into the SDLC process through use of its cloud based platform. DLC starts with the analysis and definition phases, where the purpose of the software or system should be determined, the goals of what it needs to accomplish need to be established, and a set of definite requirements can be developed.
During the software construction or development stage, the actual engineering and writing of the application is done. The software is designed and produced, while attempting to accomplish all of the requirements that were set forth within the previous stage. Next, in the software development life cycle is the testing phase. Code produced during construction should be tested using static and dynamic analysis, as well as manual penetration testing to ensure that the application is not easily exploitable to hackers, which could result in a critical security breach. The advantage of using Veracode during this stage is that by using state of the art binary analysis (no source code required), the security posture of applications can be verified without requiring the use of any additional hardware, software, or personnel. Once the software is deemed secure enough for use, it can be implemented in a beta environment to test real-world usability, and then pushed a full release where it enters the maintenance phase. The maintenance stage allows the application to be adjusted to organizational, systemic, and utilization changes. The stages in the systems life-cycle Most IT projects use the Systems Life-cycle approach to developing a new system. This approach consists of several distinct stages, which follow one after the other. During the development life-cycle, a team is not permitted to go back to a previous stage this could cause the project to over-run in terms of both cost and time. The stages in the Systems Life-Cycle are as follows: Problem identification Feasibility Study (Initial investigation) Analysis (detailed investigation) Design Coding (software development) Testing Conversion Review (Evaluation) Maintenance Note that each stage of the Systems Life-cycle has a distinct end-point, which can be shown to the customer and signed off. This helps to ensure that the final product is what the customer actually wanted!
Problem identification The problem identification is a statement of the existing problems and description of user requirements as outlined by the customer. Feasibility Study A feasibility study is an initial investigation of a problem in order to ascertain whether the proposed
system is viable, before spending too much time or money on its development. Analysis The analysis is a detailed, fact-finding, investigation of the existing system in order to ascertain its strengths and weaknesses and to produce the list of requirements for the new system. Design Design is the production of diagrams, tables and algorithms, which show how the new system is to look and work. The design will show: how the interfaces and reports should look; the structure of and relationships between the data; the processing to be used to manipulate/transform the data; the methods to be used for ensuring the security and validity of the data. Coding Coding is the creation and editing of the interfaces, code and reports so they look and work as indicated in the design stage. Note that user and technical documentation will also be produced during the coding stage. Testing Testing is the process to ensure that the system meets the requirements that were stated in the analysis and also to discover (and eliminate) any errors that might be present. Conversion Conversion is the process of installing the new system into the customers organisation and training the employees to use it. Review Post-implementation review (also known as evaluation) is a critical examination of a system after it has been in operation for a period of time.
Maintenance Maintenance is the process of making improvements to a system that is in use. The reasons for maintenance could be to fix bugs, to add new features or to make the system run quicker. SDLC Phases Systems Investigation Identify problems or opportunities Systems Analysis How can we solve the problem Systems Design Select and plan the best solution Systems Implementation Place solution into effect Systems Maintenance and Review Evaluate the results of the solution Waterfall Model Requirements defines needed information, function, behavior, performance and interfaces. Design data structures, software architecture, interface representations, algorithmic details. Implementation source code, database, user documentation,
testing. Waterfall Strengths Easy to understand, easy to use Provides structure to inexperienced staff Milestones are well understood Sets requirements stability Good for management control (plan, staff, track) Works well when quality is more important than cost or Schedule Waterfall Deficiencies All requirements must be known upfront Deliverables created for each phase are considered frozen inhibits flexibility Can give a false impression of progress Does not reflect problem-solving nature of software development iterations of phases Integration is one big bang at the end Little opportunity for customer to preview the system (until it may be too late) When to use the Waterfall Model
Requirements are very well known Product definition is stable Technology is understood New version of an existing product Porting an existing product to a new platform Systems Development Life Cycle The systems development life cycle (SDLC) is the overall process for developing information systems from planning and analysis through implementation and maintenance. The SDLC is the foundation for all systems development methodologies and there are literally hundreds of different activities associated with each phase in the SDLC. Typical activities include determining budgets, gathering system requirements, and writing detailed user documentation. The activities performed during each systems development project will vary. The SDLC begins with a business need, followed by an assessment of the functions a system must have to satisfy the need, and ends when the benefits of the system no longer outweigh its maintenance costs. This is why it is referred to as a lifecycle. The SDLC is comprised of seven distinct phases: planning, analysis, design, development, testing, implementation, and maintenance. This section takes a detailed look at a few of the more common activities performed during the phases of the systems development life cycle along with common issues facing software development projects (see Figure D.1 and Figure D.2 ). Phase 1: Planning The planning phase involves establishing a high-level plan of the intended project and determining project goals. Planning is the first and most critical phase of any systems development effort an organization undertakes, regardless of whether the effort is to develop a system that allows customers to order products over the Internet, determine the best logistical structure for warehouses around the world, or
develop a strategic information alliance with another organization. Organizations must carefully plan the activities (and determine why they are necessary) to be successful. The three primary activities involved in the planning phase are: Identify and select the system for development. Assess project feasibility. Develop the project plan.
Phase 2: Analysis The analysis phase involves analyzing end-user business requirements and refining project goals into defined functions and operations of the intended system. A good start is essential and the organization must spend as much time, energy, and resources as necessary to perform a detailed, accurate analysis. The three primary activities involved in the analysis phase are: Gather business requirements. Create process diagrams. Perform a buy versus build analysis Phase 3: Design The design phase involves describing the desired features and operations of the system including screen layouts, business rules, process diagrams, pseudo code, and other documentation. The two primary activities involved in the design phase are: Design the IT infrastructure. Design system models
Phase 4: Development The development phase involves taking all of the detailed design documents from the design phase and transforming them into the actual system. The two primary activities involved in the development phase are: Develop the IT infrastructure. Develop the database and programs Phase 5: Testing According to a report issued in June 2003 by the National Institute of Standards and Technology (NIST), defective software costs the U.S. economy an estimated $59.5 billion each year. Of that total, software users incurred 64 percent of the costs and software developers 36 percent. NIST suggests that improvements in testing could reduce this cost by about a third, or $22.5 billion, but that unfortunately testing improvements would not eliminate all software errors. 2 The testing phase involves bringing all the project pieces together into a special testing environment to test for errors, bugs, and interoperability, in order to verify that the system meets all the business requirements dened in the analysis phase. The two primary activities involved in the testing phase are: Write the test conditions. Perform the system testing. Phase 6: Implementation The implementation phase involves placing the system into production so users can begin to perform actual business operations with the system. The implementation phase is also referred to as delivery. The implementation phase is comprised of two activities: training and conversion. Each of these activities include multiple part tasks such as writing detailed user documentation, determining the conversion method, and providing training for system users. How and what time
during the phase these tasks occurs, often depends upon the conversion method selected. For example, for a plunge conversion, all training must take place prior to the conversion. Alternatively, during a parallel conversion, training can be offered at scheduled intervals as the new system is rolled out. Also, the complexity and comprehensive nature of the new system can dictate timing and steps necessary to deliver or implement the system. The two primary activities of the implementation phase include: System Training Implementation Method Phase 7: Maintenance The maintenance phase involves performing changes, corrections, additions, and upgrades to ensure the system continues to meet the business goals. This phase continues for the life of the system because the system must change as the business evolves and its needs change, demanding constant monitoring, supporting the new system with frequent minor changes (for example, new reports or information capturing), and reviewing the system to be sure it is moving the organization toward its strategic goals. Once a system is in place, it must change as the organization changes. The three primary activities involved in the maintenance phase are: Build a help desk to support the system users. Perform system maintenance. Provide an environment to support system changes.
Stands for "Rational Unified Process." RUP is a software development process from Rational, a division of IBM. It divides the development process into four distinct phases that each involve business modeling, analysis and design, implementation, testing, and deployment. The four phases are:
1. Inception - The idea for the project is stated. The development team determines if the project is worth pursuing and what resources will be needed.
2. Elaboration - The project's architecture and required resources are further evaluated. Developers consider possible applications of the software and costs associated with the development.
3. Construction - The project is developed and completed. The software is designed, written, and tested.
4. Transition - The software is released to the public. Final adjustments or updates are made based on feedback from end users. The RUP development methodology provides a structured way for companies to envision create software programs. Since it provides a specific plan for each step of the development process, it helps prevent resources from being wasted and reduces unexpected development costs.
[edit]
RUP is based on a set of building blocks, or content elements, describing what is to be produced, the necessary skills required and the step-by-step explanation describing how specific development goals are to be achieved. The main building blocks, or content elements, are the following:
Roles (who) A Role defines a set of related skills, competencies and responsibilities. Work Products (what) A Work Product represents something resulting from a task, including all the documents and models produced while working through the process. Tasks (how) A Task describes a unit of work assigned to a Role that provides a meaningful result.
Within each iteration, the tasks are categorized into nine disciplines:
Six "engineering disciplines"
Business Modeling Requirements Analysis and Design Implementation Test Deployment
Three supporting disciplines Configuration and Change Management
Project Management Environment
Four Project Life cycle Phases[edit]
RUP phases and disciplines.
The RUP has determined a project life cycle consisting of four phases. These phases allow the process to be presented at a high level in a similar way to how a 'waterfall'-styled project might be presented, although in essence the key to the process lies in the iterations of development that lie within all of the phases. Also, each phase has one key objective and milestone at the end that denotes the objective being accomplished. The visualization of RUP phases and disciplines over time is referred to as the RUP hump chart.
Inception Phase[edit]
The primary objective is to scope the system adequately as a basis for validating initial costing and budgets. In this phase the business case which includes business context, success factors (expected revenue, market recognition, etc.), and financial forecast is established. To complement the business case, a basic use case model, project plan, initial risk assessment and project description (the core project requirements, constraints and key features) are generated. After these are completed, the project is checked against the following criteria:
Stakeholder concurrence on scope definition and cost/schedule estimates. Requirements understanding as evidenced by the fidelity of the primary use cases. Credibility of the cost/schedule estimates, priorities, risks, and development process. Depth and breadth of any architectural prototype that was developed. Establishing a baseline by which to compare actual expenditures versus planned expenditures.
If the project does not pass this milestone, called the Lifecycle Objective Milestone, it either can be cancelled or repeated after being redesigned to better meet the criteria.
Elaboration Phase[edit]
The primary objective is to mitigate the key risk items identified by analysis up to the end of this phase. The elaboration phase is where the project starts to take shape. In this phase the problem domain analysis is made and the architecture of the project gets its basic form. The outcome of the elaboration phase is:
A use-case model in which the use-cases and the actors have been identified and most of the use-case descriptions are developed. The use-case model should be 80% complete. A description of the software architecture in a software system development process. An executable architecture that realizes architecturally significant use cases. Business case and risk list which are revised. A development plan for the overall project. Prototypes that demonstrably mitigate each identified technical risk. A preliminary user manual (optional)
This phase must pass the Lifecycle Architecture Milestone criteria answering the following questions:
Is the vision of the product stable? Is the architecture stable? Does the executable demonstration indicate that major risk elements are addressed and resolved? Is the construction phase plan sufficiently detailed and accurate? Do all stakeholders agree that the current vision can be achieved using current plan in the context of the current architecture? Is the actual vs. planned resource expenditure acceptable?
If the project cannot pass this milestone, there is still time for it to be cancelled or redesigned. However, after leaving this phase, the project transitions into a high-risk operation where changes are much more difficult and detrimental when made. The key domain analysis for the elaboration is the system architecture.
Construction Phase[edit]
The primary objective is to build the software system. In this phase, the main focus is on the development of components and other features of the system. This is the phase when the bulk of the coding takes place. In larger projects, several construction iterations may be developed in an effort to divide the use cases into manageable segments that produce demonstrable prototypes. This phase produces the first external release of the software. Its conclusion is marked by the Initial Operational Capability Milestone.
Transition Phase[edit]
The primary objective is to 'transit' the system from development into production, making it available to and understood by the end user. The activities of this phase include training the end users and maintainers and beta testing the system to validate it against the end users' expectations. The product is also checked against the quality level set in the Inception phase. If all objectives are met, the Product Release Milestone is reached and the development cycle is finished
DECISION SUPPORT SYSTEMS A decision support system (DSS) is a computer-based information system that supports business or organizational decision-making activities. DSSs serve the management, operations, and planning levels of an organization and help to make decisions, which may be rapidly changing and not easily specified in advance.
DSSs include knowledge-based systems. A properly designed DSS is an interactive software-based system intended to help decision makers compile useful information from a combination of raw data, documents, personal knowledge, or business models to identify and solve problems and make decisions. Typical information that a decision support application might gather and present are: data sources, cubes, data warehouses, and data marts),
6.1 Taxonomy As with the definition, there is no universally-accepted taxonomy of DSS either. Different authors propose different classifications. Using the relationship with the user as the criterion, Haettenschwiler differentiates, Passive DSS Active DSS Cooperative DSS A passive DSS is a system that aids the process of decision making, but that cannot bring out explicit decision suggestions or solutions. An active DSS can bring out such decision suggestions or solutions. A cooperative DSS allows the decision maker (or its advisor) to modify, complete, or refine the decision suggestions provided by the system, before sending them back to the system for validation. Another taxonomy for DSS has been created by Daniel Power. Using the mode of assistance as the criterion, Power differentiates communication-driven DSS, datadriven DSS, documentdriven DSS, knowledge-driven DSS, and model-driven DSS. -driven DSS supports more than one person working on a
shared task; examples include integrated tools like Microsoft's NetMeeting or Groove -driven DSS or data-oriented DSS emphasizes access to and manipulation of a time series of internal company data and, sometimes, external data. -driven DSS manages, retrieves, and manipulates unstructured information in a variety of electronic formats. -driven DSS provides specialized problem-solving expertise stored as facts, rules, procedures, or in similar structures. -driven DSS emphasizes access to and manipulation of a statistical, financial, optimization, or simulation model. Model-driven DSS use data and parameters provided by users to assist decision makers in analyzing a situation; they are not necessarily data-intensive. Dicodess is an example of an open source model-driven DSS generator. 6.2 Components of DSS Three fundamental components of DSS architecture are: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. the database (or knowledge base) 2. the model (i.e., the decision context and user criteria) 3. the user interface The users themselves are also important components of the architecture. DSS components may be classified as: 1. Inputs: Factors, numbers, and characteristics to analyze User Knowledge and Expertise: Inputs requiring manual analysis by the user 3. Outputs: Transformed data from which DSS "decisions" are generated 4. Decisions: Results generated by the DSS based on user criteria 6.3 Application As mentioned above, there are theoretical possibilities of building such systems in any knowledge domain. One example is the clinical decision support system for medical diagnosis. Other examples include a bank loan officer verifying the credit of a loan applicant or an engineering firm that has bids on several projects and wants to know if they can be competitive with their costs
17. DSS is extensively used in business and management. Executive 18. dashboard and other business performance software allow faster decision 19. making, identification of negative trends, and better allocation of business 20. resources. 21. A growing area of DSS application, concepts, principles, and techniques is 22. in agricultural production, marketing for sustainable development. For 23. example, the DSSAT4 package,[15][16]developed through financial support 24. of USAID during the 80's and 90's, has allowed rapid assessment of several 25. agricultural production systems around the world to facilitate decision-making 26. at the farm and policy levels. There are, however, many constraints to the 27. successful adoption on DSS in agriculture.[17] 28. DSS are also prevalent in forest management where the long planning time 29. frame demands specific requirements. All aspects of Forest management, from 30. log transportation, harvest scheduling to sustainability and ecosystem 31. protection have been addressed by modern DSSs. A comprehensive list and 32. discussion of all available systems in forest management is being compiled 33. under the COST action Forsys 34. A specific example concerns t 35. A specific example concerns the Canadian National Railway system, which 36. tests its equipment on a regular basis using a decision support system. A 37. problem faced by any railroad is worn-out or defective rails, which can result 38. in hundreds of derailments per year. Under a DSS, CN managed to decrease 39. the incidence of derailments at the same time other companies were 40. experiencing an increase. Benefits Improves personal efficiency Speed up the process of decision making Increases organizational control Encourages exploration and discovery on the part of the decision maker Speeds up problem solving in an organization Facilitates interpersonal communication Promotes learning or training Generates new evidence in support of a decision Creates a competitive advantage over competition Reveals new approaches to thinking about the problem space Helps automate managerial processes
3.3 WHAT IS A KNOWLEDGE BASE SYSTEM?
Top One of the important learnt in AI during the 1960s was that the general-purpose problem solvers, which used a limited number of laws or axioms, were too weak to be effective in solving problems of any complexity. This realisation led to the design of what is now known as Knowledge base system, systems that depend on a rich base of knowledge to perform difficult tasks.
Edward Feigenbaum summarised this new thinking in a paper at the International Joint Conference on Artificial Intelligence (IJCAI) in 1977. He emphasised the fact that the real power of an expert system comes from the knowledge it possesses rather than the particular inference schemes and other formalisms it employs. This new view of AI systems marked the turning point in the development of more powerful problem solvers. It formed the basis for some of the new emerging expert systems being developed during the 1970s including MYCIN, an expert system developed to diagnose infectious blood diseases. An expert system contains knowledge of experts in a particular domain along with an inference mechanism and an explanation sub-system. It is also called knowledge base system. Since this realisation, much of the work done in AI has been related to so-called Knowledge base systems, including work in vision, learning, general problem solving, and natural language understanding. This in turn has lead to more emphasis being placed on research related to knowledge representation, memory organisation, and the use and manipulation of knowledge. Knowledge base systems get their power from the expert knowledge that has been coded into facts, rules, heuristics, and procedures. The knowledge is stored in a knowledge base separate from the control and inference components. This makes it possible to add new knowledge or refine existing knowledge without recompiling the control and inference programs. This greatly simplifies the construction and maintenance of Knowledge base systems. In the knowledge lies the power! This was the message learned a few farsighted researchers at Stanford University during the late 1960s and early 1970s.
Figure 1: Components of a Knowledge-based system The proof of their message was provided in the first Knowledge base expert systems, which were shown to be more than toy problem solvers. These first systems were real world problem solvers, tackling such tasks as determining complex chemical structures given only the atomic constituents and mass spectra data from samples of the compounds and later performing medical diagnoses of infectious blood diseases. Using the analogy of a DBMS, we can define a knowledge base management system (KBMS) as a computer system used to manage and manipulate shared knowledge. A knowledge base system's manipulation facility includes a reasoning facility, usually including aspects of one or more of the following forms of reasoning: deductive, inductive, or abductive. Deductive reasoning implies that a new fact can be inferred from a given set of facts or knowledge using known rules of inference. For instance, a given proposition can be found to be true or false in light of existing knowledge in the form of other propositions believed to be either true or false. Inductive reasoning is used to prove something by first proving a base fact and then the increment step; having proved these, we can prove a generalized fact. Abductive reasoning is used in generating a hypothesis to explain observations. Like deductive reasoning, it points to possible inferences from related concepts; however, unlike deductive reasoning, the number of inferences could be more than one. The likelihood of knowing which of these inferences corresponds to the current state of the system can be gleaned from the explanations generated by the system. These explanations can facilitate choosing among these alternatives and arriving at the final conclusion. In addition to the reasoning facility, a knowledge base system may incorporate an explanation facility so that the user can verify whether reasoning used by the system is consistent and complete. The reasoning facility also offers a form of tutoring to the uninitiated user. The so-called expert systems and the associated expert system generation facilities are one form of knowledge base systems that have emerged from research labs and are being marketed commercially. Since a KBMS includes reasoning capacity, there is a clear benefit in incorporating this reasoning power in database application programs in languages such as COBOL and Pascal. Most knowledge base systems are still in the research stage. The first generation of commercial KBMSs are just beginning to emerge and integration of a KBMS with a DBMS is a current research problem. However, some headway has been made in the integration of expert systems in day-to-day database applications.
3.4 KNOWLEDGE BASE & DATABASE SYSTEM
Top There is no consensus on the difference between a knowledge base system and a database system. In a DBMS, the starting point is a data model to represent the data and the interrelationships between them; similarly, the starting point of a KBMS is a knowledge representation scheme. The requirements for any knowledge representation scheme should provide some mechanism to organize knowledge in appropriate hierarchies or categories, thus allowing easy access to associated concepts. In addition, since knowledge can be expressed as rules and exceptions to rules, exception-handling features must be present in the knowledge stored in the system must be insulated from changes in usage in its physical or logical structure. This concept is similar to the data independence concept used in a DBMS. To data, little headway has been made in this aspect of a KBMS. A KBMS is developed to solve problem for a finite domain or portion of the real world. In developing such a system, the designer selects a significant objects and relationships among these objects. In addition to this domain-specific knowledge, general knowledge such as concepts of up, down, far, near, cold, hot, on top of, and besides must be incorporated in the KBMS. Another type of knowledge, which we call common sense, has yet to be successfully incorporated in the KBMS. The DBMS and KBMS have similar architectures; both contain a component to model the information being managed by the system and have a subsystem to respond to queries. Both systems are used to model or represent a portion of the real world of interest to the application. A database system, in addition to storing facts in the form of data, has limited capability of establishing associations between these data. These associations could be pre-established as in the case of network and hierarchical models, or established using common values of shared domains as in the relational model . A knowledge base system exhibits similar associative capability. However, this capability of establishing associations between data and thus a means of interpreting the information contained is at a much higher level in a knowledge base system, ideally at the level of a knowledgeable human agent. One difference between the DBMS and KBMS that has been proposed is that the knowledge base system handles a rather small amount of knowledge, whereas a DBMS efficiently (as measured by response performance) handles large amounts of shared data. However, this distinction is fallacious since the amount of knowledge has no known boundaries and what this says is that existing knowledge base systems handle a very small amount of knowledge. This does not mean that at some future date we could not develop knowledge base systems to efficiently handle much larger amounts of shared knowledge. In a knowledge base system, the emphasis is placed on a robust knowledge representation scheme and extensive reasoning capability. Robust signifies that the scheme is rich in expressive power and at the same time it is efficient. In a DBMS, emphasis is on efficient access and management of the data that model a portion of the real world. A knowledge base system is concerned with the meaning of information, whereas a DBMS is interested in the information contained in the data. However, these distinctions are not absolute. For our purposes, we can adopt the following informal definition of a KBMS. The important point in this definition is that we are concerned with what the system does rather than how it is done. A knowledge base management system is a computer system that manages the knowledge in a given domain or field of interest and exhibits reasoning power to the level of a human expert in this domain. A KBMS, in addition , provides the user with an integrated language, which serves the purpose of the traditional DML of the existing DBMS and has the power of a high-level application language. A database can be viewed as a very basic knowledge base system in so far as it manages facts. It has been recognised that there should be an integration of the DBMS technology with the reasoning aspect in the development of shared knowledge bases. Database technology has already addressed the problems of improving system performance, concurrent access, distribution, and friendly interface; these features are equally pertinent in a KBMS. There will be a continuing need for current DBMSs and their functionalities co-existing with an integrated KBMS. However , the reasoning power of a KBMS can improve the case of retrieval of pertinent information from a DBMS. Top 3.5 KNOWLEDGE REPRESENTATION SCHEMES Knowledge is the most vital part of Knowledge Base System or Expert System . These systems contain large amounts of knowledge to achieve high performance. A suitable Knowledge Representation scheme is necessary to represent this vast amount of knowledge and to perform inferencing over the Knowledge Base (KB). A Knowledge Representation scheme means a set of syntactic and semantic
conventions to describe various objects. The syntax provides a set of rules for combining symbols and arrangements of symbols to form expressions. Knowledge Representation is a non-trivial problem, which continues to engage some of the best minds in this field even after the successful development of many a Knowledge Base System. Some of the important issues in Knowledge Representation are the following: i. Expressive Adequacy: What knowledge can be and cannot be represented in a particular Knowledge Representation scheme? ii. Reasoning Efficiency: How much effort is required to perform inferencing over the KB? There is generally a trade off between expressive adequacy and reasoning efficiency. iii. Incompleteness: What can be left unsaid about a domain and how does one perform inferencing over incomplete knowledge? iv. Real World Knowledge: How can we deal with attitudes such as beliefs, desires and intentions ? Major Knowledge Representation schemes are based on production rules, frames, semantic nets and logic. Facts and rules can be represented in these Knowledge Representation schemes. Inference Engines using forward chaining, backward chaining or a combination thereof are used along with these Knowledge Representation schemes to build actual Expert System. We will briefly describe these Knowledge Representation schemes and inferencing engines. 3.5.1 Rule Based Representation A rule based system is also called production rule system. Essentially, it has three parts, working memory, rule memory or production memory and interpreter. Working memory contains facts about the domain. These are in the form of triplets of objects, attribute and value. These facts are modified during the process of execution. Some new facts may be added as conclusions. Production memory contains IF-THEN rules. IF part contains a set of conditions connected by AND. Each condition can have different other conditions connected by AND or OR. Each condition can give either true or false as its value. THEN part has a set of conclusions or actions. Conclusions may change values of some entity or may create new facts. A rule can be fired when all the conditions in it are true. If any of the conditions is not true or unknown, the rule cannot be fired. If it is unknown, the system will try to determine its value. Once a rule has fired, all its conclusions and actions are executed. For firing a rule, the system looks into its database. If a rule has some of its conditions satisfied, it is a candidate for further exploration. There may be more than one such rule. This conflict is resolved by some strategy like choosing the rule which contains the maximum number of satisfied conditions, or there may be metarules which may be domain dependent to move the reasoning in a particular direction. Rules may be used in both forward and backward reasoning. When it is used in forward mode, the system starts with a given set of initial data and infers as much information as possible by application of various rules. Again new data are used to infer further. At any point system may ask the user to supply more information, if goal state has not been reached and no more rules can be applied. System keeps on checking for goal state at each firing of rules. Once goal state has been detected reasoning comes to an end. In backward reasoning mode, reasoning starts with the goal and the rules are selected. If they have the goal in their right hand side (RHS). To achieve the goal, left hand side (LHS) conditions have to be true. These conditions become new sub-goals. Now the system tries to achieve these sub-goals before trying the main goal. At some point it may not be possible to establish goal by application of rules. In this situation the system asks the user to supply the information. It may be noted that these rules are not IF-THEN programming constructs available in most of the procedural Programming languages. These are different in the sense that they are not executed sequentially. Their execution depends on the state of the database which determines which are the candidate rules. Another difference is that IF-part is a complex pattern and not just a Boolean expression. Rules have been used in many classical systems like MYCIN, RI/XCON etc. Even today it is the most frequent used Knowledge Representation scheme. The reason is that most of the time, experts find it easier to give knowledge in the form of rules. Further, rules can be easily used for explanations. One problem with rules is that when they grow very large in number it becomes difficult to maintain them because KB is unstructured. Some techniques like context in MYCIN solve the problem to some extent.
3.5.2 Frame Based Representation The concept of frame is quite simple. When we encounter a new situation, we do not analyse it from scratch. Instead we have a large number of structures or (records) in memory representing our experiences. We try to match the current situation with these structures and then the most appropriate one is chosen. Further details may be added to this chosen structure so that it can exactly describe the situation. A computer representation of this common knowledge is called frame. It is convenient to create a knowledge base about situations by breaking it into modular chunks, called frames. Individual frames may be regarded as a record or structure. Each frame contains slots that identify the type of situations or specify the parameters of a particular situation. A frame describes a class of objects such as ROOM or BUILDING. It consists of various slots which describe one aspect of the object. A slot may have certain conditioned which should be met by the filler. A slot may also have default value which is used when the slot value is not available or cannot be obtained by any other way. If added procedure describes what is to be done if slots get a value. Such information is called facet of slot. An example is presented below: CHAIR IS-A : FURNITURE COLOUR : BROWN MADE-OF : WOOD LEG : 4 ARMS : default:0 PRICE : 100 Reasoning with the knowledge stored in a frame requires choosing an appropriate frame for the given situation. Some of the ways information may be inferred are the following: 1. If certain information is missing from current situation , it can be inferred. For example, if we have established that the given object is a room , we can infer that room has got a door. 2. Slots in a frame describe components of situation. If we want to build a situation then information associated with the slots can be used to build component of the situation. 3. If there is any additional feature in the object which can be discovered using a typical frame, it may require special attention. For example, a man with a tail is not a normal man. 3.5.3 Semantic Nets Semantic net representation was developed for natural language understanding . Semantic net was originally designed to represent the meaning of English words. It has been used in many expert systems too. It is used for representation of declarative knowledge. In semantic nets, the knowledge is represented as a set of nodes and links. A node represents an object or concept and a link represents relationship between two objects (nodes). Moreover, any node may be linked to any number of other nodes, so giving rise to formation of network of facts. An example is shown in the following figure.
Figure 2: Semantic Net A semantic net as shown in figure 2 , cannot be represented like this in computer. Every pair and its link are stored separately. For example, IS----A (DOE, Department) in PROLOG represents
Figure 3: One-way link representation The link as shown in the figure is a one-way link. If we want an answer to "who is my employer?" the system will have to check all the links coming to node ME. This is not computationally efficient. Hence reverse links are also stored. In this case we add
Figure 4: Representation of a reverse link. In LISP the basic semantic network unit may be programmed as an atom/property list combination. The unit gives in department----DOE semantic networks would be composed of "DOE" as the atom, "IS-A" as a property and "Department" as the value of that property . The value "Department" is of course, an atom in its own right and this may have a property list associated it as well . "Is-a" relationship indicates that one concept is an attribute of the other. Other links (relationship) of particular use for describing object concepts are "has", indicating that one concept is a part of the other. Using such relations, it is possible to represent complex sets of facts through semantic network. The following figure illustrates one possible representation of facts about an employee "AKSHAY". These include ---"Akshay is a bank manager" "Akshay works in the State Bank of India located in IGNOU Campus" "Akshay is 26 years old" "Akshay has blue eyes"
Figure 5: Representation of complex sets of foots through semantic nets When we have to represent more than two argument relationships, we break it down into arguments relationships. SCORE (INDIA AUSTRALIA (250 150)) Can be written as participant (match-1 INDIA) participant (match-2 AUSTRALIA) score (match-1 (250 150)) As with an Knowledge Representation scheme, the problem solving power comes from the ability of the program to manipulate knowledge to solve a problem. Intersection search is used to find the relationship between two objects. In this case activation sphere grows from the two modes and intersects each other at some time. The corresponding paths give the correct relationship. More techniques have been developed to perform more directed search. 3.5.4 Knowledge Representation Using Logic Traditionally logic has been studied by philosophers and mathematicians in different countries in order to describe and understand the world around us. Today computer scientists are using this tool to teach a computer about the world. Here we discuss propositional and predicate logic. We can easily represent real-world facts as logical propositions in a propositional logic. In propositional logic we deal with propositions like It is raining. (RAINING) It is sunny. (SUNNY) It is windy. (WINDY) If it is raining, then it is not sunny. (RAINING--------- >~SUNNY) Given the fact that "It is raining" we can deduce that it is not sunny. But the representational power of propositional logic is quite limited. For example, suppose we have to represent Vivek is a man. Anurag is a man. We may represent them in computers as VIVEKMAN and ANURAGMAN . But from these, we don't get any information about similarity between Vivek and Anurag. A better way of representation is MAN (VIVEK) MAN(ANURAG) Consider the sentence 'All men are mortal'. This requires quantification like The form of logic with these and certain other extra features is called predicate logic. The basic elements are described here. Here capital letter P,Q etc. stand for predicates. A predicate is one of the form predicate-name (arg1..argn) It can have values true or false. A predicate represents among the arguments. AND ^ (P^ is true when both are true) OR v (PvQ is true when atleast one is true) NOT~(~P is true when P is false)
Implies(P--->Q is true unless P is true and Q is false)means for all the values of X, P holds. P(X) means there exists at least one value of X for which P holds. means only some values are true. The predicate logic has the following two properties: a) Completeness: If P is a theorem of predicate logic then it can be derived only by the inferences rules available in predicate logic. b) Soundness: There is no P such that both P and NOT P are theorems. The decidability property of propositional logic does not carry over into predicate logic. The following inferences rules are some of the important rules available in predicate logic: Modus Ponens: If P-->Q and P is true then Q is true. ModusTollens : If P-- Q and Q is false then P is false Chaining : If PvQ and (NOT P) v Q then Q is true Reduce to : P and NOT P reduces to { }. Most of the AI theorem provers for clausal form use resolution as the only way of inferencing. This subsumes the anove five rules of inference. Resolution proves a theorem by refutation . First a normal form of clauses is obtained and then negation of the theorem is added to it. If it leads to a contradiction then the theorem is proved. A discussion of detailed algorithm is beyond the scope of this handout. Let us now explore the use of predicate logic as a way of representing knowledge by looking at a specific example . i) Anil is a Manager ii) Anil is disciplined iii) All staff are either loyal to Anil or hate him. iv) Everyone is loyal to someone The facts described by these sentences, can be represented in Predicate logic as follows: 1. Anil is a Manager Manager (Anil) 2. Anil is disciplined disciplined (Anil) 3. All staff are either loyal to Anil or hate him. 4. Everyone is loyal to someone 5. Predicate logic is useful for representing simple English sentences into a logical statement. But, it creates ambiguity for complicated sentences.
Knowledge base system (KBS) INTRODUCTION Problem-solving power does not lie with smart reasoning techniques nor clever search algorithms but domain dependent real-world knowledge
Real-world problems do not have well-defined solutions
Expertise not laid down in algorithms but are domain (cause-and-effect)
dependent rules-of-thumb or heuristics
KBS allow this knowledge to be represented in computer & solution explained A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise.
Heuristic rather than algorithmic
Heuristics in search vs. in KBS general vs. domain-specific
Highly specific domain knowledge Knowledge is separated from how it is used
KBS = knowledge-base + inference engine
MYCIN (medicine) Developed in 1970 at Stanford by Shortcliffe Assist internists in diagnosis and treatment of infectious diseases: meningitis & bacterial septicemia When patient shows signs of infectious disease, culture of blood and urine set to lab (>24hrs) to determine bacterial species
XCON/RI (computer) Configures DECs VAX, PDP11 and VAX DEC offers the customer a wide choice of components when purchasing computer equipment, so that client achieves a custom-made system
Given the customers order, configuration is made, perhaps involving component replacement or addition
DRILLING ADVISOR (industry)
Developed in 1983 by Teknowledge for oil company to replace human drilling advisor Problem: drill bits becoming stuck Difficulty: lack of subsurface information on location & condition on end of drill (scarcity) expert examines rock pieces, mud, lubricant brought up by drilling to determine cause
Human Resource Management HRM facilitates the most effective use of employees to achieve organisational and individual goals HRM KBS forms part of overall strategy (includes DSS & EIS) KBS helps decision making for HRM managers with heuristic knowledge in unstructured & semistructured problems (job placement & pay rises)
v v
(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions. e.g. diagnose reasons for engine failure
(2) Interpretation - To provide an understanding of a situation from available information. e.g. DENDRAL (3) Prediction - To predict a future state from a set of data or observations. e.g. Drilling Advisor, PLANT (4) Design - To develop configurations that satisfy constraints of a design problem. e.g. XCON
5) Planning - Both short term & long term in areas like project management, product development or financial planning.
e.g. HRM (6) Monitoring - To check performance & flag exceptions.
e.g., KBS monitors radar data and estimates the position of the space shuttle (7) Control - To collect and evaluate evidence and form opinions on that evidence.
e.g. control patients treatment
(8) Instruction - To train students and correct their performance. e.g. give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions.
e.g. identify errors in an automated teller machine network and ways to correct the errors
ADVANTAGES Increase availability of expert knowledge expertise not accessible train Advantage sing future experts
Efficient and cost effective Consistency of answers Explanation of solution Deal with uncertainty
LIMITATIONS Lack of common sense Inflexible, Difficult to modify Restricted domain of expertise Lack of learning ability Not always reliable

Management Information Systems: Concepts, Data Processing, Storage and Retrieval, Data Warehousing and Mining, Decision Support Systems and Applications

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Management Information Systems: Concepts, Data Processing, Storage and Retrieval, Data Warehousing and Mining, Decision Support Systems and Applications

Transféré par

Droits d'auteur :

Formats disponibles

Management information System Concepts of Data, Information and knowledge. Processing of data using computers.

Based on Organizational Functions

Revised student table:

The Main Function of a Document Imaging Management System

Benefits of the SDLC Process

Six "engineering disciplines"

Business Modeling Requirements Analysis and Design Implementation Test Deployment

Three supporting disciplines Configuration and Change Management

Project Management Environment

Four Project Life cycle Phases[edit]

RUP phases and disciplines.

3.3 WHAT IS A KNOWLEDGE BASE SYSTEM?

3.4 KNOWLEDGE BASE & DATABASE SYSTEM

Real-world problems do not have well-defined solutions

Expertise not laid down in algorithms but are domain (cause-and-effect)

dependent rules-of-thumb or heuristics

Heuristic rather than algorithmic

Heuristics in search vs. in KBS general vs. domain-specific

Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

DRILLING ADVISOR (industry)

e.g. HRM (6) Monitoring - To check performance & flag exceptions.

e.g. control patients treatment

Vous aimerez peut-être aussi