Data Warehouse Structure

Data Warehouse Structure The Data Warehouse has two main parts: Physical store.
A Microsoft SQL Server database that you can query by using SQL queries, and an OLAP database that you can use to run reports. Logical schema. A conceptual model that maps to the data in the physical store The following figure shows the relationship between the physical store and the logical schema.
Physical Store The physical store for the Data Warehouse includes one database that you can query by using SQL queries. The physical store contains all the data that you have imported from different sources. During the unpacking process, Commerce Server automatically builds the physical store for the Data Warehouse in both the SQL Server database and the OLAP database. The Data Warehouse provides the data necessary for all the Commerce Server reports available in the Analysis modules in Business Desk. There is no need to directly modify the physical store for the Data Warehouse. If you need to extend the Data Warehouse, for example, to encompass third-party data, a site developer can programmatically add the fields you need through the logical schema. Logical Schema The logical schema provides an understandable view of the data in the Data Warehouse, and supports an efficient import process. For example, a site developer uses the logical schema to modify the location of data stored in the underlying physical tables. When a site developer writes code to add, update, or delete data in the Data Warehouse, the developer interacts with the logical schema. When Commerce Server accesses data in the Data Warehouse, it accesses the data through the logical schema. Only the site developer needs detailed knowledge of the logical schema. A logical schema includes the following: Class. A logical collection of data members. Data member. A structure that stores a piece of data.
Relation. A connection between two classes in a parent-child relationship. The parent-child relationship defines the number of instances (1 or many) a child can have with a corresponding instance in the parent. The logical schema uses classes, data members, relations, and other data structures to map data in the physical store. Re: How to design and setup data warehouse The steps in designing a data warehouse are : 1) Select the appropriate theme (to be problem-solving domains). 2) A clear definition of fact table. 3) To identify and confirm dimensions. 4) Choosing the facts. 5) Calculate and store the fact table, derived from data segment. 6) Rounding out the dimension tables. 7) Choosing the duration o the database. 8) The need to track slowly changing dimensions. 9) Determine the priority queries and query patterns. Setting Up Your Data Warehouse System An Oracle data warehouse is an Oracle Database specifically configured and optimized to handle the size of data and types of queries that are intrinsic to data warehousing. This section discusses how to initially configure your data warehouse environment. It includes the following topics: General Steps for Setting up a Data Warehouse System Preparing the Environment Setting Up a Database for a Data Warehouse Accessing Oracle Warehouse Builder General Steps for Setting up a Data Warehouse System In these instructions, you configure an Oracle Database for use as a data warehouse. Subsequently, you install Oracle Warehouse Builder which leverages the Oracle Database and provides graphical user interfaces for designing data management strategies. To set up a data warehouse system, complete the following steps: Size and configure your hardware as described in "Preparing the Environment". Install the Oracle Database software. Optimize the Database for use as a data warehouse as described in "Setting Up a Database for a Data Warehouse". Access the Oracle Warehouse Builder software. Oracle Warehouse Builder is the data integration product that is packaged with the Standard and Enterprise editions of the Oracle Database. Follow the instructions in "Accessing Oracle Warehouse Builder". Subsequently, you can install a demonstration to assist you in learning how to complete common data warehousing tasks using Warehouse Builder. Preparing the Environment The basic components for a data warehouse architecture are the same as for an online transaction processing (OLTP) system. However, due to the sheer size of the data, you have to choose different quantities to balance the individual building blocks differently. The starting point for sizing a data warehouse is the throughput that you require from the system. This can be one or both of the following:
The amount of data that is being accessed by queries hitting the system at peak time, in conjunction with the acceptable response time. You may be able to use throughput numbers and experience from an existing application to estimate the required throughput. The amount of data that is loaded within a window of time. In general, you need to estimate the highest throughput you need at any given point. Hardware vendors can recommend balanced configurations for a data warehousing application and can help you with the sizing. Contact your preferred hardware vendor for more details. Balanced Hardware Configuration A properly sized and balanced hardware configuration is required to maximize data warehouse performance. The following sections discuss some important considerations in achieving this balance: How Many CPUs and What Clock Speed Do I Need? How Much Memory Do I Need? How Many Disks Do I Need? How Do I Determine Sufficient I/O Bandwidth? How Many CPUs and What Clock Speed Do I Need? Central processing units (CPUs) provide the calculation capabilities in a data warehouse. You must have sufficient CPU power to perform the data warehouse operations. Parallel operations are more CPU-intensive than the equivalent serial operation would be. Use the estimated highest throughput as a guideline for the number of CPUs you need. As a rough estimate, use the following formula: <number of CPUs> = <maximum throughput in MB/s> / 200 In other words, a CPU can sustain up to about 200 MB per second. For example, if a system requires a maximum throughput of 1200 MB per second, then the system needs <number of CPUs> = 1200/200 = 6 CPUs. A configuration with 1 server with 6 CPUs can service this system. A 2-node clustered system could be configured with 3 CPUs in both nodes. How Much Memory Do I Need? Memory in a data warehouse is particularly important for processing memory-intensive operations such as large sorts. Access to the data cache is less important in a data warehouse because most of the queries access vast amounts of data. Data warehouses do not have memory requirements as critical as OLTP applications. The number of CPUs provides you a good guideline for the amount of memory you need. Use the following simplified formula to derive the amount of memory you need from the CPUs you selected: <amount of memory in GB> = 2 * <number of CPUs> For example, a system with 6 CPUs needs 2 * 6 = 12 GB of memory. Most standard servers fulfill this requirement. How Many Disks Do I Need? A common mistake in data warehouse environments is to size the storage based on the maximum capacity needed. Sizing based exclusively on storage requirements will likely create a throughput bottleneck. Use the throughput you require to find out how many disk arrays you need. Use the storage provider's specifications to find out how much throughput a disk array can sustain. Note that storage providers measure in Gb per second, and your initial throughput estimate is based on MB second. An average disk controller has a maximum throughput of 2 Gb second, which translates to a sustainable throughput of about (70% * 2 Gbit/s) /8 = 180 MB/s.
Use the following formula to determine the number of disk arrays you need: <number of disk controllers> = <throughput in MB/s> / <individual controller throughput in MB/s> For example, our system with 1200 MB per second throughput requires at least 1200 / 180 = 7 disk arrays. Make sure you have enough physical disks to sustain the throughput you require. Ask your disk vendor for the throughput numbers of the disks. How Do I Determine Sufficient I/O Bandwidth? The end-to-end I/O system consists of more components than just the CPUs and disks. A wellbalanced I/O system has to provide approximately the same bandwidth across all components in the I/O system. These components include the following: Host Bus Adapters (HBAs), the connectors between the server and the storage. Switches, in between the servers and a Storage Area Network (SAN) or Network Attached Storage (NAS). Ethernet adapters for network connectivity (GigE NIC or Infiniband). In a clustered environment, you need an additional private port for the interconnect between the nodes that you should not include when sizing the system for I/O throughput. The interconnect must be sized separately, taking into account factors such as internode parallel execution. Wires that connect the individual components. Each of the components has to be able to provide sufficient I/O bandwidth to ensure a wellbalanced I/O system. The initial throughput you estimated and the hardware specifications from the vendors are the basis to determine the quantities of the individual components you need. Use the conversion in the following table to translate the vendors' maximum throughput numbers in bits into sustainable throughput numbers in bytes. Table 2-1 Throughput Performance Conversion Component Bits Bytes Per Second HBA 16 Port Switch Fibre Channel GigE NIC Inf-2 Gbit 2 Gbit 8 * 2 Gbit 2 Gbit 1 Gbit 2 Gbit 200 MB 1200 MB 200 MB 80 MB 160 MB
In addition to having sufficient components to ensure sufficient I/O bandwidth, the layout of data on the disk is also a key to success or failure. If you configured the system for sufficient throughput across all disk arrays, and all data a query is going to retrieve still resides on one disk, then you still will not be able to get the required throughput, because your one disk will be the bottleneck. In order to avoid such a situation, stripe data across as many disks as possible, ideally all disks. A stripe size of 256 KB to 1 MB provides a good balance between multiblock read operations and data spread across multiple disks. About Automatic Storage Management (ASM) ASM is a component of Oracle Database that you can use to stripe data across disks in a disk group. ASM ensures the data is balanced across all disks. Disks can be added or removed while ASM is operational, and ASM will automatically rebalance the storage across all available disks.
ASM can also be used to mirror data on the file system, to avoid loss of data in case of disk failure. The default stripe size for ASM is 1 MB. You can lower the stripe size to 128 KB. You can perform storage operations without ASM, but this increases the chances of making a mistake. Thus, Oracle recommends you use ASM whenever possible. Verifying the Data Warehouse Hardware Configuration Before you install Oracle Database, you should verify your setup on the hardware and operatingsystem level. The key point to understand is that if the operating system cannot deliver the performance and throughput you need, Oracle Database will never be able to perform according to your requirements. Two tools for verifying throughput are the dd utility and Orion, an Oraclesupplied tool. About the dd Utility A very basic way to validate the operating system throughput, on UNIX or Linux systems, is to use the dd utility. The dd utility is a very basic way to read data blocks directly from disk and, because there is almost no overhead involved, the output from the dd utility provides a reliable calibration. Oracle Database will reach a maximum throughput of approximately 90 percent of what the dd utility can achieve. Example: Using the dd Utility To use the dd utility: First, the most important options for using dd are the following: bs=BYTES: Read BYTES bytes at a time; use 1 MB count=BLOCKS: copy only BLOCKS input blocks if=FILE: read from FILE; set to your device of=FILE: write to FILE; set to /dev/null to evaluate read performance; write to disk would erase all existing data!!! skip=BLOCKS: skip BLOCKS BYTES-sized blocks at start of input To estimate the maximum throughput Oracle Database will be able to achieve, you can mimic a workload of a typical data warehouse application, which consists of large, random sequential disk access. The following dd command performs random sequential disk access across two devices reading a total of 2 GB. The throughput is 2 GB divided by the time it takes to finish the following command: dd bs=1048576 count=200 if=/raw/data_1 of=/dev/null & dd bs=1048576 count=200 skip=200 if=/raw/data_1 of=/dev/null & dd bs=1048576 count=200 skip=400 if=/raw/data_1 of=/dev/null & dd bs=1048576 count=200 skip=600 if=/raw/data_1 of=/dev/null & dd bs=1048576 count=200 skip=800 if=/raw/data_1 of=/dev/null & dd bs=1048576 count=200 if=/raw/data_2 of=/dev/null & dd bs=1048576 count=200 skip=200 if=/raw/data_2 of=/dev/null & dd bs=1048576 count=200 skip=400 if=/raw/data_2 of=/dev/null & dd bs=1048576 count=200 skip=600 if=/raw/data_2 of=/dev/null & dd bs=1048576 count=200 skip=800 if=/raw/data_2 of=/dev/null & In your test, you should include all the storage devices that you plan to include for your database storage. When you configure a clustered environment, you should run dd commands from every node. About the Orion Utility
Orion is a tool that Oracle provides to mimic an Oracle-like workload on a system to calibrate the throughput. Compared to the dd utility, Orion provides the following advantages: Orion's simulation is closer to the workload the database will produce. Orion enables you to perform reliable write and read simulations within one simulation. Oracle recommends you use Orion to verify the maximum achievable throughput, even if a database has already been installed. The types of supported I/O workloads are as follows: small and random large and sequential large and random mixed workloads For each type of workload, Orion can run tests at different levels of I/O load to measure performance metrics such as MB per second, I/O per second, and I/O latency. A data warehouse workload is typically characterized by sequential I/O throughput, issued by multiple processes. You can run different I/O simulations depending upon which type of system you plan to build. Examples are the following: daily workloads when users or applications query the system the data load when users may or may not access the system index and materialized view builds backup operations To download Orion software, point your browser to the following: http://www.oracle.com/technology/software/tech/orion/index.html Note that Orion is Beta software, and unsupported. Example: Using the Orion Utility To invoke Orion: $ orion -run simple -testname mytest -num_disks 8 Typical output is as follows: Orion VERSION 10.2 Command line: -run advanced -testname orion14 -matrix point -num_large 4 -size_large 1024 -num_disks 4 -type seq -num_streamIO 8 -simulate raid0 -cache_size 0 -verbose This maps to this test: Test: orion14 Small IO size: 8 KB Large IO size: 1024 KB IO Types: Small Random IOs, Large Sequential Streams Number of Concurrent IOs Per Stream: 8 Force streams to separate disks: No Simulated Array Type: RAID 0 Stripe Depth: 1024 KB Write: 0% Cache Size: 0 MB Duration for each Data Point: 60 seconds Small Columns:, 0
Large Columns:, 4 Total Data Points: 1 Name: /dev/vx/rdsk/asm_vol1_1500m Size: 1572864000 Name: /dev/vx/rdsk/asm_vol2_1500m Size: 1573912576 Name: /dev/vx/rdsk/asm_vol3_1500m Size: 1573912576 Name: /dev/vx.rdsk/asm_vol4_1500m Size: 1573912576 4 FILEs found. Maximum Large MBPS=57.30 @ Small=0 and Large=4 In this example, the maximum throughput for this particular workload is 57.30 MB per second. Setting Up a Database for a Data Warehouse After you set up your environment and install the Oracle Database software, ensure that you have the Database parameters set correctly. Note that there are not many database parameters that have to be set. As a general guideline, avoid changing a database parameter unless you have good reason to do so. You can use Oracle Enterprise Manager to set up your data warehouse. To view various parameter settings, navigate to the Database page, then click Server. Under Database Configuration, click Memory Parameters or All Inititalization Parameters. How Should I Set the Memory Management Parameters? On a high level, there are two memory segments: Shared memory: Also called the system global area (SGA), this is the memory used by the Oracle instance. Session-based memory: Also called program global area (PGA), this is the memory that is occupied by sessions in the database. It is used to perform database operations, such as sorts and aggregations. Oracle Database can automatically tune the distribution of the memory components in two memory areas. As a result, you need to set only the following parameters: SGA_TARGET The SGA_TARGET parameter is the amount of memory you want to allocate for shared memory. For a data warehouse, the SGA can be relatively small compared to the total memory consumed by the PGA. To get started, assign 25% of the total memory you allow Oracle Database to use to the SGA. The SGA, at a minimum, should be 100 MB. PGA_AGGREGATE_TARGET The PGA_AGGREGATE_TARGET parameter is the target amount of memory that you want the total PGA across all sessions to consume. As a starting point, you can use the following formula to define the PGA_AGGREGATE_TARGET value: PGA_AGGREGATE_TARGET = 3 * SGA_TARGET If you do not have enough physical memory for the PGA_AGGREGATE_TARGET to fit in memory, then reduce PGA_AGGREGATE_TARGET. MEMORY_TARGET and MEMORY_MAX_TARGET The MEMORY_TARGET parameter enables you to set a target memory size and the related initialization parameter, MEMORY_MAX_TARGET, sets a maximum target memory size. The database then tunes to the target memory size, redistributing memory as needed between the system global area (SGA) and aggregate program global area (PGA). Because the target memory initialization parameter is dynamic, you can change the target memory size at any time without
restarting the database. The maximum memory size serves as an upper limit so that you cannot accidentally set the target memory size too high. Because certain SGA components either cannot easily shrink or must remain at a minimum size, the database also prevents you from setting the target memory size too low. Example: Setting an Initialization Parameter You can set an initialization parameter by issuing an ALTER SYSTEM statement, as illustrated by the following: ALTER SYSTEM SET SGA_TARGET = 1024M; What Other Initialization Parameter Settings Are Important? A good starting point for a data warehouse is the data warehouse template database that you can select when you run the Database Configuration Assistant (DBCA). However, any database will be acceptable as long as you make sure you take the following initialization parameters into account: COMPATIBLE The COMPATIBLE parameter identifies the level of compatibility that the database has with earlier releases. To benefit from the latest features, set the COMPATIBLE parameter to your database release number. OPTIMIZER_FEATURES_ENABLE To benefit from advanced cost-based optimizer features such as query rewrite, make sure this parameter is set to the value of the current database version. DB_BLOCK_SIZE The default value of 8 KB is appropriate for most data warehousing needs. If you intend to use table compression, consider a larger block size. DB_FILE_MULTIBLOCK_READ_COUNT The DB_FILE_MULTIBLOCK_READ_COUNT parameter enables reading several database blocks in a single operating-system read call. Because a typical workload on a data warehouse consists of many sequential I/Os, make sure you can take advantage of fewer large I/Os as opposed to many small I/Os. When setting this parameter, take into account the block size as well as the maximum I/O size of the operating system, and use the following formula: DB_FILE_MULTIBLOCK_READ_COUNT * DB_BLOCK_SIZE = <maximum operating system I/O size> Maximum operating-system I/O sizes vary between 64 KB and 1 MB. PARALLEL_MAX_SERVERS The PARALLEL_MAX_SERVERS parameter sets a resource limit on the maximum number of processes available for parallel execution. Parallel operations need at most twice the number of query server processes as the maximum degree of parallelism (DOP) attributed to any table in the operation. Oracle Database sets the PARALLEL_MAX_SERVERS parameter to a default value that is sufficient for most systems. The default value for PARALLEL_MAX_SERVERS is as follows: (CPU_COUNT x PARALLEL_THREADS_PER_CPU x (2 if PGA_AGGREGATE_TARGET > 0; otherwise 1) x 5) This might not be enough for parallel queries on tables with higher DOP attributes. Oracle recommends users who expect to run queries of higher DOP to set PARALLEL_MAX_SERVERS as follows: 2 x DOP x NUMBER_OF_CONCURRENT_USERS
For example, setting the PARALLEL_MAX_SERVERS parameter to 64 will allow you to run four parallel queries simultaneously, assuming that each query is using two slave sets with a DOP of eight for each set. If the hardware system is neither CPU-bound nor I/O bound, then you can increase the number of concurrent parallel execution users on the system by adding more query server processes. When the system becomes CPU- or I/O-bound, however, adding more concurrent users becomes detrimental to the overall performance. Careful setting of the PARALLEL_MAX_SERVERS parameter is an effective method of restricting the number of concurrent parallel operations. PARALLEL_ADAPTIVE_MULTI_USER The PARALLEL_ADAPTIVE_MULTI_USER parameter, which can be TRUE or FALSE, defines whether or not the server will use an algorithm to dynamically determine the degree of parallelism for a particular statement depending on the current workload. To take advantage of this feature, set PARALLEL_ADAPTIVE_MULTI_USER to TRUE. QUERY_REWRITE_ENABLED To take advantage of query rewrite against materialized views, you must set this parameter to TRUE. This parameter defaults to TRUE. QUERY_REWRITE_INTEGRITY The default for the QUERY_REWRITE_INTEGRITY parameter is ENFORCED. This means that the database will rewrite queries against only fully up-to-date materialized views, if it can base itself on enabled and validated primary, unique, and foreign key constraints. In TRUSTED mode, the optimizer trusts that the data in the materialized views is current and the hierarchical relationships declared in dimensions and RELY constraints are correct. STAR_TRANSFORMATION_ENABLED To take advantage of highly optimized star transformations, make sure to set this parameter to TRUE. Accessing Oracle Warehouse Builder Oracle Warehouse Builder is a flexible tool that enables you to design and deploy various types of data management strategies, including traditional data warehouses. To enable Warehouse Builder, complete the following steps: Ensure that you have access to either a Enterprise or Standard Edition of the Oracle Database 11g. Oracle Database 11g comes with the Warehouse Builder server components pre-installed. This includes a schema for the Warehouse Builder repository. To utilize the default Warehouse Builder schema installed in Oracle Database 11g, first unlock the schema. Connect to SQL*Plus as the SYS or SYSDBA user. Execute the following commands: SQL> ALTER USER OWBSYS ACCOUNT UNLOCK; SQL> ALTER USER OWBSYS IDENTIFIED BY owbsys_passwd; Launch the Warehouse Builder Design Center. For Windows, select Start, Programs, Oracle, Warehouse Builder and then select Design Center. For UNIX, locate owb home/owb/bin/unix and then execute owbclient.sh Define a workspace and assign a user to the workspace. In the single Warehouse Builder repository, you can define multiple workspaces with each workspace corresponding to a set of users working on related projects. For instance, you could create a workspace for each of the following environments: development, test, and production. For simplicity, create one workspace MY_WORKSPACE and assign a user.
In the Design Center dialog box, click Show Details and then Workspace Management. The Repository Assistant displays. Following the prompts and accept the default settings in the Repository Assistant, you create a workspace and assign a user as the workspace owner. Log into the Design Center with the user name and password you created. Installing the Oracle Warehouse Builder Demonstration In subsequent topics, this guide uses exercises to illustrate how to consolidate data from multiple flat file sources, transform the data, and load it into a new relational target. To execute the exercises presented in this guide, download the Warehouse Builder demonstration. To facilitate your learning of the product, the demonstration provides you with flat file data and scripts that create various Warehouse Builder objects. To perform the Warehouse Builder exercises presented in this guide, complete the following steps: Download the demonstration. The demonstration is comprised of a set of files in a zip file called owb_demo.zip, which is available at the following link: http://www.oracle.com/technology/obe/admin/owb10gr2_gs.htm The zip file includes a SQL script, two source files in comma separated values format, and 19 scripts written in Tcl. Edit the script owbdemoinit.tcl. The script owbdemoinit.tcl defines and sets variables used by the other tcl scripts. Edit the following variables to match the values in your computer environment: set tempspace TEMP set owbclientpwd workspace_owner set sysuser sys set syspwd pwd set host hostname set port portnumber set service servicename set project owb_project_name set owbclient workspace_owner set sourcedir drive:/newowbdemo set indexspace USERS set dataspace USERS set snapspace USERS set sqlpath drive:/oracle/11.1.0/db_1/BIN set sid servicename Execute the Tcl scripts from the Warehouse Builder scripting utility, OMB Plus. For Windows, select Start, Programs, Oracle, Warehouse Builder and then select OMB Plus. For UNIX, locate owb home/owb/bin/unix and then execute OMBPlus.sh At the OMB+> prompt, type the following to change to the directory containing the scripts: cd drive:\\newowbdemo\\ Execute all the Tcl scripts in desired sequence by typing the following command: source loadall.tcl Launch the Design Center and log into it as the workspace owner, using the credentials you specified in the script owbdemoinit.tcl.
Verify that you successfully set up the Warehouse Builder client to follow the demonstration. In the Design Center, expand the Locations node which is on the right side and in the Connection Explorer. Expand Databases and then Oracle. The Oracle node should include the follow locations: OWB_REPOSITORY SALES_WH_LOCATION When you successfully install the Warehouse Builder demonstration, the Design Center displays with an Oracle module named EXPENSE_WH.
5 -1 Project Estimation Introduction The CS/10,000 project estimation tool suite allows you to generate, manage, and validate estimates of effort for a wide variety of projects. The estimation method uses proprietary AI technology to provide both task and effort based estimation for your project. Estimation Traditionally, the accuracy of structured estimation methods such as function point analysis or COCOMO (line of code estimation) has been limited by two factors: (1) the number and types of parameters that can be evaluated and (2) the assignment of company specific input values. In CS/10,000 these complex issues have been addressed by a patented expert system/neural network based AI system. The CS/10,000 Project Estimator processes information from a large number of sources to create a meaningful estimate for your project. It evaluates the project plan, project requirements, information about your working environment, and even different aspects of your corporate culture. On the basis of historical project data, the estimator also tunes input parameters to your specific environment - automatically. You are shielded from the intricacies of the AI system behind a user friendly interface that guides you every step of the way. Easy to use and highly accurate compared to other structured methods, the CS/10,000 estimation module contains all the tools that you need to effectively manage and document the entire estimation life cycle. 5 -2- This rather thorny term, "risk analysis," camouflages a seemingly esoteric subject that hits virtually every American right where it hurts most--in the pocketbook! Risk analysis is the study of the chance of a damaging or life-threatening event happening in your life this very day. Fire! Flood! Theft! Earthquakes! Car accidents! Rabid T-Rex attacks! What are your chances today? Photo: Courtesy of NGDC/NOAA The basic ideas of risk analysis are quite simple, but the application can be quite complicated depending on the risk evaluated. But don't let that stop you! Insurance agents know all about risk analysis, since that is how they decide how much you will pay for insurance, and they are just normal people like you. To give you an idea of what's involved, let's look at a simple example: auto insurance premiums. The total cost an insurance company has to cover each year for a particular type of accident is computed as follows: Total Cost = (Cost of each accident) x (# Accidents per year) The number of accidents per year refers to the total number of accidents experienced by the group of people insured by the company. The actual number of accidents within a given group of people will, of course, change somewhat from year to year. So to get a good estimate of the average yearly cost, an insurance company will simply count the number of accidents for the group over an interval of several years and then divide the total number of accidents by the
number of years. The resulting number per year is called the probability of occurrence for the particular accident. To break even, the insurance company must collect the Total Cost from the group. So your annual premium, or the cost per person in the group to cover that particular accident, is: (Cost of each accident) x (# Accidents per Premium year) / = (# of people in group) Your premiums will actually be slightly larger (a few %) so that the company can make a small profit. We have considered only the cost of a single type of accident, but everyone knows that there are many different kinds of accidents--from small fender-benders, costing a few hundred dollars to repair, to large, car-totaling crashes, costing tens of thousands of dollars to settle. So a more realistic way of thinking about your premiums is to think of the total cost of accidents as a sum like: (the cost of small accidents times number of small accidents) + (cost of large accidents times the number of large accidents) and so on. When you take a "deductible," you are saying that you will pay for any accidents you have costing less than a certain amount. Since you pay for one type of accident, the total cost to the insurance company is less, so the premium you pay is less. You should also realize that different kinds of accidents have different probabilities of occurrence. For instance, experience shows that minor wrecks occur more often than large ones, so the probability of a large wreck is smaller than the probability of a minor wreck. The number of accidents also depends on the type of people in a group. In general, high school students have more accidents per person than their parents, so the students' premiums are larger than their parents'. Just for comparison, here are the chances of a few types of interesting events occurring in your life this year: Event Chance This Year Car stolen House catch fire Die of Cancer Die in Car wreck Die by Homicide Die of AIDS Die of Tuberculosis Win a state lottery Killed by lightning Killed by flood or tornado Killed in Hurricane 1 in 100 1 in 200 1 in 500 1 in 6,000 1 in 10,000 1 in 11,000 1 in 200,000 1 in 1 million 1 in 1.4 million 1 in 2 million 1 in 6 million
Die from Heart Disease 1 in 280
1 in 1 million to 10 million (depends on airline) Note that these numbers are national averages. They will be somewhat different depending on where you live, how old you are, whether or not you smoke or exercise, whether you like to golf during thunderstorms, and which airline you choose. On the other hand, there are some useful comparisons. For example, you are about 1,000 times more likely to be killed on the way to the airport than to die during the airplane flight. You are about 10,000 times more likely to die from any cause or to have your car stolen than to win a lottery. Now let's apply risk analysis to natural hazards. How do you estimate the annual probability of an occurrence of a particular type of natural disaster--especially of disasters that don't happen every day? In just the same way as above: count the number of the type of event over an interval of time and divide the sum by the number of years in the interval. For example, based on geologic evidence, the 14 Cascade volcanoes have erupted 50 times in the last 4000 years. So the probability of eruption for any given volcano in the Cascades in any given year is 50/[(14)(4000)], or about 1 in a thousand (10-3) per year. This translates into about 1 or 2 eruptions among the 14 Cascade volcanoes each century. All of these 50 eruptions were relatively small ones -- even the 1980 eruption of Mount St. Helens! However, there is geologic evidence of eruptions more than 100x larger than Mount St. Helens in the Cascades. How often do these very large eruptions occur? The data are Magma Caldera Volcano Age (yrs) Volume Size Die in commercial plane crash Crater Lake, Or (Mount Mazama) Newberry, Or Newberry, Or 6900 ~40 mi3 5 x 6 mi
~350,000 >5-10 mi3 4 x 5 mi <500,000 >10 mi3 4 x 5 mi
Long Valley, 700,000 ~140 mi3 15 x 20 mi Ca Four eruptions in about a million years. This implies a probability of eruption of about 1 in 250,000 per year. Now, how accurate is this estimate? We must be careful because the so-called "statistics of small numbers" can be very misleading. In statistics, we are looking for typical events, not unique ones. If we looked at only the last 10,000 years, we would find only one event: the Mazama eruption. If we assumed the Mazama eruption to be typical, a simple calculation gives an annual probability of eruption of 1 in 10,000 ("Aaaaaaahhh! We're all gonna' die!"). But what if the Mazama eruption was unique? A one-shot deal? Then there would be no giant eruptions in the future, and the actual probability of eruption would be zero ("No worries, Mate!"). Which is right? Unfortunately there is no way to tell whether a single event is unique or not. The best that we can do to find out is to try to find other similar events. We obtained our list of four events by expanding the time period of our search to over 200 times as long as the period considered for the smaller eruptions. Four is still not a large number for statistical analysis, but it is better than one. Not only that, but one of the eruptions, Long Valley,
is not even in the Cascades, only nearby. It may or may not be affected by the conditions driving large eruptions in the Cascades. Let's look at what we have. On the one hand, there have been more than one Mazama-scale eruption in the Cascades. Thus Mazama is not unique and the probability of eruption is greater than zero. On the other hand, the 1 in 10,000 probability we calculated considering Mazama as typical would predict about 100 Mazama-scale eruptions less than a million years old in the Cascades. There are actually only (at most) four. Consequently, in the absence of additional data, we can accept the 1 in 250,000 eruption probability as approximately correct, but we must also recognize the uncertainty in that number and the need to gather additional data if possible. One more point: if we compare the probability of small Cascade eruptions with the probability of large eruptions, we find that small eruptions are much more likely than large ones. This result is consistent with much experience in dealing with many other types of natural phenomena (earthquakes, hurricanes, tornadoes, floods, etc.): large events are much less probable than small events of the same kind. Since the extreme conditions required to bring about extreme events are unusual in themselves and usually require enormous concentrations of energy, these types of events are difficult to achieve. 5-3 Managing risk Introduction
Every business faces risks that could present threats to its success. Risk is defined as the probability of an event and its consequences. Risk management is the practice of using processes, methods and tools for managing these risks. Risk management focuses on identifying what could go wrong, evaluating which risks should be dealt with and implementing strategies to deal with those risks. Businesses that have identified the risks will be better prepared and have a more cost-effective way of dealing with them. This guide sets out how to identify the risks your business may face. It also looks at how to implement an effective risk management policy and programme which can increase your business' chances of success and reduce the possibility The risk management process Businesses face many risks, therefore risk management should be a central part of any business' strategic management. Risk management helps you to identify and address the risks facing your business and in doing so increase the likelihood of successfully achieving your businesses objectives. A risk management process involves: methodically identifying the risks surrounding your business activities assessing the likelihood of an event occurring understanding how to respond to these events putting systems in place to deal with the consequences monitoring the effectiveness of your risk management approaches and controls As a result, the process of risk management: improves decision-making, planning and prioritisation helps you allocate capital and resources more efficiently
allows you to anticipate what may go wrong, minimising the amount of fire-fighting you have to do or, in a worst-case scenario, preventing a disaster or serious financial loss significantly improves the probability that you will deliver your business plan on time and to budget Risk management becomes even more important if your business decides to try something new, for example launch a new product or enter new markets. Competitors following you into these markets, or breakthroughs in technology which make your product redundant, are two risks you may want to consider in cases such as these. For more information on good practice in risk management download a guide to risk management for SMEs from the Institute of Chartered Accountants in England and Wales (ICAEW) website (PDF, 93K). Your business may benefit from implementing a risk management standard, which could provide guidance on reducing the impact of everyday and extreme events. The British Standards Institution (BSI) has published British Standard BS 31100: 2008 Risk Management - Code of Practice, which sets out a detailed description of the risk management process and framework required to support the process. Find information on risk management standards on the BSI website. 5-4 Return On Investment - ROI What Does Return On Investment - ROI Mean? A performance measure used to evaluate the efficiency of an investment or to compare the efficiency of a number of different investments. To calculate ROI, the benefit (return) of an investment is divided by the cost of the investment; the result is expressed as a percentage or a ratio. The return on investment formula:
In the above formula "gains from investment", refers to the proceeds obtained from selling the investment of interest. Return on investment is a very popular metric because of its versatility and simplicity. That is, if an investment does not have a positive ROI, or if there are other opportunities with a higher ROI, then the investment should be not be undertaken. Investopedia explains Return On Investment - ROI Keep in mind that the calculation for return on investment and, therefore the definition, can be modified to suit the situation -it all depends on what you include as returns and costs. The definition of the term in the broadest sense just attempts to measure the profitability of an investment and, as such, there is no one "right" calculation. For example, a marketer may compare two different products by dividing the gross profit that each product has generated by its respective marketing expenses. A financial analyst, however, may compare the same two products using an entirely different ROI calculation, perhaps by dividing the net income of an investment by the total value of all resources that have been employed to make and sell the product. This flexibility has a downside, as ROI calculations can be easily manipulated to suit the user's
purposes, and the result can be expressed in many different ways. When using this metric, make sure you understand what inputs are being used. 6-a what is data mining and what is the scope of data mining Overview Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Continuous Innovation Although data mining is a relatively new term, the technology is not. Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years. However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost. Example For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays. Data, Information, and Knowledge Data Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes: operational or transactional data such as, sales, cost, inventory, payroll, and accounting nonoperational data, such as industry sales, forecast data, and macro economic data meta data - data about the data itself, such as logical database design or data dictionary definitions Information The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when. Knowledge Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts. Data Warehouses
Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling organizations to integrate their various databases into data warehouses. Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data. Centralization of data is needed to maximize user access and analysis. Dramatic technological advances are making this vision a reality for many companies. And, equally dramatic advances in data analysis software are allowing users to access this data freely. The data analysis software is what supports data mining. What can data mining do? Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data. With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments. For example, Blockbuster Entertainment mines its video rental history database to recommend rentals to individual customers. American Express can suggest products to its cardholders based on analysis of their monthly expenditures. WalMart is pioneering massive data mining to transform its supplier relationships. WalMart captures point-of-sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5 terabyte Teradata data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products and perform data analyses. These suppliers use this data to identify customer buying patterns at the store display level. They use this information to manage local store inventory and identify new merchandising opportunities. In 1995, WalMart computers processed over 1 million complex data queries. The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the Cavaliers during that game. By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the jump shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage. Those clips show a very successful pick-and-roll play in which Price draws the Knick's defense and then finds Williams for an open jump shot. How does data mining work? While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes
relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought: Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials. Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities. Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining. Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes. Data mining consists of five major elements: Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals. Analyze the data by application software. Present the data in a useful format, such as a graph or table. Different levels of analysis are available: Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution. Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID. Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique. Rule induction: The extraction of useful if-then rules from data based on statistical significance. Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships. What technological infrastructure is required? Today, data mining applications are available on all size systems for mainframe, client/server, and PC platforms. System prices range from several thousand dollars for the smallest applications up to $1 million a terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes to over 11 terabytes. NCR has the capacity to deliver applications exceeding 100 terabytes. There are two critical technological drivers:
Size of the database: the more data being processed and maintained, the more powerful the system required. Query complexity: the more complex the queries and the greater the number of queries being processed, the more powerful the system required. Relational database storage and management technology is adequate for many data mining applications less than 50 gigabytes. However, this infrastructure needs to be significantly enhanced to support larger applications. Some vendors have added extensive indexing capabilities to improve query performance. Others use new hardware architectures such as Massively Parallel Processors (MPP) to achieve order-of-magnitude improvements in query time. For example, MPP systems from NCR link hundreds of high-speed Pentium processors to achieve performance levels exceeding those of the largest supercomputers. 6-b 7-1 Data selection is defined as the process of determining the appropriate data type and source, as well as suitable instruments to collect data. Data selection precedes the actual practice of data collection. This definition distinguishes data selection from selective data reporting (selectively excluding data that is not supportive of a research hypothesis) and interactive/active data selection (using collected data for monitoring activities/events, or conducting secondary data analyses). The process of selecting suitable data for a research project can impact data integrity. The primary objective of data selection is the determination of appropriate data type, source, and instrument(s) that allow investigators to adequately answer research questions. This determination is often discipline-specific and is primarily driven by the nature of the investigation, existing literature, and accessibility to necessary data sources. Integrity issues can arise when the decisions to select appropriate data to collect are based primarily on cost and convenience considerations rather than the ability of data to adequately answer research questions. Certainly, cost and convenience are valid factors in the decisionmaking process. However, researchers should assess to what degree these factors might compromises the integrity of the research endeavor. Considerations/issues in data selection There are a number of issues that researchers should be aware of when selecting data. These include determining: the appropriate type and sources of data which permit investigators to adequately answer the stated research questions, suitable procedures in order to obtain a representative sample the proper instruments to collect data. There should be compatibility between the type/source of data and the mechanisms to collect it. It is difficult to extricate the selection of the type/source of data from instruments used to collect the data. 7-2 data cleaning Poor data quality is a well-known problem in data warehouses that arises for a variety of reasons such as data entry errors and differences in data representation among data sources. For example, one source may use abbreviated state names while another source may use fully expanded state names. However, high quality data is essential for accurate data analysis. Data cleaning is the process of detecting and correcting errors and inconsistencies in data. 7-3 Data reduction Data reduction is the process of minimizing the amount of data that needs to be stored in a data storage environment. Data reduction can increase storage efficiency and reduce costs.
Data reduction can be achieved using several different types of technologies. The best-known, data deduplication, eliminates redundant data on storage systems. Data archiving and data compression can also reduce the amount of data needed to be stored on primary storage systems. Data archiving works by filing infrequently accessed data to secondary data storage systems. Data compression reduces the size of a file by removing redundant information from files so that less disk space is required. 7-4 Data enrichment Data enrichment is a value adding process, where external data from multiple sources is added to the existing data set to enhance the quality and richness of the data. This process provides more information of the product to the customer.

Data Warehouse Structure

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data Warehouse Structure

Transféré par

Droits d'auteur :

Formats disponibles

Data Warehouse Structure The Data Warehouse has two main parts: Physical store.

Die from Heart Disease 1 in 280

~350,000 >5-10 mi3 4 x 5 mi <500,000 >10 mi3 4 x 5 mi

Vous aimerez peut-être aussi