Vous êtes sur la page 1sur 23

Decision Support Systems 42 (2006) 121 143 www.elsevier.

com/locate/dsw

A framework for data warehouse refresh policies


Michael V. Mannino *, Zhiping Walter
The Business School, University of Colorado at Denver and Health Sciences, P.O. Box 173364, Campus Box 165, Denver, CO 80217-3364, USA Available online 10 December 2004

Abstract In a field study to explore influences on data warehouse refresh policies, we interviewed data warehouse administrators from 13 organizations about data warehouse details and organizational background. The dominant refresh strategy reported was daily refresh during nonbusiness hours with some deviations due to operational decision making and data source availability. As a result of the study, we developed a framework consisting of short-term and long-term influences of refresh policies along with traditional information system success variables influenced by refresh policies. The framework suggests the need for research about process design, data timeliness valuation, and optimal refresh policy design. D 2004 Elsevier B.V. All rights reserved.
Keywords: Data warehouse; Refresh process; Data source; Refresh policy

1. Introduction Data warehouse, a term coined by William Inmon in 1990, refers to a central data repository where data from operational databases and other sources are integrated, cleaned, and archived to support decision making. A data warehouse provides management with convenient access to large volumes of internal and external data. Because of the potential benefits, most medium to large organizations operate data warehouses. Many of these organizations have operated data warehouses for 5 years or more with
* Corresponding author. Tel.: +1 303 5566615; fax: +1 303 5565899. E-mail addresses: Michael.Mannino@cudenver.edu (M.V. Mannino)8 Zhiping.Walter@cudenver.edu (Z. Walter). 0167-9236/$ - see front matter D 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2004.11.002

continuing development to increase the size and scope of the data warehouses. Data warehouse refreshment is a complex process comprising many tasks, such as extraction, transformation, integration, cleaning, key management, history management, and loading [7,24]. To fulfill decision support needs, data warehouses can use data from many internal and external sources. Reconciling the differences among data sources is a significant challenge especially considering that source systems typically cannot be changed. As its data sources change, a data warehouse should be refreshed in a timely manner to support decision making. As data sources change at different rates, the determination of the time and content to refresh can be a significant challenge. Because of these challenges, refreshing a data warehouse is perhaps

122

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

the most significant activity in the operation of a data warehouse. In this paper, we report an exploratory field study to reveal salient factors that influence data warehouse refresh policies. More specifically, this study addresses when organizations refresh their data warehouse and why. We interviewed 13 data warehouse administrators about data warehouse characteristics, refresh process details, data warehouse usage characteristics, organizational influences, and problems in the management of the refresh process. As part of the field study design, we define the refresh policy orientation to distinguish between user-driven refresh (more frequent with fewer data sources per refresh) and warehouse-driven refresh (less frequent with more data sources per refresh) policies. The framework derived from the field study consists of short-term and long-term influences on data warehouse refresh policies along with traditional information systems success variables influenced by refresh policies. Constraints on data sources and data warehouses are the dominant short-term factor directly affecting both information and system quality requirements as well as refresh policies. Other shortterm influences on refresh policies are data warehouse characteristics (nature of data, nature of use, and type of user) and user expectations. Long-term factors influencing refresh policies are information technology investments that can relax constraints at both the source system and data warehouse levels. Organizational factors influence the motivation and ability of organizations to relax infrastructure constraints. Refresh policies influence information and system quality and, indirectly, information systems success factors (user satisfaction and perceived net benefits). The insights from the framework have a significant impact on the research agenda for the refresh process particularly on process design, valuation of data timeliness, and optimal refresh policy. As far as we are aware, this paper proposes the first framework about refresh policies supported by a field study. Most prior research assumes that a particular refresh policy is preferred, typically a user-driven policy. Research with this assumption emphasizes algorithms and architectures to reduce the effort or cost of on-demand refresh with incremental update of materialized views from distributed data sources. In addition, major database vendors aggressively pro-

mote products to support user-driven refresh policies. Some research has studied the complex nature of the refresh process and optimal refresh policies. This process-oriented research has not identified the factors influencing refresh policy determination, although the research has elucidated details of the refreshment process. The remainder of this paper is organized as follows. Section 2 reviews previous research on data warehouse refresh approaches. Section 3 presents background for understanding refresh policies including refresh process background and refresh policy orientation. Section 4 presents the field study design, descriptive data, and analysis consisting of the refresh policy framework and implications for research and practice. Section 5 summarizes the paper and provides directions for future research.

2. Related work The data warehouse refresh process has been studied intensely since the mid-1990s. The majority of research involves materialized views, which are stored and periodically refreshed. In contrast, traditional views are virtual tables in which the DBMS either performs a substitution process or in some cases derives on demand. The materialized view update problem involves efficient refresh of distributed materialized views, typically in an incremental manner. There has been considerable research about the class of materialized views that can be incrementally updated [18], as well as algorithms for synchronizing networks of materialized views using distributed data sources [2830,41,42]. The focus in these papers is the degree of consistency of the materialized views and efficient strategies for querying distributed data sources. These papers do not consider the complexities of the refresh process that restrict direct maintenance of materialized views from data sources. Thus, the research on the materialized view update problem is only indirectly related to the refresh policies studied in this paper. A directly related area involves the representation and control of the refresh process. Bouzeghoub et al. [7] developed a flexible workflow model to represent the refresh process. Their conceptual model provides a framework to understand the details of refresh

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

123

processes including constraints on data sources and data warehouses. We have used their conceptual model as a starting point for understanding refresh policy factors. Other researchers [38,39] have developed architectures to control the refresh process. The architecture developed in Ref. [38] is notable for its emphasis on data quality goals, specifically data timeliness. However, these conceptual representations and architectures do not provide a framework in which to understand refresh policies nor empirical evidence about refresh policies. Another closely related research area is the study of optimal refresh policies for data warehouses. Adelberg et al. [1] presented models to balance transaction deadlines with data warehouse currency. Srivastava and Rotem [36] developed a stochastic model to minimize a weighted sum of average processing cost (a system cost) and average waiting cost (a user cost). Their model did not consider data staleness costs as queries using materialized views always have current data. Segev and Fang [34] developed a stochastic model to determine time-based and query-based refresh policies. Their model used currency constraints on queries rather than minimizing data staleness costs. Zhang et al. [12,13] developed a stochastic model to determine the refresh frequency that minimizes the sum of the data staleness cost and synchronization cost, balancing data staleness against data warehouse availability. For flexibility, the refresh frequency can be computed as a time-based, a querybased, or an update-based policy. The applicability of these models to the organizations participating in this study is discussed in Section 4. Measuring the impact of data source changes on data warehouse queries is an important issue for analytical models that determine refresh policies. Berndt and Fisher [6,16] defined measures for the impact of dimension changes on summary queries that are typically used by data warehouse users. Their analysis showed that even dimensions with a high level of change can lead to a low level of volatility on query results. Calculating dimension volatility requires knowledge of both the dimension change rate as well as the usage of dimensions in data warehouse queries. Thus, analytical models using data staleness costs may require extensive measurement programs. As a final related area, empirical investigations about data warehouse success factors have some

overlap with the issues studied in this paper. In surveys of data warehouse managers and data suppliers, Wixom and Watson [40] identified data quality as measured by accuracy, comprehensiveness, consistency, and completeness as an important component of data warehouse success. In a survey of one large organization, Shin [35] indicated that users had a high level of satisfaction with data currency, although in separate interviews, data currency was listed as a problem. Both works provided evidence about the importance of data quality in data warehouse success, a factor related to the importance of constraints in the refresh process. Neither work provided evidence about refresh policies and factors influencing policy choices, the main topics of this paper.

3. Refresh processes and policies This section discusses refresh policies for data warehouses, the essential issue of this field study. Before presenting the refresh policy orientation, we provide background about the terminology and characteristics of the refresh process. 3.1. Refresh process background Refreshing a data warehouse is a complex process that involves management of time differences between updating of data sources and updating of the related data warehouse objects (base tables, materialized views, data cubes, data marts). In Fig. 1, valid time lag is the difference between the occurrence of an event in the real world (valid time) and the storage of the event in an operational database (transaction time). Load time lag is the difference between transaction time and the storage of the event in a data warehouse (load time). For internal data sources, there may be some control over valid time lag. For external data sources, there is usually no control over valid time lag. To refresh a data warehouse, change data are collected from each data source. Change data comprises new source data (insertions) and modifications to existing source data (updates and deletions). Furthermore, change data can affect fact tables and/ or dimension tables. The most common change data involve insertions of new facts. Insertions of new

124

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

Fig. 1. Overview of the data warehouse refresh process.

dimensions and modifications of dimensions are less common but still important to capture. Depending on the characteristics of the data source, change data can be described as cooperative, queryable, logged, or snapshot. Because cooperative change data involve modifications to source systems (usually triggers), it is the least common format for change data. Since few data sources contain timestamps for all data, queryable change data usually are augmented with other kinds of change data. Queryable change data are most applicable for fact tables using fields, such as order date, shipment date, and hire date, that are stored in operational data sources. Logged change data usually involve no changes to a source system as logs are readily available for most source systems. However, logs can contain large amounts of irrelevant data. Snapshots have no source system requirements, although they require a difference operation with a previous snapshot to determine change data.
Table 1 Summary of refresh constraints Constraint name Source access Integration Completeness/consistency Data warehouse availability Definition

Usage of change data in a refresh process is often subject to constraints [7,38] involving both source systems and data warehouses as summarized in Table 1. Source access constraints can be due to legacy technology with restricted scalability for internal data sources or coordination problems for external data sources. Integration constraints often involve identification of common entities, such as customers and transactions across data sources. Completeness/consistency constraints can involve time consistency of the change data or inclusion of change data from each data source for completeness. Data warehouse availability often involves conflicts between online availability and warehouse loading. In addition to the processing details of change data and constraints, the refresh process can be understood according to the dimensions of human involvement, task structure, and task complexity [17]. On the human involvement dimension, the refresh process has no human involvement except for exception and

Restrictions on the time and frequency of extracting change data Processing restrictions that require concurrent reconciliation of change data Loading restrictions that require loading of change data in the same refresh period Load scheduling restrictions due to resource issues including storage capacity, online availability, and server usage

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

125

alert handling. For mature refresh processes, exceptions (bad source data) and alerts (process failures, such as disk space problems) should occur infrequently. On the task structure dimension, the refresh process involves high complexity [7] with many steps, complex dependencies, and parallel process execution. On the task complexity dimension, the refresh process has moderate complexity. Decisions about individual records involving cleaning, integration, and transforming are routine unless exceptions occur in which human intervention is necessary. 3.2. Refresh policy orientation To classify data warehouse refresh policies, we initially examined refresh policies for distributed information services. These policies can be described according to the interaction of clients and servers. A push policy involves server-provided information requests in anticipation of user needs. A common push policy is for a server to refresh periodically after either a number of local updates or a period of time. A pull policy involves client-initiated information requests. Traditional web browsing when a user initiates a request for a web page is an example of a pull request. Periodic refresh of sports results is an example of a push policy. Cybenko and Brewington [9] provide a precise definition of push and pull refresh policies involving the number of messages and the message size. Traditional push versus pull notions are not useful to differentiate among data warehouse refresh policies. Most data warehouses exclusively use push policies due to process complexity and emphasis on adding value through cleaning and integration. Thus, we focus on the frequency and scope of refresh to capture important variations of data warehouse refresh policies. Frequency refers to the number of times per time

interval that each refresh job is performed. Scope refers to the number of data sources and the volumes of each data source involved in a refresh job. The frequency along with the duration determines the load time lag, while the scope indicates the volume of data refreshed. Using the frequency and scope criteria, we characterize data warehouse refresh policies as user-driven or warehouse-driven (Fig. 2). A user-driven policy has more frequent refreshes with narrower scopes leading to a smaller load time lag. This type of policy caters to satisfying specific user timeliness requirements on relevant data sources. A warehouse-driven policy has less frequent refreshes with wider scopes leading to a larger load time lag. This type of policy emphasizes constraints and cost control. Note that an organizations refresh policy can reside in the continuum of a more user-driven policy to a more warehouse-driven policy. The trade literature has recognized the issue of refresh policy for data warehouses. Data repositories with high refresh frequency are known by various names, such as operational data stores [4,22], real-time online analytic processing [26], virtual data warehouses [21], and right-time data warehouses [8]. Each of these concepts emphasizes integrated, enterprisewide data for operational and tactical decision making. In essence, each of these concepts is a data warehouse with more frequent refresh. In this paper, we use the notion of refresh policy orientation depicted in Fig. 2 to unify data warehouses with these other proposed concepts. In this sense, we view a data warehouse as the infrastructure that provides enterprise-wide data for decision making at any decision level.

4. Field study and analysis This section presents a field study to elicit the factors influencing refresh policy determinations.

Fig. 2. Refresh policy orientation.

126

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

After a description of the field study design, we present the data and analysis. 4.1. Field study design From June to October 2003, we conducted a field survey using a multiple case design [5,15] to study the refresh processes of organizations with moderate to large size data warehouses. The case study methodology is especially useful in exploratory studies and multiple cases provide external validity for theory development. Site selection was based on theoretical sampling to provide a cross-section of industries and a diversity of data warehouse characteristics including nature of data, type of users (internal and external), type of decisions supported (operational, tactical, and strategic), and technology level. After completing some initial interviews, we identified additional organizations that would provide more diversity in the site selection criteria. Table 2 summarizes the final selection of organizations that consented to participate

in the study. Due to the terms in our consent agreement, organizations are not identified in Table 2. We interviewed data warehouse administrators at each organization listed in Table 2. All interviewees held management positions one to three levels beneath the CIO level. Typical job titles were development manager for data warehouse, director of business intelligence, and manager of database administration. The interview format was semistructured with both planned questions (Table 3) and ad hoc questions often resulting from planned questions. Each interview involved 1 to 3 h of interview time with an initial interview from 1 to 2 h and a follow-up interview if necessary. Most interviews were conducted in person by at least one investigator. To encourage participation, recording equipment was not used. Instead, interviews were recorded using detailed field notes and documents provided by interviewees. During a typical interview, an organizations representative would respond to questions as well as provide documents about the organizations data warehouse

Table 2 Summary of organizations Organization 1 2 3 4 5 6 7 8 9 10 11 12 13 Industry Retail services (online and store) Health care services (equipment) Telecommunications (wholesale) Manufacturing (building supplies) Telecommunications (retail) Financial services (credit card) Engineering services Health care (hospital) Manufacturing (aerospace and defense) Financial services (insurance) Wholesale services (office supplies) Financial services (credit card) Manufacturing (beverages) Organization size 1300 stores 12,700 employees, $1.7 billion revenue 6000 employees, $3 billion revenue 9000 employees, $2 billion revenue 50,000 employees, $19.7 billion revenue 29,000 employees, $8 billion revenue 10,500 employees, $2.4 billion revenue 3000 employees, $350 million revenue 125,000 employees, $26 billion revenue 6800 employees, $3.0 billion revenue 12,800 employees, $5 billion revenue $31 billion revenue 5700 employees, $4 billion revenue DW scope Online and point of sale transactions, order processing Treatment tracking, general ledger, inventory, billing, security profiles Order processing, network traffic Order processing, finance Customer contacts, order processing Financial institutions using debit and credit cards Financial management, human resources Financial management Data marts for programs (engineering, procurement, and manufacturing) Health insurance claims, policies, and members Order and warehouse management, procurement, human resources, finance Credit card purchases, bills, and payments Marketing, finance, operations, and distribution Year established 2000 2000 1998 1994 1994 1990 to 2002 1998 1993 1992 to 2003 1999 1997 2002 1996

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143 Table 3 Summary of interview areas Area Organizational representative Data warehouse characteristics Refresh process Data warehouse usage Organization issues Investigation focus Title, responsibilities, organization position relative to CIO, data warehouse experience

127

Development history, data warehouse architecture, physical size, schema size, scope of data warehouse, source systems, change data volumes Change data characteristics, refresh constraints, refresh costs, refresh schedule User characteristics, query characteristics (level of detail and planning), sensitivity to data timeliness Source system technology, source system control, funding mechanism, senior management involvement, investments affecting refresh process, problems involving refresh process

and refresh process. Some of those questions required specific answers, but most were open-ended questions. The interview format evolved as the study progressed. Follow-up interviews were conducted to probe areas that became relevant after later interviews. 4.2. Descriptive data In the initial part of each interview, we tried to understand the boundary between the data warehouse under discussion and associated application systems. Most interviewees provided a diagram to show the scope of the data warehouse and boundary between application systems and the data warehouse. In addition, we confirmed that each data warehouse had traditional data warehouse characteristics: subjectoriented, integrated, time-variant, and nonvolatile. We then asked about characteristics of the data warehouses to provide additional context for later question areas. The field study involved data warehouses of varying sizes as summarized in Table 4. Appendix A provides complete details about the size
Table 4 Summary of data warehouse size characteristics Characteristic Schema size Measures Number of tables (dimension and fact) and star schemas Number of bytes and rows Number of source systems, internal data sources, and external data sources Number of records per time interval Data range

characteristics of data warehouses in each organization. Taken collectively, the four areas in Table 4 provide reasonable measures of data warehouse size and complexity. The units for the size areas are not uniform because organizations could not provide standardized data. Since our sampling objective emphasized diversity of organizations and data warehouse characteristics, some data warehouses were modest in size and scope. Of the data warehouses of more modest size and scope, Organization 6 provided customer-oriented data marts, Organization 8 provided program-oriented data marts, and Organization 9 was the smallest organization studied. 4.2.1. Refresh policies reported After obtaining a background about the data warehouses, we probed for details about the refresh policies. The most common policy reported was daily refresh during nonbusiness hours. This policy was used by all but two organizations (Organizations 1 and 3) to refresh most data sources, while a different refresh frequency was used for a few data sources as noted in

Typical values 100 tables

32 tables to 340 tables; 33 to 100 star schemas 15 GB to 150 TB; 500 to 600 million rows 3 source systems to 70 source systems; 3 internal sources to 70 internal data sources; 2 external data sources to 600 external data sources Thousands of rows per day to 200 millions of rows per days

Physical size Source data

1.5 TB 20 source systems

Change data volumes

Millions of rows per day

128

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

Table 5. In terms of refresh policy orientation, Organizations 4, 7, 8, 9, 11, 12, and 13 had a full warehouse-driven policy, Organizations 2, 5, 6, and 10 had a moderate warehouse-driven policy, Organization 1 had a moderate user-driven policy, and Organization 3 had a full user-driven policy. This distribution of reported policies is consistent with a recent informal survey [2]. Organization 3 had the most user-driven refresh policy in our study. Organization 3 made a design decision to use the data warehouse as the repository of network device data in place of building a separate source system. In addition, the modern infrastructure of its source systems and data warehouse supported frequent refresh. Organization 3 considered its data warehouse as the primary reporting tool for the entire organization. Source systems had a minor role in report generation and ad hoc queries. In contrast to Organization 3s decision to use the data warehouse as the primary reporting tool, Organization 1 had a more restricted user-driven policy driven by the inability of order procurement systems to support operational queries about order status and inventory status. It was more cost-effective to provide frequent refresh than to develop middleware to communicate among its order procurement systems. 4.2.2. Refresh process constraints reported To provide more background about refresh policies, we directed questions about refresh processes, especially constraints affecting refresh processes. Most organizations noted a variety of constraints in their refresh processes as summarized in Table 6.
Table 5 Summary of refresh policy deviations Organization 1 2 3 4 5 6 8 9 10 12 13

Appendix A provides supporting details about the refresh process in each organization. Interviewee comments indicated that source access constraints and data warehouse availability constraints had the most impact on refresh processes. Integration constraints, although prevalent in our study, had less impact on setting refresh policies. 4.2.3. Data timeliness requirements reported To provide a perspective on the demand for timely data, we inquired about data timeliness requirements. For most organizations in our study, data warehouse users required 1-day load time lag as evidenced by service level agreements and interviewee comments. Service level agreements were drawn between an IT organization and data warehouse users, both internal and external. Table 7 summarizes exceptions to daily timeliness requirements. Appendix A provides supporting details about data warehouse usage in each organization. Data timeliness requirements were dominated by daily reporting needs for updated business metrics. For some organizations with more frequent timeliness needs, the requirements might not match the refresh frequency. For example, Organization 1 had 15-min refreshes for some data sources, but the requested data timeliness was only several times per day. Organization 1 provided refresh as frequently as possible to satisfy peak demands for data timeliness. 4.2.4. Problems and investments reported In the last part of the interviews, we asked about problems and investments that affected the refresh process. Table 8 summarizes the comments provided

Data sources with non-daily refresh frequencies 15 min for order status, 3 h for inventory status Security profiles (15 min) and contact information (2/day) 5- to 10-min refresh for call logs; some billing data refreshed every 5 min; most dimension data refreshed hourly Monthly refresh for some plant-level data Refresh several times per day possible for two source systems 10-min refresh for one data source Daily refresh with previous days changes (24- to 48-h load time lag) On demand refresh possible for major business events Continuous refresh possible with middleware source data collection and routing Daily refreshes but load time lag is 18 to 36 h Weekly refreshes for distributor data with 5- to 12-day load time lag

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143 Table 6 Summary of refresh constraints Constraint Source access constraints Number of organizations 10 Comments

129

Integration constraints Completeness/consistency constraints Data warehouse availability constraints

7 4 8

Typical constraint involves legacy systems several generations behind; constraints in some cases are due to one generation old technology (e.g., no row locking). Typical constraint involves different parts of a transaction managed by different source systems. Typical constraint involves time consistency. Typically load step of refresh must occur when data warehouse is offline.

by the interviewees with supporting details in Appendix A. The most significant and frequent problem was managing the complexity of the refresh process. A majority of interviewees reported difficulties with the time duration as well as conflicts with source system processing and backups. Organizations made major investments to manage refresh process complexity including hardware, extraction-transformation-loading (ETL) software, and optimization work. Accommodating new data sources was the next most frequently mentioned problem. Interviewees cited the difficulty of changing a refresh process to accommodate new data sources and integrating a data warehouse with an ERP system. The other problems were minor in comparison to the first two problems in terms of the amount of investment. 4.3. Data analysis and refresh policy framework We performed an iterative process to analyze the data. Shortly after each interview, we summarized and organized our resulting notes and documents. After summarization, we compared notes and identified key concepts while consulting the literature to provide background as needed. A preliminary list of categories
Table 7 Summary of exceptions to daily timeliness requirements Organization 1 2 3 6 8 12

was identified based on the first interview notes. Each of the subsequent interview notes was used to expand and refine the initial list of categories. As new concepts emerged, we reviewed notes from earlier interviews and conducted additional data collection, if necessary. In the process of developing a category list, the interview notes were reviewed again to identify relationships among categories. Data supporting each category were compared across cases to focus on the causes and barriers to specific refresh policies. As a result of this analysis, it became clear that an organizations refresh policy was driven by system success goals. Our analysis identified user satisfaction and perceived net benefits as important success goals of data warehouses. These two measures have been used in recent studies of data warehouse success [35,40], as well as in various studies of information systems success [3,14,19,32]. Using multiple measures is consistent with the multidimensional nature of information systems success [10,11,33]. User satisfaction in existing studies is measured mostly by asking respondents if their needs were satisfied for various aspects of a system, such as output quality, user interface, information quality, and

Exceptions to daily timeliness requirements Multiple refreshes per day for order and inventory status; increased data timeliness demands during peak retail seasons Daily timeliness except for contact information (2/day) and user profiles (2/h) Many times per day for network status; multiple times per day for order status No stated business reason for data refreshed every 10 min Users accept 48 h timeliness Service level agreements for 2-day load time lag

130 Table 8 Summary of problems Problem

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

Number of organizations 9 8 4 2

Reduce refresh processing times Accommodate significant amounts of new data Increase refresh frequency Consolidate resources (servers, disks, and software licenses)

system quality [3,14,23]. No organization in our study had formal surveys of user satisfaction. To ensure user satisfaction, most organizations created user committees that met regularly with data warehouse administrators. As a result, service level agreements were either drawn up or modified to ensure satisfaction of information and system quality needs. Hence, our study indicated that refresh policies were devised to ensure that resulting information and system quality would bring user satisfaction. In our framework, we capture this relationship as refresh policy indirectly affecting user satisfaction. Perceived net benefits [32,40] was expressed by many organizations in statements that emphasized the change in which business processes and decision making were affected by a data warehouse. For example, Organization 3 stated that they could resolve customer problems faster because the data warehouse provided hourly reports tracking order status across parts of the order procurement process. After justifying using both user satisfaction and perceived net benefits, we then identified related categories resulting in the structure shown in Fig. 3. In the information systems success literature, system

quality and information quality are considered determinants of user satisfaction and perceived net benefits [10,11,40]. In our model, system and information quality directly influences the success factors, and refresh policy directly influences system and information quality. Other practices that influence the quality factors are outside the scope of this study. The framework distinguishes between information and system quality requirements and quality levels obtained. Constraints directly influence both the system and data quality requirements, as well as the refresh policy. The general label, data warehouse characteristics, covers nature of data, nature of data warehouse use, and type of user. Organizational variables include the categories competitive environment, operational benefits, profit motive, and source system control. The relationship structure in Fig. 3 is divided between short-term and long-term effects. IT investments can influence constraints and thereby refresh policies in the long term. User expectations, data warehouse characteristics, and constraints influence requirements and refresh policies in the short term. In our interviews, we saw evidence of both short-term

Fig. 3. Factors and relationships in the refresh policy framework.

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

131

and long-term influences on refresh policies. In the subsequent subsections, we discuss each of the determinant categories and relationships among categories. 4.3.1. Information and system quality Information and system quality have been linked to information systems success in many studies [10,11]. One of the information quality attributes, data timeliness was especially significant as reflected by interviewee comments. The refresh policy directly influenced data timeliness by definition. Most interviewees indicated that the data warehouse users were quite satisfied with the current daily timeliness of data, consistent with satisfaction about timeliness reported in Ref. [25]. For most organizations, the content of the data warehouse was more of a concern than data timeliness. For example, in Organization 4, users were happy to have data that were previously unavailable. In Organization 13, invoice data were unavailable before development of the data warehouse. Users would like more current invoice data, but they realized that incentives were not sufficient to motivate suppliers to provide data more frequently than weekly. Some interviewees commented that refresh policies also affected data consistency. For example, Organization 12 stated that one reason for daily refresh was instability when data were refreshed more often. Some users could become confused if the same query returned different results if executed multiple times on the same day. System quality measures the quality of the system itself, such as system flexibility, response time, reliability, utilization, and availability [10,20,27,37]. In our study, two system quality attributes were indicated by interviewees as important to user satisfaction: system availability and response time. Our interviews suggested that refresh policy directly affected these two system quality measures, because a daily refresh policy corresponding to nightly reload and weekend maintenance ensured that the system was available during the regular business hours and that the response time of the system was not compromised due to refresh operations. A more frequent or user-driven refresh policy would likely slow system response time and decrease system availability. As evidence of the impact of refresh policy on system quality, service level

agreements in six organizations contained an agreement that the data warehouse be available and offer quick response time by the beginning of business hours. These agreements effectively stipulated that refresh could occur only during nonbusiness hours. In the organizations interviewed, if a refresh policy resulted in unacceptable system quality (too slow or lapse of access), system requirements would not have been met, compelling data warehouse administrators to change the refresh policy. 4.3.2. Information and system quality requirements In most organizations in our study, the most visible part of information and system quality requirements were captured in service level agreements. Although no organizations had conducted formal surveys about data quality and information quality requirements, interviewees indicated that periodic meetings with the user groups were held to review or revise existing service agreements as well as to elicit new requirements. Some of these meetings were held through a steering committee in which a top level official might be involved. Each organizations refresh policy satisfied its service level agreement, and occasionally, informal requirements communicated from the users. Thus, information and system quality requirements directly determined refresh policy. As the interviewee from Organization 9 stated, bif users want something, data warehouse staff will respond to their needs.Q In the last section, we have established that refresh policies directly affect information and system quality. Organizations, when faced with specified user information and system requirements, may adopt other information systems practices to ensure these requirements are satisfied, in addition to devising appropriate refresh policies. Therefore, information and system quality requirements can indirectly affect information and system quality through refresh policies, as well as through other information systems practices. However, these other practices are outside the scope of this paper. Information and system quality requirements influenced each other because of the conflicts between data timeliness and system availability and response time. Most organizations indicated that more frequent refresh during business hours would negatively impact system availability and response times for data warehouse queries. Interviewee responses were consistent with refresh process models that empha-

132

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

sized tradeoffs between system and information quality [7,38]. 4.3.3. Constraints As noted in the presentation of descriptive data, constraints played a fundamental role in determining user requirements for information quality, the tradeoff between information quality and system quality, and the daily refresh policy used at least in part by all organizations in our study. Constraints, if not known to users, were made known to them by data warehouse staff during their meetings, hence system quality and information quality requirements were adjusted based on these constraints. The time-consuming nature of the refresh process itself was a dominant constraint setting the information and system quality requirements on service level agreements and thereby shaping an organizations refresh policy. Constraints not only influenced refresh policies indirectly through information quality and system quality requirements, but also influenced refresh policies directly. For example, two organizations with the same data timeliness requirements may have slightly different refresh policies: one organization may refresh more often than specified in its service level agreement, while the other organization may refresh as often as specified in the agreement. For example, Organizations 1 and 3 provided more frequent refresh than required by their user requirements as documented in Table 7. The modern technology level with fewer constraints allowed both organizations to provide more frequent refreshes. Other organizations with warehouse-driven refresh policies determined how to satisfy constraints to implement a refresh policy. Thus, constraints also directly determined refresh policies. Our data do not indicate the importance of fixed costs of refresh that are independent of refresh volumes. Although some analytical models of the refresh process [13] emphasize fixed costs, fixed costs were not a significant factor in the refresh policy decision of any studied organization. Most organizations indicated index destruction/creation and DBA monitoring activities as fixed costs. However, the decision to drop/create indexes was the result of the daily refresh policy rather than the cause of this policy. Smaller but more frequent data warehouse

loads would make index maintenance preferable to index creation. DBA monitoring activities were passive as DBAs had other duties during the on-call periods in most organizations. Typical DBA responsibilities were to respond to alarms raised during the refresh process, such as an out-of-space condition. For Organization 6, monitoring activities were active because the data warehouse must exactly reconcile with the source systems. Both DBA and application personnel are involved in the reconciliation process. Even for Organization 6, source access constraints were more significant in determining refresh policy than fixed monitoring costs. 4.3.4. Data warehouse characteristics According to our interviewees, data warehouse characteristics can influence information quality and system requirements. These variables include nature of data, nature of data warehouse use, and type of user. 4.3.4.1. Nature of data. Some data were processed by users only weekly or monthly. As a result, users of these data normally required the corresponding data sources to be refreshed weekly or monthly even when users of other data required daily refresh. For example, in Organization 8, general ledger and employee data were refreshed on a monthly refresh schedule at the time of monthly close of books. In Organization 4, accounts payable data were processed in a nightly batch; however, loading was only done monthly. For Organization 13, distributor sales data were refreshed weekly. The interviewee mentioned that the sales and marketing divisions would like to see timelier data while the finance division did not care. 4.3.4.2. Nature of data warehouse use. While it was common for a data warehouse to support strategic decisions of a company, some companies used a data warehouse to support operational decisions. The data warehouses in these organizations served both as the data warehouse supporting strategic decision making, and operational data store, supporting operational decisions. Consequently, users of these data might require more frequent refresh to support operational decision making. For example, Organizations 1 and 3 provided more frequent refresh to support operational

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

133

decision making as documented in Table 5. These organizations considered it more cost-effective to provide frequent refresh than to develop middleware and technology enhancements to support operational reporting in the source system level. For Organization 9, the refresh process could be performed on demand to support significant business events. 4.3.4.3. Type of users. External users may perceive a need for timelier data depending on their cost sensitivity and other sources of data. For example, external users of Organization 6s data warehouses did not require more than daily timeliness. These users paid a fee for data warehouse access based partly on data timeliness. In addition, internal systems provided operational reporting, hence these external users did not rely solely on the data warehouses for all of their data needs. In contrast, Organization 10 may be forced to increase data timeliness to hourly instead of daily refresh because its external stakeholders were not particularly sensitive to the cost of increased data timeliness. In some organizations, all variables may be present to influence the requirements of information quality and system quality, and indirectly, the refresh policy. Organization 3 had a user-driven refresh policy on two types of data: network monitoring and billing data. The user-driven policy on these two types of data was due to the fact that the data warehouse was used as the primary reporting tool for both strategic and operational decision making. Source systems had a minor role in report generation and ad hoc queries for operational decision making. In addition, the refresh policies in Organization 3 supported larger customers who received reports generated by its data warehouse. 4.3.5. User expectations User expectations about the competitive environment, industry standards, and work requirements influenced user requirements for information quality and system quality. Organization 3 was partly motivated in their refresh policies by the needs of its large customers who received reports generated by its data warehouse. Organization 12 increased data timeliness to daily refresh less than 2 years ago after a sizable investment to meet the increasing demand for daily timeliness of its mostly external data warehouse

users. A critical reason for this decision was to match industry practices to maintain their customer base. The data warehouse administrator in Organization 10, reporting directly to the CIO, viewed real-time refreshing as the next industry standard in 5 years time motivated by healthcare reform. In contrast, an important reason for not increasing data timeliness beyond daily refresh involved expectations about work processes. Some interviewees responded that more frequent refresh would cause confusion because a query executed before and after the refresh during the same business day might return different results. 4.3.6. Long-term changes in refresh processes In the short run, an organization must comply with its refresh constraints. In the long run, additional investment (Table 9) can relax refresh constraints by extracting change data more frequently, using refresh processes with fewer dependencies, and loading data with less impact on data warehouse availability. The investment areas in Table 10 have been deployed by organizations in our study to alleviate or relax constraints on the refresh process. For example, Organizations 1 and 12 used technology (partitioning and replication) to alleviate data warehouse availability restrictions. Organization 3 invested in technology to maintain online data warehouse availability, although some data were not available during the loading process. To make investments in areas that affect refresh policy, an organization must perceive significant benefits. Table 10 summarizes organizational issues that can influence investment choices. The direct benefit of improved data timeliness may not be enough to justify investment, and it may not be easily quantified. However, investments that affect refresh policies usually have quantifiable benefits for operational purposes, so that data warehouse timeliness is a positive side effect. For example, Organizations 5 and 10 justified significant investments in middleware for reliable and timely capture of transaction data on the basis of benefits to operational systems. However, refresh policies might be changed in the future to use timelier change data. Operational benefits also are realized when the data warehouse provides a costeffective way to compensate for source system deficiencies. Other factors that can influence the ability to make these investments are control of

134

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

Table 9 Summary of impact of IT investment areas on refresh constraints Investment area Source system technology Middleware Replication Standardized interfaces Effect on refresh constraints Modern DBMS technology supports triggers and online query so that source data can be extracted more frequently Middleware supports capture of source data independent of processing by source systems allowing more frequent capture of change data Replication technology supports loading of new data with small impact on data warehouse online availability Standardized interfaces can reduce the need for data integration steps in the refresh process

source systems, profit motive in operating a data warehouse, and the competitive environment. 4.3.7. Refresh policy In Section 4.2, we reported that most organizations in our study adopted a warehouse-driven as opposed to user-driven refresh policy. It may seem that refresh policy was driven by satisfying constraints rather than ensuring user satisfaction. One may hence perceive a contradiction between a warehouse-driven policy reported by interviewees and satisfying user needs indicated by our framework. However, user satisfaction for a data warehouse is a measure of the degree to which users perceive their information and system quality needs are met. In the 11 organizations with a warehouse-driven policy for most data, data warehouse users decided that daily refresh satisfied their needs according to the interviewees. For users of those data warehouses, better information or system quality would not be desirable because improvements might require undesirable cost sharing. However, there were subgroups of users in 7 of 11 of those data warehouses that required more user-driven refresh policies. The data warehouse administrators in those organizations adopted user-driven policies for a small portion of data sources. When asked whether or not the data warehouse group planned a more userdriven refresh policy for the rest of the data sources,
Table 10 Summary of organization issues that influence IT investment ability Organizational issue Primary benefit for operational purposes Control of source systems Profit motive for data warehouse operation

the common response was bwe do not see any need for it.Q Hence, refresh policy is driven by both constraints and requirements. 4.4. Implications for research The framework described in Section 4.3 requires rigorous empirical testing building on the many years of empirical testing in the information systems success literature [11]. Empirical testing has been performed for the success of data warehouse projects [35,40] but not with a focus on the refresh policy. Testing of the refresh policy framework should identify the direction and significance of the relationships. A novel aspect of the refresh policy framework is the long-term influences on refresh policies. Testing the significance of these factors and linking them to refresh policy decisions may involve methodology issues not used in previous research on information systems success. The framework as a whole underscores the importance of user involvement as an important factor in ensuring data warehouse system success. Research into other higher level factors that are important to data warehouse system success can yield results with important practical implications. In addition to direct testing of the framework, other empirical research related to data warehouse characteristics, usage trends, and user satisfaction is impor-

Effect on investment areas Justify investment primarily for impact on source systems rather than on data warehouse refresh policy Ability to make investments on source systems Provide for more direct way to value data timeliness impact on data warehouse users

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

135

tant. In our study, we identified three data warehouse characteristics that influenced data warehouse refresh policy. Further research is needed to study how organizations decide or should decide the nature of data warehouse use, that is, to what extent should a data warehouse also support operational decision making? What are the influencing factors in such a decision? In addition, type of user and type of data warehouse use are related issues. For example, if an external user would be better served with access to operational systems, development of data warehouse perhaps can take a back seat. The issue of cost sharing and management also comes to mind. In all organizations interviewed, the data warehouse systems and administration have been structured as a cost center whether the users are mainly external or internal users. Research in novel pricing and cost sharing structures for a data warehouse system can be valuable to guide organizations in best utilizing their resources. Another potential area of research is to investigate the trends of user expectations and user requirements for information and system quality. As reported by interviewees, most users were happy with monthly access to certain source systems. Now, 24-h timeliness is the norm in most of these organizations. However, decision makers in some of the industries are now eyeing continuous refresh in 510 years time. Understanding industry treads would be important for organizations to stay competitive in providing critical information in a timely manner for decision making. Although interviewees highlighted the importance of satisfying user requirements and all interviewed organizations had created steering committees so users and data warehouse administers regularly would meet about requirements, none of the organizations conducted formal surveys of user satisfaction. This omission points to a need for the development of user satisfaction instrument specific to data warehouses. This instrument is needed for the empirical testing of the framework. Such a measurement instrument can be marketed to practitioners, who can use it to monitor user needs and alert to them potential problems. In addition to empirical research, the results of this study have important implications for research in process design, data timeliness measures in investment decisions, and data warehouse efficiency, as well

as rigorous empirical testing of this framework. The complexities of the refresh process reported by many organizations in our study suggest the need for research about process design. This new type of research could help organizations explore alternative designs, simulate performance, and consider resource implications in refresh process design decisions. Since none of the organizations reported tools that could help with refresh process design, additional research and development may be helpful. The need for investments to relax constraints in the refresh process suggests the need for research about data timeliness in IT investment decisions. Data timeliness can be a factor in investment decisions affecting operational decision making. The value of data timeliness for tactical and strategic decision making should be evaluated as an additional benefit from IT investments. An important way to measure data timeliness effects may be efficiency models for data warehouse performance. Data timeliness should be recognized as an important output of data warehouses. The efficiency of the refresh process in generating timely data for decision making could be evaluated by the efficiency models. However, the difficulty of collecting a reference set of data from peer organizations could hinder the development of efficiency models. Our study suggests changes to research about analytical models for refresh policies. This line of research should emphasize user requirements satisfaction subject to constraints rather than fixed cost reduction. In every organization that we studied, the refresh processes and associated policies were geared towards satisfying user requirements for information and system quality within the confines of existing technology. In addition, none of the organizations interviewed considered measurement of data staleness cost in determining refresh policies. Thus, we agree with Ref. [34] about the usage of query currency constraints in place of minimizing data staleness costs. The model developed by Dey et al. [13] is more limited because it minimizes data staleness costs and data warehouse availability costs rather than considering them as constraints. Data staleness costs, when used, should be based on operational costs and benefits rather than intrinsic data timeliness measures because of acceptance difficulties as noted in our study and measurement difficulties noted in Refs. [6,16].

136

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

4.5. Implications for practice The major practical finding is that most organizations in our study have warehouse-driven refresh policies with refresh processes that satisfy constraints for source system access, data warehouse availability, integration, and consistency. These policies provide enough data timeliness without sacrificing system availability and response time to support satisfactory levels of both user satisfaction and perceived net benefits. For most organizations, data timeliness requirements for the previous days business activity were satisfied by refreshment during nonbusiness hours. For organizations with needs for timelier data, there can be a tradeoff between providing the data through operational systems versus the data warehouse. To ensure a high level of perceived net benefits, response time and system availability should not be sacrificed. Organizations with modest requests for timelier data may find refresh process improvements to be more cost-effective than source system revisions with only a slight impact on perceived net benefits. For organizations that want to transform their data warehouse for large amounts of operational reporting, perceived net benefits may be compromised without extensive IT infrastructure changes possibly involving middleware, third party application software, and DBMS vendor offerings. We believe, based on our interviews, that most organizations will not be able to justify such extreme user-driven refresh policies without significant benefits for operational systems. A second important finding involves optimization of the refresh process. For organizations with large refresh volumes during nonbusiness hours, there may be potential for significant improvements through increased capacity, outsourced processing, and process redesign. These process improvements can reduce refresh processing times and eliminate conflicts with backups and source system processing. In some cases, cost reductions may occur by reducing capacity to save on licensing costs without adversely affecting refresh process times. In some organizations with more user-driven policies, we observed that refresh policies did not closely match data timeliness requirements. Refresh policies were set to provide data as timely as possible to meet

possible future timeliness requirements. In these cases, there may be cost savings in more closely aligning refresh policies with current data timeliness requirements.

5. Concluding remarks We conducted a field study with a multiple-case design to understand the factors influencing refresh policy decisions for data warehouses. We interviewed data warehouse administrators from 13 organizations about data warehouse characteristics, refresh process details, data warehouse usage characteristics, organizational influences, and problems in the management of the refresh process. As part of the field study design, we defined the refresh policy orientation as measuring the frequency and scope of refresh processes. The interviews showed that daily refresh during nonbusiness hours were the most common policy. Significant deviations from this policy were due to source system deficiencies and the competitive environment. Based on the field study results, we developed a framework consisting of short-term and long-term influences on refresh policies along with information systems success measures influenced by refresh policy choices. This study provides a foundation for future work about the data warehouse refresh process. We are interested in quantitative modeling of the refresh process and efficiency studies of data warehouses. For quantitative modeling, we are studying variations of the resource-constrained project scheduling problem to balance refresh processing times against resource consumption using process redesign transformations as suggested in Ref. [31]. For efficiency studies, we are studying the usage of data timeliness as an important output of the refresh process. A major challenge of efficiency studies is to collect sufficient data due to lack of publicly available data.

Acknowledgements Partial support for Michael Mannino was provided by a faculty research award from the Business School, University of Colorado at Denver.

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

137

Appendix A. Supporting interview details


Table A.1. Listing of data warehouse size characteristics Organization 1 2 Schema size 12 fact tables, 20 dimension tables 54 star schemas Physical size 600 million rows in largest fact table 500 million rows Source systems 8 source systems, 60 data sources 21 source systems, all internal data sources, external data sources in near future 50 source systems; many data sources from network devices 22 source systems 70 source systems; a few external data sources 4 to 20 data sources per data warehouse Change volumes 40 to 50 million rows per day 10 million rows per day

100 star schemas

3 to 6 TB

200 million rows per day

4 5

200 fact tables 120 tables

1.5 TB 5 TB

10 GB/day Millions of rows per day Thousands of rows to millions of rows per day. 1,500,000 rows per pay 65,000 rows per day 1000 to 100,000 rows per day 200,000 records per day

2 to 6 constellation schemas per data warehouse

50 GB to 150 TB

40 fact tables, 300 dimension tables 150 data files

250 GB

8 source systems

30 GB

9 10

40 tables to 130 tables 208 tables

2 to 15 GB 2 TB

11

80 to 100 tables; 33 cubes 80 tables

1.7 TB

12

500 GB

13

175 tables

350 to 450 GB

6 internal systems; 4 external data sources One data source to 3 data sources 3 major source systems; more than 100 data sources 8 internal systems; 2 external data sources 70 to 80 external data sources 8 internal systems, 600 external data sources

8 GB/day

1 to 2 GB per day Millions of rows per week

138

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

Table A.2. Listing of refresh process details Organization Change data 1 Queryable, Web logs, snapshot Constraints Refresh policy Refresh orientation 2 (more user-driven); refresh driven by order processing and inventory status requirements 4 (more warehouse-driven); refresh driven by source system constraints

Source access constraints, integration constraints Cooperative dominates Source access constraints; small refresh window

Mix of cooperative and device logs

General ledger logs, snapshots of order processing, some queryable Snapshots (80%), cooperative (20%)

Mostly snapshots

Queryable only

Queryable only

Logs (special purpose and transaction) Snapshot and cooperative

10

11

Cooperative and some snapshot Mostly queryable, some snapshot Mix of cooperative and queryable

12

13

15 min for inventory and order status, 3 h for order activity, daily for others Most refreshes are daily during nonbusiness hours. Security profiles (15 min) and contact information (2/day) Integration constraints 5 to 10 min for network for device logs; network data; 1 h for most device capacity constraints dimension data; 5 min for some billing data Source access constraints Most refresh occurs daily (nonbusiness hour during nonbusiness hours extraction), integration (finance data) Source access Most refresh occurs daily constraints, during nonbusiness hours. integration Refresh several times per constraints day for two source systems Many source access 10 min refresh for one data constraints; one source; daily refreshes for integration constraint; most data sources DW capacity constraint on some processing Morning refresh: dimension Source access data; evening refresh: (weekly refresh), dimension and fact data; consistency, integration, weekly and quarterly refreshes and DW capacity constraints Source access, Daily refresh but changes are DW availability 24- to 48-h old constraints DW availability Most refreshes are daily during and consistency nonbusiness hours. One data constraints warehouse supports on demand refresh for major business events Daily refreshes during nonbusiness DW availability, hours; infrastructure for frequent source system, and consistency change data capture constraints DW availability, Daily refreshes during integration, and nonbusiness hours consistency constraints Source access Daily refreshes but data is and integration 18- to 36-h old constraints Source access, Daily refreshes for internal data. DW availability, Weekly refreshes (up to 10-day and consistency lag) for external data

1 (user-driven); frequent update driven by need to monitor network status and billing 5 (warehouse-driven)

4 (warehouse-driven) dominated by service level agreements about DW availability and source system access 4 (warehouse-driven) dominated by source system constraints

4 (warehouse-driven) dominated by DW capacity constraints (refresh times)

5 (warehouse-driven) dominated by source access and DW availability constraints 4 (warehouse-driven) dominated by data warehouse availability and consistency constraints 4 (warehouse-driven) dominated by source access and DW availability constraints 5 (warehouse-driven) dominated by DW availability 5 (warehouse-driven) dominated by source access constraints 5 (warehouse-driven) dominated by source access constraints on external data

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143 Table A.3. Listing of data warehouse usage environment Organization 1 User population 40 to 50 active users Currency constraints Multiple refreshes per day; for order and inventory status Output level Summary (99%), detail (1%); detail reports may be executed many times per day; Summary (75%), detail (10%) Planning level 50% (ad hoc), 50% planned

139

All employees but 10 active users

Network performance and audit groups; large customers (indirect users) 500 users; 50 active users

Daily timeliness except for contact information and profiles Multiple times per day for network status Service level agreement for daily refresh currency Daily refresh currency for business metrics Daily refresh for most customers; Business reason for 10-min refresh is unclear. 24-h refresh for fact data Users accept 48-h timeliness Service level agreements for 24-h timeliness

Mostly planned

Summary (60%), detail (40%)

Mostly planned (200 reports)

Summary (75%), detail (25%)

Mostly planned

25 active users; hundreds of less active users Mostly external users

Summary (95%), detail (5%) Summary (80%), detail (20%)

25% (planned), 75% (ad hoc) Varies by DW: some DWs have mostly repetitive while others have mostly ad hoc Replicated data marts for ad hoc and repetitive queries 50% planned, 50% ad hoc Mostly planned because of query complexity; ad hoc typically become planned 2/3 (ad hoc), 1/3 (planned) 50% (ad hoc), 50% (planned) 60% (ad hoc), 40% planned

2000 users; 500 active users 60 users; 10 active users 2500 to 7000 users; 300 to 1500 active users

Summary (60%), detail (40%) Summary (70%), detail (30%) Varies by data warehouse; Some mostly detail; others mostly summary Summary (75%), detail (25%) Summary (33%, detail (67%) Summary (90%), detail (10%)

8 9

10

300 total users; 100 active users 3000 licensed users; 500 active users 800 registered users; 400 active users

11

12

13

2000 to 3000 users; distributors are indirect users

Service level agreements for 24-h timeliness Service level agreement for 24-h timeliness Service level agreements for data timeliness (2 days) and DW availability Service level agreements for DW availability, data timeliness, and query performance

Summary (95%), detail (5%)

30% (ad hoc), 70% (planned)

140

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143

Table A.4. Listing of technology and management environments Organization 1 Source system technology Moderate level: legacy technology for POS systems; ERP systems for order processing and warehouse management Moderate level: previous generation DBMS technology High level: relatively new source systems and technology Low level: legacy technology and source systems Moderate level: mixed level of technology in source systems Low level: mostly legacy technology for source systems Moderate level: ERP systems for major source systems (moderately adaptable) Low level: legacy systems and technology Moderate level: legacy systems and technology are being replaced by ERP systems Moderate level: legacy systems with modern middleware Near high level: most systems use at least client-server technology Varying levels of source system technology; lots of batch processing of transaction data Middleware provides efficient and reliable extraction Transaction transparency None for POS system; order status transparency for online orders Management support Moderate level: director of business intelligence works with business groups Financial management Cost center

Push for more transparency of government paid care Transparency push for major customers Low level

High level: senior management initiated outside audit High level: senior management push for DW investment Moderate level: active CIO support Moderate level of senior management support High level: strong senior management support for strong and weak performing DWs Moderate level: change control board Moderate level: monthly stakeholder meetings Low level: data warehouse staff works closely with customers

Cost center

Cost center

Cost center

Low level: some usage of XML Low level

Cost center with service level agreements Profit center for each warehouse (external usage)

Low level of transparency Moderate level: some transparency to third party payers Moderate level: push for web enabled and XML formats

Cost center

Cost center

Cost center

10

High level: strong push for HIPAA standardization

11

Moderate level: transparency of Web transactions High level: transaction level format for 10 years

High level: steering committee, direct reporting to the CIO from business services manager High level: DW manager reports to VP under CIO

Cost center

Cost center

12

High level: CIO involvement with data timeliness improvement

Cost center but part of franchise payment; may charge for DW enhancements Cost center

13

High level: government regulations

High level: promotion of DW manager position and DW place in strategic plan

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143 Table A.5. Listing of problems and investments Organization 1 2 Problems (recent and pending) Previous problem with startup time of ETL tool Short refresh window (current): difficult to complete daily refresh Merge data from new businesses, cater to data needs of large customers; use of checkpoints to manage rejected records Shorten refresh window; more frequent refresh of plant level data Investments affecting refresh process

141

Significant investment in optimization process to overcome refresh process duration New DBMS technology; new data warehouse integrated with transaction databases; add external data sources Capacity planning to reduce the server, disk, and software licensing costs

6 7 8 9

DW users perform integration; reduce load times to alleviate conflict with backups Refresh window is a major problem with one DW Long load times and complex refresh process design Capacity increase through outsourcing Movement from home-grown ETL to third-party ETL tools; server consolidation None mentioned Refresh process time was a large problem Week to month load time lag (previous problem) Computations of aggregate measures requires 14 h

Addition of plant level data with hourly refreshment; adoption of ERP for U.S. operations: may require redevelopment of the data warehouse Perform more integration in the ETL process; move to cooperative change data; address some requests timelier data Reduce change data load so that refresh window is shortened Consolidation of ERP databases Consolidated bills for same day visits; integrating third party billing software Integrate data warehouses across programs

10 11 12 13

Continuous refresh of standardized (HIPAA) transactions; integration of claims systems Adding new source systems and external data On going development to reduce load time lag to 2 days or less Integration of SAP business warehouse with existing warehouse

References
[1] B. Adelberg, H. Garcia-Molina, B. Kao, Applying Update Streams in a Soft Real-Time Database System, ACM SIGMOD Record 24, No. 2, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, (May 1995), 245256. [2] L. Agosta, Data Warehousing Lessons Learned: Data Warehousing Refresh Rates, DM Review, June 2003, online at http://www.dmreview.com/article_sub.cfm?articleId=6834, last accessed on Oct. 25th, 2004. [3] J. Bailey, S. Pearson, Development of a tool for measuring and analyzing computer user satisfaction, Management Science 29 (5) (1983 (May)) 530 545. [4] C. Baragoin, M. Marini, C. Morgan, O. Mueller, A. Perkins, Y. Kiho, M. Persaud, Building the Operational Data Store on DB2 UDB Using IBM Data Replication, WebSphere MQ

Family, and DB2 Warehouse Manager, IBM Redbook, IBM, San Jose, CA, 2001, http://www.redbooks.ibm.com/redbooks/ pdfs/sg246513.pdf, last accessed on Oct. 24th, 2004. [5] I. Benbasat, D. Goldstein, M. Mead, The case research strategy in studies of information systems, MIS Quarterly 11 (3) (1987 (Sep.)) 368 386. [6] D. Berndt, J. Fisher, Understanding dimension volatility in data warehouses, Proceedings Sixth INFORMS Conference on Information Systems and Technology (CIST- 2001), (Miami, Florida, Nov. 2001). [7] M. Bouzeghoub, F. Fabret, M. Matulovic-Broque, Modeling data warehouse refreshment process as a workflow application, in: S. Gatziu, M. Jeusfeld, M. Staudt, Y. Vassiliou, (Eds.), CEUR-WS, Proceedings on the International Workshop on Design and Management of Data Warehouses (DMDW99 at CAiSE*99), (Heidelberg, Germany, June 1999) 6.16.12, online at http://sunsite.informatik.rwth-aachen.de/Publications/ CEUR-WS/Vol-19/paper6.pdf, last accessed on Oct. 24th, 2004.

142

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143 [26] D. Kennedy, The Reality of Real-Time OLAP, MSDN, Microsoft, 2003 (February) http://msdn.microsoft.com/library/ default.asp?url=/library/en-us/dnsql2k/html/sql_real-timeolap. asp, last accessed on Oct. 24th, 2004. [27] C. Kriebel, A. Raviv, An economics approach to modeling the productivity of computer systems, Management Science 26 (3) (1980 (Mar.)) 297 311. [28] W.J. Labio, R. Yerneni, H. Garcia-Molina, Shrinking the Warehouse Update Window, ACM SIGMOD Record, 28, No. 2, Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, (Phildelphia, PA, Jun. 1999), 383394. [29] L. Liang, N. Wang, W. Orlowska, Making multiple views selfmaintainable in a data warehouse, Data and Knowledge Engineering 30 (2) (1999 (Jun.)) 121 134. [30] G. Moro, C. Sartori, Incremental maintenance of multi-source views, Proceedings 12th Australian Database Conference (ADC 2001), IEEE, 2001, pp. 13 20. [31] J. Rummel, Z. Walter, R. Dewan, A. Seidmann, Activity consolidation to improve responsiveness, European Journal of Operational Research 161 (3) (2005 (Mar.)) 683 703. [32] P. Seddon, A respecification and extension of the Delone and Mclean model of is success, Information Systems Research 8 (3) (1997 (Sep.)) 240 253. [33] P. Seddon, D. Staples, R. Patnayakoni, M. Bowtell, The dimensions of information systems success, Communications of the Association for Information Systems 2, Article 20 (1999 (Nov.)) 1 40. [34] A. Segev, F. Fang, Optimal update policies for distributed materialized views, Management Science 37 (7) (1991 (Jul.)) 851 870. [35] B. Shin, An exploratory investigation of system success factors in data warehousing, Journal of the Association for Information Systems 4, Article 6 (2003 (Aug.)) 141 170. [36] J. Srivastava, D. Roten, Analytical modeling of materialized view maintenance, Proceedings of the Seventh ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Austin, TX), 1988 (Mar.), pp. 126 134. [37] B. Swanson, Management information systems: appreciation and involvement, Management Science 21 (2) (1974 (Oct.)) 178 188. [38] D. Theodoratos, M. Bouzeghoub, Data currency quality satisfaction in the design of a data warehouse, International Journal of Cooperative Information Systems 10 (3) (2001 (Sep.)) 299 326. [39] A. Vavouras, S. Gatziu, K. Dittrich, Modeling and executing the data warehouse refreshment process, Proceedings of 1999 International Symposium on Database Applications in NonTraditional Environments (DANTE 1999) (Kyoto, Japan), 1999 (Nov.), pp. 66 73. [40] B. Wixom, H. Watson, An empirical investigation of the factors affecting data warehousing success, MIS Quarterly 28 (1) (2001 (Mar.)) 17 41. [41] X. Zhang, E. Rundensteiner, Integrating the maintenance and synchronization of data warehouses using a cooperative framework, Information Systems 27 (4) (2002 (Jun.)) 219 243.

[8] M. Connor, A practical view of real-time warehousing, Business Intelligence Journal 8 (2) (2003 (Spring)) 10 18. [9] G. Cybenko, B. Brewington, The foundations of information push and pull, in: D. OLeary (Ed.), Mathematics of Information, Institute for Mathematics and Applications Proceedings, Springer-Verlag, 1998, http://actcomm.dartmouth.edu/papers/ cybenko:push.pdf, last accessed on Oct. 24th, 2004. [10] W. Delone, E. McLean, Information systems success: the quest for the dependent variable, Information Systems Research 3 (1) (1992 (Mar.)) 65 90. [11] W. Delone, E. McLean, The Delone and McLean model of information systems success: a ten-year update, Journal of Management Information Systems 19 (4) (2003 (Spring)) 9 30. [12] D. Dey, Z. Zhang, An optimal policy for data warehouse synchronization, Proceedings of the Eleventh Workshop on Technologies and Systems (WITS 2001), (New Orleans, Dec. 2001). [13] D. Dey, Z. Zhang, P. De, Optimal Synchronization Policies for Data Warehouses, INFORMS Journal on Computing (in press). [14] W. Doll, G. Torkzadeh, Developing a multidimensional measure of systems use in an organizational context, Information and Management 33 (4) (1998 (Mar.)) 171 185. [15] K. Eisenhardt, Building theories from case study research, Academy of Management Review 14 (4) (1989 (Oct.)) 532 550. [16] J. Fisher, D. Berndt, Creating false memories: temporal reconstruction errors in data warehouses, Proceedings of the Eleventh Workshop on Technologies and Systems (WITS 2001), (New Orleans, Dec. 2001). [17] D. Georgakopoulos, M. Hornrick, A. Sheth, An overview of workflow management: from process modeling to workflow automation infrastructure, Distributed and Parallel Databases 3 (2) (1995 (Apr.)) 119 153. [18] A. Gupta, I. Mumick, Maintenance of materialized views: problems, techniques, and applications, IEEE Data Engineering Bulletin 18 (2) (1995 (Jun.)) 3 18. [19] B. Haley, H. Watson, D. Goodhue, The benefits of data warehousing at whirlpool, in: M. Khosrow-Pour (Ed.), Annals of Cases on Information Technology Applications and Management in Organizations, vol. 1, Idea Group Publishing, Hershey, PA, 1999, pp. 14 25. [20] S. Hamilton, N. Scott, Evaluating information system effectivenessPart I. Comparing evaluation approaches, MIS Quarterly 5 (3) (1981 (Sep.)) 55 70. [21] P. Holland, Virtues of a Virtual Data Warehouse, Datamation, 2000 (March 22) http://itmanagement.earthweb.com/datbus/ article.php/621401, last accessed on Oct. 24th, 2004. [22] W. Inmon, Building the Operational Data Store, 2nd ed., Wiley, New York, 1999. [23] B. Ives, M.H. Olson, J.J. Baroudi, The measurement of user information satisfaction, Communications of the ACM 26 (10) (1983 (Oct.)) 785 793. [24] M. Jarke, M. Lenzerini, Y. Vassiliou, P. Vassiliadis, Fundamentals of Data Warehouse, Springer, Berlin, 2000. [25] B. Kahn, D. Strong, R. Wang, Information quality benchmarks: product and service performance, Communications of the ACM 45 (4) (2002 (Apr.)) 184 192.

M.V. Mannino, Z. Walter / Decision Support Systems 42 (2006) 121143 [42] Y. Zhuge, H. Garcia-Molina, J. Wiener, Consistency algorithms for multi-source warehouse view maintenance, Journal of Distributed and Parallel Databases 6 (1) (1998 (Jan.)) 7 40. Dr. Mannino is currently an associate professor of Information Systems at the Business School, University of Colorado at Denver and Health Sciences. Previously, he was a faculty member in the Computer and Information Sciences Department at the University of Florida, the Department of Management, Science and Information Systems at the University of Texas at Austin, and the Management Science Department at the University of Washington. Dr. Mannino teaches and conducts research in database management, software engineering, and knowledge representation. His articles have appeared in journals affiliated with the ACM, IEEE, and INFORMS, including CACM, ACM Computing Surveys, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Software Engineering, INFORMS Journal on Computing, and Information Systems Research, most with multiple articles. His articles also appeared in Journal of Management Information Systems, MIS Quarterly, Information and Management, and other journals. He is the author of the textbook, Database Design, Application Development, and Administration, published by Irwin McGraw-Hill.

143

Zhiping Walter received the PhD degree in Business Administration, specializing in Management Information Systems from the Simon School of Business, University of Rochester. She is currently an Assistant Professor of Management Information Systems at the Business School, University of Colorado at Denver and Health Sciences. Dr. Walters research interests are in the areas of economics of information systems, Internet marketing, and Information Technology in Healthcare. Her articles have appeared in Decision Support Systems, Communications of the ACM, International Journal of Electronic Commerce, European Journal of Operational Research, International Journal of Healthcare Technology Management, and Technology Analysis and Strategic Management.

Vous aimerez peut-être aussi