The wiser, gentler data warehouse
In the past few years, data warehousing methodologies have matured. Everything we've learned is here in a comprehensive guide to get your data warehouse project off the ground and to help prevent a nose dive
Building a data warehouse is, of course, anything but a simple procedure. Observing the lessons learned by those companies leading the way can at least make the process a bit easier for the rest of us. Here we point out crucial questions you should be asking yourself and take you through the steps involved in pre-engineering, engineering, and implementing a successful data warehouse. (5,900 words)
So, you've decided to build a data warehouse. According to market research reports, you're not alone. Some $5 billion dollars will flow into the data warehouse market annually by the year 2000, according to IDC -- and that's for packaged software products alone.
That's fairly impressive growth, considering that data warehousing was classified as an emerging technology only two years ago. But early projects were characterized by enormous costs, epic development cycles, performance and scalability problems, and nebulous return on investments (ROIs). Data warehouse experts have learned a lot over the years.
Vendors -- database management systems (DBMS) makers, especially -- have revamped their wares to better serve this demanding application. Even more importantly, the industry as a whole has developed "best practices" guidelines that shorten the development cycle while improving the overall quality and financial payback of these systems. In fact, from May 27 to June 1, The Data Warehousing Institute is conducting the Best Practices & Implementation Conference in Chicago. (See Resources below.)
Every data warehouse development cycle is different. However, generally speaking, any method has three broad categories: pre-engineering, engineering, and deployment.
Step one: Pre-engineering
Pre-engineering is by far the most critical part of the entire data warehousing process. It is the time to define the goals of the application and to make other decisions (which will ultimately affect every other decision made throughout the process). This includes deciding whether the warehouse should be built at all.
"If you don't have good analysis, if you don't get proper user involvement, do a good job limiting the scope, get agreement on the scope, do a good job giving users what they need, know where you're going to get the data from -- if you drop the ball on any one of those, you'll have a failure of a project," says Dan Dampier, president of Access Data Consulting Corp., a data warehousing system integrator in Englewood, CO.
In fact, each phase of the pre-engineering process should be put to the "go/no-go" test. Phases consist of identifying a high ROI initial application, taking stock of what skills and supplies are needed, and defining a budget and timeline. An excellent list of white papers covering many of the issues involved in pre-engineering can be found on the Data Warehouse Institute site. (See Resources.)
Identify an initial application
Before anyone writes a single line of code, you must clearly define the expected performance of the data warehouse. This can be more difficult than it seems, experts say, because a project so wide in scope touches many areas of the company. Misunderstand an area -- or worse, leave it out -- and the project will be ultimately damaged.
The first step toward identifying the right application is to secure executive buy-in. Better yet, the project should be championed from the highest executive level, given the expense of even the smallest data mart in both man-hour and purchased resources.
"Typically we hear from the IS manager who's saying, 'Users are beating down my door saying I need one. I need one.' We stop and ask, 'Is your business ready? Do you have executive sponsorship?'" cautions Jane Griffin, president and CEO of Systems Techniques Inc., a data warehouse system integrator in Atlanta.
If the project does not have executive sponsorship, Griffin advises that it be put on hold until the authoritative nod is given. Once executive backing has been granted, however, the next step is to determine what the company's business objectives are and how the data warehouse will help to achieve them.
For instance, is the company attempting to combat attrition? Increase sales revenue? Increase net profit? Prepare for deregulation? Reduce manufacturing expenses? Manage raw materials purchasing better? Understand its customer base better? Which of these items is the most pressing for the business today?
The executive sponsor should be able to supply these answers. If management cannot name the company's objectives, the project becomes a no-go. A data warehouse is a tool for reaching business objectives, not creating them. A warehouse must have clearly-defined business goals if its success is to be measured and realized.
"We try to link an initial data warehouse to the corporate strategies. Most people, of course, want to increase market share and reduce cost. However, how to go about that varies from company to company or industry to industry," explains Griffin. "A power company may have two major initiatives, power supply management and marketing. While a data warehouse on power-supply management would be easily justified, because of the deregulation of that industry, the marketing application would go first."
Different industries suffer different market share/cost control problems. In the cellular industry, the cost to get a customer is considerably higher than the cost to keep one, given the enticements that the industry must use to hook customers. So an application that identifies causes of customer attrition and/or moving to competitors would have the biggest payback.
Develop list of performance criteria for the application
Once the broad parameters of the application are identified (e.g., deciding that an application should help reduce customer losses), the "how to" must be addressed. Such a warehouse may attempt to reduce attrition by identifying the risk factors that cause people to leave and then identifying specific customers at risk. Or it can identify the characteristics of customers that remain loyal and then attempt to find ways to attract more of these customers.
In order to define best what the application should do, developers should get feedback from representatives across the enterprise.
Assemble a team
Executive opinions are only a fraction of the business picture, end users say. Equally important is an intimate understanding of the needs and work habits of the users who will be using or inputting into the system. For instance, when MCI decided that none of its numerous data warehouses and marts throughout its massive organization was allowing it to target consumers more effectively, it developed an application-specific data mart for that purpose. The Revenue Planning Group is the financial department responsible for marketing and analyzing of all consumer products and would be the ones to use the application. The mart was created by a former member of the group who was going to use the application but who had defected to IS.
"You've got to think through the application. Find yourself a main user -- one who is really educated about the needs of the users -- then steal him or hire him. For us, I used to work in that department. I know what the people needed. Do not just send an IT group out on its own," counsels Chris Courim, manager of information systems development for MCI in Denver.
One good idea at this early point of pre-engineering is to assemble a "prototyping" team composed of representatives of all strata of the user population -- from data entry clerks to upper management. In a single brainstorming session, this team should toss out every conceivable expectation they have of the new system, Access Data's Dampier advises.
"One of the techniques I use is to get a representative of each group -- line workers, middle managers, upper executives -- together in a room and tell them, `We can't talk about Sybase or Oracle, Sun or HP, or tools.' When people talk about computers, they tend to disagree a lot. When you can get them to stop talking about technology and get them to talk about the business needs, people rarely disagree," says Dampier.
What should the application do?
After brainstorming, Dampier asks the group to come to a consensus of the top five business goals the application must accomplish first. This way, the developers can proceed with the project confident that they have full support and know what people need. The team members also become advocates of the project from the start, feeling that they contributed to it, and then gain a better understanding of the needs of other departments.
Most warehouse experts use similar group interview methods. "In group sessions, we do follow-up with the lower levels of employees. Often we hear them say, `We didn't know that's important. We can do it that way.' Get that cooperation in the front end and get some agreement," says Systems Techniques' Griffin.
How quickly should it be done?
Once the group has decided what the application should do, the next question to answer is how quickly it should be done. Are these reports needed hourly, daily, monthly, quarterly? Can any of them be canned and produced overnight or will they all be adhoc? How current must the data be? Does the data need to be refreshed hourly, nightly, monthly? How much historical data is necessary for the application to be valid? One year's worth of history? More? Can this history accumulate from the start date of the warehouse or must the warehouse be populated initially with existing historical data?
These answers will be critical in determining product choices later, because they will allow the developers to set required performance criteria. "When it comes to performance, the number one challenge for data warehouse developers is if users can't get answers in a reasonable amount of time," asserts Steve O'Brien, director of product marketing for Red Brick Systems of Los Gatos, CA. "The decision of which data to use first will guide a lot of the other purchasing decisions and the criteria is set by the application. Financial applications often need to load every night, and users want reports to generate in under a minute."
Such stipulations will later dictate which database to use because various databases have different strengths and weaknesses. Some of the more traditional online transaction processing (OLTP) databases that have been refitted to include data warehousing features may face the bottleneck at I/O, meaning that data might require a longer time to load, for instance. Others may face the bottleneck at CPU, meaning that adhoc reports may take longer to generate.
Again, it is critical to determine what the users really need. If reports must be generated in less than a minute because they typically use this data while speaking on the phone to customers, that's a true need. If a report could take one minute or one hour to generate without affecting the user's work day, that's a different situation altogether.
How many users will it support initially?
Try to ascertain how many users plan on using the application. Will a single secretary perform the queries for all upper management? Will every salesperson be running adhoc queries all day? This type of feedback is essential when the database architect is trying to design for scalability.
How many users will it support ultimately?
Also consider how many users will ultimately use the system. Will it initially support only the sales department but eventually accounting as well? Will key people be in charge of using the system or will everyone in a department eventually be trained?
Identify source systems
Once the performance of the application has been defined, the next crucial step is to identify where the data will come from and if the data even exists at all. For instance, say the company has decided that it wants to quantify customer profitability. The sales managers and other members of the brainstorming team have determined that profitability includes length of time a customer has been patronizing the company, how many calls to customer service the customer has made, and how many dollars per year the customer spends. The warehouse developer tries to identify the source systems for this information and discovers that some of it isn't being captured in any system. In such a case, the project has reached a no-go.
"At pre-engineering, we analyze the maturity of the organization," says Shannon Platz, the method developer for Oracle's data warehouse method in Redwood City, CA. "From a technical perspective are they prepared to build and also to use a data warehouse? We perform a gap analysis. We define what they want, find out if the data on what they want is there. If it is, wonderful -- we move forward. If it's not, we revise our scope, or if the gap is a real show stopper, we wait until we can get our act together. We don't want to move forward if the organization is not ready."
How much data will it store?
Along with identifying where the data is stored, the organization must also decide on how much data to use. "One of the biggest challenges is determining the value of how much detail you keep. Wrestling with just the sheer quantity of data kills your productivity. There's something to be said for a lot of data, but it's a balancing act -- details versus productivity," says MCI's Courim.
Currently, MCI's data mart contains about 60 gigabytes of data. Because it uses Sybase IQ as its DBMS, the amount of raw data is actually in the neighborhood of 150 gigabytes, says Courim. IQ is a mart-specific DBMS that uses a number of methods to reduce disk volume requirements and I/O, one of the traditional bottlenecks of warehouse applications. It stores data as bit maps which are compressed and pre-indexed at load time. This way, IQ does away with building additional indexes, a technique used by OLTP databases to improve I/O performance, but one that also increases the volume of data that must be stored, Sybase officials claim.
Even with the 60 percent reduction of the original data achieved by IQ, MCI still adds about one gigabyte of compressed data a month to its mart. "When we rebuild a database, it's out of commission for two to three weeks, meaning users can't be productive with it," says Courim.
Required technical skills
According to Dampier, the laundry list of skills necessary for building a warehouse includes knowledge of:
Political skills: ownership of data, work habits, cultural habits
Data warehouses are political animals, plain and simple. Any implementation plan should take into consideration some of the political issues that may plague development such as perceived ownership of data. Working with data centers is a prime example. In the past, mainframe programmers had exclusive access to mainframe data. If anyone wanted to pose a query against a mainframe application, that person requested the data center people to perform the task. No one else accessed data directly. Data center personnel may feel that their dominion is being threatened by a warehouse application that anyone may use. So, if the warehouse will rely on data from the data center, it is important to include a member of the center on the development team. This will foster political allies, rather than rivalries.
The work cultures of those using the system should also be taken into consideration. For instance, maybe all members of the sales team routinely come into the office in the morning for staff meetings, perform their administrative tasks during that time, and head out for the day to meet with clients. If the warehouse is still loading during those morning hours and cannot be used, it won't be a very successful application.
Determine a budget
Finally, when all else is completed in the pre-engineering stage, fiscal budgets and timelines can be created. An early budget is an essential guide that will assist in determining the choices for the application. For instance, if the company has a budget of $500,000 for the entire warehouse project, it obviously can't afford the latest, greatest Sun StarFire box, which is priced starting at $879,000. This may mean that it must use less historical data than it originally wanted, or it must limit the functions of the applications so that the warehouse won't require as many data sources. The company may also structure the application as a simple mart, and not as the start of an enterprise warehouse.
Of course, a budget is likely to be modified as warehouse developers become educated on what can and can't be done with the amount of money a company is willing to spend.
One method of determining the budget is to do a cost/benefit analysis of the project, says Systems Techniques' Griffin, who routinely performs this task for her clients. "You need to understand the value of the information for the organization and understand the benefit it provides. From there, you can implement a plan and budget and calculate ROI."
In general, expect the initial warehouse project to cost somewhere between $250,000 and $500,000 with a pilot program often landing at about $50,000. If these costs are simply not feasible, Griffin says that data marts can be had for $50,000 or less, total.
For other companies, the straight bottom line returns are not the primary motive for building a warehouse. Instead, shifts in their industries make a warehouse mandatory. This is often the case for industries that are facing deregulation or competition for the first time, such as cable companies, power companies, and telecommunications companies.
For instance, in the last two years the cable industry has changed so radically, that Continental Cablevision Inc. was compelled to create a massive and exhaustive customer information database no matter what the cost.
"It's almost like a capital asset. It's awfully hard to wrap numbers around it. There's been a radical shift for us across our markets. Suddenly I'm looking at the customer in a whole new way," explains Robert Strickland, senior vice president of information systems for the Boston cable company. "We found that we knew most about our customers as dwellings and not as customers. We've been very `engineering centric.' Customer care was the billing system."
Step two: engineering
With the application well defined, data warehouse developers are able to begin making technical decisions, among them, which hardware and operating system to choose. Sun/Solaris has several advantages over competitors like Digital Alpha/OpenVMS and Silicon Graphics/IRIS. While those competitors boast the improved performance of 64-bit operating systems, Solaris offers price/performance, scalability, and unequaled partnerships with data warehouse tool vendors.
"I can't put my finger on any specific disadvantages of using Solaris," claims Peter Wagner, director of systems and networks at HCIA Inc., a Baltimore-based company that uses data warehouses and marts to analyze healthcare industry information for its clients. "We are looking forward to getting 64-bit, but that's quite a bit down the road. One advantage is that Solaris is usually one of the first ports tool vendors make."
Sun executives say that performance gains from competitive 64-bit systems are negligible for data warehousing applications because its high-end boxes have been designed specifically to reduce the I/O bottleneck, a typical problem area for data warehouses. "Unlike OLTP, which is short and bursty, a warehouse query sometimes joins terabyte tables. The SMTP architecture on Sun, particularly our `Gigaplane-XB' is designed to manage this," says Donna Rubin, data warehousing technology manager for Sun Microsystems Computer Corp. in King of Prussia, PA. Gigaplane is a high performance crossbar switch interconnect, which allows customers to load balance increasing amounts of memory, disk, and processors, even when using them at the same time while supporting up to 20,000 users.
In addition, scalability of Sun boxes is also a plus. For example, on the high end, Sun recently announced its StarFire box, which is aimed to compete with mainframes in power and capacity. This box ships with 16 processors and can scale up to 64 CPUs, with entry-level at a price tag of $870,000. It includes the Gigaplane-XB interconnect technology. Of course, Sun systems capable of sustaining small to medium-sized marts, but also offering tremendous scalability, can be had for a quarter of that price. Moreover, because of the immense popularity of Solaris in general, a larger pool of expertise with the operating system exists within all segments of the computing industry, including warehousing.
"We find that people seem to be choosing Sun more, not just for raw performance but for price performance. You can start out with two to four SPARC processors and plan for success by leaving headroom for more. With the new Starfire, you can scale up to 64 processors. There's also hugely more amounts of skill level on Sun than with DEC's Unix. Sun has good partnerships, and SunSoft has more utilities," observes Red Brick's O'Brien.
Decide on an architecture: Warehouse or mart
With the application fairly defined and the data sources confirmed, the next decision is between a full-fledged warehouse or a mart. Today, the generally accepted practice is to develop marts first and then tie them together to make a warehouse rather than to begin with an all-out, enterprisewide warehouse. This allows applications to be built incrementally, leveraging the knowledge, experience, and, sometimes, the spare hardware capacity obtained from existing systems.
Prepackaged or customized?
Another decision is whether or not to use a prepackaged "quick-start" solution or to completely custom-develop the warehouse from scratch. While the price for quick-start solutions may be tempting, data warehouse experts caution users to read the fine print carefully. Prices range from about $25,000 to $125,000 for the DBMS and other selected tools, such as front-end or extraction tools.
However, quick-start programs can be misleading for a number of reasons:
Still, to determine if quick-start programs are right for your organization, compare vendors' offerings on a number of fronts: How is the data mapped, modeled, and loaded? Exactly what products are included in the package? What consulting services are included with the purchase, and what are available for a fee? How many independent consultants are authorized and trained on the product, and how many are in your area?
If a quick-start program is not for you, the next step is in deciding which database to use. Companies such as Red Brick, Informix, and Sybase each have data mart- or warehouse-specific products that warrant a look. Oracle is also a popular data warehouse solution for Solaris and includes numerous warehouse-specific features. Several good guides to choosing between them exist on the Red Brick Web site. (See Resources below.)
Determine vendors for other steps
In addition to the database, developers will also need to select tool vendors for the following aspects of the project:
Create a data model
The three steps to creating a data model:
Analyze/cleanse legacy data
With nearly any project, mart or warehouse, the rub will be both the extraction and the cleansing of data. Continental Cablevision is an extreme example. Extraction proved to be such an enormous undertaking that even now, two years after the initial project began, the developers still haven't found an ideal solution, in part because the consolidation of that industry keeps throwing new systems into the customer database kettle.
"We haven't overcome the issue at all. But we are dealing with it in our standards committee. Thinking about how to standardize is exacerbated by all of these trades and acquisitions," says Continental Cablevision's Strickland. "We've acquired several thousand customers. People built their careers around maintaining knowledge and keeping it to themselves. Those 10 codes that Phyllis knew, now that Phyllis is gone, are difficult to decipher."
In fact, most companies that populate the initial repository with legacy data will face a similar problem. However, by implementing a few policies early on, these problems might pose a minimal delay on the project.
Create data uniformity
Chances are, various systems will have different ways of entering data (e.g. last name first, two letter or four letter codes for states, or various city codes). Examine all data sources then decide on a uniform structure. Map which changes need to be made to which data source.
Create a plan for dealing with anomalies. Who will look at them? How often? And what is to be done with those entries while the status of the data is pending?
Identify a "gold" source
When the question of data accuracy becomes an issue, choose a "gold" source that will be used as the baseline. For instance, the gold source for customer names and addresses might be the billing system, rather than customer service or marketing. It may be necessary to spot check the database for accuracy to ensure quality of that gold system.
Developers should also tell users how data quality has been established and invite them to comment on this issue. Confidence in the quality of the data is essential for the project to be a success, experts say. "When dealing with data it's not just the data acquisition, it's also the integrity of the data that affects the warehouse. If you don't have a way to check back to see if the data is OK when it comes from the source system, you'll have problems," warns Oracle's Platz. "Also, you've got to tell the users how the data got from point A to point B. Otherwise users will distrust it and feel like, `Am I going to make a major decision based on this warehouse report? No way!'"
Analyze typical data entry mistakes and train users not to make them
Obviously, the cleaner the data is when it goes into the system, the cleaner it will be when it comes out. If possible, try to record or analyze the kinds of "dirt" or errors that are most common. Are they typos in names, or wrong codes? Send the results of this analysis back to the departments who own the source systems so they can train their people better. Maybe data entry would get cleaner if that source application used a GUI with pull-down menus. Such solutions could be examined with an analysis of the dirt.
Two sources that provide a more detailed look at cleansing problems and solutions are:
While modeling the data and developing extraction methods and policies, the time is right to decide on several other policies of importance such as how often to reload records.
What types of information should be cataloged in the metadata? Will the data be transfigured upon loading (such as stripping out transaction data) or will all cleansed data remain intact?
With the data mapped and the database chosen, it's time to estimate how much capacity is needed for the initial application. Unfortunately, no rule of thumb exists, so an estimate has to be made by trial and error. Take a sample extraction. See how big it is. Estimate how many more will be used. Add in the DBMS overhead, such as indexing tables, if any. Or subtract the savings of DBMS compression techniques. The result will be a place to start. Hardware vendors will try to convince warehouse developers to err on the "up to 20 percent too much" side and with some reasonably acceptable logic chances are that the data warehouse will be larger than expected and grow faster than expected.
Such was the case for HCIA. The warehouse for this company is actually the engine for much of its products and services. The company provides standard reports to its clients on various issues relating to the healthcare industry, including business performance and treatments. It also offers custom reports. For example, a hospital might come to HCIA and want analysis of a specific illness comparing the success of treatments versus the cost of treatment for itself and its four major competitors.
"One of our biggest technical challenges was anticipating the size. We wish we had done a better job of anticipating how big our warehouse could get. We have grown it beyond the capabilities of the technology," says HCIA's Wagner.
Specifically, data is dispersed throughout multiple data warehousing repositories, the largest of which is about 500 gigabytes. In all, about three terabytes of data is spread on nine Sun servers.
The warehouse is actually designed in a traditional relational or distributed schema, rather than today's more popular star or snowflake schema. This architecture was chosen because the warehouse was originally based on Informix version 4, which was limited to 100 gigabytes in any given database, says Wagner. HCIA is currently moving the warehouse to Informix version 7, with most of the upgrade completed at this time. Informix 7 doesn't have the same capacity limitations of earlier versions, so HCIA is going to stick with this schema, with no plans for a redesign.
However, consultants warn not to overbuy at the onset. Sun is extremely modular. More hardware can always be purchased, but excess cannot be sent back. Other issues in dealing with such huge volumes of data exist, such as archiving and sizing disk space to optimize performance, not just account for capacity.
"Make sure you spend a considerable amount of time figuring out how to lay out the data on the physical devices. A lot of very new technologies have come out within the last year that are better for dealing with large amounts of data," advises Wagner. "All of them have the potential for speeding up the application or tying it down. Make sure you spread your data across as many physical drives as possible in order to speed up I/O. Don't fill up one drive and move on to the next." Technologies such as Sun's RSM Array 2000 RAID drives may be a wise initial investment with the servers, Wagner believes.
Step three: At last! Implementation
The final step in a data warehouse is to bring the application to the users. Ideally, a prototype should be available for hands-on experimentation within three to four months after beginning the project, says both System Techniques' Jane Griffin and Oracle's Shannon Platz. That's about the time when the project's executive sponsors start feeling antsy and want to see some progress.
Prototypes should be offered with the instruction to pound on them, not protect them. Let users find out what's wrong, and what's right. Offer prototypes early and often. One good way to start is to test for the look and feel of the front end by putting together a quick GUI interface and allowing users to comment on it. This typically doesn't take much time to create and may provide some invaluable insight into what users were truly expecting (versus what they said they were expecting during pre-engineering meetings).
"Some of [the development process] is trial and error, but for the most part it's planning. Are the drives balanced? Any CPUs underutilized? Are the queries the ones you expected? That's the art form," says Access Data's Dampier. "Data warehouses don't start off by knowing all the answers. For instance, say you discover that all customers that have a certain salary have a certain attribute. Your users will then start running queries to find out what that attribute has to do with that salary." Such queries from queries can't be planned for explicitly, but a well-balanced design will allow the warehouse to be flexible enough to accommodate them.
In addition prototypes should, at some point, be examined for the following attributes:
Create a baseline model
Use the results of the prototypes to create a baseline performance model. This might include performance statistics such as CPU usage and I/O response times. A number of tools that perform modeling and capacity planning functions for Solaris exist. Among them is BGS Systems' BEST/1. To locate more tools, or learn more about the process of modeling and capacity planning, contact the Institute for Computer Capacity Management (ICCM). (See Resources.)
Training occurs at the prototyping stage and beyond. One good technique is to include the departments "power users" in the prototyping committee. These are the people that others in the department tend to turn to for immediate help. By engaging these people as part of the development team, the warehouse help desk might be relieved of some of its calls.
At the same time, training should be geared toward a mix of users. The executive branch will probably run different types of queries than those run by the sales department. Also, this is the time to introduce online documentation and avenues where users can freely report bugs and offer suggestions for improvements.
Finally, don't suppose that no news is good news. Zero feedback probably means that people aren't using the system, rather than that they're happy with it. Part of the initial budget and timeline should include periodic interviews with the pre-engineering team and/or the prototyping team. Developers can then be sure that the system is responding to the needs of users as the company grows.
Also, at three months, six months, and then semi-annually thereafter, the performance of the system should be modeled again and compared to previous models. Developers can watch how performance develops as users grow comfortable with the system and incorporate it more tightly into their work culture.
About the author
Julie Bort is a freelance writer and author in Dillon, CO. Her most recent book is Building an Extranet (John Wiley & Sons). Reach Julie at firstname.lastname@example.org.
If you have technical problems with this magazine, contact email@example.com