Dimensional Modeling Essay

Introduction

Business intelligence is the key achromous in today’s competitive world of business and Data Warehousing the approach for achieving this level of intelligence about your business from your business. For years, data management people believed that there was only one real, persistent level of data – the operational level. All other data, while accepted, was derivable from this level. This is not true as there are several levels of data within an organization.

The reason stems not from information technology (IT), but from business. Classically, there are three major levels of management and decision making within an organization: operational, tactical and strategic (figure 1). While these levels feed one another, they are essentially distinct. Operational data deals with day-to- day operations. Tactical data deals with medium-term decisions. Strategic data deals with long- term decisions. Decision making changes as one goes from level to level. At the operational level, decisions are structured. This means they are based on rules. (A credit card charge may not exceed the customer’s credit limit.) At the tactical level, decisions are semi-structured. (Did we meet our branch quota for new loans this week?) Strategic decisions are unstructured. (Should a bank lower its minimum balances to retain more customers and acquire more new customers?)

Levels of Analysis (figure 1)

Corresponding to each of these decision levels are three levels of data. These levels of data also are separate and distinct – again, one feeding the other. Not all strategic data can be derived from operational data. In an organization, there are at least four different kinds of data, including: internally owned, externally acquired, self-reported and modeled. External data, such as competitive data, is obviously acquired from outside agencies. Modeled data is data that is mathematically created (e.g., data created by analyzing customer characteristics and market demographics, and producing such measures as market segments). External and modeled data do not exist in the operational environment. Strategic data is usually comprised of internal and external data.

Levels of Analysis

There are many levels of reporting and analysis that range from fairly structured to quite unstructured (figure

2) Levels of Reporting and Analysis Characteristics of Analytical Data (figure 2)

Analytical data has its own characteristics. It is management-oriented, historical, query-oriented and integrated.

Management-oriented.

Analytical data focuses on management measures. To do this, it often uses different grains of data, such as transactions, periodic snapshots and summaries. Management data is often rolled-up to different levels. Management also requires some cross-functional information. External data is often used to supplement internal data.

Historical.

Management needs several years of historical data to reveal trends. This allows year-to-year comparison and reveals patterns over time. Therefore, changes in facts and dimensions need to be kept over time. A common historical requirement is to allow restated and non-restated versions of the data. This allows management to pose changes and test the effect of the changes on the business. It allows them to restate the past in terms of the present or the present in terms of the past.

Query-oriented.

Analytical data is not used for transaction processing and maintenance but for reporting and different forms of analysis, such as mining. It is mostly read-only. It is not necessarily totally read-only, because some data could be changed as new information is discovered. The analytical environment needs to be able to support a wide range of query types, such as ad hoc, prescribed ad hoc and standardized. The warehouse can be queried directly or used to supply extracts.

Integrated.

Proper analysis and reporting requires data from multiple relevant sources. These can be internal, external, self- reported and even modeled data sources. The data must be reconciled to ensure its quality. This means that it must be cleansed to produce data of good quality. The integration of disparate data is one of the main challenges in the analytical environment.

Data Modeling

Modeling is the process of creating a representation of real or abstract objects and data modeling is the act of exploring data-oriented structures. There are several modeling themes a brief depiction on some related to our report are as follows to get the ball rolling in the right direction.

Logical Modeling

A logical model is a representation of a business problem, without regard to implementation, technology and organizational structure. The purpose of a logical model is to represent the business requirement completely, correctly and concisely.

A constraint of this model is that all redundancy is removed in order to focus purely on the business requirements and rules. A logical model does not presuppose the granularity of the data; this means that a logical model does not require that the data be only at a very detailed operational level. A logical model is not implemented; instead, it is converted to a physical model against which optimizations are performed, and this is implemented.

E-R Modeling

Entity-relationship (ER) modeling is a powerful technique for designing transaction processing systems in relational environments.

An ER model is a logical and graphical representation of the information needs of an organization. There are three main processes involved in ER modeling: classifying, characterizing and interrelating. The objects of interest to the organization are grouped into mutually exclusive classes called entities. These classes are assigned characteristics that describe them, called attributes. An important attribute is the identifier or key. Finally, one class or entity can be associated with another via connections called relationships.

Physical Model

A physical model is the specification of what is implemented. Physical models should be optimized, efficient, build able and robust.

The conversion to a logical model depends on many factors especially the size and complexity of the data, the complexity of the queries and the number of users.

The conversion from logical to physical models can be simple or complex, depending on the requirements. Logical model undergoes several transformations as it progresses from a purely logical model to a physically implemented model (figure 3). Following is a brief review of some of the possible transformations. We cover three forms of optimizations or trade-offs: safe, aggressive and technical. A trade-off is the emphasis of one feature, which becomes an advantage, against another feature, which then becomes a disadvantage.

Progression from Logical Model to Physically Implemented Model (figure 3)

Dimensional Modeling

The dimensional modeling is a logical design technique which represents a set of business measurements in a standard frame work. The schemas generated from dimensional modeling are symmetric in nature assists in structure and symmetric querying and reporting.

Dimensional model is inherently dimensional, and it adheres to a discipline that uses the relational model with some important restrictions. Every dimensional model is composed of one table with a multipart key, called the fact table, and a set of smaller tables called dimension tables. Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multipart key in the fact table. This characteristic “star-like” structure is often called a star join.

A dimensional model (figure 4) is a form of analytical design (or physical model) in which data is pre-classified as a fact or dimension. The purpose of a dimensional model is to improve performance by matching the data structure to the queries. People use the data by writing queries such as, “Give this period’s total sales volume and revenue by product, business unit and package.” Access can be inside-out or outside-in. When access occurs from dimension to fact, it is outside-in. “Give me total sales in volume and revenue for product XYZ in the NE region for the last period, compared to the same period last year” is outside-in. Access can also be inside-out, in which case the query analyzes the facts and then retrieves the appropriate dimensions. For example, “Give me the characteristics of term life policies for which the insured amount is less than $10,000” is inside-out.

Dimensional Model Example (figure 4)

A particular form of a dimensional model is the star schema, which consists of a central fact table containing measures, surrounded by one perimeter of descriptors, called dimensions. In a star schema, if a dimension is complex or leveled, it is compressed or flattened into a single dimension. For example, if Product consists of Product, Brand and Category, Brand and Category are compressed into Product. This compression causes some redundancy, but can sometimes improve performance. In a star schema, related complex dimensions are de-normalized to the extent that they are flattened into a single dimension.

Another version of the dimensional model is the snowflake. In a snowflake model, complex dimensions are re-normalized. A snowflake model is a model in which a given dimension has relationships to other levels of the same dimension. In the snowflake, the different levels or hierarchies of a dimension are kept separate. Model, Product, Brand and Category (figure 5) would be maintained as three separate but related tables.

Snowflake Model (figure 5)

De-normalization and the Dimensional Model

The fact that both the star and snowflake are physical schemas can be illustrated by examining a sample model. A dimensional model typically (not always) uses de-normalization (figure 6)

Dimensional Modeling Using De-normalization (figure 6)

1. It violates 3NF in dimensions by collapsing higher-level dimensions into the lowest level as in Brand and Category.

2. It violates 2NF in facts by collapsing common fact data from Order Header into the transaction, such as Order Date.

3. It often violates Boyce-Codd Normal Form (BCNF) by recording redundant relationships, such as the relationships both from Customer and Customer Demographics to Booked Order.

4. However, it supports changing dimensions by preserving 1NF in Customer and Customer Demographics.

This systematic degree and pattern of de-normalization is reserved for physical models. The important conclusion from this (and other discussions in this article) is that the star schema and snowflake schema are not logical models but physical models.

Fact?

A problem with the dimensional model is that data must be pre-classified as a fact or a dimension. In addition, only dimensions are permitted to have indices. While this sometimes works, it just as often doesn’t. This author prefers the view that data is data and that it is a fact or a dimension relative to the query within which it is used, not in and of itself. For example, we have Delivery Items and Order Items in our data warehouse, each as its own star schema. In many ways, this works fine – until we need to relate Delivery Items and Order Items. We do this when we conduct exception reporting. (What Orders did not result in Deliveries?) For this, we need to relate Delivery Items and Order Items. Order Items, which was a fact table, is now effectively a dimension in the exception query. We are told we should not do this; this is a fact-to-fact join. However, in most accounts, we have not had any difficulty doing fact-to-fact joins.

Dimensional Evaluation

Dimensional modeling has strengths and weaknesses. There are three ways to improve performance: use better hardware, use better software and optimize the data. Dimensional modeling uses the third method. The primary justification for dimensional modeling is to improve performance by compromising the data to compensate for the inefficiency of technology. A secondary purpose is to provide a consistent base for analysis. Dimensional modeling comes with a price and with restrictions. There are times and places where dimensional modeling is appropriate and will work and other times and places where it is inappropriate and will actually interfere with the goals of a warehouse.

Applicability of the Dimensional Model

The dimensional model is very appropriate to certain components of the data warehouse environment. Specifically, it is appropriate for most forms of data marts. Imbedded data marts, which are usually stored aggregates, are inherently dimensional in structure. Certain structures within the enterprise data warehouse (EDW) are also amenable to the dimensional model. These are driven by query needs. For example, if querying would regularly require joining data across many tables and rolling that up as well, a pre-joined dimensional structure would be useful.

Where do the various models fit into the overall data warehouse environment? The dimensional model is appropriate: for most forms of data marts, except where data mining, case reasoning and the like are involved; when the platform, even for the central data warehouse, cannot support a more normalized model; and for aggregates and summaries, which are inherently dimensional in structure.

The normalized model is more appropriate: for the central data warehouse; when the platform is capable of sustaining it; when the data is of general purpose; and where pure ad hoc must be supported.

Technology has a significant effect on the choice of data model. One must admit that a dimensional model is a conscious compromise to improve query performance. Suppose that technology was not an issue. Suppose that the technology was so fast it could do a 27- table deep join and still give satisfactory performance. This is a real example as related by Al Messerli, formerly CIO of 3M, who performed it successfully on the Teradata platform. However, if one does not have such technology, then care must be taken. While I still believe the central DW model need not be a fully dimensional model, it should not be a fully normalized model. All models should be optimized based on the requirements of the business.

Dimensional Vs E-R Model

Dimensional modeling is a top-down design process. First you identify the main business processes that act as the sources of the fact tables, and then you populate the fact tables with numeric, additive facts. You describe each fact record by as many business dimensions as you can identify. The resulting fact table records consist entirely of key values that have many-to-many relationships with one other, together with numeric data representing measurements of each dimension. Overall, the storage of the fact table records is quite efficient. The dimension tables represent the biggest departure from the usual ER and normalization techniques. It is important that the dimension tables remain as flat, single-level tables without being further normalized. E-R is good for reporting and point queries while dimensional modeling is good for ad-hoc query analysis. Many times, this translates to an E-R based data warehouse and a dimensional data mart layer. Dimensional imposes some rules on the modeling, but results in a model that has the access methods inherent by virtue of the relationships. Users are also better able to relate to the ‘see measure by dimensional value(s)’ paradigm than ‘anything goes’.

Steps for Designing Dimensional Model

1. First identify the business process.

In this step you will determine what is your business process that your data warehouse represents. This process will be the source of your metrics or measurements.

2. Next Identify the Grain

You will determine what does one row of fact table mean.

3. Next Identify the Dimensions

Your dimensions should be descriptive (SQL VARCHAR or CHARACTER) as much as possible and confirm to your grain.

4. Finally Identify the facts

In this step you will identify what are your measurements (or metrics or facts). The facts should be numeric and should confirm to the grain defines in step 2.

DM For HR System

Most of us operate in a rapidly changing, competitive business environment .We need to better understand our employees’ demographics, skills, earnings and performance in order to maximize their impact.

Dimensional models for such a domain may yield many interesting facts. In most dimensional models the facts presented are quantified and additive in nature however in HR many of the facts aren’t additive and most are even numbers yet they present interesting employee trends.

Employee Change Transaction

Suppose and HR department of a large enterprise having 100,000 employees and each employee has a detailed HR profile with at least 100 attributes including date of hire, grade, salary ,review dates, review outcome ,vacation entitlements and many others .

The highest priority business requirement is to track and analyze these employee transaction events accurately .The following dimensional model addresses this mentioned requirement and is the initial draft of employee change transaction (figure 7).

Employee change transaction (figure 7).

In the above schema the transaction date dimension is the exact date of the employee transaction and the transaction type dimension refers to a variety of transaction that caused the creation of the particular row such as promotion or change of address.

We envision the type 2 ( preservation of history )slowly changing dimension of technique for tracking of changed profile attributes in the employee dimension .We also create a new type 2 row in the employee dimension that represents the employee profile as a result of the transaction event. The row continues to accurately describe the employee until the next employee transaction occurs.

HR Periodic Snap Shot

The prior schema discussed does not contains any facts in the fact table but the employee transactional table contains textual values which is the primary constrain of query and reports.

HR Periodic Snap Shot (figure 8 ).

In addition to profiling the employee base in HR we also nee to report summary statuses of the employee base on regular monthly basis. Management is interested in crunching numbers like number of employees, total salary paid during a month, vacation days taken and many more .The periodic snapshot dimensional model stratifies the bulk of managements inquires regarding monthly employee statistics. All the facts are additive across all the dimensions except the fact labeled as Balance .These balance are semi additives and must be averaged across the time dimension after adding across the other dimensions.

DM For Inventory System

The business process we are interested in analyzing now is the inventory systems. Optimized inventory levels in the stores can have a major impact on productivity and profitability. Making sure the right product is in the right store at the right time minimized the out of stock problem. The Store level periodic snapshot (figure 9) represents a dimensional model which caters information like daily inventory by product at each individual store.

Store level periodic snapshot (figure 9)

This is the simplest view presenting facts like inventory at hand by using quantifiers as quantity in and quantity out. Even a schema as simple as this can be very use full .Numerous insight can be derived if the inventory levels are measured frequently form many products in many storage location. If we are analyzing the in store inventory levels this schema can be used to balance the store inventory on periodic bases.

We can further indulge in the inventory systems by evolving entities from our transactional systems like vendors and brand.

Periodic inventory snap shot (figure 10).

The dimensional model of Periodic inventory snap shot (figure 10) is another perspective for analyzing the inventory system. The model answers queries like the number of products returned on each inventory indent. By this the management can rate there vendors and brand on the basis of consistency in delivery the right quality of product

Conclusion

The classical transaction database is not able to do analytical processing, because:

1. “Transactional databases contain only raw data, and thus, the processing speed will be considerably slower.

2. Transactional databases are not designed for queries, reports and analyses…

3. Transactional databases are inconsistent in the way that they represent information.”

In order to surpass the above allegations a new environment was needed to be developed which was named as “Data Warehouse” and for this new environment we needed a representation system which would assist in silent translation from our existing transactional system to the warehouse environment and hence “Dimensional Modeling” came to existence.

Presented in the article are some interesting facts on the traditional transactional system and data warehouse/data mart. Followed by a dimensional implementation of some of the business process of an organization and identifying the issues the new model addressed.

Data warehouses are specially designed to handle different types of queries-queries based on statistical analysis. Until there are better definitions of what and how to process the volumes of available data within a company in a meaningful and reliable way, any company considering implementing a data warehouse or data mart would have to persist with the current techniques to cater there needs.

References

* Data warehousing Fundamentals By Paulraj Ponnaih

* Building a Data-Warehouse(2nd Edition) by Inwon

* Data warehousing Strategies, Technology and Techniques By Hammergren

* http://freedatawarehouse.com/tutorials/dmtutorial/

* http://www.donmeyer.com/

* http://searchcrm.techtarget.com/

* http://www.dbmsmag.com/