Bottom Up Approach

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Bottom-Up Approach

The advantages of this approach are:

  • Faster and easier implementation of manageable piece
  • Favorable return on investment and proof of concept
  • Less risk of failure
  • Inherently incremental; can schedule important data marts first
  • Allows project team to learn and grow

The disadvantages are:

  • Each data mart has its own narrow view of data
  • Permeates redundant data in every data mart
  • Perpetuates inconsistent and irreconcilable data
  • Proliferates unmanageable interfaces

In this bottom-up approach, you build your departmental data marts one by one. You would set a priority scheme to determine which data marts you must build first. The most severe drawback of this approach is data fragmentation. Each independent data mart will be blind to the overall requirements of the entire organization.

Courtesy: Data Warehouse Fundamentals – Paulraj Ponniah

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Top Down Approach

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Top-Down Approach

The advantages of this approach are:

  • A truly corporate effort, an enterprise view of data
  • Inherently architected—not a union of disparate data marts
  • Single, central storage of data about the content
  • Centralized rules and control
  • May see quick results if implemented with iterations

The disadvantages are:

  • Takes longer to build even with an iterative method
  • High exposure/risk to failure
  • Needs high level of cross-functional skills
  • High outlay without proof of concept

This is the big-picture approach in which you build the overall, big, enterprise-wide data warehouse. Here you do not have a collection of fragmented islands of information. The data warehouse is large and integrated. This approach, however, would take longer to build and has a high risk of failure. If you do not have experienced professionals on your team, this approach could be dangerous. Also, it will be difficult to sell this approach to senior management and sponsors. They are not likely to see results soon enough.

Courtesy: Data Warehouse Fundamentals – Paulraj Ponniah

 

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Cleaning

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Although this step is where the most transformations are needed, usually, we`ll mostly use the ETL tool features. During this step, data cleaning, data quality checks and data integration are performed. This means that individual columns are checked for valid data (e.g. ranges for numeric values) and business rules are enforced. Also, data from different sources are integrated.

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Dimension Tables

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Dimension Tables

Dimension tables are used to describe, filter and aggregate the data that is collected for a specific activity or event. Practically they can be considered as the by words when defining the information content of a report. For example number of encounters by month by provider. Dimensions have relatively few rows compared to the fact tables, but have many descriptive textual attributes. Many times they contain hierarchies, for example location, province, country and are highly denormalized, storing even multiple hierarchies in a single table. A dimension has a primary key, that is usually a meaningless, unique integer, called a surrogate key. These keys are handled by the ETL process and are used to join dimensions to fact tables. In addition, a dimension has one or more fields that serve as the natural key of the table and are based on fields that come from the source system. When we track changes in a dimension there is a one-to-many relationship between a natural key and a surrogate key. Finally, a dimension contains many (hopefully) descriptive attributes.

Delivering Dimension Tables

At this step the dimension tables are loaded with the new and changed data. Dimension tables consist of a primary key (which is a meaningless integer key – a surrogate key), which is used to join to fact tables, a natural key which is based on fields of the source system, that doesn`t change over time or across systems (e.g. for a patient dimension we use the patient identifier for natural key, not a patient id) and descriptive attributes.
We also need to handle changes to dimension attributes. There are three approaches:
1. Overwrite the existing dimension record (type 1 slowly changing dimension)
In this case we use an update else insert (merging) functionality.
2. A new dimension record is created with a new surrogate key, but the same natural key (type 2 slowly changing dimensions).
If a dimension contains attributes that are handled as above, we`ll add two fields: a row effective datetime (exact datetime of change) and a row end datetime (exact datetime of next change). This way, we know the interval a record was valid.
3. Two columns are created for each attribute that we want to track changes. The first column contains the most recent value and the second column contains the old value.
For the moment Type 1 is used.
Other Issues:

  • Generating surrogate keys
  • Date Dimension
  • Dimensional Roles
What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Degenerate Dimensions

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

We sometimes want to store identifiers coming from source systems transaction tables in the data warehouse, for purposes of grouping rows, tracking where the data come from or using them as primary keys in fact tables. We don`t create a separate dimension for these attributes as they would contain only one field, so we store them directly on the fact table and call them degenerate dimensions (for example observation id, encounter id).

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful