Cleaning

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Although this step is where the most transformations are needed, usually, we`ll mostly use the ETL tool features. During this step, data cleaning, data quality checks and data integration are performed. This means that individual columns are checked for valid data (e.g. ranges for numeric values) and business rules are enforced. Also, data from different sources are integrated.

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Dimension Tables

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Dimension Tables

Dimension tables are used to describe, filter and aggregate the data that is collected for a specific activity or event. Practically they can be considered as the by words when defining the information content of a report. For example number of encounters by month by provider. Dimensions have relatively few rows compared to the fact tables, but have many descriptive textual attributes. Many times they contain hierarchies, for example location, province, country and are highly denormalized, storing even multiple hierarchies in a single table. A dimension has a primary key, that is usually a meaningless, unique integer, called a surrogate key. These keys are handled by the ETL process and are used to join dimensions to fact tables. In addition, a dimension has one or more fields that serve as the natural key of the table and are based on fields that come from the source system. When we track changes in a dimension there is a one-to-many relationship between a natural key and a surrogate key. Finally, a dimension contains many (hopefully) descriptive attributes.

Delivering Dimension Tables

At this step the dimension tables are loaded with the new and changed data. Dimension tables consist of a primary key (which is a meaningless integer key – a surrogate key), which is used to join to fact tables, a natural key which is based on fields of the source system, that doesn`t change over time or across systems (e.g. for a patient dimension we use the patient identifier for natural key, not a patient id) and descriptive attributes.
We also need to handle changes to dimension attributes. There are three approaches:
1. Overwrite the existing dimension record (type 1 slowly changing dimension)
In this case we use an update else insert (merging) functionality.
2. A new dimension record is created with a new surrogate key, but the same natural key (type 2 slowly changing dimensions).
If a dimension contains attributes that are handled as above, we`ll add two fields: a row effective datetime (exact datetime of change) and a row end datetime (exact datetime of next change). This way, we know the interval a record was valid.
3. Two columns are created for each attribute that we want to track changes. The first column contains the most recent value and the second column contains the old value.
For the moment Type 1 is used.
Other Issues:

  • Generating surrogate keys
  • Date Dimension
  • Dimensional Roles
What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Degenerate Dimensions

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

We sometimes want to store identifiers coming from source systems transaction tables in the data warehouse, for purposes of grouping rows, tracking where the data come from or using them as primary keys in fact tables. We don`t create a separate dimension for these attributes as they would contain only one field, so we store them directly on the fact table and call them degenerate dimensions (for example observation id, encounter id).

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Sub-Dimensions

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

A subdimension is a secondary dimension table linked to a dimension table. Although we try to have all the related attributes of a dimension in a single table, there are some sets of attributes that are used again and again in many dimensions. The same way that dimensions are used as an entry point in fact tables, subdimensions are used as an entry point in dimensions. The most usual subdimension is the date dimension, for example a last encounter date in a patient dimension.

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Fact Tables

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful 

Fact Tables

Fact tables store the measurements of the processes/activities/work flows/events of the system we are trying to model. There is one row in the fact table for each measurement. The fields of the fact table that hold the result of the measurements are called facts (or measures). Dimensions (and their attributes) describe these measurements and contain the context surrounding them. This is known as the grain of the fact table (e.g. a test to determine the temperature of a patient on certain day at a certain location). A fact table contains foreign keys that point to the dimension tables, have one or more numerical fields (the facts) and may contain one or more degenerate dimensions. The primary key of these tables is defined by a subset of the foreign keys.

Delivering Fact Tables Steps

  • Structure
  • Referential Integrity
  • Surrogate Key Pipeline
  • Preparing the fact tables for loading
  • Incremental Loading
  • Aggregations
  • OLAP Cubes

 

What do you think of this post?
  • Awesome 
  • Interesting 
  • Useful