Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
581 views
in Technique[技术] by (71.8m points)

data modeling - Star schema, normalized dimensions, denormalized hierarchy level keys

Given the following star schema tables.

  • fact, two dimensions, two measures.

#   geog_abb  time_date amount     value
#1:       AL 2013-03-26  55.57 9113.3898
#2:       CO 2011-06-28  19.25 9846.6468
#3:       MI 2012-05-15  94.87 4762.5398
#4:       SC 2013-01-22  29.84  649.7681
#5:       ND 2014-12-03  37.05 6419.0224
  • geography dimension, single hierarchy, 3 levels in hierarchy.

#   geog_abb  geog_name geog_division_name geog_region_name
#1:       AK     Alaska            Pacific             West
#2:       AL    Alabama East South Central            South
#3:       AR   Arkansas West South Central            South
#4:       AZ    Arizona           Mountain             West
#5:       CA California            Pacific             West
  • time dimension, two hierarchies, 4 levels in each.

#    time_date time_weekday time_week time_month time_month_name time_quarter time_quarter_name time_year
#1: 2010-01-01       Friday         1          1         January            1                Q1      2010
#2: 2010-01-02     Saturday         1          1         January            1                Q1      2010
#3: 2010-01-03       Sunday         1          1         January            1                Q1      2010
#4: 2010-01-04       Monday         1          1         January            1                Q1      2010
#5: 2010-01-05      Tuesday         1          1         January            1                Q1      2010

Examples is stripped of surrogate keys to improve readability. In results there are levels in hierarchy without other attributes, just don't bother that, they are still levels in hierarchy.


In star schema expressed as:

         GEOGRAPHY (all fields)
        /
       /
   FACT
       
        
         TIME (all fields)

In snowflake schema expressed as:

                    geog_region_name
                   /
                  geog_division_name
                 /
                geog_abb (+ geog_name)
               /
              /
          FACT
              
               
                time_date
                   |
hierarchies:       |
        weekly    /     monthly
                 /    
                /     
   time_weekday         time_month (+ time_month_name)
            |             |
            |             |
        time_week     time_quarter (+ time_quarter_name)
            |             |
            |             |
        time_year      time_year

How would you call the following schema

Does it have any specific name? Starflake? :)

        |>-- geog_region_name
        |
        |>-- geog_division_name
        |
        |>-- geog_abb (+ geog_name)
        |
        |
        geography base
       /
      /
  FACT
      
       
        time base
        |
        |
        |>-- time_date
        |
        |>-- time_weekday
        |
        |>-- time_week
        |
        |>-- time_month (+ time_month_name)
        |
        |>-- time_quarter (+ time_quarter_name)
        |
        |>-- time_year

It basically has a dimension base table storing identities of every level of every hierarchy within a dimension. No need for recursive walk through snowflake's levels, potentially less joins. Data still well normalized, only keys are denormalized into base table. All levels from all hierarchies tied to lowest grain key of a dimension in dimension base.
Additionally having a dimension base table allows to handle time variant attributes/temporal queries just in that table, at the granularity of a hierarchy level.

Here is the tabular representation.

Still on natural keys!

  • fact

#    geog_abb  time_date amount     value
# 1:       AK 2010-01-01 154.43 12395.472
# 2:       AK 2010-01-02  88.89  6257.639
# 3:       AK 2010-01-03  81.74  7193.075
# 4:       AK 2010-01-04 165.87 11150.619
# 5:       AK 2010-01-05   8.75  6953.055
  • time dimension base

#     time_date time_year time_quarter time_month time_week time_weekday
# 1: 2010-01-01      2010            1          1         1       Friday
# 2: 2010-01-02      2010            1          1         1     Saturday
# 3: 2010-01-03      2010            1          1         1       Sunday
# 4: 2010-01-04      2010            1          1         1       Monday
# 5: 2010-01-05      2010            1          1         1      Tuesday
  • time dimension normalization to hierarchy levels

#    time_year
# 1:      2010
# 2:      2011
# 3:      2012
# 4:      2013
# 5:      2014

#    time_quarter time_quarter_name
# 1:            1                Q1
# 2:            2                Q2
# 3:            3                Q3
# 4:            4                Q4

#    time_month time_month_name
# 1:          1         January
# 2:          2        February
# 3:          3           March
# 4:          4           April
# 5:          5             May

#    time_week
# 1:         1
# 2:         2
# 3:         3
# 4:         4
# 5:         5

#    time_weekday
# 1:       Friday
# 2:       Monday
# 3:     Saturday
# 4:       Sunday
# 5:     Thursday

#     time_date time_week time_weekday time_year
# 1: 2010-01-01         1       Friday      2010
# 2: 2010-01-02         1     Saturday      2010
# 3: 2010-01-03         1       Sunday      2010
# 4: 2010-01-04         1       Monday      2010
# 5: 2010-01-05         1      Tuesday      2010
  • geography dimension base

#    geog_abb geog_region_name geog_division_name
# 1:       AK             West            Pacific
# 2:       AL            South East South Central
# 3:       AR            South West South Central

# 4:       AZ             West           Mountain
# 5:       CA             West            Pacific
  • geography dimension normalization to hierarchy levels

#    geog_region_name
# 1:    North Central
# 2:        Northeast
# 3:            South
# 4:             West

#    geog_division_name
# 1: East North Central
# 2: East South Central
# 3:    Middle Atlantic
# 4:           Mountain
# 5:        New England

#    geog_abb  geog_name geog_division_name geog_region_name
# 1:       AK     Alaska            Pacific             West
# 2:       AL    Alabama East South Central            South
# 3:       AR   Arkansas West South Central            South
# 4:       AZ    Arizona           Mountain             West
# 5:       CA California            Pacific             West

Dimension base could store also primary key's attributes, this would de-duplicate dimension's lowest level but will be less consistent (time_date levels from both hierarchies would fit into time dimension base tables).


What drawbacks such schema would have? I don't much bother about speed of joins and aggregates, and a query tool adaptivity.
Does it have any name? It is being use? If not why?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You are building a snowflake schema with shortcuts.

It's used and BI tools can easily use the shortcuts.

You can also have shortcuts from a parent level of a dimension to a fact table at child level for that dimension. It works, you can skip a join, but you need to store an additional column in the fact table.

The only concern is about data integrity, if a parent-child relationship changes you need to update not only the child table, but also all other tables where this relationship is stored.

It's not a big deal if you generate every time your dimension table from your normalize data, but you need to be careful, even more if you store a parent ID in the fact table.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...