Star, Snowflake, Fact Constellation

n The entity-relationship model is commonly used in the design of relational databases.

n A data warehouse, however, requires a concise, subject- oriented schema that facilitates online data analysis.

n The most popular data model for a data warehouse is a multidimensional model, which can exist in the form:

Copyright By cscodehelp代写 加微信 cscodehelp

n Star schema

n Snowflake schema

n Fact constellation schema

Star Schema

n Star schema is the most common modeling paradigm for data warehouse.

n Star schema consists of:

n A large central table (fact table) containing the bulk of the

n A set of smaller attendant tables (dimension tables), one for each dimension

n The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table.

n For more details: https://en.wikipedia.org/wiki/Star_schema

Example of Star Schema

day day_of_the_week month

quarter year

Sales Fact Table

branch_key

location_key

units_sold

dollars_sold

location_key street

city state_or_province country

item_key item_name brand

type supplier_type

branch_key branch_name branch_type

Facts/Measures

Star Schema

n Notice that in the star schema, each dimension is represented by only one table, and each table contains a set of attributes.

n For example, the location dimension table contains the attribute set {location key, street, city, state_or_province, country}.

n This constraint may introduce some redundancy.

n For example, ¡°Urbana¡± and ¡°Chicago¡± are both cities in the state

of Illinois, USA.

n Entries for such cities in the location dimension table will create redundancy related to the attributes state_or_province and country: ¡°…, Urbana, IL, USA¡± and ¡°…, Chicago, IL, USA¡±.

n Moreover, the attributes within a dimension table may form either a hierarchy (total order) or a lattice (partial order).

Snowflake Schema

n The snowflake schema is a variant of the star schema model, where some dimension tables are further split into additional tables.

n The resulting schema graph forms a shape similar to a snowflake.

n The major difference between the snowflake and star schema models is that the dimension tables of the snowflake model may be split to reduce redundancies.

n Such a table is easy to maintain and saves storage space. n More details: https://en.wikipedia.org/wiki/Snowflake_schema

Example of Snowflake Schema

day day_of_the_week month

quarter year

Sales Fact Table

branch_key

location_key

units_sold

dollars_sold

branch_key branch_name branch_type

Facts/Measures

item_key item_name brand

type supplier_key

location_key street city_key

supplier_key supplier_type

city state_or_province country

Snowflake Schema

n However, this type of space saving is not important in comparison to the typical magnitude of the fact table.

n Furthermore, the snowflake structure can reduce the effectiveness of browsing, since more joins will be needed to execute a query.

n Consequently, the system performance may be adversely impacted.

n Hence, although the snowflake schema reduces redundancy, it is not as popular as the star schema in data warehouse design.

Fact Constellation

n Complex applications may require multiple fact tables to share dimension tables.

n This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation.

Example of Fact Constellation

day day_of_the_week month

quarter year

Shipping Fact Table

item_key item_name brand

type supplier_type

shipper_key

from_location

to_location

dollars_cost

units_shipped

Sales Fact Table

branch_key

location_key

units_sold

dollars_sold

branch_key branch_name branch_type

location_key street

city province_or_state country

Facts/Measures

shipper_key shipper_name location_key shipper_type

Fact Constellation

n In data warehousing, there is a distinction between a data warehouse and a data mart.

n A data warehouse collects information about subjects that span the entire organization, such as customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide.

n For data warehouses, the fact constellation schema is commonly used, since it can model multiple, interrelated subjects.

Fact Constellation

n A data mart, on the other hand, is a department subset of the data warehouse that focuses on selected subjects, and thus its scope is department-wide.

n For data marts, the star or snowflake schema is commonly used, since both are geared toward modeling a single subject.

n Of course, as mentioned previously, star schema is more popular and efficient.

all region

Germany …

North_America Canada … Mexico

Dimensions: Schema Hierarchy vs. Set- grouping Hierarchy

all Europe …

A concept hierarchy that is a total (or partial) ordering of the attributes in a database schema is called a schema hierarchy.

city Frankfurt … office

Vancouver … Toronto L. Chan … M. Wind

Dimensions: Schema Hierarchy vs. Set- grouping Hierarchy

Concept hierarchies may also be defined by discretizing or grouping values for a given dimension or attribute, resulting in a set-grouping hierarchy.

Facts/Measures: Three Categories

n Facts/Measures can be organized into three categories: distributive, algebraic, and holistic. The categorization is based on the kind of aggregate functions used.

n Distributive Measure:

n A distributive measure corresponds to a distributive aggregate

n An aggregate function is distributive if it can be computed in a distributed manner as follows.

n Suppose the data are partitioned into n sets. We apply the function to each partition, resulting in n aggregate values.

n If the result derived by applying the function to these n aggregate values is the same as that derived by applying the function to the entire data set (without partitioning), the function can be computed in a distributed manner

n E.g., count(), sum(), min(), max()

Facts/Measures: Three Categories

n Algebraic Measure:

n An algebraic measure corresponds to an algebraic aggregate

n An aggregate function is algebraic if it can be computed using an algebraic function with M arguments (where M is a bounded positive integer), each of which is obtained by applying a distributive aggregate function.

n E.g., avg()=sum()/count()

n Note that both sum() and count() are distributive aggregate

functions.

n Holistic Measure: There does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation.

n E.g., median()

Facts/Measures: Three Categories

n Most large data cube applications require efficient computation of distributive and algebraic measures.

n Many efficient techniques for computation of distributive and algebraic measures exist.

n In contrast, it is difficult to compute holistic measures efficiently.

n Efficient techniques to approximate the computation of some holistic measures, however, do exist.

Typical OLAP Operations

n Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction

n Drill down (roll down): reverse of roll-up

n from higher level summary to lower level summary or

detailed data, or introducing new dimensions

n Slice: perform a selection on one dimension of the given cube, resulting in a subcube.

n Dice: define a subcube by performing a selection on two or more dimensions

n Pivot (rotate): a visualization operation that rotates the data axes in view to provide an alternative data presentation

Typical OLAP Operations

n Other operations

n drill across: executes queries involving (i.e., across)

more than one fact table

n drill through: uses relational SQL facilities to drill through the bottom level of a data cube down to its back-end relational tables

A Star-net Query Model

n The querying of multidimensional databases can be based on a star-net model.

n The star-net model consists of radial lines emanating from a central point, where each line represents a concept hierarchy for a dimension.

n Each abstraction level in the hierarchy is called a footprint.

n The footprints represent the granularities that can be used by OLAP operations such as drill-down and roll-up.

A Star-net Query Model

Chapter 4: Data Warehousing and On-line Analytical Processing

n Data Warehouse: Basic Concepts

n Data Warehouse Modeling: Data Cube and OLAP n Data Warehouse Design and Usage

n Data Warehouse Implementation

Design of Data Warehouse: A Business Analysis Framework

n To design an effective data warehouse, we need to understand/analyze business needs and construct a business analysis framework.

n The construction of a data warehouse, a large and complex information system, can be viewed as the construction of a large and complex building, for which the owner, architect, and builder have different views.

n These views are combined to form a complex framework that represents the top-down, business-driven, or owner¡¯s perspective, as well as the bottom-up, builder- driven, or implementer¡¯s view of the information system.

Design of Data Warehouse: A Business Analysis Framework

n Four views related to the design of a data warehouse

n Top-down view: allows selection of the relevant information necessary for the data warehouse. This view matches current and future business needs.

n Data source view: exposes the information being captured,

stored, and managed by operational systems. Data sources are often modeled by traditional data modeling techniques, such as the entity-relationship model

n Data warehouse view: consists of fact tables and dimension tables

n Business query view: the data perspective in the data warehouse from the end-user¡¯s viewpoint

Data Warehouse Design Process

n Three different approaches

n Top-down: Starts with overall design and planning (mature) n Bottom-up: Starts with experiments and prototypes (rapid) n Combination of both

n From software engineering point of view

n Waterfall: structured and systematic analysis at each step

before proceeding to the next

n https://en.wikipedia.org/wiki/Waterfall_model

n Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around

n https://en.wikipedia.org/wiki/Spiral_model

Data Warehouse Design Process

n Typically, data warehouse design process includes the following steps:

n Choose a business process to model, e.g., orders, invoices n Choose the grain (atomic level of data) of the business

n E.g., individual transactions, individual daily snapshots

n Choose the dimensions that will apply to each fact table record

n Choose facts/measures that will populate each fact table record

Data Warehouse Development: A Recommended Approach

Multi-Tier Data Warehouse

Distributed Data Marts

Enterprise Data Warehouse

Model refinement Model refinement

Define a high-level corporate data model

Data Warehouse Usage

n Three kinds of data warehouse applications n Information processing

n supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs

n Analytical processing

n multidimensional analysis of data warehouse data

n supports basic OLAP operations, slice-dice, drilling, pivoting

n Data mining

n knowledge discovery

n supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools

From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)

n Online Analytical Mining (OLAM) = OLAP + Data Mining n OLAM is becoming popular because:

n High quality of data in data warehouses

n DW contains integrated, consistent, cleaned data

n Available information processing structure surrounding data warehouses

n Web accessing, service facilities, reporting and OLAP tools n OLAP-based exploratory data analysis

n Mining with drilling, dicing, pivoting, etc. n On-line selection of data mining functions

n Integration and swapping of multiple mining functions, algorithms, and tasks

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com