Data Mining: Concepts and Techniques

— Chapter 5 —

Qiang (Chan) Ye Faculty of Computer Science Dalhousie University University

Copyright By cscodehelp代写 加微信 cscodehelp

Chapter 5: Data Cube Technology

n Data Cube Computation: Preliminary Concepts n Data Cube Computation Methods

Data Cube Computation

Although the data cube concept was originally intended for OLAP, it is also useful for data mining.

Multidimensional data mining is an approach to data mining that integrates OLAP-based data analysis with knowledge discovery techniques.

It is also known as exploratory multidimensional data mining or online analytical mining (OLAM).

In this chapter, we study methods for data cube computation and methods for multidimensional data analysis.

Cube Materialization

As mentioned before, a data cube refers to a lattice of cuboids. 4

Cube Materialization

Here is the figure corresponding to the 3-D cuboid involving time, item and location.

Cube Materialization

A cell in the base cuboid is a base cell.

A cell from a non-base cuboid is an aggregate cell.

An aggregate cell aggregates over one or more dimensions, where each aggregated dimension is indicated by a * in the cell notation.

Suppose we have an n-dimensional data cube: Leta=(a1,a2,…,an,):measurebeacellfromoneofthe

cuboids making up the data cube.

We say that a is an m-dimensional cell (i.e., from an m- dimensional cuboid) if exactly m (m

n This not only saves processing time and disk space, but also leads to a more focused analysis.

n The cells that cannot pass the threshold are likely to be too trivial to warrant further analysis.

n Such partially-materialized cubes are known as iceberg cubes.

n The minimum threshold is called the minimum support threshold, or minimum support (min_sup), for short.

Iceberg Cube, Closed Cube & Cube Shell

n An iceberg cube can be specified with an query:

compute cube sales_iceberg as

select month, city, customer_group, count(*) from salesInfo

cube by month, city, customer_group

having count(*) >= min_support

§ Computing only the cuboid cells whose measure satisfies the iceberg condition

§ Only a small portion of cells may be “above the water’’ in a sparse cube

iceberg condition

Iceberg Cube, Closed Cube & Cube Shell

n An intuitive approach to computing an iceberg cube would be to first compute the full cube and then prune the cells that do not satisfy the iceberg condition.

n However, this is still prohibitively expensive.

n An efficient approach is to compute only the iceberg cube

directly without computing the full cube.

n Sections 5.2.2 and 5.2.3 discuss the methods for efficient iceberg cube computation.

Iceberg Cube, Closed Cube & Cube Shell

n With iceberg cube, we could still end up with a large number of uninteresting cells to compute.

n For example, there are only 2 base cells for a data cube of 100 dimensions, denoted as {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}, where each has a cell count of 10 (i.e. measure value=10).

n How many distinct aggregate cells will the iceberg cube have if the iceberg condition is “measure >= 10”?

n There are (2101 – 6) distinct aggregate cells (details in next slide), suchas(a1,a2,a3 …,a99,*):10and(a1,a2,*,a4 …,a100):10, but most of them do not contain much new information.

n If we ignore all the aggregate cells that can be obtained by replacing some constants with * while keeping the same measure value, there are only three distinct cells left: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10, (a1, a2, * . . . , *):20}.

Iceberg Cube, Closed Cube & Cube Shell

n In total, there are 2100 + 2100 = 2101 distinct cells.

n There are 100 dimensions, therefore, there are 2100 combinations of

dimensions.

n (a1, a2, a3 . . . , a100):10 leads to 2100 distinct cells. n (a1, a2, b3, . . . , b100):10 leads to 2100 distinct cells. n The total number is 2100 + 2100 = 2101.

n Among the distinct cells, there are 2 base cells and 4 duplicate aggregate cells: n The following two cells are the base cells themselves: (a1, a2, a3 . . . ,

a100):10, (a1, a2, b3, . . . , b100):10

n (a1, a2, a3 . . . , a100):10 leads to the following four aggregate cells: (a1, a2, * . . . , *):10, (*, a2, * . . . , *):10, (a1, *, * . . . , *):10, (*, *, * . . . , *):10.

n (a1,a2,b3,…,b100):10leadstothesameaggregatecells:(a1,a2,*…, *):10, (*, a2, * . . . , *):10, (a1, *, * . . . , *):10, (*, *, * . . . , *):10.

n Therefore, there are 2101 – 2 – 4 = 2101 – 6 distinct aggregate cells.

Iceberg Cube, Closed Cube & Cube Shell

n Closed cell: To systematically compress a data cube, we need to introduce the concept of closed cell. A cell, c, is a closed cell if:

n There does not exist a cell, d, so that:

n d is a specialization (descendant) of cell c (i.e., where d is obtained by

replacing * in c with a non-* value)

n In addition, d has the same measure value as the measure value of c.

n Closed cube: A closed cube is a data cube consisting of only closed cells.

n For instance, the cells (i.e. three cells) derived in the preceding example are three closed cells of the data cube whose base cells are(a1,a2,a3 …,a100):10and(a1,a2,b3,…,b100):10.

n The closed cells form the lattice of a closed cube as shown in Figure 5.2 (which is in next slide).

n Other non-closed cells can be derived from their corresponding closed cells in this lattice.

Iceberg Cube, Closed Cube & Cube Shell

Iceberg Cube, Closed Cube & Cube Shell

n Another strategy for partial materialization is to precompute only the cuboids involving a small number of dimensions, such as three to five.

n These cuboids form a cube shell for the corresponding data cube.

n Queries on additional combinations of the dimensions will have to be computed on-the-fly.

n For example, we could compute all cuboids with three dimensions or less in an n-dimensional data cube, resulting in a cube shell of size 3.

n This, however, can still result in a large number of cuboids to compute, particularly when n is large.

Iceberg Cube, Closed Cube & Cube Shell

n Alternatively, we can choose to precompute only portions or fragments of the cube shell that interest us.

n Section 5.2.4 discusses a method for computing shell fragments and explores how they can be used for efficient OLAP query processing.

General Strategies for Data Cube Computation

n In general, there are two basic data structures used for storing cuboids.

n The implementation of relational OLAP (ROLAP) uses relational tables.

n Multidimensional arrays are used in multidimensional OLAP (MOLAP).

n Although ROLAP and MOLAP may involve different cube computation techniques, some optimization “tricks” can be shared among the different data representations.

General Strategies for Data Cube Computation

n Optimization Technique 1: Sorting, hashing, and grouping. Sorting, hashing, and grouping operations should be applied to the dimension attributes to reorder and cluster related tuples.

n Optimization Technique 2: Simultaneous aggregation and caching of intermediate results. In cube computation, it is efficient to compute higher-level aggregates from previously computed lower-level aggregates, rather than from the base fact table. Moreover, simultaneous aggregation from cached intermediate computation results may lead to the reduction of expensive disk input/output (I/O) operations.

General Strategies for Data Cube Computation

n Optimization Technique 3: Aggregation from the smallest child when there exist multiple child cuboids.

n When there exist multiple child cuboids, it is usually more efficient to compute the desired parent (i.e., more generalized) cuboid from the smallest, previously computed child cuboid.

n To compute a sales cuboid, Cbranch, when there exist two previously computed cuboids, C{branch,year} and C{branch,item}, for example, it is obviously more efficient to compute Cbranch from the C{branch,year} than from the latter if there are many more distinct items than distinct years.

General Strategies for Data Cube Computation

n Optimization Technique 4: The Apriori pruning method can be explored to compute iceberg cubes efficiently.

n The Apriori property, in the context of data cubes, states as follows: If a given cell does not satisfy minimum support, then no descendant of the cell (i.e., more specialized cell) will satisfy minimum support either.

n This property can be used to substantially reduce the computation of iceberg cubes.

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com