Chapter 3: Data Preprocessing
n Data Preprocessing: An Overview
n Data Quality
n Major Tasks in Data Preprocessing
Copyright By cscodehelp代写 加微信 cscodehelp
n Data Cleaning
n Data Integration
n Data Reduction
n Data Transformation and Data Discretization n Summary
Data Reduction Strategies
n Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results
n Why data reduction? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run with the complete data set.
n Data reduction strategies
n Dimensionality reduction (e.g., remove unimportant attributes)
n Wavelet transforms
n Principal Components Analysis (PCA) n Attribute subset selection
n Numerosity reduction (some simply call it “Data Reduction”)
n Regression and Log-Linear Models, Histograms, clustering,
n Data cube aggregation
n Data compression
Data Reduction 1: Dimensionality Reduction – Attribute Subset Selection
n Attribute Subset Selection: It reduces the data set size by removing irrelevant or redundant attributes (or dimensions)
n Redundant attributes
n Duplicate much or all of the information contained in one or
more other attributes
n E.g., purchase price of a product and the amount of sales
n Irrelevant attributes
n Contain no information that is useful for the data mining task at hand
n E.g., students’ ID is often irrelevant to the task of predicting students’ GPA
Data Reduction 1: Dimensionality Reduction – Attribute Subset Selection
n Goal: Find a minimum set of attributes so that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes.
n For n attributes, there are 2 n possible attribute combinations.
n Exhaustive search: An exhaustive search for the optimal subset of attributes can be overly expensive, especially as n and the number of data classes increase.
n Heuristic methods: They explore a reduced search space, which is commonly used for attribute subset selection.
n These methods are typically greedy in that, while searching through attribute space, they always make what looks to be the best choice at the time.
Heuristic Search in Attribute Selection
n Typical heuristic attribute selection methods:
n Stepwise forward selection: The procedure starts with an empty set of attributes as the reduced set. The best of the original attributes is determined and added to the reduced set. At each subsequent iteration or step, the best of the remaining original attributes is added to the set.
n Stepwise backward elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set.
n Note that how to determine the best/worst attribute is not easy, and will be discussed later.
n Combination of forward and backward elimination: The stepwise forward selection and backward elimination methods can be combined so that the procedure selects the best attribute and removes the worst simultaneously.
Heuristic Search in Attribute Selection
n Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were originally intended for classification.
n Decision tree induction constructs a flowchart-like structure where each internal (non-leaf) node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node denotes a class prediction.
n At each node, the algorithm chooses the “best” attribute to partition the data into individual classes.
n When decision tree induction is used for attribute subset selection, a tree is constructed from the given data.
n All attributes that do not appear in the tree are assumed to be irrelevant.
n The set of attributes appearing in the tree form the reduced subset of attributes.
Heuristic Search in Attribute Selection
Data Reduction 2: Numerosity Reduction
n Reduce data volume by choosing alternative, smaller forms of data representation
n Parametric methods
n Assume the data fit some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)
n E.g. regression
n Non-parametric methods
n Do not assume models
n Major families: histograms, clustering, sampling …
Parametric Data Reduction: Regression
n Linear regression
n Data modeled to fit a straight line
n Often solved using the least square method
n Once the model is available, the model is kept and the original data are discarded.
n Multiple regression
n Allows a response variable Y to be modeled as a linear function of two or more variables
n Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
n Often applying the least squares method to the known values of Y1, Y2, …, X1, X2, ….
n Multiple regression: Y = b0 + b1 X1 + b2 X2
n Many nonlinear functions can be transformed into the above
n Histograms can be used to approximate data distributions.
n Histograms were introduced in Section 2.2.3.
n A histogram for an attribute, X, partitions the data distribution of X into disjoint subsets, referred to as buckets or bins.
n Approximation at different levels results in different levels of data reduction.
n For example:
n A list of AllElectronics prices for commonly-sold items (rounded
to the nearest dollar) can be found below.
The prices have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
n The following figure shows a basic histogram for the data.
n To further reduce the data, it is common to have each bucket denote a value range. In the following figure, each bucket represents a different $10 range for price.
n In this scenario, other statistical quantities, such as bucket mean, could be used to help approximate data distribution.
How are the buckets determined and the attribute values partitioned?
n There are several partitioning rules, including the following:
n Equal-width: In an equal-width histogram, the width of each bucket range is uniform (e.g., the width of $10 for the buckets in the previous figure).
n Equal-depth (or equal-frequency): In an equal-frequency histogram, the buckets are created so that, roughly, the frequency of each bucket is constant (i.e., each bucket contains roughly the same number of contiguous data samples).
n Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only
n The original data is discarded
n Can be very effective if data is clustered but not if
data is “smeared”
n There are many choices of clustering definitions and clustering algorithms
n Cluster analysis will be studied in detail in Chapter 10
n Sampling: obtaining a small sample s to represent the whole data set N
n It allows the execution of a mining algorithm with a complexity that is potentially sub-linear to the size of the data
n Key Principle: Choose a representative subset of the data
n Simple random sampling may have very poor performance in the presence of skew
Types of Sampling
Suppose that a large data set, D , contains N tuples.
n Simple random sample without replacement (SRSWOR) of size s: This is created by drawing s of the N tuples from D (s < N), where the probability of drawing any tuple in D is 1/N, that is, all tuples are equally likely to be sampled. n Simple random sample with replacement (SRSWR) of size s: This is similar to SRSWOR, except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again. Types of Sampling SRSWOR vs. SRSWR Types of Sampling n Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters”, then an SRS of s clusters can be obtained, where s < M. n For example, tuples in a database are usually retrieved a page at a time, so that each page can be considered a cluster. A reduced data representation can be obtained by applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples. Types of Sampling n Stratified sample: If D is divided into mutually disjoint parts called strata, a stratified sample of D is generated by obtaining an SRS at each stratum. This helps ensure a representative sample, especially when the data are skewed. n For example, a stratified sample may be obtained from customer data, where a stratum is created for each customer age group. In this way, the age group having the smallest number of customers will be sure to be represented. Types of Sampling Data Reduction 3: Data Compression n String compression n There are extensive theories and well-tuned algorithms n Typically lossless, but only limited manipulation is possible n Audio/video compression n Typically lossy compression, with progressive refinement n Sometimes small fragments of signal can be reconstructed without reconstructing the whole Chapter 3: Data Preprocessing n Data Preprocessing: An Overview n Data Quality n Major Tasks in Data Preprocessing n Data Cleaning n Data Integration n Data Reduction n Data Transformation and Data Discretization n Summary Data Transformation n A function that maps the entire set of values of a given attribute to a new set of replacement values so that each old value can be identified with one of the new values n Attribute/feature construction: New attributes constructed from the given ones to help the mining process (“area”, which is based on “width” and “height”, could generated) n Aggregation: Summarization (e.g. daily sales => monthly total)
n Normalization: Scaled to fall within a specified range 23
n Discretization (for numeric data): Raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g., 0– 10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).
n Concept hierarchy generation (for numeric and nominal data): Attributes such as street can be generalized to higher-level concepts, like city.
n Min-max normalization: to [new_minA, new_maxA]
v’= v-minA (new_maxA -new_minA)+new_minA
n Ex. Suppose that the income range $12,000 to $98,000 is
normalized to [0.0, 1.0]. Then $73,600 is mapped to:
73,600 -12,000 (1.0 – 0) + 0 = 0.716 98,000 -12,000
n Min-max normalization preserves the relationships among the original data values.
n It will encounter an “out-of-bounds” error if a future input for normalization falls outside of the original data range.
n Z-score normalization (μ: mean, σ: standard deviation):
n Ex. Let μ = 54,000, σ = 16,000, then $73,600 is mapped to: 73,600-54,000 =1.225
n This method of normalization is useful when the actual minimum and maximum of attribute A are unknown, or when there are outliers that dominate the min-max normalization.
n Normalization by decimal scaling: It normalizes by moving the decimal point of values of attribute A.
n The number of decimal points moved depends on the maximum absolute value of A.
n A value, v, is normalized to v’ by computing:
v’ = v Where j is the smallest integer so that Max(|ν’|) <1 10j Discretization n Two types of attributes n Non-Ordinal — values from an unordered set, e.g., color, profession n Ordinal — values from an ordered set, e.g., military or academic rank n Numeric — real numbers, e.g., integer or real numbers n Discretization (for numeric data): Divide the range of a continuous numeric attribute into intervals n Options: n Supervised (i.e. class label is utilized) vs. unsupervised (i.e. no class label) n Split (top-down) vs. merge (bottom-up) n Interval labels can then be used to replace actual data values n Reduce data size by discretization n Discretization can be performed recursively on an attribute n Prepare for further analysis, e.g., classification Data Discretization Methods n Typical methods: All the methods can be applied recursively n Binning (discussed in Section 3.2.2) n Top-down split, unsupervised. n Note that we learned how to use equal-frequency binning to remove noise; but binning can also be based on equal- width bins. n Clustering analysis (unsupervised, top-down split or bottom- up merge) n Decision Tree analysis (supervised, top-down split) n Correlation (e.g., c2) analysis (supervised, bottom-up merge) Discretization by Binning n Equal-width (distance) partitioning n Divides the range into N intervals of equal size n if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. n The most straightforward method, but outliers may dominate presentation n Skewed data is not handled well n Equal-depth (frequency) partitioning n Divides the range into N intervals, each containing approximately same number of samples n Good data scaling Discretization by Clustering Data Equal width (binning) Equal depth (binning) K-means clustering leads to better results (Data in same bin are more similar to each other) Discretization by Classification & Correlation Analysis n Decision tree analysis: top-down, recursive split, details in Chapter 7 n The generated tree is used to discretize data n Correlation analysis: bottom-up merge n Chi-merge: an example correlation-based discretization method n Initially, each distinct value of a numeric attribute A is considered to be one interval. n 𝝌2 tests are performed for every pair of adjacent intervals. n Adjacent intervals with the least 𝝌2 values are merged together, because low 2 values for a pair indicate similar class distributions. n Merge recursively, until a predefined stopping condition n Details of chi-merge can be found here: https://www.youtube.com/watch?v=hEeIpeJeqAU Concept Hierarchy Generation for Nominal Data n Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is usually associated with each dimension in a data warehouse (data warehouse will be discussed in Chapter 4) n Concept hierarchies facilitate drilling and rolling in data warehouses to view data in multiple granularity n Concept hierarchy formation: Recursively reduce the data by replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior) n Concept hierarchies can be explicitly specified by domain experts and/or data warehouse designers n Concept hierarchy can be automatically formed for both numeric and nominal data. We focus on concept hierarchy generation for nominal data in this chapter. Concept Hierarchy Generation n Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts n Concept hierarchies for nominal attributes typically involve a group of attributes. n A user or expert can easily define a concept hierarchy by specifying a partial or total ordering of the attributes at the schema level. n For example, suppose that a relational database contains the following group of attributes: street < city < province < country Concept Hierarchy Generation n Specification of a portion of a hierarchy for a set of values by explicit data grouping n This is essentially the manual definition of a portion of a concept hierarchy. n In a large database, it is often unrealistic to define an entire concept hierarchy. n In this scenario, a user can specify explicit groupings for a small portion of intermediate-level data. n For example, after specifying that province and country form a hierarchy at the schema level, a user could define some intermediate levels manually, such as “Alberta < prairies Canada” or “Saskatchewan < prairies Canada”. Concept Hierarchy Generation n Specification of a set of attributes, but not of their partial ordering: n A user may specify that a set of attributes form a concept hierarchy, but omit to explicitly state their partial ordering. n The system can then try to automatically generate the attribute ordering so as to construct a meaningful concept hierarchy. n Higher-level concepts generally cover several subordinate lower-level concepts. Consequently, an attribute defining a high concept level (e.g., country) will usually contain a smaller number of distinct values than an attribute defining a lower concept level (e.g., street). Concept Hierarchy Generation n Based on this observation, a concept hierarchy can be automatically generated according to the number of distinct values per attribute in the given attribute set. n The attribute with the most distinct values is placed at the lowest hierarchy level. n The lower the number of distinct values an attribute has, the higher it is in the generated concept hierarchy. Concept Hierarchy Generation n For example: There are a set of location-oriented attributes (street, country, province, city) in the AllElectronics database, but the database does not specify the hierarchical ordering among the attributes. n A concept hierarchy for location can be generated automatically, as illustrated in the following figure. n Step 1: Sort the attributes in ascending order according to the number of distinct values in each attribute. This results in the following: country (15), province or state (365), city (3567), and street (674,339). n Step 2: Generate the hierarchy from the top down according to the sorted order, with the first attribute at the top level and the last attribute at the bottom level. n Step 3: The user can examine the generated hierarchy, and when necessary, modify it to reflect desired semantic relationships among the attributes Concept Hierarchy Generation country province city street 15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values Concept Hierarchy Generation n Specification of only a partial set of attributes n Sometimes a user can be careless when defining a hierarchy, or have only a vague idea about what should be included in a hierarchy. n Consequently, the user may have included only a small subset of the relevant attributes in the hierarchy specification. n For example, instead of including all of the hierarchically relevant attributes for location, the user may have specified only street and city. To handle such partially specified hierarchies, it is important to embed data semantics in the database schema so that attributes with tight semantic connections can be pinned together. Chapter 3: Data Preprocessing n Data Preprocessing: An Overview n Data Quality n Major Tasks in Data Preprocessing n Data Cleaning n Data Integration n Data Reduction n Data Transformation and Data Discretization n Summary n Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability n Data cleaning: e.g. missing values, noisy values n Data integration from multiple sources: n Entity identification problem n Remove redundancies n Data reduction n Dimensionality reduction n Numerosity reduction n Data compression n Data transformation and data discretization n Normalization n Discretization n Concept hierarchy generation 程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: email@example.com