Data Management
- Data components
- Data objects
- Attributes
- Values
Description
- Types
- Tabular
- Categorical: Qualitative
- Nominal
- Ordinal
- Numerical: Quantitative
- Discrete
- Categorical: Qualitative
- Text
- Time Series
- Image
- Network
- Tabular
- Statistics () of the sample is an estimate of parameter () of the population.
Pre-processing
- Replace missing values
- Substitute missing values with dummy values, mean
- Substitute missing values with the most frequent values
- Reduce Data
- Attribute Selection (select the most useful attributes/variables)
- Remove outliers
- Record sampling (randomly, or by defined rules)
- Create new features from existing ones
- Discretize data
- Data normalization
- min-max scaling (normalization)
- Standardization (to bring the outliers closer)
- Correlation/Covariance analysis
Reduction
- “Curse of Dimensionality”, exponentially many training points are needed as dimensionality increases.
- High dimensionality causes sparsity, while good models need to cover as many regions as possible.
- Numerosity Reduction
- Simply sampling
- Adopt stratified sampling in sparse dataset.
- Dimensionality Reduction
- Feature selection (Heuristic search)
- Remove redundant attributes
- Remove irrelevant attributes
- Methods
- Best single attribute under independent assumption
- Forward step-wise selection (addition)
- Backward step-wise selection (elimination)
- (Latent) Feature extraction
- Principal Component Analysis (PCA), identifying eigenvectors/values.
- Singular Value Decomposition (SVD), reducing the number of observations.
- Feature selection (Heuristic search)
Visualization
See visualization.
Validation
- Data Quality
- Accuracy
- Completeness
- Consistency
- Timelines
- Frequent Tests
- Null Test
- Distribution Test
- Volume Test
- Uniqueness Test
- Correlation Analysis