Data Management

Description

Types
- Tabular
  - Categorical: Qualitative
    - Nominal
    - Ordinal
  - Numerical: Quantitative
    - Discrete
- Text
- Time Series
- Image
- Network
Statistics ( $\overset{x}{ˉ}$ ) of the sample is an estimate of parameter ( $μ$ ) of the population.

Replace missing values
- Substitute missing values with dummy values, mean
- Substitute missing values with the most frequent values
Reduce Data
- Attribute Selection (select the most useful attributes/variables)
- Remove outliers
- Record sampling (randomly, or by defined rules)
Create new features from existing ones
Discretize data
Data normalization
- min-max scaling (normalization)
- Standardization (to bring the outliers closer)
Correlation/Covariance analysis

“Curse of Dimensionality”, exponentially many training points are needed as dimensionality increases.
High dimensionality causes sparsity, while good models need to cover as many regions as possible.
Numerosity Reduction
- Simply sampling
- Adopt stratified sampling in sparse dataset.
Dimensionality Reduction
- Feature selection (Heuristic search)
  - Remove redundant attributes
  - Remove irrelevant attributes
  - Methods
    - Best single attribute under independent assumption
    - Forward step-wise selection (addition)
    - Backward step-wise selection (elimination)
- (Latent) Feature extraction
  - Principal Component Analysis (PCA), identifying eigenvectors/values.
  - Singular Value Decomposition (SVD), reducing the number of observations.