Data Management

  • Data components
    • Data objects
    • Attributes
    • Values

Description

  • Types
    • Tabular
      • Categorical: Qualitative
        • Nominal
        • Ordinal
      • Numerical: Quantitative
        • Discrete
    • Text
    • Time Series
    • Image
    • Network
  • Statistics () of the sample is an estimate of parameter () of the population.

Pre-processing

  • Replace missing values
    • Substitute missing values with dummy values, mean
    • Substitute missing values with the most frequent values
  • Reduce Data
    • Attribute Selection (select the most useful attributes/variables)
    • Remove outliers
    • Record sampling (randomly, or by defined rules)
  • Create new features from existing ones
  • Discretize data
  • Data normalization
    • min-max scaling (normalization)
    • Standardization (to bring the outliers closer)
  • Correlation/Covariance analysis

Reduction

  • “Curse of Dimensionality”, exponentially many training points are needed as dimensionality increases.
  • High dimensionality causes sparsity, while good models need to cover as many regions as possible.
  • Numerosity Reduction
    • Simply sampling
    • Adopt stratified sampling in sparse dataset.
  • Dimensionality Reduction
    • Feature selection (Heuristic search)
      • Remove redundant attributes
      • Remove irrelevant attributes
      • Methods
        • Best single attribute under independent assumption
        • Forward step-wise selection (addition)
        • Backward step-wise selection (elimination)
    • (Latent) Feature extraction
      • Principal Component Analysis (PCA), identifying eigenvectors/values.
      • Singular Value Decomposition (SVD), reducing the number of observations.

Visualization

See visualization.

Validation

  • Data Quality
    • Accuracy
    • Completeness
    • Consistency
    • Timelines
  • Frequent Tests
    • Null Test
    • Distribution Test
    • Volume Test
    • Uniqueness Test
    • Correlation Analysis