Columnar Data Structures

Benefits of Columnar Storage

  • Efficiency, as irrelevant data can be skipped. As a result, aggregation can be really fast. High throughput can be achieved.
  • Flexible in compression and efficient in encoding, as each column only contains one type of data. And as a result reduces the space.
  • Potentially reduces the cost, as certain cloud service charges based on the amount of data stored or scanned per query.

This approach is best especially for those queries that need to read certain columns from a large table. Parquet can only read the needed columns therefore greatly minimizing the IO.

A graph from dremel paper explaining the difference between record-oriented storage and column-oriented storage.