The Optimized Row Columnar (ORC) file format is the most powerful way for improved performance and storage saving, of all file formats. It provides the most efficient compression that cause smaller disk reads. Also, the columnar format is also ideal for vectorization optimizations in Tez.
As specified in the documentation, the ORC file format for data storage is recommended for the following reasons:
• Efficient compression:
Stored as columns and compressed, which leads to smaller disk reads. The columnar format is also ideal for vectorization optimizations in Tez.
• Fast reads:
ORC has a built-in index, min/max values, and other aggregates that cause entire stripes to be skipped during reads. In addition, predicate pushdown pushes filters into reads so that minimal rows are read. And Bloom filters further reduce the number of rows that are returned.
• Proven in large-scale deployments:
Facebook uses the ORC file format for a 300+ PB deployment.
Below is an example to define the ORC file format while creating the table.
CREATE TABLE Customers (
) STORED AS ORC;
Hope you like this post.
Stay in touch for more interesting updates.