IndexR: The fastest open source storage format for big data

05-03

IndexR: the fastest open source storage format for big data

IndexR project includes its own storage format, the realtime ingestion module, the data management system and the plugins working with other systems like Hive, Drill. And the storage format in IndexR is the fastest open source format on the earth. It provides 2~4 times scan speed as Parquet. And the random access speed increase to 10x after outer index is added. It is greatly suitable for variety workloads, like online and offline big data analytics, and fast read among hue dataset.

It is firstly developed by Sunteng Tech. Currently it serve a huge analytic system, include many important business like BiddingX, provides the realtime, online analytic ability with tens of billions events generated by every date.

After open sourced, IndexR have got many attentions from different companies. Include those in Ad, AI, E-Commerce domain, and even the government, logistics industry which got extremely large dataset and also high data quality required. Many of them have successfully deployed IndexR to their production environment.

Open source link: shunfei/indexr

Architecture

IndexR work closely with other cityzen in Hadoop ecosystem. Here is a classic system using IndexR.

IndexR provides fast data for the query engine, it speeds up the IO speed, improve the analytic ability of the whole system.
IndexR can pull the data stream from Kafka, and pack them into system, and upload to HDFS. Data can be queried right after them arrived, without any delay.
It solved the duration problem of online analytic, also combines the realtime data and the historical data. No more lambda architecture.
The data is stored on HDFS, so we can use different tools to analyze the same data.
Take the full advantage of open source Hadoop ecosystem.

Why IndexR

The real index inside storage format

IndexR have three levels of indices by design: Rough Set Index, Inner Index and the optional Outer Index.
Most formats on Hadoop, like ORC or Parquet, dont have the real index, the filtering is depended on partitions or some statistic characteristic like MIN, MAX. That is not a big problem for offline computation. But for online workload, the fast is the better. The lesser data we load, the faster the query.
The traditional indices, like B+ tree or Inverted Index. Those indices, map the key directly to row, are suitable for OLTP scenario which only need to fetch a few rows. But when come to the OLAP, it can produce too many random accesses, because one query may need to scan tens of millions of rows. Besides, the the traditional indices are too expensive for large dataset, they use too much RAM, IO, and processing them also cost CPU cycles.
IndexR use a different strategy. We use multi-level indices design, each level limit the dataset to a smaller size. For example, we need to index the roads in the country. The tradition way works like building a map, {"cityId_roadId" : "roadData"}. While IndexR have two level map, i.e. {"cityId" : {"roadId" : "rowData"}}. IndexR group the data into packs, each pack contains fixed rows of data.
- Rough Set Index - This idea originally comes from Infobright. It works like the well known BloomFilter, to filter out those irrelevant packs. It is extremely light weight, super fast.
- Inner Index - The index inside packs. It can filter out the irrelevant rows without decompress the packs.
- Outer Index - Optional, by default not added. The outer index using Inverted Index and Bitmaps. It it very convenient to do variety filters on this index, and the bitmap can do AND, OR operation directly. We done a lot of optimizations on inverted index, to avoid random access and huge ram required issue. And we also speed up the merge operation while in range condition, e.g. >= or BETWEEN.
When it comes a query with filter, IndexR first apply the rough set index, and then the outer index. The survival packs will be loaded into memory to do the final accurate filter by inner index. It guarantee the efficiency both in large scan and small fetch.
Two storage mode
- vlt mode - The default mode, suitable for most scenario. It is super fast, offer 2~4x scan speed of Parquet. And the the help of indices, the actual query speed is on the fly! By default it takes only 75% size of Parquet.
- basic mode - Offers extremely high compress ratio, up to 10:1. Normally only takes 1/3 size of Parquet. But yet provides faster scan speed than many other storage format. It is suitable to store very large historical data in minimal cost, and can be fast queried in any time.
Streaming pipeline, realtime analytic
Currently the Hadoop ecosystem is difficult to do realtime analytic. Storm, Spark Streaming are pre-calculation, which are not work in Ad-hoc scenario. And Druid, Kudu also not suitable for use because they are dependent outside of Hadoop ecosystem. We face many issue while trying to deploy and integrate them to our system. And their performance is not good enough, especially when it comes to large dataset aggregation.
IndexR support realtime data ingestion, like pulling events from Kafka, and those data can be analyzed immediately after they arrived. It deeply integrated with Hadoop ecosystem. And for OLAP workload, IndexR supports realtime and offline pre-aggregation, which really reduce query duration.
Super fast

IndexR use highly optimized data encoding algorithm. It speed can be as fast as some in memory databases. The data structure is arranged in the way vectorized processing like, and all data is store outside of JVM heap, e.g. direct memory. We try to minimize the cost of CPU and RAM, and try to avoid the GC and virtual methods.
IndexR is based on Hadoop, means the files are stored on HDFS, which can take advantage of the rich features and tools in Hadoop ecosystem. We have done a lot of optimization about the read operations on HDFS files, and spread the query operators to the local data node.
Here is the speed benchmark with Parquet, on the TPC-H dataset, using the same Drill cluster.
The biggest table named lineitem contains 0.6 billions rows. We use 5 nodes, each node: [Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz] x 2, RAM 64G（DrillBit actual use ~12G）, HDD SATA with 7200RPM.
- Simple queries
Histogram:
- TPC-H standard SQL
Histogram:
Resource efficiency
IndexR carefully design the data structure, to avoid any waste on RAM. We try to avoid the Java class and virtual cost, and using the so called "Code C In Java" code style. We communicate between modules by data strut, rather than interface. The result is IndexR can work very fast with extra indices, while use the same memory as Parquet. We try to avoid the cases like CarbonData while need large memory to build Inverted Index.
By combination with other tools, IndexR is perfect to build a next generation of data warehouse on Hadoop.

Classic Usage

IndexR is used by many different teams after open sourced. Here we introduce how they use IndexR.

Parquet replacement. IndexR is much faster and indexed, which can speed up the whole system.
Druid replacement. For realtime analytic in OLAP. IndexR contains all features Druid provides, and it can support full SQL by Drill. It wont throw away events like Druid do. It is much simpler to maintain, with better query performance, much more hardware efficient. The teams love it for making their life easy.
Build an OLAP system, support large dataset.
Store extremely large historical data (TBs), with many more generated every date.
Move the statistic queries out of MySQL, Oracle, or ES into IndexR system.

The Future

IndexR released the 0.5.0 version days before. We will keep pushing it forward along with the developers and users. IndexR want to be the best format on Hadoop which focus on fast analytic. We hope it can attract more and more attention and usage, and more developers to join us.