Apache Kudu is a new Open Source data engine developed by Cloudera, and while the project is relatively new, it shows great promise. This blog post is a short overview of what I like about Apache Kudu, and in the future I’ll write in more detail about specific capabilities.
Throughout my career I have specialized in databases and data technology, and have seen the phenomenal progress and evolution in the space. From standalone single-user databases, to the innovation of the RDBMS, to distributed databases, and now to a plethora of Big Data engines.
What do I like most about Apache Kudu?
Apache Kudu is the first Big Data engine that closely resembles the capabilities of a traditional relational store, but with exceptional performance, data volume and data distribution capabilities. To be clear, Apache Kudu is a data (storage) engine, it itself does not support SQL directly. Rather, it exposes an API for key-value semantics, similar to standalone engines like LevelDB and RocksDB. The low-level API is easy to use, but most implementations will utilize a Query Engine as the interface to the engine, engines such as Apache Impala, Apache Spark, or Apache Drill. While the Apache Kudu API resembles the semantics of a standalone engine, that is where the similarity ends — Apache Kudu offers a full-fledged distributed columnar engine, supporting a number of key features that fill a void in the Big Data space.
Apache Kudu when used in combination with a capable Query Engine gives you very familiar capabilities, those we have become accustomed to when using traditional enterprise databases.
If Apache Kudu is yet another key-value store, why should I care?
To be sure there are many key-value stores available, including Cassandra, HBase, and several others on and on (really these two examples go well-beyond a simple key-value, but are mentioned in the general category). These key-value engines are extremely fast for reliable writes, but in my opinion queries have always been a challenge. They suffer from what I call the scatter-gather methodology: Data is scattered all over the cluster (typically based on a hash of the key), and when you want to query the data back, the items must be “gathered” together into a sensible result. The gather step often means table-scans, unless you are able to express your query in terms of specific keys. Typical operations like searches, sorts, grouping all take a lot of resources and can be inefficient.
Of course Cassandra and HBase have flexible Column Families, that can help a lot, but you have to be very clever in designing your database to make it do what you want. Purely random Data Science type queries are not always the easiest with these engines. Vendors such as DataStax have invested heavily in indexing capabilities and query languages to improve this, but suffice it to say that these layers are needed to compensate for the limitations of the initial underlying design. Just so there is no misunderstanding, tremendous work and progress has been invested in these engines to make these capabilities better, but the fundamental design favored writes at the expense of reads.
How does Apache Kudu handle queries?
Recall that earlier I said that Apache Kudu is a Columnar data engine, meaning that it organizes your data into columns rather than the traditional row structure of an RDBMS. This column structure is very similar to how data is organized in files that use the Apache Parquet data format. With a column structure you can scan specific columns very quickly. This is like a legacy table-scan but typically much faster because only a given column needs to be queried — and the column data can be compressed or otherwise optimized for fast searching.
But Apache Kudu combines several capabilities into its fundamental architecture, giving you a much wider variety of capabilities in a single package. These capabilities are the very core of Apache Kudu, rather than layers on top of the engine added as an afterthought.
Due to its intelligent ground-up design Apache Kudu supports multiple query types, giving you these capabilities:
Lookup any specific value by its key.
Lookup a range of keys, sorted in key-order.
Perform any arbitrary query across any number of columns.
This means that if you know the key or key-range, you can get very fast query results for specific data items. And if you don’t have specifics, you can still perform arbitrary random queries via the built-in Columnar engine.
Is Apache Kudu a read-only data engine? Most Columnar engines are designed for read-only capabilities, with either bulk writes to create the entire dataset (as in a traditional Data Warehouse), or less efficient updates or additions to the dataset. Generally these engines make a trade-off in favor of read-efficiency at the expense of write-efficiency and flexibility.
Apache Kudu is a live data engine that supports concurrent high-speed writes, while queries are in progress. That means you don’t need to build careful staging of your data into your data engine — you can just write data as it comes in, and run concurrent queries on that same data. This capability is what makes Apache Kudu the closest Big Data engine to a traditional database, because you can just read and write as your application dictates. This is a great capability for real-time analytics requirements, for applications like IoT (Internet of Things) streaming data, system information, or other fast-changing data.
What about data distribution and reliability?
Apache Kudu is a fully distributed database with built-in reliability using the Raft Consensus Algorithm. This means that you can configure Apache Kudu to make “n” copies (usually 3) of your data, so that your data is safe and your implementation is fully reliability. This includes built-in failover, rebalancing, and a lot of other details I’ll cover in the future.
Just as importantly, Apache Kudu supports data distribution using a two-tier sharding (partitioning) mechanism, by key and/or by key-range.
While this deserves a blog post in itself, suffice it to say that this mechanism is very flexible, allowing you to tune your data distribution to your specific needs. Think of the scatter-gather issue I mentioned above; using Apache Kudu’s two-tier sharding you can minimize or in some cases eliminate this problem for your application (depending on your requirements of course).
I’ve only covered a few high-level capabilities of Apache Kudu in this brief blog post, but I hope it pointed out some of the innovations and capabilities in this engine that I find very exciting. I believe Apache Kudu will be a real winner in the Big Data space, and I have seen its popularity continue to rise since its first introduction over a year ago. In future blog posts I’ll cover each of Apache Kudu’s main capabilities and architecture in more detail.
P.S. This week Apache Kudu went GA, which is great news. You can read about it here: http://finance.yahoo.com/news/cloudera-announces-general-availability-apache-120000890.html