Introduction — Apache Druid TopN SQL Queries
By definition, Top-N SQL queries provide the n rows with the smallest/largest values of a given column or combination of columns (e.g the top 5 out of all e-books sold in the US in 2020). This query type is basically designed to limit the number of results returned from an ordered set of rows in SQL.
Apache Druid (a high-performance real-time analytics database) has built-in support for such queries, using approximation methods (see here).
Recently, I was trying to understand how to improve the accuracy of TopN queries in Apache Druid. I started by doing an online search about what top-N queries are in general, but I could not get a clear answer as the implementation varies across different Databases.
I thought of sharing my understanding of how Druid handles TopN queries efficiently.
Why TopN?
Before we dig deep into TopN, let’s try to understand why TopN is required in the first place.
To demonstrate that, I’ll use a Wikipedia dataset I loaded into my chosen database (this can be done using this tutorial for example), and a BI tool (Imply Pivot in this case) to explore the data.
As you can see in the screenshot below, this BI tool allows you to list the channels based on the number of events for each one. By clicking the “Number of Events” column title, you can re-order the list by descending the number of events (rather than ascending).
Try to imagine how this BI tool fires the SQL query to achieve this. One option could be:
SELECT channel, count(*) as “Number of Events”
FROM “Wikipedia”
GROUP BY 1
ORDER BY 2 DESC
In the same way, if the data needs to be sorted in an ascending manner, the query will likely be:
SELECT channel, count(*) as “Number of Events”
FROM “Wikipedia”
GROUP BY 1
ORDER BY 2 ASC
If the cardinality (i.e the number of distinct values) of the “channel” column is very high, we will see a long list in the report. However, the User Interface can display only a limited number of rows (say80 rows).
So to retrieve only the first 80 rows that match the query, we may need to rewrite the query to something different. One approach would be:
SELECT channel, count(*) as “Number of Events”
FROM “Wikipedia”
GROUP BY channel
ORDER BY 2 ASC
limit 80
This will ensure that only 80 rows of data will be sent to the BI tool from the database.
Different databases have different syntax for getting the TopN rows, apart from the general SQL structure.
For example, PostgreSQL has a Fetch clause, to retrieve the first or last n rows.
SELECT channel, count(*) as “Number of Events”
FROM “Wikipedia”
GROUP BY channel
ORDER BY “Number of Events” ASC
FETCH FIRST 80 ROWS ONLY;
In a single node database engine architecture, this is easier to understand, as all the data resides in the same node.
In the case of distributed, multi-node architecture, where the data is distributed across multiple nodes, this implementation will be more like a map and reduce operation. In the “map” phase, data is retrieved from each node (after sorting), and passed to the “reduce” phase. The “reduce” phase will apply a merge sort on the data from all the nodes and then take only the top n rows.
However, grouping the entire data and then sorting is really a costly operation. For some use-cases, especially BI use-cases, it will be highly desirable to execute the query as fast as possible, even at the expense of accuracy.
This led the designers of Druid to think of a separate type of query, which can execute fast and with fewer resources.
Fast TopN queries — the approximation approach
The Druid approach to achieve fast TopN queries is to avoid retrieving all the data from each node. Instead, only retrieve the Top K rows from each node (the specific K value can be selected based on the tolerance for accuracy, so one would choose a higher K if more accurate results are required).
In Apache Druid, by default, the K value is hardcoded as 1000. If you want to increase the value to a higher value, use the
Once the top K rows from each node are retrieved in the “map” phase”, the “reduce” phase will work the same way as before, i.e — apply a merge sort on all the data, and take only the top n rows.
In this approach, we won’t get 100% accuracy, but it should be close enough and it’s much faster (since there’s much fewer data to process).
I mentioned that only the top K rows from each node will be retrieved in the “map” phase and sent to the “reduce” phase, so to improve the accuracy of the final result, you should choose a higher value for K.
A very common alternative is — if you need the Top 100 results, choose a higher value for K, e.g 10 times the value of N ie K=10*100 = 1000. Some users choose K such that the number of nodes is factored in, e.g K = Number of Node * N (e.g for a 20-node cluster and a Top 100 query, K = 20 * 100 = 2000).
What is the worst case?
In the previous section, I mentioned this is an approximation approach, so it’s not 100% accurate.
Let’s discuss the worst case, which is the approximate results. How can this happen?
It can happen if there is skewness in the data distribution across the nodes. For the purpose of explaining the worst-case scenario, let’s look at the below example:
In this example, we consider N as 2 and K as 3, i.e. we take the 3 top results from each node for the final “Top 2” calculation.
In node 1 and node 2, the total number of events (“Count”) for channel #sv is 2, but in node 3, it is 100. Since the #sv channel isn’t part of the top 3 results for node 1 and node 2, its values are not selected in the “map” phase. In node 3, because of data skewness, there are a lot more raw events with channel #sv, so the total number of events for this channel on this node is high, and #sv is in the top 3 results for this node. After the “reduce” phase, the total number of events (“Count”) for #sv came within the top 2. However, the “Count” 100 is in fact wrong, since due to the approximation approach, rows for #sv from node 1 and node 2 weren’t sent to the “reduce” phase.
It’s likely that these approximations perform badly and often provide sub-optimal results for queries when the data points are on the same values and each node chooses its Min/Max.
What is the best case?
The best case is to avoid the worst-case :).
There are a few recommended alternatives:
Option 1: Make sure the data is distributed evenly across all the nodes. But what if the metric is on some other column like Revenue in a table for Sales? We cannot make sure the Revenue generated from the sales of an item is evenly distributed across various nodes in the cluster.
Option 2: If option 1 is not feasible, the next option is to segregate rows belonging to this dimension together into one node. In the perspective of Apache Druid, you could do this by using hash partitioning data with the dimension used in the group by query. Follow this to partition the data based on hash partitioning or single-dim partition.
A pictorial representation of what happens when data is partitioned based on the hash partition or single-dim partitioning
The data in each node before partitioning occurred randomly. During the partitioning, similar data is moved to the same node. In the example, all the data points with the channel as #en landed in Node1 and data points with channel #ar and #sh landed in Node 2 and data points with Channel value #sv landed in Node 3.
Option 3: Oftentimes, different dimensions are required to be grouped while doing slice and dice, so partitioning based on one dimension, as suggested in option 2, is not a good solution.
One of the other alternatives is to increase the value of K as a factor of the dimension’s (i.e column) cardinality. The factor depends on the trade-off on the accuracy vs how much system resources can be dedicated for this.
This option is complicated, and to be honest, it’s hard to say what would be a good guideline for the trade-off model when determining the value of K, which is resource-efficient against accuracy.
Summary
In this article, we have discussed the need of using Top N queries. We discussed why Top N is resource-efficient and its trade against accuracy. I briefly covered an example of how TopN can reduce its accuracy in the worst-case scenario and some approaches to improve the accuracy.
Call to Action:
By now you might be convinced that fixing the TopN approximation result is not an easy task. If we assume that the data distribution pattern in a given data set remains constant, it is still possible to find an acceptable value for K.
If you are good with statistics please connect with me, I am happy to collaborate.