In web applications, it is a very common task to sort some set of items according to the user-selected criteria and return only the first or N-th page of the sorted result. The page size can be much less than the total number of items, hence it is typically not reasonable to sort the entire set and crop a one page; it’s much more efficient to extract this page on the fly, running through the initial unsorted set. Sorting with priority queue is well know solution for this problem. In this post I present analysis of priority queue sorting for page-oriented use cases.

Let us assume that sorting unit iterates over the unsorted list of items and maintains a sorted page (a queue) of the selected maximums. If the unit meets an item which is greater than the minimal element in the queue then it removes the minimal element from the page and inserts the current item into the queue. This logic is illustrated below:

Let we have `N` randomly permuted items and we need to extract `p` maximums. If we have set of `k` items `x`_{1}`, ...,x`_{k} and `x`_{k} is p-th largest element in this list then we say that `rank(x`_{k}`)=p`. In other words `rank(x`_{k}`)` is a position of `x`_{k }in a list after sorting. Let us consider the following case: we scanned `k-1 > p` items and the page contains `p `largest items among `x`_{1}`, ..., x`_{k-1}. In this case, next item `x`_{k} will be inserted in the page with the following probability (assuming flat distribution of all possible permutations):

To see this note that we have `k!` possible permutations of k items and k-th item will be inserted if it is a largest element (`(k-1)!` permutations) or second largest element (`(k-1)!` permutations), … or p-th largest element (`(k-1)!` permutations) . Now we can estimate an average number of item comparisons for page extraction. This number consist of two components. The first one is an initial effort of scanning of first p of N elements – the list that represents the extracted page is initially empty, so we simply add first p items into it and sort them – this requires about p*log p comparisons. The second component is an effort of scanning of the remaining items. It requires N-p-1 comparisons with the page minimal element (“1 + Pr{..” term in the formula below) and insertion into the page with complexity log p, but only in the cases when item has a higher rank than a minimal item in the currently extracted page. Combining all this into one formula and reducing it we obtain that total complexity is about p log p ln N + N:

At the same time direct sorting of all items is estimated in `N*logN` comparisons. For example, if we select first 48 items (second page of size 24) from 5000 items then simple sorting will take `~61500` comparisons and page-oriented sorting will take `~6470` comparisons.

The plots below depict complexity (number of comparisons) of page-aware sorting and sorting of entire item set for different values of p and N.

elegant) it’s something which I’ve never seen analyzed. btw, rank(x>p) or rank(x)>p?

Sure, rank(x)>p, fixed. Thank you.

Did you consider nth_element for sorting+pagination ?

It’s supposed to have a better complexity., Assuming nth_element works O(n).

Call nth_element twice, at upper and lower boundaries of the page. Then fully sort the p elements of the page.

complexity: O(N) + O(N) + O(p.log p) ~ O(N)