Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Sort Number Strings Numerically

DZone's Guide to

How to Sort Number Strings Numerically

In a recent talk, I told a community of mathematicians that nearly arbitrary data can *easily* be stored in a JSON-based document store. I should've known better...

· Big Data Zone
Free Resource

Intelligently automate your Big Data operations to lower your costs, make your team more productive, scale more efficiently, and lower the risk of failure. Learn how >>

Recently, I gave a talk about ArangoDB in front of a community of mathematicians. I advertised that nearly arbitrary data can *easily* be stored in a JSON-based document store. The moment I had uttered the word "easily," one of them asked about long integers. And if a mathematician says "long integer," they do not mean 64bit, but "properly long." He actually wanted to store orders of finite groups. I said one should use a JSON UTF-8 string for this, but I should have seen the next question coming — he then wanted a sorted index that would actually sort the documents by the numerical value stored in the string. But most databases — and ArangoDB is no exception here — will compare UTF-8 strings lexicographically (dictionary order).

And Thus, Today's Problem

How can one store long integral values as UTF-8 strings such that the usual UTF-8 byte-wise lexicographical comparison actually sorts the values by their numerical value?

So the real title of this post should have been, "An order-preserving embedding of the set of integers into the set of lexicographically sorted UTF-8 strings." But who would have clicked on that? After all, this is a blog and no mathematical journal.

This problem has a surprisingly simple solution, at least for the set of positive numbers: To store the number N, simply store N times the letter x. The larger N is, the longer the string gets. And lexicographic ordering says that in this case, longer strings with more letters are larger.

This answer is exactly as mathematically correct as it is totally useless, because who would like to waste nearly a megabyte of data for the number 1,000,000? We have to do better!

If we restrict ourselves to numbers with at most k digits, we can just pad with zeroes on the left and store 1,000,000 as, say:

00000000000000000000000000000000000000000001000000

But this is also wasteful, and surely once we decide about the maximal number of digits, somebody will come along and want to store a larger number. So what we really would like is a scheme with the following properties:

  • Use not much more space than would anyway be needed in the usual notation
  • Work for arbitrarily long numbers
  • Can easily be converted to and from the usual decimal notation
  • Is easy to explain and to comprehend

This post suggests such an encoding.

Let's Do the Non-Negative Numbers First

For numbers with up to 92 digits, we simply put the number of digits of the number first (as one character, using Unicode code points 34 to 125, adding 33 to the number of digits
to get the code point), then followed by a space, and then the actual number in the usual notation.

Here are a few examples:

" 0              is the number 0
" 1              is the number 1
# 42             is the number 42
. 1623463626463" is the number 1623463626463

Why does this work and preserve the numerical ordering? Clearly, a "shorter" positive number is smaller than a longer one (in the usual notation without leading zeroes). But in this case, the lexicographical ordering of the strings compares the number of digits correctly in the first character (letters have higher Unicode code points than digits). For two numbers with the same number of digits, the initial character and space are the same, and then the lexicographical comparison works for the actual numbers. Note that transformation between formats is easy because the actual notation is contained verbatim after the space character, which makes it human readable at the same time.

This takes us up to 1092-1, which is clearly a large number but not enough for mathematicians.

For larger numbers (Y >= 1092), we use the following trick.

So far, we have encoded the number of digits of our number using the first character. Now we simply put a tilde character ~ to indicate something larger (the Unicode code point for tilde is 126 and thus larger than any one of the ones we have used above), then put the number of digits in the above notation (without the space), followed by a space and then the actual number.

That is:

"~cX Y"

...where c is the character for the number of digits of X and 92 < X < 10 92 in the above notation and Y has exactly X digits not starting with a 0 digit. This works because of a very similar argument as above: All these numbers are larger than 10 92-1 and all the strings are lexicographically larger than the ones without a tilde. If two such numbers Y and Y' have a different number of digits, then the one with fewer digits is smaller. But then the corresponding string "~cX" is lexicographically smaller than "~cX'" because of our previous arguments. If Y and Y' have the same number of digits, then their "~cX" parts are the same and the lexicographical ordering sorts the strings correctly according to the numerical values.

This covers the range 1092 <= Y < 10-1, which is good for all practical cases because no computer in this universe can store a string with 1092-1 characters, anyway.

Therefore, our problem is practically solved. However, I told you in the beginning that this audience consisted of mathematicians. If they say "for all integers," they mean it.

So How Would We Do Even Larger Numbers?

Actually, the case of exactly one tilde character is just a special case. In general, we store a string of this form:

"~~~~~~ca[0]a[1]a[2]... a[n]"

...which consists of n tilde characters (n some positive integer) followed by one character c and then n+1 decimal numbers a[0], a[1], a[2], ...a[n], and a space before the last one. a[0] is encoded in the above notation with c and is less than 1092, a[1] has exactly a[0] digits, a[2] has exactly a[1] digits, and so on until a[n] has exactly a[n-1] digits.

We claim that this delivers an embedding of an arbitrarily large positive integer into the set of all UTF-8 strings and it is order-preserving. The details of the proof are left for the reader, but the main argument about the ordering works as above: To compare the two such representations of a and b respectively: Either a and b have the same number of digits, then the two strings will have equally many tilde characters and an equal amount of numbers and in both cases all but the last numbers are pairwise identical. Then the lexicographical comparison of a[n] and b[n] works correctly. If they have a different number of digits, then the shorter one is the smaller one and our previous argument shows that lexicographical comparison works correctly.

This is now a mathematically satisfying solution of the problem. But stop! We forgot about the negative numbers. Fortunately, we left out ! the character with Unicode code point 33, which is smaller than any initial character we have used above. So we can simply prepend a ! sign and follow with the encoding of the absolute value, and have all negative numbers earlier than all positive numbers. However, there is a slight problem: We get that the string !" 6 compares smaller than !" 7 and this contradicts -7 being smaller than -6.

Therefore, we have to find a way to invert the ordering of the encodings we used for the positive numbers. Fortunately, this is easy. Since we only used the characters from code points 33 to 126, we can simply translate everything by keeping the space and reversing the order. That
is, code points 33 and 126 are interchanged, 34 and 125, and so on, until 79 and 80. Then the lexicographic ordering will be the other way around. That is:

`!} m` stands for -2 (since `" 2` is the encoding of +2)
`!| jl` stands for -53 (since `# 53` is the encoding of +53)

This concludes the mathematical solution. For the sake of completeness, here are some JavaScript functions that do the encoding and decoding.

Note that we omitted the rare cases of strings with more than 10([1o^92]-1)-1 digits:

  function encodeNonNegative(s, pad) {
      let l = s.length;
      if (l <= 92) { return String.fromCodePoint(33+l) + pad + s; }
      return '~' + encodeNonNegative(l.toString(10), '') + ' ' + s;
    }
 
    function translate(s) {
      let r = [];
      for (i = 0; i < s.length; ++i) {
        let c = s.charCodeAt(i);
        r.push(c === 32 ? ' ' : String.fromCodePoint(159 - c));
      }
      return r.join('');
    }
 
    function encodeLong(s) {
      if (s[0] !== '-') { return encodeNonNegative(s, ' '); }
      let p = encodeNonNegative(s.slice(1), ' ');
      return '-' + translate(p);
    }
 
    function decodeNonNegative(s) {
      return s.slice(s.indexOf(" ")+1);
    }
 
    function decodeLong(s) {
      if (s[0] !== '-') { return decodeNonNegatve(s); }
      let p = decodeNonNegative(translate(s.slice(1)));
      return '-' + p;
    }

The problem is solved! We can embed any integer into an UTF-8 string and preserve the proper order.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
big data ,number strings ,sorting data ,integers ,tutorial

Published at DZone with permission of Max Neunhoeffer, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}