Character Encodings and S3 Distributed Workloads
Character Encodings and S3 Distributed Workloads
Here we take a look at the mathematical complexitites of base62 encoding with S3 in response to a four-year old question.
Join the DZone community and get the full member experience.Join For Free
Quite some time ago at an AWS Meetup in 2014 I gave a talk (sadly, I can't find the slides) about how we used S3 and built tooling to interact with it based on our understanding of how the internals of the platform worked, and best practices as advised by AWS themselves. One large aspect of the tooling we built was a custom base62 encoding using the math/big package from the Go standard library.
The general idea is that we had a filename generated randomly based on a character set of
[a-zA-Z0-9] and 12 characters in length. Since S3 is not an indexed data store, you can only enumerate objects serially (unless you know what you are looking for). By treating the keyspace as a range of numbers, you can divide up the set of numbers by a number of parallel workers, then convert each range of numbers back into filenames within the character set above. The process by which we did this used the big number package as mentioned above, and some functions to convert from a filename to a number, number to a filename, and partitioning the number space.
One question during the meetup was: can you give custom encodings to the encoding/base64 package (in this case just dropping two unnecessary characters), and potentially eliminate a lot of custom code? Why not? At the time I didn't have a good answer, and since then I have had to re-implement the same kind of program from scratch and figure it out a bit more. In the process, I would also openly admit that I didn't completely understand how Base64 worked in the first place, which was an enlightening discovery.
So let's talk first about Base64. I think I first encountered it when emails moved to MIME-encoded rich content in the 90s, and since then my mental model has more or less consisted of treating it as a human-readable encoding of binary data. Of course, the name itself has semantic meaning — numbers of some other base are encoded into numbers in base 64 — 64 different unique values per digit which conveniently map to upper- and lower-case letters, numbers 0-9, the
/ characters. There is also a padding character of
= which I'll get to in a moment.
I guess I would naively say we are converting from base2 to base64, but maybe it's better to say we are converting from base256 to base64 — after all, we are treating the data as characters, and one 8-bit character can represent 256 unique values. Therefore we are going from a representation that takes up 8 bits per "slot" to something that takes up 6 bits per "slot", which is quite important.
Here's an example of a string in regular ASCII encoding:
Character: H e l l o Integer value: 72 101 108 108 111 Binary value: 01001000 01100101 01101100 01101100 01101111
Now, immediately you can see that the leading bit is always 0, as the standard US ASCII table only uses 7 bits. There is an optimization there that can be made (and in some protocols it is) but that's out of scope for this discussion. The other thing that can be said is that the printable set of characters in the ASCII table only really starts with a value of 32. If you leave out punctuation characters, upper-case letters start at 65 and numerals start at 48. The reason this is important is because encodings like base64 shift all of the values back down to 0 —
A in this case. The largest value of 63 is represented by
/. So everything has a value that can be represented by only 6 bits. Let's have a look now at the above string as though it represented a Base64 encoded value:
Character: H e l l o Base64 value: 000111 011110 100101 100101 101110
Great, now we have saved two bits from each character. But of course, computers still use 8 bits for everything, so we need to then convert this back into groups of 8 bits, and see what it represents:
Character: H e l l o Base64 value: 000111 011110 100101 100101 101110 8 bit groups: 00011101 11101001 01100101 101110(00)
Uh-oh, we are two bits short. This is where this would be "padded" with some extra bits which would be represented by an
= character in an encoded string — representing any number of bits to pad the encoded string to a full 8 bits at the end. The above sequence of binary data from a readable Base64 encoding would not be printable, so it's not that interesting for this example.
What about if we took the printable string and encoded it into Base64?
Character: H e l l o Integer value: 72 101 108 108 111 Binary value: 01001000 01100101 01101100 01101100 01101111 6-bit groups: 010010 000110 010101 101100 011011 000110 1111(00) Base64 chars: S G V s b G 8 =
As you can see, we came up short by two bits and had to pad with
= at the end. The
1111 bits have two zeros added, which translates to
111100 which is converted to
8 in the Base64 lookup table, and the
= is added to indicate that trailing zeros are just padding and were not included in the original input. Even without the padding indicator, it is possible to deduce this, as converting back would give us a number of bits not divisible evenly by 8.
Why Doesn't This Work for S3 Keyspace?
So, it seems like we should be able to just use this encoding scheme for our S3 filenames, with a slightly modified encoding (even if it only has 62 characters, we'll still need 6 bits). It should be unimportant how to convert the filenames into a number, as long as they remain consistently sortable in the same way that S3 does, and we can partition the number space.
Well, at least with the Go
base64 package, it will always represent encoded strings as 6 bits so your encoding string needs to have 64 values. I created a playground example to demonstrate what goes wrong if we try to use a restricted set of characters (base36 or
[a-z0-9]) and convert some strings. You need to set the remaining characters in the encoding string to some value that won't be encountered in your encoded data (they will never be used). In fact, in the latest version of this tool we've created, our keyspace is only base36 and this is exactly what has been implemented (at least for some sets of objects). Regardless of whether there are 36 or 62 unique values, you still need 6 bits to represent them.
In principle, we have solved the part of the process that relates to conversion from an object key into a number, but actually, we've ended up with a sequence of 6-bit numbers that have to be joined into one larger number. For a 12-character object filename, we have 72 bits — just greater than the largest number we can represent natively with a
uint64. Taking the last 8 bits we could create (at least in Go) a
big.Int and multiply it by these left-over bits, although there's a complication because
big.NewInt() can only be initialized with a
int64 which reduces our maximum representable number by half (and minus one). Now we have to do a bit of fiddling around with the extra one or two bits, maybe splitting the number up further to make it digestible by
math.Big. Possible, but I don't believe using
encoding/base64 has appreciably simplified the problem for us — to answer the original question posed back in 2014.
An Alternative Approach
Or perhaps, the alternative approach, since it was the one I already explained and used in the talk in 2014, and have used again this year. Rather than going down the path of standardized decoding and more tricky math operations, this customizes the decoding and takes a more general approach to the math operations. You can take a look here for an example of how it looks for base36 conversion.
Effectively we just treat each character as its own encoded version of a number, and build up the numeric representation one "column" of magnitude at a time. You will probably notice that this means we're quite wasteful with number space and end up generating a far larger number than necessary, but in this case, I am happy to delegate that complexity to the arbitrary-precision math library and leave our task in application code somewhat simpler. We shift our way along the input, generate one gigantic
big.Int number and then use more math to partition the number space, and finally convert back to strings for use with S3 requests. The S3 aspect is taken care of via marker parameters, and the individual workers in the application itself need to check against their partition end marker when receiving object keys from S3, to ensure they don't fall off the end of their allocated keyspace.
So there you have it — there are several ways to skin this cat by partitioning the keyspace. To the person who asked this question back in 2014, yes, it probably could have been done with
encoding/base64 and I didn't spend any time thinking about the complications at the time, but now that I have, I am happy we went the way we did!
Thanks for reading and if you have any comments or feedback, please let me know on Twitter.
Published at DZone with permission of Oliver Hookins , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.