Over a million developers have joined DZone.
Gold Partner

Splitting the Atom is Hard, Splitting Strings is Even Harder

· Java Zone

I always seem to underestimate how much readers of this blog enjoy a good coding challenge. A few days ago, one of my coworkers was tripped by a line of code and I thought I'd share his confusion with people following me on Twitter, which turned out to be a fitting medium for this micro challenge:

Pop quiz: "abc.def".split(".").length should return...?

I really didn't think much of it but I got much more responses than I anticipated. A lot of them were incorrect (hey, if I throw a challenge, there has to be a trick), but I still want to congratulate everyone for playing the game and answering without testing the code first. That's the spirit!

A few people saw the trap, and interestingly, they pointed out that I must have made a mistake in the question instead of just giving the correct answer (“You must have meant split("\\.")”).

As hinted above, the trick here is that the parameter to java.lang.String#split is a regular expression, not a string. Since the dot matches all the characters in the given string and that this method cannot return any character that matches the separator, it returns an empty string array, so the answer is "0".

This didn't stop a few people from insisting that the answer is "2", which made me realize that the code snippet above happens to be valid Ruby code, a pretty unlikely coincidence. So to all of you who answered 2 because you thought this was Ruby, you were right. To the others, you are still wrong, sorry :-)

The bottom line is that this method is terribly designed. First of all, the string confusion is pretty common (I've been tricked by it more than a few times myself). A much better design would be to have two overloaded methods, one that takes a String (a real one) and one that takes a regular expression (a class that, unfortunately, doesn't exist in Java, so what you really want is a java.util.regex.Pattern).

API clarity is not the only benefit of this approach, there is also the performance aspect.

With the current signature, each call to split() causes the regular expression to be recompiled, as the source sadly confirms:

public String[] split(String regex, int limit) {
return Pattern.compile(regex).split(this, limit);
Since it's not uncommon to have such parsing code in a loop, this can become a significant performance bottleneck. On the other hand, if we had an overloaded version of split() that accepts a Pattern, it would be possible to precompile this pattern outside the loop.

Interestingly, that's how Ruby implements split():

If pattern is a String, then its contents are used as the delimiter when splitting str. If pattern is a single space, str is split on whitespace, with leading whitespace and runs of contiguous whitespace characters ignored.

If pattern is a Regexp, str is divided where the pattern matches. Whenever the pattern matches a zero-length string, str is split into individual characters.

But there is a catch: since Ruby is dynamically typed (which means it doesn't support overloading either), the determination of the class of the parameter has to be done at runtime, and even though this particular method is implemented in C, there is still an unavoidable performance toll to be paid, which is unfortunate for such a core method.

The conclusion of this little drill is that if you are going to split strings repeatedly, you are better off compiling the regular expression yourself and use Pattern#split instead of String#split.

From http://beust.com/weblog


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}