Parser combinators is such a cool concept, very nice tool in your toolkit if you are working on external DSLs. I've been playing with them a little bit recently. Combining different parsers using higher order functions is fun, especially if you are using Scala.

Parser combinators are provided as a library in Scala over the core language. Let's use an example to walk through the details ...

Problem: HTTP's accept header provides a way for the client to request what content-type it prefers. For this exercise let's just parse the header into a list of AcceptHeader objects. Sorting the list based on the quality factor (or q value) is trivial and not related with this discussion, so skipping.

According to the HTTP specification the grammar for the Accept header isĀ  --

Accept = "Accept" ":" #( media-range [ accept-params ] )
media-range = ( "*/*"
| ( type "/" "*" )
| ( type "/" subtype )
) *( ";" parameter )
accept-params = ";" "q" "=" qvalue *( accept-extension )
accept-extension = ";" token [ "=" ( token | quoted-string )

Example of such a header is: application/html;q=0.8, text/*, text/xml, application/json;q=0.9


The goal now is to parse the header value into a list of AcceptHeader objects, where an AcceptHeader is defined as a case class:

case class AcceptHeader(mediaType: String, mediaSubType: String, qualityFactor: Float)

See below for a possible approach on parsing the accept header using the combinator technique:

import util.parsing.combinator._

object AcceptHeaderParser extends JavaTokenParsers {
lazy val accept: Parser[List[AcceptHeader]] = rep1sep(acceptEntry, ",")
lazy val acceptEntry: Parser[AcceptHeader] = (mediaType <~ "/") ~ mediaSubType ~ opt(qualityFactor) ^^ {
case t ~ st ~ Some(q) => AcceptHeader(t, st, q.toFloat)
case t ~ st ~ None => AcceptHeader(t, st, 1.0F)
lazy val wordRegex = """[\w+\-*]*""".r
lazy val mediaType = wordRegex
lazy val mediaSubType = wordRegex
lazy val qualityFactor = ";" ~> "q" ~> "=" ~> floatingPointNumber

def parse(input: String): List[AcceptHeader] = parseAll(accept, input).getOrElse(Nil)

Note: I did not implement accept-extension defined in the specification's grammar in this example.

Now let's look at various aspects of the code:
[click on the image to enlarge]

  • First look at the lazy val acceptEntry:
    • mediaType < ~ "/" indicates that result of parsing slash ("/") is irrelevant and only carry forward the result on the left (that's of the mediaType).
    • ~ is a method in the Parsers trait that stands for sequential combinator.
    • opt method stands for optional value for quality factor (q).
    • ^^ is a method in the Parsers trait -- it has a parser on the left and a function on it's right (that's doing some case matching, in this case). If the parsing effort on the left is successful it applies the function on the right to that parse result.
  • The subsequent lines expand and define each of the parsers defined in acceptEntry
    • regex is defined for media type and subtype allowing for alpha-numeric, hyphen (-) and asterisk(*) values
    • For qualityFactor: ";" ~> "q" ~> "=" ~> floatingPointNumber -- ignores all the parsed results on the left as we are only interested in knowing what the value of q is, which is defined as a floatingPointNumber
  • Now jump back to the first line which says accept is rep1sep(acceptEntry, ","). rep1sep is a method in the Parsers trait. We are saying that the accept entry will repeat one or more times, and each entry is separated by a comma (",)

You may test the functionality via

object AcceptHeaderTest {
def main(args: Array[String]) {
println(AcceptHeaderParser.parseAndOrder(""" application/html;q=0.8, text/*, text/xml, application/json;q=0.9 """))

Output: List(AcceptHeader(application,html,0.8), AcceptHeader(text,*,1.0), AcceptHeader(text,xml,1.0), AcceptHeader(application,json,0.9))

We just scratched the surface here. Debasish Ghosh's DSLs in Action dedicated a chapter for parser combinators, which helped me quite a bit in furthering my understanding. (Highly recommend Ghosh's book if you are contemplating about implementing DSLs).