Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Parsing in Python: Tools and Libraries (Part 5)

DZone's Guide to

Parsing in Python: Tools and Libraries (Part 5)

In the last article, we wrapped up our examination of CFG parsers in Python. Now, it's time to get started with PEG parsers.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Check out Part 4 here!

PEG

After examining the CFG parsers, it's time to see the PEG parsers available for JavaScript.

Arpeggio

Arpeggio is a recursive descent parser with backtracking and memoization (AKA packrat parser). Arpeggio grammars are based on PEG formalism.

The documentation defines Arpeggio as a parser interpreter since parsers are generated dynamically from a grammar. In any case, it does not work any different from many other Python parser generators. A peculiarity of Arpeggio is that you can define a grammar in a textual PEG format or use Python expressions. Actually, there are two dialects of PEGs: one with a cleaner Python-like syntax and the other with the traditional PEG one.

Arpeggio generates a simple parse tree, but it supports the use of a visitor. The visitor can also include a second action to perform after all the tree nodes have been processed. This is used for post-processing; for instance, it can be used to deal with symbol reference.

An Arpeggio grammar defined with either a PEG notation or the Python one is usually quite readable. The following example uses Python notation.

# partial example from the documentation
def record():                   return field, ZeroOrMore(",", field)
def field():                    return [quoted_field, field_content]
def quoted_field():             return '"', field_content_quoted, '"'
def field_content():            return _(r'([^,\n])+')
def field_content_quoted():     return _(r'(("")|([^"]))+')
def csvfile():                  return OneOrMore([record, '\n']), EOF

[..]

def main(debug=False):
    # First we will make a parser - an instance of the CVS parser model.
    # Parser model is given in the form of python constructs therefore we
    # are using ParserPython class.
    # Skipping of whitespace will be done only for tabs and spaces. Newlines
    # have semantics in csv files. They are used to separate records.
parser = ParserPython(csvfile, ws='\t ', debug=debug)

[..]

There are a couple of options for debugging: verbose and informative output, and the generation of DOT files of the parser. The DOT files can be used for creating a visualization of the parser, but you will have to call graphviz yourself. The documentation is comprehensive and well-organized.

Arpeggio is the foundation of a more advanced tool for the creation of DSL called textX. TextX is made by the same developer that created Arpeggio and it is inspired by the more famous XText.

Canopy

Canopy is a parser compiler targeting Java, JavaScript, Python, and Ruby. It takes a file describing a parsing expression grammar and compiles it into a parser module in the target language. The generated parsers have no runtime dependency on Canopy itself.

It also provides easy access to the parse tree nodes.

A Canopy grammar has the neat feature of using actions annotation to use custom code in the parser. In practical terms. you just write the name of a function next to a rule and then you implement the function in your source code.

A Canopy grammar with actions:

// the actions are prepended by %
grammar Maps
  map     <-  "{" string ":" value "}" %make_map
  string  <-  "'" [^']* "'" %make_string
  value   <-  list / number
  list    <-  "[" value ("," value)* "]" %make_list
  number  <-  [0-9]+ %make_number

The Python file containing the action code:

class Actions(object):
    def make_map(self, input, start, end, elements):
        return {elements[1]: elements[3]}

   [..]

Parsimonious

Parsimonious aims to be the fastest arbitrary-lookahead parser written in pure Python — and the most usable. It’s based on parsing expression grammars (PEGs), which means you feed it a simplified sort of EBNF notation.

Parsimonious is a no-nonsense tool designed for speed and low usage of RAM. It is also a no-documentation tool; there are not even complete examples. Actually, the short README file explains the basics and redirects you to Docstring for more specific information.

In any case, Parsimonious is a good-working tool that allows you dynamically create a grammar defined in a file or a string. You can also define a visitor to traverse and transform the parsing tree. So, if you are already familiar with the PEG format, you do not need to know anything else to use it to its fullest.

A Parsimonious grammar is readable like any other PEG grammar.

# example from the documentation
my_grammar = Grammar(r"""
    styled_text = bold_text / italic_text
    bold_text   = "((" text "))"
    italic_text = "''" text "''"
    text        = ~"[A-Z 0-9]*"i
    """)

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,python ,parsing ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}