Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Parsing in Python: Tools and Libraries (Part 6)

DZone's Guide to

Parsing in Python: Tools and Libraries (Part 6)

We're nearing the end of this 8-part series! This time, we'll finish looking at PEG parsers, diving into pyPEG, TatSu, and Waxeye.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Check out Part 5 here!

PEG

pyPEG

pyPEG is a plain and simple intrinsic parser interpreter framework for Python versions 2.7 and 3.x.

pyPEG is a framework to parse and compose text, which means that you define a grammar in a syntax as powerful as PEG, but you do it in Python code. And then, you use this grammar to parse and/or compose a text based on that grammar. Obviously, if you compose a text, you have to provide the data yourself. In this case, it works as a template system.

The syntax for a pyPEG is on the verbose side; frankly, it is too verbose to be productive if you just want to use it for simple parsing. But it is a cool library if you want to parse and manipulate some document in a specific format. For instance, you could use it to transform documentation in one format to another.

# from the documentation
from pypeg2 import *
class Type(Keyword):
grammar = Enum( K("int"), K("long") )

class Parameter:
grammar = attr("typing", Type), name()

class Parameters(Namespace):
grammar = optional(csl(Parameter))

class Instruction(str):
grammar = word, ";"

block = "{", maybe_some(Instruction), "}"
class Function(List):
grammar = attr("typing", Type), name(), \
        "(", attr("parms", Parameters), ")", block

f = parse("int f(int a, long b) { do_this; do_that; }", Function)

pyPEG does not produce a standard tree, but a structure based on the defined grammar. Look at what happens for the previous example.

# execute the example
>>> f.name
Symbol('f')
>>> f.typing
Symbol('int')
>>> f.parms["b"].typing
Symbol('long')
>>> f[0]
'do_this'
>>> f[1]
'do_that'

TatSu

TatSu (for grammar compiler) is a tool that takes grammars in a variation of EBNF as input, and outputs memoizing (Packrat) PEG parsers in Python.

TatSu is the successor of Grako, another parser generator tool, and it has a good level of compatibility with it. It can create a parser dynamically from a grammar or compile into a Python module.

TatSu generates PEG parsers, but grammars are defined in a variant of EBNF — though the order of rules matters, as is usual for PEG grammars. So, it is actually a sort of cross between the two. This variant includes support for dealing with associativity and simplifying the generated tree or model (more on that later). Support for left-recursive rules is present but experimental.

calc.ebnf:

// TatSu example grammar from the tutorial
@@grammar::CALC

start
    =
    expression $
    ;

expression
    =
    | expression '+' term
    | expression '-' term
    | term
    ;

term
    =
    | term '*' factor
    | term '/' factor
    | factor
    ;

factor
    =
    | '(' expression ')'
    | number
    ;

number
    =
    /\d+/

TatSu grammars cannot include actions that can be defined in a separate Python class. Instead, you have to annotate the grammar if you want to use an object model in place of semantic actions. An object model is a way to separate the parsing process from the entity that is parsed. In practical terms, instead of doing something when a certain rule is matched, you do something when a certain object is defined. This object may be defined by more than one rule.

The following extract example defines an object Multiply that corresponds to the rule multiplication.

multiplication::Multiply
    =
    left:factor op:'*' ~ right:term
    ;

The object model can then be used for what TatSu calls walker (essentially, a visitor to the model).

from tatsu.walkers import NodeWalker

class CalcWalker(NodeWalker):
    def walk_object(self, node):
        return node

    def walk__add(self, node):
        return self.walk(node.left) + self.walk(node.right)

    def walk__subtract(self, node):
        return self.walk(node.left) - self.walk(node.right)

    def walk__multiply(self, node):
        return self.walk(node.left) * self.walk(node.right)

    def walk__divide(self, node):
        return self.walk(node.left) / self.walk(node.right)

def parse_and_walk_model():
    grammar = open('grammars/calc_model.ebnf').read()

    parser = tatsu.compile(grammar, asmodel=True)
    model = parser.parse('3 + 5 * ( 10 - 20 )')

    print('# WALKER RESULT IS:')
    print(CalcWalker().walk(model))
    print()

The same object model can also be used for code generation; for instance, to transform one format into another one. But for that, you obviously cannot reuse the walker, but you have to define a template class for each object.

TatSu also provides a tool to translate ANTLR grammars, complex trace output, and a graphical representation of the tree using pygraphviz. ANLTR grammar may have to be manually adapted to respect PEG constraints.

The documentation is complete: it shows all the features, provides examples, and even has a basic introduction to parsing concepts like AST.

Waxeye

Waxeye is a parser generator based on parsing expression grammars (PEGs). It supports C, Java, Javascript, Python, Ruby, and Scheme.

Waxeye can facilitate the creation of an AST by defining nodes in the grammar that will not be included in the generated tree. That is quite useful, but a drawback of Waxeye is that it only generates an AST — in the sense that there is no way to automatically execute an action when you match a node. You have to traverse and execute what you need manually.

One positive side-effect of this limitation is that grammars are easily readable and clean. They are also independent of any language.

Calc.waxeye:

// from the manual

calc  <- ws sum

sum   <- prod *([+-] ws prod)

prod  <- unary *([*/] ws unary)

unary <= '-' ws unary
       | :'(' ws sum :')' ws
       | num

num   <- +[0-9] ?('.' +[0-9]) ws

ws    <: *[ \t\n\r]

A particular feature of Waxeye is that it provides some help to compose different grammars together and then it facilitate modularity. For instance, you could create a common grammar for identifiers that are usually similar in many languages.

Waxeye has a great documentation in the form of a manual that explains basic concepts and how to use the tool for all the languages it supports. There are a few example grammars.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,python ,parsing ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}