Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Parsing in Python: Tools and Libraries (Part 8)

DZone's Guide to

Parsing in Python: Tools and Libraries (Part 8)

In the last part of this series, check out some miscellaneous Python libraries and tools related to parsing, like CPython, Reparse, and Construct.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Check out Part 7 here!

Python Libraries Related to Parsing

Python offers also some other libraries or tools related to parsing.

Parsing Python Inside Python

There is one special case that could be managed in a more specific way: you want to parse Python code in Python. When it comes to Python, the best choice is to rely on your own Python interpreter.

The standard reference implementation of Python, known as CPython, include a few modules to access its internals for parsing: tokenize, parser, and ast. You may also be able to use the parser in the PyPy interpreter.

Parsing With Regular Expressions and the Like

Usually, you resort to parsing libraries and tools when regular expressions are not enough. However, there is a good library for Python that can extend the life and usefulness of regular expressions or use elements of similar complexity.

Regular Expression-based parsers for extracting data from natural languages [..] This library basically just gives you a way to combine Regular Expressions together and hook them up to some callback functions in Python.

Nonetheless, Reparse is a simple tool that can be quite useful in certain scenarios. The author himself says that it is much simpler and with fewer features than PyParsing or Parboiled.

The basic idea is that you define regular expressions, the patterns they can use to combine and the functions that are called when an expression or pattern is found. You must define functions in Python, but expressions and pattern can be defined in YAML, JSON, or Python.

In this example from the documentation, expressions and patterns are defined in YAML:

Color:
    Basic Color:
        Expression: (Red|Orange|Yellow|Green|Blue|Violet|Brown|Black)
        Matches: Orange | Green
        Non-Matches: White
        Groups:
          - Color

Time:
    Basic Time:
        Expression: ([0-9]|[1][0-2]) \s? (am|pm)
        Matches: 8am | 3 pm
        Non-Matches: 8a | 8:00 am | 13pm
        Groups:
          - Hour
          - AMPM

Fields like Matches are there for humans, but can be used for testing by Reparse.

BasicColorTime:
  Order: 1
  Pattern: |
    <Color> \s? at \s? <Time>
  # Angle brackets detonate expression groups
  # Multiple expressions in one group are combined together

An example function in Python for the pattern:

from datetime import time
def color_time(Color=None, Time=None):
    Color, Hour, Period = Color[0], int(Time[0]), Time[1]
    if Period == 'pm':
        Hour += 12
    Time = time(hour=Hour)

    return Color, Time

functions = {
   'BasicColorTime' : color_time,
}

The file that puts everything together:

from reparse_functions import functions
import reparse

colortime_parser = reparse.parser(
    parser_type=reparse.basic_parser,
    expressions_yaml_path=path + "expressions.yaml",
    patterns_yaml_path=path + "patterns.yaml",
    functions=functions
)

print(colortime_parser("~ ~ ~ go to the store ~ buy green at 11pm! ~ ~"))

Parsing Binary Data: Construct

Instead of writing imperative code to parse a piece of data, you declaratively define a data structure that describes your data. As this data structure is not code, you can use it in one direction to parse data into Pythonic objects, and in the other direction, to build objects into binary data.

And that is it: Construct. You could parse binary data even with some parser generators (i.e. ANTLR), but Constuct makes it much easier. It is a sort of DSL combined with a parser combinator to parse binary formats. It gives you a bunch of fields to manage binary data. Apart from the obvious ones (i.e. float, string, bytes etc.), there are a few specialized to manage sequences of fields (sequence), groups of them (struct), and a few conditional statements.

It also makes available functions to adapt or validate (test) the data and debug any problem you find.

As you can see in the following example, it is quite easy to use:

# from the documentation

gif_logical_screen = Struct("logical_screen",
    ULInt16("width"),
    ULInt16("height"),
    [..]
    If(lambda ctx: ctx["flags"]["global_color_table"],
        Array(lambda ctx: 2**(ctx["flags"]["global_color_table_bpp"] + 1),
            Struct("palette",
                ULInt8("R"),
                ULInt8("G"),
                ULInt8("B")
            )
        )
    )
)

gif_header = Struct("gif_header",
    Const("signature", b"GIF"),
    Const("version", b"89a"),
)

[..]

gif_file = Struct("gif_file",
    gif_header,
    gif_logical_screen,
    [..]
)

if __name__ == "__main__":
    f = open("../../../tests/sample.gif", "rb")
    s = f.read()
    f.close()
    # if you want to build the file, you just have to provide the data
    # to the build() function
    print(gif_file.parse(s))

There is a nice amount of documentation and even many example grammars for different kinds of format, such as filesystems or graphics files.

Summary

Any programming language has a different community with its own peculiarities. This difference remains even when we compare the same interests across languages. For instance, when we compare parser tools, we can see how Java and Python developers live in a different world.

The parsing tools and libraries for Python, for the most part, use very readable grammars and are simple to use. But the most interesting thing is that they cover a very wide spectrum of competence and use cases. There seems to be an uninterrupted line of tools available from regular expressions, passing through Reparse to end with TatSu and ANTLR.

Sometimes, this means that it can be confusing if you are a parsing expert coming from a different language. Few parser generators actually generate parsers, but they mostly interpret them at runtime. On the other hand, with Python, you can find the perfect library or tools for your needs. And to help you with that, we hope that this comparison has been useful for you.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,python ,parsing ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}