Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Cool Python Module: html2text

DZone's Guide to

Cool Python Module: html2text

· Web Dev Zone ·
Free Resource

Learn how error monitoring with Sentry closes the gap between the product team and your customers. With Sentry, you can focus on what you do best: building and scaling software that makes your users’ lives better.

Parts of the Fiesta API (as well as some new features we’re rolling out for Fiesta itself) rely on the ability to automatically generate clean text (markdown) versions of incoming messages. Our parser tends to prefer using the text version of a message if possible, as it’s generally easier to parse than the HTML version. That said, sometimes a message only contains an HTML version - we need a way to generate our canonical markdown representation from that alone. Enter html2text.

html2text is a great little Python module by Aaron Swartz that takes HTML as input and generates a markdown version as output. Here’s an example:

>>> import html2text
>>> print html2text.html2text('''<html><body><p>Hello World</p>
<ul><li>Here's one thing</li>
<li>And here's another!</li></ul></body></html>''')
Hello World

  * Here's one thing
  * And here's another!

The output is nicely formatted markdown text, exactly what we were looking for. The only problem we’ve noticed is that the module has some trouble dealing with malformed HTML. Our approach has been to run things through BeautifulSoup first, which tends to do a great job even with crappy markup.

 

Source: http://blog.fiesta.cc/post/14575090271/cool-python-module-html2text

What’s the best way to boost the efficiency of your product team and ship with confidence? Check out this ebook to learn how Sentry's real-time error monitoring helps developers stay in their workflow to fix bugs before the user even knows there’s a problem.

Topics:

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}