Over a million developers have joined DZone.

Python Html2txt

·
// description of your code here


p = re.compile('()|()', re.I)
t = re.compile('', re.I)
comm = re.compile('', re.M)
tags = re.compile('<.*?>', re.M)

def html2txt(s, hint = 'entity', code = 'ISO-8859-1'):
    """Convert the html to raw txt
    - suppress all return
    - 

, to return - to tab Need the foolwing regex: p = re.compile('()|()', re.I) t = re.compile('', re.I) comm = re.compile('', re.M) tags = re.compile('<.*?>', re.M) version 0.0.1 20020930 """ s = s.replace('\n', '') # remove returns time this compare to split filter j oin s = p.sub('\n', s) # replace p and tr by \n s = t.sub('\t', s) # replace td by \t s = comm.sub('', s) # remove comments s = tags.sub('', s) # remove all remaining tags s = re.sub(' +', ' ', s) # remove running spaces this remove the \n and \t # handling of entities result = s pass return result

Topics:

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}