Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Building a Directory Structure Index in Python

DZone's Guide to

Building a Directory Structure Index in Python

· Web Dev Zone ·
Free Resource

Learn how error monitoring with Sentry closes the gap between the product team and your customers. With Sentry, you can focus on what you do best: building and scaling software that makes your users’ lives better.

I’m working through examples in “Natural Language Processing with Python” (read my review) and found that the corpus I have to work with is large enough to require special performance tuning exercises.

If you have a large enough directory structure, it becomes difficult to walk with os.walk – for instance any failure in longer scripts require starting from scratch. This is a common issue in larger systems – typically they manage file listings through a relational database, and directory storage is obfuscated in some way.

In this environment it takes at an hour for Windows to count the files, and python seems to take longer.

It’s worth generating a list of the files in advance – this lists which PDFs and HTML documents exist, and for which a text extract has been generated (see a Node.JS approach here). This supports a few uses, including generating missing text renditions.

import os
import re
import datetime
 
print datetime.datetime.now()
 
pdf_idx = open('pdfs.idx', 'w')
rendition_idx = open('txts.idx', 'w')
html_idx = open('htmls.idx', 'w')
xml_idx = open('xmls.idx', 'w')
 
for root, dirs, files in os.walk('.'):
  for f in files:
    if f.endswith(".pdf"):
      rend = root + os.sep + f + ".textrendition.txt"
      try:
          with open(rend):
              rendition_idx.write(rend + "\n")
      except IOError:
          pass
      pdf_idx.write(root + os.sep + f + "\n")
    if f.endswith(".html") or f.endswith('.htm'):
      html_idx.write(root + os.sep + f + "\n")
    if f.endswith(".xml"):
      xml_idx.write(root + os.sep + f + "\n")
 
rendition_idx.close()
pdf_idx.close()
html_idx.close()
xml_idx.close()
 
print datetime.datetime.now()

What this allows is quite useful – you can read the file quickly, and select a random subset for training and test data for NLP algorithms. This could also be done by storing all the names in a database, but this is probably the simplest and fastest for my current needs.

rendition_idx = open('txts1.idx', 'r')
files = [f[:-1] for f in rendition_idx]
rendition_idx.close()
len(files)
124559
 
>>> files[:4]
['./00/00/gov.uscourts.rid.6064/gov.uscourts.rid.6064.20.0.pdf.textrendition.txt',
'./00/01/gov.uscourts.cacd.547806/gov.uscourts.cacd.547806.6.0.pdf.textrendition.txt',
'./00/01/gov.uscourts.oknd.31699/gov.uscourts.oknd.31699.21.0.pdf.textrendition.txt',
'./00/01/gov.uscourts.paed.406890/gov.uscourts.paed.406890.19.0.pdf.textrendition.txt']
 
import random
random.shuffle(files)
files[:4]
['./16/63/gov.uscourts.ded.48575/gov.uscourts.ded.48575.1.0.pdf.textrendition.txt',
'./09/51/gov.uscourts.casd.273674/gov.uscourts.casd.273674.1.0.pdf.textrendition.txt',
'./09/aa/gov.uscourts.hid.14739/gov.uscourts.hid.14739.73.0.pdf.textrendition.txt',
'./09/57/gov.uscourts.casd.268361/gov.uscourts.casd.268361.4.0.pdf.textrendition.txt']
 
contents = [open(f[2:]).read() for f in files[:100]]


What’s the best way to boost the efficiency of your product team and ship with confidence? Check out this ebook to learn how Sentry's real-time error monitoring helps developers stay in their workflow to fix bugs before the user even knows there’s a problem.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}