DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Coding
  3. Languages
  4. Python from scratch- RegEx

Python from scratch- RegEx

Hod Benbinyamin user avatar by
Hod Benbinyamin
·
Mar. 23, 12 · Interview
Like (0)
Save
Tweet
Share
4.90K Views

Join the DZone community and get the full member experience.

Join For Free
It was 5 weeks ago when I first published my first post about me learning Python from Scratch and since then also published Part 2 and Part 3 posts.
To study Python On-Line I am using a Google's Python class site and I follow it step by step.
Each class takes me to read and understand around 2 hours and to do the practice code around 4-5 hours more; The thing is that I do not use these 7 hours in a row but usually over one week and this is not because I am lazy but because I do it during the night instead of sleeping and when my boys are actually sleeping and not ill.

 
Last night I learned Regular Expressions (= RegEx) and boy! It was interesting to see how you manage to pull a text out of a large text file.
For those of you who don't know me (or haven't read my profile yet) I am working in a security company as a QA manager. Our product needs to detect text file leaked out of the perimeter of any organization. The code for the detection is using RegEx over Python and that is a good reason for me to improve my RegEx skills and of course Python.

 
The class itself is confusing, I mean- you know what RegEx is and you certainly understand why someone can use it, but it is damn hard to remember the patterns (and so far they are only the basics) and I also found some illogical parameters here:
-           For some reason \w is a word char and \s is a whitespace while in my mind \s should be string and \w should stands for whitespace. Bizarre.
-           I would expect Repetition to act like Wildcards but ' + ' is not a wildcard.

 
Yet, this tool is very powerful and I am envy in people controlling it just like our research guys who write these RegEx detection codes.

 
When I felt ready I came to solve the exercise (linked here) which is long to read but once you see an example of the needed output it becomes easier to understand.

 
So here is my code, I am proud of it and although it looks short it was not short at all in time and I had some tough moments here:
def find_year (filename):
file1 = open (filename, 'rU')
match = re.findall (r'>Popularity in ([\d]+)<', file1.read())
file1.close()
return match
def find_name_and_rank (filename):
temp_tuple_name_and_rank = {}
file2 = open (filename, 'rU')
temp_tuple_name_and_rank = re.findall (r'<tr align="right"><td>([\d+])</td><td>([\w]+)</td><td>([\w]+)</td>', file2.read())
file2.close
return temp_tuple_name_and_rank
def convert_tuple_to_unisex_list (tuple_name_and_rank):
dict_men = {}
dict_women = {}
# Convert tuple into dictionary
for item_tuple in tuple_name_and_rank:
dict_men[item_tuple[1]] = item_tuple[0]
for item_tuple in tuple_name_and_rank:
dict_women[item_tuple[2]] = item_tuple[0]
#convert dictionary into list
list_men = list(dict_men.items())
list_women = list (dict_women.items())
#sort both lists into one men and women sorted list
unisex_list = list_men
unisex_list.extend (list_women)
unisex_list.sort()
return (unisex_list)
def extract_names(filename):
  """
Given a file name for baby.html, returns a list starting with the year string
followed by the name-rank strings in alphabetical order.
['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...]
"""
  
  tuple_name_and_rank = {}
 
  
  # Looking for name and its rank and insert into a tuple
  tuple_name_and_rank = find_name_and_rank (filename)
    
  # Convert tuple into sorted list
  list_sorted = convert_tuple_to_unisex_list (tuple_name_and_rank)
  
  return list_sorted
  
def main():
  args = sys.argv[1:]
if not args:
print 'usage: [--summaryfile] file [file ...]'
sys.exit(1)
summary = False
if args[0] == '--summaryfile':
summary = True
del args[0]
  # For each filename, get the names, then either print the text output (False)
  # or write it to a summary file (True)
  
if summary == False:
the_year_is = find_year (args[0])
print '\n', the_year_is,
   list_sorted = extract_names(args[0])
print list_sorted
else:
list_sorted = extract_names(args[0])
file_output = open ('result.txt', 'a')
print >> file_output, find_year(args[0])
print >> file_output,list_sorted
file_output.close
if __name__ == '__main__':
  main()

  I have attached also Google solution which is different then mine; it is ok to be different but what I did take from their solution is how to handle a case of 'invalid data’ or ‘unexpected data’ unlike my assumption that the data is valid and fits:

def extract_names(filename):
  """
Given a file name for baby.html, returns a list starting with the year string
followed by the name-rank strings in alphabetical order.
['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...]
"""
  # +++your code here+++
  # LAB(begin solution)
  # The list [year, name_and_rank, name_and_rank, ...] we'll eventually return.
  names = []
  # Open and read the file.
  f = open(filename, 'rU')
  text = f.read()
  # Could process the file line-by-line, but regex on the whole text
  # at once is even easier.
  # Get the year.
  year_match = re.search(r'Popularity\sin\s(\d\d\d\d)', text)
  if not year_match:
    # We didn't find a year, so we'll exit with an error message.
    sys.stderr.write('Couldn\'t find the year!\n')
    sys.exit(1)
  year = year_match.group(1)
  names.append(year)
  # Extract all the data tuples with a findall()
  # each tuple is: (rank, boy-name, girl-name)
  tuples = re.findall(r'<td>(\d+)</td><td>(\w+)</td>\<td>(\w+)</td>', text)
  #print tuples
  # Store data into a dict using each name as a key and that
  # name's rank number as the value.
  # (if the name is already in there, don't add it, since
  # this new rank will be bigger than the previous rank).
  names_to_rank = {}
  for rank_tuple in tuples:
    (rank, boyname, girlname) = rank_tuple # unpack the tuple into 3 vars
    if boyname not in names_to_rank:
      names_to_rank[boyname] = rank
    if girlname not in names_to_rank:
      names_to_rank[girlname] = rank
  # You can also write:
  # for rank, boyname, girlname in tuples:
  # ...
  # To unpack the tuples inside a for-loop.
  # Get the names, sorted in the right order
  sorted_names = sorted(names_to_rank.keys())
  # Build up result list, one element per line
  for name in sorted_names:
    names.append(name + " " + names_to_rank[name])
  return names
  # LAB(replace solution)
  # return
  # LAB(end solution)
def main():
  # This command-line parsing code is provided.
  # Make a list of command line arguments, omitting the [0] element
  # which is the script itself.
  args = sys.argv[1:]
  if not args:
    print 'usage: [--summaryfile] file [file ...]'
    sys.exit(1)
  # Notice the summary flag and remove it from args if it is present.
  summary = False
  if args[0] == '--summaryfile':
    summary = True
    del args[0]
  # +++your code here+++
  # For each filename, get the names, then either print the text output
  # or write it to a summary file
  # LAB(begin solution)
  for filename in args:
    names = extract_names(filename)
    # Make text out of the whole list
    text = '\n'.join(names)
    if summary:
      outf = open(filename + '.summary', 'w')
      outf.write(text + '\n')
      outf.close()
    else:
      print text
  # LAB(end solution)
if __name__ == '__main__':
  main()

Python (language) Scratch (programming language)

Published at DZone with permission of Hod Benbinyamin. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Low-Code Development: The Future of Software Development
  • How To Best Use Java Records as DTOs in Spring Boot 3
  • How We Solved an OOM Issue in TiDB with GOMEMLIMIT
  • A Gentle Introduction to Kubernetes

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: