Topic Modeling in Python and R: The Enron Email Corpus, Part 2
Topic Modeling in Python and R: The Enron Email Corpus, Part 2
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
After posting my analysis of the Enron email corpus, I realized that the regex patterns I set up to capture and filter out the cautionary/privacy messages at the bottoms of peoples emails were not working. Let’s have a look at my revised python code for processing the corpus:
docs =  from os import listdir, chdir import re # Here's the section where I try to filter useless stuff out. # Notice near the end all of the regex patterns where I've called # "re.DOTALL". This is pretty key here. What it means is that the # .+ I have referenced within the regex pattern should be able to # pick up alphanumeric characters, in addition to newline characters # (\n). Since I did not have this in the first version, the cautionary/ # privacy messages people were pasting at the ends of their emails # were not getting filtered out and were being entered into the # LDA analysis, putting noise in the topics that were modelled. email_pat = re.compile(".+@.+") to_pat = re.compile("To:.+\n") cc_pat = re.compile("cc:.+\n") subject_pat = re.compile("Subject:.+\n") from_pat = re.compile("From:.+\n") sent_pat = re.compile("Sent:.+\n") received_pat = re.compile("Received:.+\n") ctype_pat = re.compile("Content-Type:.+\n") reply_pat = re.compile("Reply- Organization:.+\n") date_pat = re.compile("Date:.+\n") xmail_pat = re.compile("X-Mailer:.+\n") mimver_pat = re.compile("MIME-Version:.+\n") dash_pat = re.compile("--+.+--+", re.DOTALL) star_pat = re.compile('\*\*+.+\*\*+', re.DOTALL) uscore_pat = re.compile(" __+.+__+", re.DOTALL) equals_pat = re.compile("==+.+==+", re.DOTALL) # (the below is the same note as before) # The enron emails are in 151 directories representing each each senior management # employee whose email account was entered into the dataset. # The task here is to go into each folder, and enter each # email text file into one long nested list. # I've used readlines() to read in the emails because read() # didn't seem to work with these email files. chdir("/home/inkhorn/enron") names = [d for d in listdir(".") if "." not in d] for name in names: chdir("/home/inkhorn/enron/%s" % name) subfolders = listdir('.') sent_dirs = [n for n, sf in enumerate(subfolders) if "sent" in sf] sent_dirs_words = [subfolders[i] for i in sent_dirs] for d in sent_dirs_words: chdir('/home/inkhorn/enron/%s/%s' % (name,d)) file_list = listdir('.') docs.append([" ".join(open(f, 'r').readlines()) for f in file_list if "." in f]) # (the below is the same note as before) # Here i go into each email from each employee, try to filter out all the useless stuff, # then paste the email into one long flat list. This is probably inefficient, but oh well - python # is pretty fast anyway! docs_final =  for subfolder in docs: for email in subfolder: if ".nsf" in email: etype = ".nsf" elif ".pst" in email: etype = ".pst" email_new = email[email.find(etype)+4:] email_new = to_pat.sub('', email_new) email_new = cc_pat.sub('', email_new) email_new = subject_pat.sub('', email_new) email_new = from_pat.sub('', email_new) email_new = sent_pat.sub('', email_new) email_new = received_pat.sub('', email_new) email_new = email_pat.sub('', email_new) email_new = ctype_pat.sub('', email_new) email_new = reply_pat.sub('', email_new) email_new = date_pat.sub('', email_new) email_new = xmail_pat.sub('', email_new) email_new = mimver_pat.sub('', email_new) email_new = dash_pat.sub('', email_new) email_new = star_pat.sub('', email_new) email_new = uscore_pat.sub('', email_new) email_new = equals_pat.sub('', email_new) docs_final.append(email_new) # (the below is the same note as before) # Here I proceed to dump each and every email into about 126 thousand separate # txt files in a newly created 'data' directory. This gets it ready for entry into a Corpus using the tm (textmining) # package from R. for n, doc in enumerate(docs_final): outfile = open("/home/inkhorn/enron/data/%s.txt" % n,'w') outfile.write(doc) outfile.close()
As I did not change the R code since the last post, let’s have a look at the results:
terms(lda.model,20) Topic 1 Topic 2 Topic 3 Topic 4 [1,] "enron" "time" "pleas" "deal" [2,] "busi" "thank" "thank" "gas" [3,] "manag" "day" "attach" "price" [4,] "meet" "dont" "email" "contract" [5,] "market" "call" "enron" "power" [6,] "compani" "week" "agreement" "market" [7,] "vinc" "look" "fax" "chang" [8,] "report" "talk" "call" "rate" [9,] "time" "hope" "copi" "trade" [10,] "energi" "ill" "file" "day" [11,] "inform" "tri" "messag" "month" [12,] "pleas" "bit" "inform" "compani" [13,] "trade" "guy" "phone" "energi" [14,] "risk" "night" "send" "transact" [15,] "discuss" "friday" "corp" "product" [16,] "regard" "weekend" "kay" "term" [17,] "team" "love" "review" "custom" [18,] "plan" "item" "receiv" "cost" [19,] "servic" "email" "question" "thank" [20,] "offic" "peopl" "draft" "purchas"
One at a time, I will try to interpret what each topic is trying to describe:
- This one appears to be a business process topic, containing a lot of general business terms, with a few even relating to meetings.
- Similar to the last model that I derived, this topic has a lot of time related words in it such as: time, day, week, night, friday, weekend. I’ll be interested to see if this is another business meeting/interview/social meeting topic, or whether it describes something more social.
- Hrm, this topic seems to contain a lot of general terms used when we talk about communication: email, agreement, fax, call, message, inform, phone, send, review, question. It even has please and thank you! I suppose it’s very formal and you could perhaps interpret this as professional sounding administrative emails.
- This topic seems to be another case of emails containing a lot of ‘shop talk’
Okay, let’s see if we can find some examples for each topic:
sample(which(df.emails.topics$"1" > .95),3)  27771 45197 27597 enron[] Christi's call. Christi has asked me to schedule the above meeting/conference call. September 11th (p.m.) seems to be the best date. Question: Does this meeting need to be a 1/2 day meeting? Christi and I were wondering. Give us your thoughts.
Yup, business process, meeting. This email fits the bill! Next!
enron[] Bob, I didn't check voice mail until this morning (I don't have a blinking light. The assistants pick up our lines and amtel us when voice mails have been left.) Anyway, with the uncertainty of the future business under the Texas Desk, the following are my goals for the next six months: 1) Ensure a smooth transition of HPL to AEP, with minimal upsets to Texas business. 2) Develop operations processes and controls for the new Texas Desk. 3) Develop a replacement a. Strong push to improve Liz (if she remains with Enron and ) b. Hire new person, internally or externally 4) Assist in develop a strong logisitcs team. With the new business, we will need strong performers who know and accept their responsibilites. 1 and 2 are open-ended. How I accomplish these goals and what they entail will depend how the Texas Desk (if we have one) is set up and what type of activity the desk will be invovled in, which is unknown to me at this time. I'm sure as we get further into the finalization of the sale, additional and possibly more urgent goals will develop. So, in short, who knows what I need to do. D
This one also seems to fit the bill. “D” here is writing about his/her goals for the next six months and considers briefly how to accomplish them. Not heavy into the content of the business, so I’m happy here. On to topic 2:
sample(which(df.emails.topics$"2" > .95),3)  50356 22651 19259 enron[] I agree it is Matt, and I believe he has reviewed this tax stuff (or at least other turbine K's) before. His concern will be us getting some amount of advance notice before title transfer (ie, delivery). Obviously, he might have some other comments as well. I'm happy to send him the latest, or maybe he can access the site? Kay Given that the present form of GE world hunger seems to be more domestic than international it would appear that Matt Gockerman would be a good choice for the Enron- GE tax discussion. Do you want to contact him or do you want me to. I would be interested in listening in on the conversation for continuity.
Here, the conversants seem to be talking about having a phone conversation with “Matt” to get his ideas on a tax discussion. This fits in with the meeting theme. Next!
enron[] LOVE HONEY PIE
Well, that was pretty social, wasn’t it? Okay one more from the same topic:
enron[] Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X- X- X- X-b X-Folder: \ExMerge - Giron, Darron C.\Sent Items X-Origin: GIRON-D X-FileName: darron giron 6-26-02.PST Sorry. I've got a UBS meeting all day. Catch you later. I was looking forward to the conversation. DG It seems everyone agreed to Ninfa's. Let's meet at 11:45; let me know if a different time is better. Ninfa's is located in the tunnel under the JP Morgan Chase Tower at 600 Travis. See you there. Schroeder
Woops, header info that I didn’t manage to filter out . Anyway, DG writes about an impending conversation, and Schroeder writes about a specific time for their meeting. This fits! Next topic!
sample(which(df.emails.topics$"3" > .95),3)  24147 51673 29717 enron[] Kaye: Can you please email the prior report to me? Thanks. Sara Shackleton Enron North America Corp. 1400 Smith Street, EB 3801a Houston, Texas 77002 713-853-5620 (phone) 713-646-3490 (fax) 04/10/2001 05:56 PM At Alan's request, please provide to me by e-mail (with a Thursday of this week your suggested changes to the March 2001 Monthly Report, so that we can issue the April 2001 Monthly Report by the end of this week. Thanks for your attention to this matter. Nita
This one definitely fits in with the professional sounding administrative emails interpretation. Emailing reports and such. Next!
I believe this was intended for Susan Scott with ETS...I'm with Nat Gas trading. Thanks FYI...another executed capacity transaction on EOL for Transwestern. This message is to confirm your EOL transaction with Transwestern Pipeline Company. You have successfully acquired the package(s) listed below. If you have questions or concerns regarding the transaction(s), please call Michelle Lokay at (713) 345-7932 prior to placing your nominations for these volumes. Product No.: 39096 Time Stamp: 3/27/01 09:03:47 am Product Name: US PLCapTW Frm CenPool-OasisBlock16 Shipper Name: E Prime, Inc. Volume: 10,000 Dth/d Rate: $0.0500 /dth 1-part rate (combined Res + Com) 100% Load Factor + applicable fuel and unaccounted for TW K#: 27548 Effective Points: RP- (POI# 58649) Central Pool 10,000 Dth/d DP- (POI# 8516) Oasis Block 16 10,000 Dth/d Alternate Point(s): NONE Note: In order to place a nomination with this agreement, you must log off the TW system and then log back on. This action will update the agreement's information on your PC and allow you to place nominations under the agreement number shown above. Contact Info: Michelle Lokay Phone (713) 345-7932 Fax (713) 646-8000
Rather long, but even the short part at the beginning falls under the right category for this topic! Okay, let’s look at the final topic:
sample(which(df.emails.topics$"4" > .95),3)  39100 31681 6427 enron[] Randy, your proposal is fine by me. Jim
Hrm, this is supposed to be a ‘business content’ topic, so I suppose I can see why this email was classified as such. It doesn’t take long to go from ‘proposal’ to ‘contract’ if you free associate, right? Next!
enron[] Attached is the latest version of the Wildhorse Entrada Letter. Please review. I reviewed the letter with Jim Osborne and Ken Krisa yesterday and should get their comments today. My plan is to Fedex to Midland for Ken's signature tomorrow morning and from there it will got to Wildhorse.
This one makes me feel a little better, referencing a specific business letter that the emailer probably wants the emailed person to see. Let’s find one more for good luck:
enron[] At a ratio of 10:1, you should have your 4th one signed and have the fifth one on the way... 09/19/2000 05:40 PM ONLY 450! Why, I thought you guys hit 450 a long time ago. Marie Heard Senior Legal Specialist Enron Broadband Services Phone: (713) 853-3907 Fax: (713) 646-8537 09/19/00 05:34 PM Well, I do believe this makes 450! A nice round number if I do say so myself! Susan Bailey 09/19/2000 05:30 PM We have received an executed Master Agreement: Type of Contract: ISDA Master Agreement (Multicurrency-Cross Border) Effective Enron Entity: Enron North America Corp. Counterparty: Arizona Public Service Company Transactions Covered: Approved for all products with the exception of: Weather Foreign Exchange Pulp & Paper Special Note: The Counterparty has three (3) Local Business Days after the receipt of a Confirmation from ENA to accept or dispute the Confirmation. Also, ENA is the Calculation Agent unless it should become a Defaulting Party, in which case the Counterparty shall be the Calculation Agent. Susan S. Bailey Enron North America Corp. 1400 Smith Street, Suite 3806A Houston, Texas 77002 Phone: (713) 853-4737 Fax: (713) 646-3490
That one was very long, but there’s definitely some good business content in it (along with some happy banter about the contract that I guess was acquired).
All in all, I’d say that fixing those regex patterns that were supposed to filter out the caution/privacy messages at the ends of peoples’ emails was a big boon to the LDA analysis here.
Let that be a lesson: half the battle in LDA is in filtering out the noise!
Published at DZone with permission of Matthew Dubins , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.