Introduction

The Enron Corpus is one of the largest dataset of emails available to the public. Normally, emails are a very personal and private thing, and shouldn’t be made available to the public. However, the Federal Energy Regulatory Commission acquired these emails during its investigation of the company in 2002 and placed the email corpus in the public domain once the investigation was over. The Enron Corpus contains over 500,000 emails generated by over 150 employees. It is a very large data set and is ideal for learning how to mine unstructured (text) data.

In this article we will explore mining unstructured data by tidying the data into a data frame, analyzing the data using sentiment analysis, and also grouping the emails by using the unsupervised learning method known as the Latent Dirichlet Allocation Model (LDA).

The Enron Corpus can be found by going to Carnegie Mellon University’s School of Computer Science’s website. Carnegie Mellon provides the corpus in it’s raw format - that is, each email is given its own text file and is stored in a folder representating the original users’ email folders. Luckily, William Cukierski was kind enough to provide the corpus in a CSV on Kaggle’s website. This article will use the CSV.

Finally, all analysis will be done on my local machine, which maxes out at about 2GB of RAM. My machine will not be able to compute an LDA model for a 500,000+ observation dataset and so I will gradually shrink the dataset down to a size that is manageable, but also will work towards shrinking the dataset down to a smaller dataset that makes sens - not simply removing random observations.

Tidying the Data

First we will need to load the tidyverse. It is a wrapper for a number of useful packages including dplyr, ggplot, and readr.

library(tidyverse)

We will now load the data into R by using the readr function read_csv.

enron_emails <- read_csv('emails.csv')

The csv provides only two variables: file and message. The file variable contains the original directory and filename of each email. The root level of this path is the employee to whom the email belongs.

enron_emails$file %>%
  sample(5)
## [1] "beck-s/_sent_mail/52."       "mann-k/_sent_mail/2018."    
## [3] "mann-k/_sent_mail/1661."     "mann-k/_sent_mail/2047."    
## [5] "schoolcraft-d/_sent_mail/7."

The message variable contains the email text. This includes email meta-data such as the Message-ID, Date, To, Subject fields as well as the body of the email.

enron_emails$message %>%
  sample(1)
## [1] "Message-ID: <153208.1075846034300.JavaMail.evans@thyme>\nDate: Wed, 23 May 2001 08:35:00 -0700 (PDT)\nFrom: kay.mann@enron.com\nTo: greg.krause@enron.com\nSubject: RE: AEW's backup\nMime-Version: 1.0\nContent-Type: text/plain; charset=us-ascii\nContent-Transfer-Encoding: 7bit\nX-From: Kay Mann\nX-To: Greg Krause\nX-cc: \nX-bcc: \nX-Folder: \\Kay_Mann_June2001_4\\Notes Folders\\'sent mail\nX-Origin: MANN-K\nX-FileName: kmann.nsf\n\nCould you email me the option agreement?\n\n\n\nFrom: Greg Krause/ENRON@enronXgate on 05/23/2001 02:19 PM\nTo: Kay Mann/Corp/Enron@Enron\ncc:  \n\nSubject: RE: AEW's backup\n\nI think the invoice was for between $20,000 and $30,000, but I can't \nremember.  AEW has the invoice.  This probably will not be the final as we \nwill need to work with them in discussions with DERM on delaying the landfill \nclosure and in moving jurisdiction of the project from CZAB to the County \nCommissioners.   I think Shutts & Bowen (who may get stiffed by their client \nif this deal blows up) and Certosa Holdings would be open to any suggestions \nand we need to renegotiate the option anyway.  Shall I call and suggest this \nthis flat fee for retooling the option agreement to them or shall we do it \ntogether?  I,m not sure I could explain to them acequately the idiosyncrasies \nof our accounting requirements.\n\n -----Original Message-----\nFrom:  Mann, Kay  \nSent: Wednesday, May 23, 2001 2:00 PM\nTo: Krause, Greg\nSubject: RE: AEW's backup\n\nHow much is it and should this be the final amount?  One thought I have is \nthat maybe we can retool the option agreement so that we pay them a flat fee, \nwhich is enough to cover the expenditures.  Don't know if this works, but it \nis one thought.  What do you think?\n\nKay\n\n\nFrom: Greg Krause/ENRON@enronXgate on 05/23/2001 01:50 PM\nTo: Kay Mann/Corp/Enron@Enron\ncc:  \n\nSubject: RE: AEW's backup\n\nOne more thing on the SDEC project:  According to the option agreement we \nexecuted last October, we agreed to reimburse Certosa Holdings for actual \nthird party costs that they incurred in support of our necessary \napplications, submittals and in seeking local approval.  Several weeks ago, \nwe recieved an invoice from Shutts & Bowen, attorney for Certosa Holdings \nrequesting reimbursement pursuant to the contract.  I forwarded this invoice \non th Ann Elizabeth not necessarily to pay for but to review considering this \nwhole soft cost hard cost discussion.  I recieved another call this morning \nfrom Shutts & Bowen asking about the invoice.  What should I do? \n -----Original Message-----\nFrom:  Mann, Kay  \nSent: Wednesday, May 23, 2001 11:03 AM\nTo: Krause, Greg\nSubject: RE: AEW's backup\n\nGreg,\n\nYou can call me on whatever you have, including Midway, SDEC and Medley \nDunn.  If I have a problem getting to something, I'll find help.\n\nKay\n\n\n\n\nFrom: Greg Krause/ENRON@enronXgate on 05/23/2001 10:50 AM\nTo: Kay Mann/Corp/Enron@Enron\ncc:  \n\nSubject: RE: AEW's backup\n\n\nKay,\n\nAnn Elizabeth did not provide a designated hitter for the South Dade Energy \nCenter (Dade Development Company LLC is Optionee, Certosa Holdings is \nOptionor) nor did she provide one for tne Medley Dunn project.  I have been \ntold that the Dunns are considering backing off their ultimatums that they \ngave Ann Elizabeth and I regarding taxes to the town and assumption of \nenviromental liability.  Who do I talk to about the Dunn contract while Ann \nElizabeth is out?  \n -----Original Message-----\nFrom:  White, Ann Elizabeth  \nSent: Tuesday, May 22, 2001 10:33 PM\nTo: gkrause@enron.com; Krimsky, Steven; Ben Jacoby/HOU/ECT@ENRON; Carnahan, \nKathleen\nCc: Milligan, Taffy\nSubject: AEW's backup\n\nKay Mann is the designated hitter for the Pompano and Deerfied projects while \nI'm on vacation.  I've given her a down load of the status of Greg and \nSteve's projects.  Chris Boehler at A&K will be the designated hitter for \nMidway.  I'm not going to check my voice mail while I'm gone but, if \nnecessary, here are the contact numbers while I'm gone.\n\nWalter and Marlena Schilling  011-49-8218-89351 schilling.jun@freenet.de\n\nMonika and Bernhard Steinacher 011-49-8232-8932 m.steinacher@schwabmuenchen.de\n\nIf you call, Walter and Bernhard and Bernhard's daughter, Susanne, speak very \ngood English.  Monika's isn't bad.  Marlena may get flustered and hang up on \nyou.\n\nBest of luck at Deerfield and hope to see Pompano on track when I get back in \nthe office on June 11th.  Kay is planning on going to Florida on June 12 for \nthe moratorium hearing and the rezoning hearing.\n\n\n\n\n\n\n\n"

First, to tackle this large dataset we should first create a smaller, more manageble subset. After looking at the different files, I decided that a decent subset would be emails found in the users’ sent_mail folder. This seems like a good choice because it will only contain emails from Enron employees. The stringr package provides us with str_detect() which is a function that looks to see if a string can be found within another string. We can combine str_detect() with dplyr’s filter() to provide us all the observations which have sent_mail in their file-path.

library(stringr)

enron_emails <- enron_emails  %>%
  filter(str_detect(file, '/_sent_mail/'))

There are now only 30237 observations to work with.

Next we need to split the the message variable into the new header and body variables. Luckily, the two are neatly seperated by two consecutive newlines. Before the two newlines is the header, and after the two newlines is the body. We should also clean the body a bit by removing all subsequent newlines and tabs, as well as well as removing all referenced emails. A referenced email is an email that was forwarded or copied from a previous email. Also, we will replace all email addresses with with an EMAIL_ADRESS token so that different emails don’t have a strong influene on the analysis.

enron_emails <- enron_emails %>%
  mutate(message = message %>% str_replace_all('\r', '')) %>%
  mutate(header = str_sub(message, end = str_locate(message, '\n\n')[,1] -1)) %>%
  mutate(body = str_sub(message, start = str_locate(message, '\n\n')[,2] + 1) %>% 
           str_replace_all('\n|\t', ' ') %>% 
           str_replace_all('---Original Message .*', 'FORWARDED_MESSAGE') %>% 
           str_replace_all('--- Forwarded by .*', 'FORWARDED_MESSAGE') %>%
           str_replace_all('From: .*', 'FORWARDED_MESSAGE') %>%
           str_replace('To:.*', 'FORWARDED_MESSAGE') %>%
           str_replace_all('\\S*@\\S*', 'EMAIL_ADDRESS')) 

Now we will split the header into its seperate metadata fields.

enron_emails <- enron_emails %>%
  mutate(date = str_extract(header, 'Date:.*') %>% 
           str_replace('Date: ', '') %>% 
           str_replace('.+, ', '') %>% 
           strptime(format = '%d %b %Y %H:%M:%S %z') %>%
           as.POSIXct()) %>%
  mutate(from = str_extract(header, 'From:.*') %>% 
           str_replace('From: ', '')) %>%
  mutate(to = header %>% str_replace_all('\n|\t', ' ') %>%
           str_extract('To:.*Subject:') %>%
           str_replace_all('To: |Subject:', '')) %>%
  mutate(subject = str_extract(header, 'Subject:.*') %>% 
           str_replace('Subject: ', '')) %>%
  mutate(xfrom = str_extract(header, 'X-From:.*') %>% 
           str_replace('X-From: ', '')) %>%
  mutate(xto = str_extract(header, 'X-To:.*') %>% 
           str_replace('X-To: ', '')) %>%
  mutate(xcc = str_extract(header, 'X-cc:.*') %>% 
           str_replace('X-cc: ', '')) %>%
  mutate(xbcc = str_extract(message, 'X-bcc:.*') %>% 
           str_replace('X-bcc: ', '')) %>%
  arrange(date)

Sentiments:

In order to analyse the sentiments of each email we will need to split the body of each email into its individual words. We will be using the AFINN sentiment library to score each word with either a positive number or a negative number depending on the individual word’s sentiment.

library(tidytext)
library(tm)

In order to analyze the sentiments of the body of an email, we will need to create a matrix of each word within the body. The matrix shows provides the number of times a word shows up in an email. We can create such a matrix for every body of email by using the tm package’s DocumentTermMatrix() function. A document term matrix is a matrix that provides a word count for each document provided. For us, a document is a single body of an email. Before we can create the document term matrix, however, we will need to convert the vector of email bodies into a corpus by using the Corpus() function. A corpus is a data structure for a collection of text and it is necessary to convert a vector of characters into a corpus in order to perform many different text-mining tasks.

We will need to provide a control for making the document term matrix that accounts for punctuation, numbers, and stopwords. A stop word is a word that shows up so frequently in the english language that it doesn’t have any analytical value. For example ‘the’ and ‘a’ are neither positive words, nor negative words, and are also used in almost every sentence so they can’t be used to differentiate how different people speak.

dtm.control = list(
  removePunctuation = T,
  removeNumber = T,
  stopwords = stopwords('english')
)

enron_dtm <- Corpus(VectorSource(enron_emails$body)) %>%
  DocumentTermMatrix(dtm.control)

Once the document term matrix is created, we will need to convert the matrix into a tidy data frame using tidytext’s tidy() function. The tidy data frame will have three variables - document, term, and count. The document variable is an email body, where a document of 1 is the first email body in our enron_emails data frame. The term variable contains each word found in each document. Each document-word combination is given its own row.The count variable is the number of times a particular term (or word) is found in a document.

enron_sentiments <- tidy(enron_dtm)

enron_sentiments %>% head()
## # A tibble: 6 × 3
##   document    term count
##      <chr>   <chr> <dbl>
## 1        1    fuel     1
## 2        1 include     1
## 3        1   tetco     1
## 4        1 tickets     1
## 5        1 updated     1
## 6        1   usage     1

Once we create the tidy dataframe we will join it with the ‘AFINN’ library and create the score variable by multiplying the word’s sentiment score with the the number of times a word appears in a document. We can then sum the score of each word within each document to create a sentiment variable for each document.

enron_sentiments <- enron_sentiments %>%
  inner_join(get_sentiments('afinn'), by = c(term = 'word')) %>%
  mutate(score = score * count) %>%
  group_by(document) %>%
  nest() %>%
  mutate(sentiment = sapply(seq_along(.$data), function(i){
    .$data[[i]]['score'] %>%
      sum()
  })) %>%
  select(-data)

enron_sentiments %>% head()
## # A tibble: 6 × 2
##   document sentiment
##      <chr>     <dbl>
## 1        2        -2
## 2        3         3
## 3        4         5
## 4        6         1
## 5        9         4
## 6       11         1

The document term matrix numbers the documents according to order of appearence in the dataframe, so in order to combine the information with our original enron_emails data frame we will need to create a new variable called document and order it from 1 to nrow(enron_emails)

We will also clean the date variable so we can analyze the sentiments for each day of the week and each month. We can do this by using the lubridate’s wday() and month() functions.

library(lubridate)

enron_emails <- enron_emails %>%
  mutate(weekday = wday(date, label = T)) %>%
  mutate(month = month(date, label = T)) %>%
  mutate(document = 1:nrow(.) %>% as.character()) %>%
  left_join(enron_sentiments) 

Emails that don’t have any sentiment are given NA values. We will need to convert these values into 0.

enron_emails$sentiment[is.na(enron_emails$sentiment)] = 0

Some of the emails were sent to multiple different people. We should split the to variable so that each reciepient of an email is accounted for. Then we can find the averge email sentiment over the course of all the correspondence between two people.

enron_emails <- enron_emails %>%
  mutate(to = str_split(to, ',')) %>%
  unnest() %>%
  group_by(from, to) %>%
  nest() %>%
  mutate(mean_sentiment = sapply(seq_along(.$data), function(i){.$data[[i]]$sentiment %>% mean()})) %>%
  mutate(correspondence = sapply(seq_along(.$data), function(i){.$data[[i]] %>% nrow()})) %>%
  arrange(desc(correspondence))

Visualizations

Let’s check the check the mean sentiment for each correspondence. Since their are a lot of email recipients, we will need to shrink the number of ‘to’ fields. I think the best way to do this is by focusing only on email correspondence that contain more than 10 emails. This let’s us focus on people who write to eachother somewhat frequently.

ggplot(enron_emails %>% filter(correspondence > 10), aes(to, from)) +
  geom_point(aes(color = mean_sentiment)) +
  scale_color_gradient2(low = 'brown',
                        mid = 'yellow',
                        high = 'blue') +
  theme(axis.text.x = element_blank(), 
        axis.ticks.x = element_blank())

We don’t have too much variation in with the sentiments. This shouldn’t be too surprising since we’re analyzing work emails.

We can also see whether an employee wrote a lot of emails to a single person, or multiple people, or both.

ggplot(enron_emails %>% filter(correspondence > 10), aes(to, from)) +
  geom_point(aes(size = correspondence)) +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

Vince Kaminski seems to be sending a lot of emails to one particular person. While Kay Mann, Eric Bass, and Chris Germany seem to be sending multiple emails to many different people.

Let’s look into this further…

enron_emails %>%
  unnest() %>%
  group_by(from, to, correspondence) %>%
  nest() %>%
  arrange(desc(correspondence)) %>%
  select(-data)
## # A tibble: 16,380 × 3
##                        from                           to correspondence
##                       <chr>                        <chr>          <int>
## 1  vince.kaminski@enron.com           vkaminski@aol.com            1059
## 2        kay.mann@enron.com     suzanne.adams@enron.com             412
## 3  vince.kaminski@enron.com  shirley.crenshaw@enron.com             291
## 4        kay.mann@enron.com              nmann@erac.com             265
## 5  robin.rodrigue@enron.com    gabriel.monroy@enron.com             229
## 6      kate.symes@enron.com    evelyn.metoyer@enron.com             229
## 7      kate.symes@enron.com    kerri.thompson@enron.com             215
## 8        kay.mann@enron.com kathleen.carnahan@enron.com             210
## 9        kay.mann@enron.com       carlos.sole@enron.com             179
## 10    drew.fossum@enron.com     martha.benner@enron.com             170
## # ... with 16,370 more rows

It turns out that Vince is forwarding many of his work emails to his personal email account. We should probably get rid of these email correspondence because they don’t actually represent email communication between two different people.

Also, to improve visualization and also provide a more concise dataset to model, we should only keep the top ten most prolific email writers.

top_ten_writers <- enron_emails %>% 
  unnest()  %>%
  filter(!(str_detect(from, 'vince.kaminski') & str_detect(to, 'vkaminski'))) %>%
  group_by(from, body) %>%
  nest() %>%
  group_by(from) %>%
  nest() %>%
  mutate(tot_emails_sent = sapply(seq_along(.$data), function(i){
      nrow(.$data[[i]])
    })) %>%
  arrange(desc(tot_emails_sent)) %>%
  .[1:10,] %>%
  unnest() 

Let’s visualize both the sentiment and correspondence once more, but with just the top ten email writers.

ggplot(top_ten_writers %>% unnest(), aes(to, from)) +
  geom_point(aes(color = mean_sentiment)) +
  scale_color_gradient2(low = 'brown',
                        mid = 'yellow',
                        high = 'blue') +
  theme(axis.text.x = element_blank(), axis.ticks = element_blank())

I will need to check my code on this one. The mean_sentiment scale changed drastically… If everything is correct, then we learn that Susan Scott is a more positive writer in general and Chris Germany is really positive with one other person.

top_ten_writers %>%
  unnest() %>% 
  arrange(desc(sentiment)) %>%
  .$body %>%
  .[1]
## [1] "There are five CNG postings listed below.  Posting #1 APPALACHIAN PRODUCER NOTICE  POSTED May 23, 2000  Dear West Virginia Producer,  Over the last several years, Dominion Transmission has worked hard providing  additional value to producers on its gathering system and through innovative  projects and joint ventures.  In continuing with that process to bring you  more value for your local production and to improve the efficiency of our  gathering system, we would like to continue a process where compression  into our gathering system may be approved.  We have been able to grant a  limited number of these requests in the past, and would now like to make  this opportunity somewhat more widely available to the producers.  For this second offering, the process we will use is as follows:  For a  period of 30 days, beginning 12:00 Noon on May 24, 2000, the Gathering and  Production Division of Dominion Transmission will accept applications for  compression rights into its gathering system for gas feeding its SCHUTTE  compressor station.  The Dominion \"bubbles\" affected are 4404 & 4405.    Dominion recognizes that, although limited, there does exist the incremental  capacity for throughput through these compressor stations.  The incremental  capacity authorized through new compression agreements will not exceed the  limit Dominion determines to be the capacity of the existing compressors.   Dominion will attempt to be flexible in approving the applications it  receives, but will limit approved applications so as to not unreasonably  interfere with the production from the existing non-compressed gas entering  the system.  Please submit your written application to Joe Thompson at the  Dominion Gathering & Production Office.  The application must indicate the  amount of \"incremental\" gas for the specific compression agreement.   Dominion recognizes that the installation of compression requires differing  time frames.  We will balance that consideration with our need to insure  that the application process is not abused to the disadvantage of other  producers; therefore, once granted, the applicant must have the compression  in place and operable within six months of Dominion's approval, or the rights  to all unconstructed capacity will be forfeited.  In the event that more  requests for compression rights for incremental gas are received than for  which there is capacity, Dominion will award the rights on a prorated basis.  Please include in your written request the mid number or a map of the  proposed  location where you will be requesting to install compression.  Also include  the incremental volume you are wishing to transport under the compression  agreement.  Your name, company name and address, state of incorporation,  phone and fax number, as well as an e-mail address, if available, should  be included (please print or type).  Forward this information to:        Joseph A. Thompson     Manager, Business Development     Dominion Transmission Gathering & Production Division     500 Davisson Run Road     Clarksburg, WV  26301  After all requests are received, we will review the request and make a  determination on granting compression based on capacity available and the  capacity requests.  You will be contacted as soon as possible after  June 23, 2000 to notify you of those requests that have been approved.   Please realize that this is a \"window\" and the opportunity will close  after Noon on June 23, 2000.  I hope that this will provide opportunity for you to increase the value  of your production facilities, and I look forward to working with you on  this opportunity to add compression.  You can contact Joe Thompson at  (304) 623-8709 or Dan Stuart at (304) 623-8705 with any questions you may  have regarding this compression  \"open season\" request on Dominion Gathering.  Sincerely,  H. Dale Rexrode Director, Producer Services & Business Development   Posting #2 APPALACHIAN PRODUCER NOTICE  POSTED May 23, 2000  Dear West Virginia Producer,  Over the last several years, Dominion Transmission has worked hard providing  additional value to producers on its gathering system and through innovative  projects and joint ventures.  In continuing with that process to bring you  more value for your local production and to improve the efficiency of our  gathering system, we would like to continue a process where compression  into our gathering system may be approved.  We have been able to grant a  limited number of these requests in the past, and would now like to make  this opportunity somewhat more widely available to the producers.  For this second offering, the process we will use is as follows:  For a  period of 30 days, beginning 12:00 Noon on May 24, 2000, the Gathering and  Production Division of Dominion Transmission will accept applications for  compression rights into its gathering system for gas feeding its SMITHBURG,  COLLINS, MAXWELL AND NEW OXFORD  compressor stations.  The Dominion \"bubbles\"  affected are 4401, 4403, 4413.   Dominion recognizes that, although very  limited, there does exist the incremental capacity for throughput through  these compressor stations.  The incremental capacity authorized through new  compression agreements will not exceed the limit Dominion determines to be  the capacity of the existing compressors.  Dominion will attempt to be  flexible  in approving the applications it receives, but will limit approved  applications  so as to not unreasonably interfere with the production from the existing  non-compressed gas entering the system.  Please submit your written  application  to Joe Thompson at the Dominion Gathering & Production Office.  The  application  must indicate the amount of \"incremental\" gas for the specific compression  agreement.  Dominion recognizes that the installation of compression requires  differing time frames.  We will balance that consideration with our need to  insure that the application process is not abused to the disadvantage of  other  producers; therefore, once granted, the applicant must have the compression  in  place and operable within six months of Dominion's approval, or the rights to  all unconstructed capacity will be forfeited.  In the event that more  requests  for compression rights for incremental gas are received than for which there  is  capacity, Dominion will award the rights on a prorated basis.  Please include in your written request the mid number or a map of the  proposed  location where you will be requesting to install compression.  Also include  the  incremental volume you are wishing to transport under the compression  agreement.  Your name, company name and address, state of incorporation,  phone and fax number, as well as an e-mail address, if available, should be  included (please print or type).  Forward this information to:        Joseph A. Thompson     Manager, Business Development     Dominion Transmission Gathering & Production Division     500 Davisson Run Road     Clarksburg, WV  26301  After all requests are received, we will review the request and make a  determination on granting compression based on capacity available and the  capacity requests.  You will be contacted as soon as possible after  June 23, 2000 to notify you of those requests that have been approved.   Please realize that this is a \"window\" and the opportunity will close  after Noon on June 23, 2000.  I hope that this will provide opportunity for you to increase the value of  your production facilities, and I look forward to working with you on this  opportunity to add compression.  You can contact Joe Thompson at  (304) 623-8709 or Dan Stuart at (304) 623-8705 with any questions you may  have regarding this compression  \"open season\" request on Dominion Gathering.  Sincerely,  H. Dale Rexrode Director, Producer Services & Business  Development                               Posting #3 APPALACHIAN PRODUCER NOTICE  POSTED May 23, 2000  Dear West Virginia Producer,  Over the last several years, Dominion Transmission has worked hard providing  additional value to producers on its gathering system and through innovative  projects and joint ventures.  In continuing with that process to bring you  more value for your local production and to improve the efficiency of our  gathering system, we would like to continue a process where compression into  our gathering system may be approved.  We have been able to grant a limited  number of these requests in the past, and would now like to make this  opportunity somewhat more widely available to the producers.  For this second offering, the process we will use is as follows:  For a  period of 30 days, beginning 12:00 Noon on May 24, 2000, the Gathering and  Production Division of Dominion Transmission will accept applications for  compression rights into its gathering system for gas feeding its CAMDEN  compressor station.  The Dominion \"bubbles\" affected are 3306, 3307, 3401,  3402, 3403, 3406, 3407, 3408 & 3409.   Dominion recognizes that, although  limited, there does exist the incremental capacity for throughput through  these compressor stations.  The incremental capacity authorized through  new compression agreements will not exceed the limit Dominion determines  to be the capacity of the existing compressors.  Dominion will attempt to  be flexible in approving the applications it receives, but will limit  approved  applications so as to not unreasonably interfere with the production from the  existing non-compressed gas entering the system.  Please submit your written  application to Joe Thompson at the Dominion Gathering & Production Office.   The application must indicate the amount of \"incremental\" gas for the  specific compression agreement.  Dominion recognizes that the installation  of compression requires differing time frames.  We will balance that  consideration with our need to insure that the application process is not  abused to the disadvantage of other producers; therefore, once granted,  the applicant must have the compression in place and operable within six  months of Dominion's approval, or the rights to all unconstructed capacity  will be forfeited.  In the event that more requests for compression rights  for incremental gas are received than for which there is capacity, Dominion  will award the rights on a prorated basis.  Please include in your written request the mid number or a map of the  proposed location where you will be requesting to install compression.   Also include the incremental volume you are wishing to transport under  the compression agreement.  Your name, company name and address, state  of incorporation, phone and fax number, as well as an e-mail address, if  available, should be included (please print or type).  Forward this information to:        Joseph A. Thompson     Manager, Business Development     Dominion Transmission Gathering & Production Division     500 Davisson Run Road     Clarksburg, WV  26301  After all requests are received, we will review the request and make a  determination on granting compression based on capacity available and the  capacity requests.  You will be contacted as soon as possible after  June 23, 2000 to notify you of those requests that have been approved.   Please realize that this is a \"window\" and the opportunity will close  after Noon on June 23, 2000.  I hope that this will provide opportunity for you to increase the value  of your production facilities, and I look forward to working with you on  this opportunity to add compression.  You can contact Joe Thompson at  (304) 623-8709 or Dan Stuart at (304) 623-8705 with any questions you may  have regarding this compression  \"open season\" request on Dominion Gathering.  Sincerely,  H. Dale Rexrode Director, Producer Services & Business Development                             Posting #4                                           Posted May 9 ,2000 9:30 a.m.          Dear West Virginia Producer,  Over the last several years, Dominion Transmission has worked hard providing additional value to producers on its gathering system and through innovative  projects and joint ventures.  In continuing with that process to bring you  more value for your local production and to improve the efficiency of our gathering system, we would like to continue a process where compression into our  gathering system may be approved.  We have been able to grant a limited number of these requests in the past, and would now like to make this opportunity somewhat  more  widely available to the producers.  For this second offering, the process we will use is as follows:  For a period of 30 days, beginning 12:00 Noon on May 10, 2000, the Gathering  and Production Division of Dominion Transmission will accept applications for compression rights into its gathering system for gas feeding its JONES and  ORMA compressor stations.  The Dominion \"bubbles\" affected are 3205, 3206 and 3207  for Jones station and 3201, 3202, 3203 and 3204 for Orma station.   Dominion  recognizes that, although limited, there does exist the incremental capacity  for throughput through these compressor stations.  The incremental capacity authorized through new compression agreements will not exceed the limit Dominion determines to be the capacity of the existing compressors.  Dominion will attempt to be flexible in approving the applications it  receives, but will limit approved applications so as to not unreasonably interfere with  the production from the existing non-compressed gas entering the system.  Please submit your written application to Joe Thompson at the Dominion  Gathering & Production Office.  The application must indicate the amount of \"incremental\" gas for the specific compression agreement.  Dominion  recognizes  that the installation of compression requires differing time frames.  We will balance that consideration with our need to insure that the application  process is not abused to the disadvantage of other producers; therefore, once granted, the applicant must have the compression in place and operable within six  months of Dominion's approval, or the rights to all unconstructed capacity will be forfeited.  In the event that more requests for compression rights for  incremental gas are received than for which there is capacity, Dominion will award the rights on a prorated basis.  Please include in your written request the mid number or a map of the proposed location where you will be requesting to install compression.  Also include the incremental volume you are wishing to transport under the compression agreement.  Your name, company name and address, state of  incorporation, phone and fax number, as well as an e-mail address, if  available, should be included (please print or type).  Forward this information to:        Joseph A. Thompson     Manager, Business Development     Dominion Transmission                                 Gathering & Production Division     500 Davisson Run Road     Clarksburg, WV  26301  After all requests are received, we will review the request and make a determination on granting compression based on capacity available and the capacity requests.  You will be contacted as soon as possible after June 12, 2000 to notify you of those requests that have been approved.   Please realize that this is a \"window\" and the opportunity will close after Noon on June 12, 2000.  I hope that this will provide opportunity for you to increase the value of  your production facilities, and I look forward to working with you on this  opportunity to add compression.  You can contact Joe Thompson at  (304) 623-8709 or Dan Stuart at (304) 623-8705 with any questions you may have regarding this compression  \"open season\" request on Dominion Gathering.  Since rely,  H. Dale Rexrode Di rector, Producer Services & Business Development  Posting #5                                                   Posted May 9, 2000 9:07 a.m.     Dear West Virginia Producer,  Over the last several years, Dominion Transmission has worked hard providing  additional value to producers on its gathering system and  through innovative projects and joint ventures.  In continuing with  that process to bring you more value for your local production and to  improve the efficiency of our gathering system, we would like to continue a process where compression into our gathering system may be approved.   We have been able to grant a limited number of these requests in the past,  and would now like to make this opportunity somewhat more widely available to  the producers.  For this second offering, the process we will use is as follows:   For a period of 30 days, beginning 12:00 Noon on May 10, 2000,  the Gathering and Production Division of Dominion Transmission will accept applications for compression rights into its gathering system for gas  feeding the Barbour County High Pressure System.  The Dominion \"bubble\" affected is 4205.   Dominion recognizes that, although limited, there does exist the incremental capacity for throughput through these compressor  stations. The incremental capacity authorized through new compression agreements  will not exceed the limit Dominion determines to be the capacity of the  pipeline system.  Dominion will attempt to be flexible in approving the  applications it receives, but will limit approved applications so as to not  unreasonably interfere with the production from the existing non-compressed  gas entering the system.  Please submit your written application to  Joe Thompson at the Dominion Gathering & Production Office.   The application must indicate the amount of \"incremental\" gas for the  specific compression agreement.  Dominion recognizes that the installation  of compression requires differing time frames.  We will balance that  consideration with our need to insure that the application process is  not abused to the disadvantage of other producers; therefore, once granted,  the applicant must have the compression in place and operable within  six months of Dominion's approval, or the rights to all unconstructed  capacity will be forfeited.  In the event that more requests for compression  rights for incremental gas are received than for which there is capacity,  Dominion will award the rights on a prorated basis.  Please include in your written request the mid number or a map of the  proposed location where you will be requesting to install compression.  Also include the incremental volume you are wishing to transport under  the compression agreement.  Your name, company name and address,  state of incorporation, phone and fax number, as well as an e-mail address, if available, should be included (please print or type).  Forward this information to:        Joseph A. Thompson     Manager, Business Development     Dominion Transmission                                 Gathering & Production Division     500 Davisson Run Road     Clarksburg, WV  26301  After all requests are received, we will review the request and make a  determination on granting compression based on capacity available and the  capacity requests.  You will be contacted as soon as possible after June 12,  2000 to notify you of those requests that have been approved.  Please realize that this is a \"window\" and the opportunity will close after Noon on June 12, 2000.  I hope that this will provide opportunity for you to increase the value of your production facilities, and I look forward to working with you on this  opportunity to add compression.  You can contact Joe Thompson at (304)  623-8709 or Dan Stuart at (304) 623-8705 with any questions you may have regarding  this compression  \"open season\" request on Dominion Gathering.  Sincerely,  H. Dale Rexrode Director, Producer Services & Business Development                      "

It turns out the most positive email was a compilation of 5 posting to producers in West Virginia. While the overall tone of these postings is cordial, I wouldn’t say that they are particularly positive. This is an instance where the more words on a document can skew its sentiment score.

top_ten_writers %>%
  unnest() %>% 
  arrange(desc(sentiment)) %>%
  .$body %>%
  .[5]
## [1] "We got back from Dallas yesterday about 5:00 and it was a fun and very  fruitful trip for the team.   Schools are categorized by size - small, medium, large and super.  Klein is  super-sized.  The Bearkadettes won first place and the grand champion trophy  in the super category.  Best in Category awards are given for each category  of team dance entered, regardless of school size.  Klein won best in category  for all four team dances entered - kick, pom, military and jazz - a clean  sweep.  Klein also had the first place winner in the solo category and first  place in the duet category.  Also won first place in officer dances for super  sized schools.  They also won two special judges awards - not given out every  time, but only when the judges want to especially recognize achievements.   They got a special judges award for Creativity and Originality (commentary on  both our fabulous costumes made by a group of Bearkadette moms and the  choreography) and a special judges award for Perfect Score (a score of 100  awarded by all 3 judges for a particular routine - apparently very rare).   The Perfect Score award was for their pom routine, where they are dressed  like penguins (one of the ones that Meagan was in).  The girls were pretty  excited and the moms were cheering a lot, too.  They all work so hard and  spend so many hours practicing that it is nice to see that pay off!!  I am ordering the video from competition, so we can show you the routines  sometime.    I had a lot of fun.  The girls are cute and I enjoyed getting to know some of  the moms who were chaperoning as well.  But it was nice to get back to  Houston yesterday."

The second most positive email was recounting a school dance competition where the writer’s daughter’s team won first place. Unlike the previous email, this was not a professional email, the body of the email was relatively short, and the tone of the email was incredibly positive.

Let’s move on to the top-ten writers correspondence chart now that we removed Vince’s emails to himself.

ggplot(top_ten_writers %>% unnest(), aes(to, from)) +
  geom_point(aes(size = correspondence)) +
  theme(axis.text.x = element_blank(), axis.ticks = element_blank())

It turns out Kay Mann write the most emails in general and also seems to write the most emails to individuals as well. Let’s verify this by checking all correspondence greater than 100.

top_ten_writers %>% 
  unnest() %>% 
  filter(correspondence > 100) %>% 
  group_by(from, to, correspondence) %>% 
  nest() %>%
  arrange(desc(correspondence)) %>%
  select(-data)
## # A tibble: 12 × 3
##                        from                           to correspondence
##                       <chr>                        <chr>          <int>
## 1        kay.mann@enron.com     suzanne.adams@enron.com             412
## 2  vince.kaminski@enron.com  shirley.crenshaw@enron.com             291
## 3        kay.mann@enron.com              nmann@erac.com             265
## 4      kate.symes@enron.com    evelyn.metoyer@enron.com             229
## 5      kate.symes@enron.com    kerri.thompson@enron.com             215
## 6        kay.mann@enron.com kathleen.carnahan@enron.com             210
## 7        kay.mann@enron.com       carlos.sole@enron.com             179
## 8     drew.fossum@enron.com     martha.benner@enron.com             170
## 9        kay.mann@enron.com        ben.jacoby@enron.com             139
## 10      eric.bass@enron.com     shanna.husser@enron.com             133
## 11       kay.mann@enron.com      sheila.tweed@enron.com             122
## 12      eric.bass@enron.com      jason.bass2@compaq.com             108
ggplot(top_ten_writers %>% unnest(), aes(month, sentiment)) +
  geom_boxplot()

ggplot(top_ten_writers %>% unnest(), aes(weekday, sentiment)) +
  geom_boxplot()

Machine Learning: Latent Dirichlet Allocation

We will use the top_ten_writers dataframe for modelling. It is a subset of the original Enron Corpus, but with only 14238 observations. The model we will be using is called the Latent Dirichlet Allocation (LDA) model. LDA creates topics by analyzing the co-occurances of words within different documents. If two words are frequently present in the same documents, then they are more likely to be part of the same topic. If two words are almost never present in the same documents at once, then they are more likely to be part of different topics. It is important to note that a word can appear in multiple topics. LDA belongs to the Bayesian family of models and you can learn more about the statistics involved by going to Journal of Machine Learning Research.

The topicmodels package provides the function LDA() for Latent Dirichlet Allocation (not to be confused by a similiar R function lda() used for Linear Discriminant Analysis). LDA()requires a document term matrix as an input. When creating the document term matrix, we will need to remove sparse terms, that is, we will need to remove words that rarely show up in the Enron Corpus. Removing infrequent words is necessary in order to use the LDA() function.

dtm.control = list(
  tolower = T,
  removePunctuation = T,
  removeNumbers = T,
  stopwords = stopwords('english'),
  weighting = weightTf,
  seed = 0
)

enron_dtm <- Corpus(VectorSource(top_ten_writers$body)) %>%
  DocumentTermMatrix( control = dtm.control) %>%
  removeSparseTerms(.999) %>%
  .[rowSums(as.matrix(.))>0,]

Once the document term matrix is created, we can make the model with LDA(). We will arbitrarily choose to create 4 different topics. Once the model is created, then we can see the most frequently used words in each topic by using the terms() function.

library(topicmodels)
enron_body_lda <- LDA(enron_dtm, k = 4)
terms(enron_body_lda, 20)
##       Topic 1            Topic 2     Topic 3 Topic 4           
##  [1,] "forwardedmessage" "will"      "will"  "deal"            
##  [2,] "emailaddress"     "enron"     "get"   "know"            
##  [3,] "thanks"           "agreement" "know"  "let"             
##  [4,] "kay"              "may"       "good"  "thanks"          
##  [5,] "vince"            "power"     "can"   "forwardedmessage"
##  [6,] "please"           "can"       "think" "kate"            
##  [7,] "ben"              "gas"       "just"  "deals"           
##  [8,] "can"              "group"     "like"  "just"            
##  [9,] "call"             "capacity"  "going" "changed"         
## [10,] "send"             "also"      "time"  "need"            
## [11,] "john"             "project"   "week"  "ive"             
## [12,] "email"            "energy"    "dont"  "now"             
## [13,] "shall"            "one"       "well"  "please"          
## [14,] "enron"            "rate"      "work"  "ill"             
## [15,] "corp"             "ena"       "back"  "change"          
## [16,] "sent"             "business"  "one"   "ces"             
## [17,] "respond"          "market"    "see"   "price"           
## [18,] "north"            "company"   "next"  "new"             
## [19,] "shirley"          "contract"  "hope"  "contract"        
## [20,] "america"          "risk"      "last"  "term"

Networks

It would be interesting to see the relationship between different employees within Enron. The best way to do this is by filtering for only Enron employees within the to variable. Better yet, let’s filter for email recipients who have a sent email on record in the Enron Corpus. This guarentees that there is at least a one way connection.

enron_network <- enron_emails %>%
  unnest() %>%
  group_by(from, to, correspondence) %>%
  nest() %>%
  filter(to %in% from) %>%
  select(-data) %>%
  mutate(correspondence = sapply(correspondence, function(x){
    if(x < 9){
        'Few'
      } else if ( x < 17) {
        'Medium'
      } else {
        'Many'
      }}) %>%
      as.factor()) 

ggraphis a new ggplot extension that is all the rage. It can make pretty neat network graphs pretty easily. However, proper use of ggraph depends on the user being literate in the igraph package. It is absolutely worth checking out Katherine Ognyanova’s igraph tutorial ‘Network Analysis and Visualization with R and igraph’.

library(ggraph)
library(igraph)

set.seed(4321)
graph <- graph_from_data_frame(enron_network) 

ggraph(graph, layout = 'kk') +
  geom_edge_fan(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
  geom_edge_loop(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
  geom_node_point() +
  theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank())

ggraph(graph, layout = 'fr') +
  geom_edge_fan(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
  geom_edge_loop(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
  geom_node_point() +
  theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank())

ggraph(graph, layout = 'linear') +
  geom_edge_fan(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
  geom_edge_loop(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
  geom_node_point() +
  theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank())

ggraph(graph, layout = 'linear', circular = T) +
  geom_edge_fan(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
  geom_edge_loop(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
  geom_node_point() +
  theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank())

Conclusion