The Enron Corpus is one of the largest dataset of emails available to the public. Normally, emails are a very personal and private thing, and shouldn’t be made available to the public. However, the Federal Energy Regulatory Commission acquired these emails during its investigation of the company in 2002 and placed the email corpus in the public domain once the investigation was over. The Enron Corpus contains over 500,000 emails generated by over 150 employees. It is a very large data set and is ideal for learning how to mine unstructured (text) data.
In this article we will explore mining unstructured data by tidying the data into a data frame, analyzing the data using sentiment analysis, and also grouping the emails by using the unsupervised learning method known as the Latent Dirichlet Allocation Model (LDA).
The Enron Corpus can be found by going to Carnegie Mellon University’s School of Computer Science’s website. Carnegie Mellon provides the corpus in it’s raw format - that is, each email is given its own text file and is stored in a folder representating the original users’ email folders. Luckily, William Cukierski was kind enough to provide the corpus in a CSV on Kaggle’s website. This article will use the CSV.
Finally, all analysis will be done on my local machine, which maxes out at about 2GB of RAM. My machine will not be able to compute an LDA model for a 500,000+ observation dataset and so I will gradually shrink the dataset down to a size that is manageable, but also will work towards shrinking the dataset down to a smaller dataset that makes sens - not simply removing random observations.
First we will need to load the tidyverse. It is a wrapper for a number of useful packages including dplyr
, ggplot
, and readr
We will now load the data into R by using the readr
function read_csv
enron_emails <- read_csv('emails.csv')
The csv provides only two variables: file and message. The file variable contains the original directory and filename of each email. The root level of this path is the employee to whom the email belongs.
enron_emails$file %>%
## [1] "beck-s/_sent_mail/52." "mann-k/_sent_mail/2018."
## [3] "mann-k/_sent_mail/1661." "mann-k/_sent_mail/2047."
## [5] "schoolcraft-d/_sent_mail/7."
The message variable contains the email text. This includes email meta-data such as the Message-ID, Date, To, Subject fields as well as the body of the email.
enron_emails$message %>%
## [1] "Message-ID: <153208.1075846034300.JavaMail.evans@thyme>\nDate: Wed, 23 May 2001 08:35:00 -0700 (PDT)\nFrom:\nTo:\nSubject: RE: AEW's backup\nMime-Version: 1.0\nContent-Type: text/plain; charset=us-ascii\nContent-Transfer-Encoding: 7bit\nX-From: Kay Mann\nX-To: Greg Krause\nX-cc: \nX-bcc: \nX-Folder: \\Kay_Mann_June2001_4\\Notes Folders\\'sent mail\nX-Origin: MANN-K\nX-FileName: kmann.nsf\n\nCould you email me the option agreement?\n\n\n\nFrom: Greg Krause/ENRON@enronXgate on 05/23/2001 02:19 PM\nTo: Kay Mann/Corp/Enron@Enron\ncc: \n\nSubject: RE: AEW's backup\n\nI think the invoice was for between $20,000 and $30,000, but I can't \nremember. AEW has the invoice. This probably will not be the final as we \nwill need to work with them in discussions with DERM on delaying the landfill \nclosure and in moving jurisdiction of the project from CZAB to the County \nCommissioners. I think Shutts & Bowen (who may get stiffed by their client \nif this deal blows up) and Certosa Holdings would be open to any suggestions \nand we need to renegotiate the option anyway. Shall I call and suggest this \nthis flat fee for retooling the option agreement to them or shall we do it \ntogether? I,m not sure I could explain to them acequately the idiosyncrasies \nof our accounting requirements.\n\n -----Original Message-----\nFrom: Mann, Kay \nSent: Wednesday, May 23, 2001 2:00 PM\nTo: Krause, Greg\nSubject: RE: AEW's backup\n\nHow much is it and should this be the final amount? One thought I have is \nthat maybe we can retool the option agreement so that we pay them a flat fee, \nwhich is enough to cover the expenditures. Don't know if this works, but it \nis one thought. What do you think?\n\nKay\n\n\nFrom: Greg Krause/ENRON@enronXgate on 05/23/2001 01:50 PM\nTo: Kay Mann/Corp/Enron@Enron\ncc: \n\nSubject: RE: AEW's backup\n\nOne more thing on the SDEC project: According to the option agreement we \nexecuted last October, we agreed to reimburse Certosa Holdings for actual \nthird party costs that they incurred in support of our necessary \napplications, submittals and in seeking local approval. Several weeks ago, \nwe recieved an invoice from Shutts & Bowen, attorney for Certosa Holdings \nrequesting reimbursement pursuant to the contract. I forwarded this invoice \non th Ann Elizabeth not necessarily to pay for but to review considering this \nwhole soft cost hard cost discussion. I recieved another call this morning \nfrom Shutts & Bowen asking about the invoice. What should I do? \n -----Original Message-----\nFrom: Mann, Kay \nSent: Wednesday, May 23, 2001 11:03 AM\nTo: Krause, Greg\nSubject: RE: AEW's backup\n\nGreg,\n\nYou can call me on whatever you have, including Midway, SDEC and Medley \nDunn. If I have a problem getting to something, I'll find help.\n\nKay\n\n\n\n\nFrom: Greg Krause/ENRON@enronXgate on 05/23/2001 10:50 AM\nTo: Kay Mann/Corp/Enron@Enron\ncc: \n\nSubject: RE: AEW's backup\n\n\nKay,\n\nAnn Elizabeth did not provide a designated hitter for the South Dade Energy \nCenter (Dade Development Company LLC is Optionee, Certosa Holdings is \nOptionor) nor did she provide one for tne Medley Dunn project. I have been \ntold that the Dunns are considering backing off their ultimatums that they \ngave Ann Elizabeth and I regarding taxes to the town and assumption of \nenviromental liability. Who do I talk to about the Dunn contract while Ann \nElizabeth is out? \n -----Original Message-----\nFrom: White, Ann Elizabeth \nSent: Tuesday, May 22, 2001 10:33 PM\nTo:; Krimsky, Steven; Ben Jacoby/HOU/ECT@ENRON; Carnahan, \nKathleen\nCc: Milligan, Taffy\nSubject: AEW's backup\n\nKay Mann is the designated hitter for the Pompano and Deerfied projects while \nI'm on vacation. I've given her a down load of the status of Greg and \nSteve's projects. Chris Boehler at A&K will be the designated hitter for \nMidway. I'm not going to check my voice mail while I'm gone but, if \nnecessary, here are the contact numbers while I'm gone.\n\nWalter and Marlena Schilling 011-49-8218-89351\n\nMonika and Bernhard Steinacher 011-49-8232-8932\n\nIf you call, Walter and Bernhard and Bernhard's daughter, Susanne, speak very \ngood English. Monika's isn't bad. Marlena may get flustered and hang up on \nyou.\n\nBest of luck at Deerfield and hope to see Pompano on track when I get back in \nthe office on June 11th. Kay is planning on going to Florida on June 12 for \nthe moratorium hearing and the rezoning hearing.\n\n\n\n\n\n\n\n"
First, to tackle this large dataset we should first create a smaller, more manageble subset. After looking at the different files, I decided that a decent subset would be emails found in the users’ sent_mail folder. This seems like a good choice because it will only contain emails from Enron employees. The stringr
package provides us with str_detect()
which is a function that looks to see if a string can be found within another string. We can combine str_detect()
with dplyr
’s filter()
to provide us all the observations which have sent_mail in their file-path.
enron_emails <- enron_emails %>%
filter(str_detect(file, '/_sent_mail/'))
There are now only 30237 observations to work with.
Next we need to split the the message
variable into the new header
and body
variables. Luckily, the two are neatly seperated by two consecutive newlines. Before the two newlines is the header, and after the two newlines is the body. We should also clean the body a bit by removing all subsequent newlines and tabs, as well as well as removing all referenced emails. A referenced email is an email that was forwarded or copied from a previous email. Also, we will replace all email addresses with with an EMAIL_ADRESS token so that different emails don’t have a strong influene on the analysis.
enron_emails <- enron_emails %>%
mutate(message = message %>% str_replace_all('\r', '')) %>%
mutate(header = str_sub(message, end = str_locate(message, '\n\n')[,1] -1)) %>%
mutate(body = str_sub(message, start = str_locate(message, '\n\n')[,2] + 1) %>%
str_replace_all('\n|\t', ' ') %>%
str_replace_all('---Original Message .*', 'FORWARDED_MESSAGE') %>%
str_replace_all('--- Forwarded by .*', 'FORWARDED_MESSAGE') %>%
str_replace_all('From: .*', 'FORWARDED_MESSAGE') %>%
str_replace('To:.*', 'FORWARDED_MESSAGE') %>%
str_replace_all('\\S*@\\S*', 'EMAIL_ADDRESS'))
Now we will split the header into its seperate metadata fields.
enron_emails <- enron_emails %>%
mutate(date = str_extract(header, 'Date:.*') %>%
str_replace('Date: ', '') %>%
str_replace('.+, ', '') %>%
strptime(format = '%d %b %Y %H:%M:%S %z') %>%
as.POSIXct()) %>%
mutate(from = str_extract(header, 'From:.*') %>%
str_replace('From: ', '')) %>%
mutate(to = header %>% str_replace_all('\n|\t', ' ') %>%
str_extract('To:.*Subject:') %>%
str_replace_all('To: |Subject:', '')) %>%
mutate(subject = str_extract(header, 'Subject:.*') %>%
str_replace('Subject: ', '')) %>%
mutate(xfrom = str_extract(header, 'X-From:.*') %>%
str_replace('X-From: ', '')) %>%
mutate(xto = str_extract(header, 'X-To:.*') %>%
str_replace('X-To: ', '')) %>%
mutate(xcc = str_extract(header, 'X-cc:.*') %>%
str_replace('X-cc: ', '')) %>%
mutate(xbcc = str_extract(message, 'X-bcc:.*') %>%
str_replace('X-bcc: ', '')) %>%
In order to analyse the sentiments of each email we will need to split the body of each email into its individual words. We will be using the AFINN sentiment library to score each word with either a positive number or a negative number depending on the individual word’s sentiment.
In order to analyze the sentiments of the body of an email, we will need to create a matrix of each word within the body. The matrix shows provides the number of times a word shows up in an email. We can create such a matrix for every body of email by using the tm
package’s DocumentTermMatrix()
function. A document term matrix is a matrix that provides a word count for each document provided. For us, a document is a single body of an email. Before we can create the document term matrix, however, we will need to convert the vector of email bodies into a corpus by using the Corpus()
function. A corpus is a data structure for a collection of text and it is necessary to convert a vector of characters into a corpus in order to perform many different text-mining tasks.
We will need to provide a control for making the document term matrix that accounts for punctuation, numbers, and stopwords. A stop word is a word that shows up so frequently in the english language that it doesn’t have any analytical value. For example ‘the’ and ‘a’ are neither positive words, nor negative words, and are also used in almost every sentence so they can’t be used to differentiate how different people speak.
dtm.control = list(
removePunctuation = T,
removeNumber = T,
stopwords = stopwords('english')
enron_dtm <- Corpus(VectorSource(enron_emails$body)) %>%
Once the document term matrix is created, we will need to convert the matrix into a tidy data frame using tidytext
’s tidy()
function. The tidy data frame will have three variables - document, term, and count. The document variable is an email body, where a document of 1 is the first email body in our enron_emails
data frame. The term variable contains each word found in each document. Each document-word combination is given its own row.The count variable is the number of times a particular term (or word) is found in a document.
enron_sentiments <- tidy(enron_dtm)
enron_sentiments %>% head()
## # A tibble: 6 × 3
## document term count
## <chr> <chr> <dbl>
## 1 1 fuel 1
## 2 1 include 1
## 3 1 tetco 1
## 4 1 tickets 1
## 5 1 updated 1
## 6 1 usage 1
Once we create the tidy dataframe we will join it with the ‘AFINN’ library and create the score variable by multiplying the word’s sentiment score with the the number of times a word appears in a document. We can then sum the score of each word within each document to create a sentiment variable for each document.
enron_sentiments <- enron_sentiments %>%
inner_join(get_sentiments('afinn'), by = c(term = 'word')) %>%
mutate(score = score * count) %>%
group_by(document) %>%
nest() %>%
mutate(sentiment = sapply(seq_along(.$data), function(i){
.$data[[i]]['score'] %>%
})) %>%
enron_sentiments %>% head()
## # A tibble: 6 × 2
## document sentiment
## <chr> <dbl>
## 1 2 -2
## 2 3 3
## 3 4 5
## 4 6 1
## 5 9 4
## 6 11 1
The document term matrix numbers the documents according to order of appearence in the dataframe, so in order to combine the information with our original enron_emails
data frame we will need to create a new variable called document and order it from 1 to nrow(enron_emails)
We will also clean the date variable so we can analyze the sentiments for each day of the week and each month. We can do this by using the lubridate
’s wday()
and month()
enron_emails <- enron_emails %>%
mutate(weekday = wday(date, label = T)) %>%
mutate(month = month(date, label = T)) %>%
mutate(document = 1:nrow(.) %>% as.character()) %>%
Emails that don’t have any sentiment are given NA
values. We will need to convert these values into 0
enron_emails$sentiment[$sentiment)] = 0
Some of the emails were sent to multiple different people. We should split the to
variable so that each reciepient of an email is accounted for. Then we can find the averge email sentiment over the course of all the correspondence between two people.
enron_emails <- enron_emails %>%
mutate(to = str_split(to, ',')) %>%
unnest() %>%
group_by(from, to) %>%
nest() %>%
mutate(mean_sentiment = sapply(seq_along(.$data), function(i){.$data[[i]]$sentiment %>% mean()})) %>%
mutate(correspondence = sapply(seq_along(.$data), function(i){.$data[[i]] %>% nrow()})) %>%
Let’s check the check the mean sentiment for each correspondence. Since their are a lot of email recipients, we will need to shrink the number of ‘to’ fields. I think the best way to do this is by focusing only on email correspondence that contain more than 10 emails. This let’s us focus on people who write to eachother somewhat frequently.
ggplot(enron_emails %>% filter(correspondence > 10), aes(to, from)) +
geom_point(aes(color = mean_sentiment)) +
scale_color_gradient2(low = 'brown',
mid = 'yellow',
high = 'blue') +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
We don’t have too much variation in with the sentiments. This shouldn’t be too surprising since we’re analyzing work emails.
We can also see whether an employee wrote a lot of emails to a single person, or multiple people, or both.
ggplot(enron_emails %>% filter(correspondence > 10), aes(to, from)) +
geom_point(aes(size = correspondence)) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
Vince Kaminski seems to be sending a lot of emails to one particular person. While Kay Mann, Eric Bass, and Chris Germany seem to be sending multiple emails to many different people.
Let’s look into this further…
enron_emails %>%
unnest() %>%
group_by(from, to, correspondence) %>%
nest() %>%
arrange(desc(correspondence)) %>%
## # A tibble: 16,380 × 3
## from to correspondence
## <chr> <chr> <int>
## 1 1059
## 2 412
## 3 291
## 4 265
## 5 229
## 6 229
## 7 215
## 8 210
## 9 179
## 10 170
## # ... with 16,370 more rows
It turns out that Vince is forwarding many of his work emails to his personal email account. We should probably get rid of these email correspondence because they don’t actually represent email communication between two different people.
Also, to improve visualization and also provide a more concise dataset to model, we should only keep the top ten most prolific email writers.
top_ten_writers <- enron_emails %>%
unnest() %>%
filter(!(str_detect(from, 'vince.kaminski') & str_detect(to, 'vkaminski'))) %>%
group_by(from, body) %>%
nest() %>%
group_by(from) %>%
nest() %>%
mutate(tot_emails_sent = sapply(seq_along(.$data), function(i){
})) %>%
arrange(desc(tot_emails_sent)) %>%
.[1:10,] %>%
Let’s visualize both the sentiment and correspondence once more, but with just the top ten email writers.
ggplot(top_ten_writers %>% unnest(), aes(to, from)) +
geom_point(aes(color = mean_sentiment)) +
scale_color_gradient2(low = 'brown',
mid = 'yellow',
high = 'blue') +
theme(axis.text.x = element_blank(), axis.ticks = element_blank())
I will need to check my code on this one. The mean_sentiment
scale changed drastically… If everything is correct, then we learn that Susan Scott is a more positive writer in general and Chris Germany is really positive with one other person.
top_ten_writers %>%
unnest() %>%
arrange(desc(sentiment)) %>%
.$body %>%
It turns out the most positive email was a compilation of 5 posting to producers in West Virginia. While the overall tone of these postings is cordial, I wouldn’t say that they are particularly positive. This is an instance where the more words on a document can skew its sentiment score.
top_ten_writers %>%
unnest() %>%
arrange(desc(sentiment)) %>%
.$body %>%
## [1] "We got back from Dallas yesterday about 5:00 and it was a fun and very fruitful trip for the team. Schools are categorized by size - small, medium, large and super. Klein is super-sized. The Bearkadettes won first place and the grand champion trophy in the super category. Best in Category awards are given for each category of team dance entered, regardless of school size. Klein won best in category for all four team dances entered - kick, pom, military and jazz - a clean sweep. Klein also had the first place winner in the solo category and first place in the duet category. Also won first place in officer dances for super sized schools. They also won two special judges awards - not given out every time, but only when the judges want to especially recognize achievements. They got a special judges award for Creativity and Originality (commentary on both our fabulous costumes made by a group of Bearkadette moms and the choreography) and a special judges award for Perfect Score (a score of 100 awarded by all 3 judges for a particular routine - apparently very rare). The Perfect Score award was for their pom routine, where they are dressed like penguins (one of the ones that Meagan was in). The girls were pretty excited and the moms were cheering a lot, too. They all work so hard and spend so many hours practicing that it is nice to see that pay off!! I am ordering the video from competition, so we can show you the routines sometime. I had a lot of fun. The girls are cute and I enjoyed getting to know some of the moms who were chaperoning as well. But it was nice to get back to Houston yesterday."
The second most positive email was recounting a school dance competition where the writer’s daughter’s team won first place. Unlike the previous email, this was not a professional email, the body of the email was relatively short, and the tone of the email was incredibly positive.
Let’s move on to the top-ten writers correspondence chart now that we removed Vince’s emails to himself.
ggplot(top_ten_writers %>% unnest(), aes(to, from)) +
geom_point(aes(size = correspondence)) +
theme(axis.text.x = element_blank(), axis.ticks = element_blank())
It turns out Kay Mann write the most emails in general and also seems to write the most emails to individuals as well. Let’s verify this by checking all correspondence greater than 100.
top_ten_writers %>%
unnest() %>%
filter(correspondence > 100) %>%
group_by(from, to, correspondence) %>%
nest() %>%
arrange(desc(correspondence)) %>%
## # A tibble: 12 × 3
## from to correspondence
## <chr> <chr> <int>
## 1 412
## 2 291
## 3 265
## 4 229
## 5 215
## 6 210
## 7 179
## 8 170
## 9 139
## 10 133
## 11 122
## 12 108
ggplot(top_ten_writers %>% unnest(), aes(month, sentiment)) +
ggplot(top_ten_writers %>% unnest(), aes(weekday, sentiment)) +
We will use the top_ten_writers
dataframe for modelling. It is a subset of the original Enron Corpus, but with only 14238 observations. The model we will be using is called the Latent Dirichlet Allocation (LDA) model. LDA creates topics by analyzing the co-occurances of words within different documents. If two words are frequently present in the same documents, then they are more likely to be part of the same topic. If two words are almost never present in the same documents at once, then they are more likely to be part of different topics. It is important to note that a word can appear in multiple topics. LDA belongs to the Bayesian family of models and you can learn more about the statistics involved by going to Journal of Machine Learning Research.
The topicmodels
package provides the function LDA()
for Latent Dirichlet Allocation (not to be confused by a similiar R function lda()
used for Linear Discriminant Analysis). LDA()
requires a document term matrix as an input. When creating the document term matrix, we will need to remove sparse terms, that is, we will need to remove words that rarely show up in the Enron Corpus. Removing infrequent words is necessary in order to use the LDA()
dtm.control = list(
tolower = T,
removePunctuation = T,
removeNumbers = T,
stopwords = stopwords('english'),
weighting = weightTf,
seed = 0
enron_dtm <- Corpus(VectorSource(top_ten_writers$body)) %>%
DocumentTermMatrix( control = dtm.control) %>%
removeSparseTerms(.999) %>%
Once the document term matrix is created, we can make the model with LDA()
. We will arbitrarily choose to create 4 different topics. Once the model is created, then we can see the most frequently used words in each topic by using the terms()
enron_body_lda <- LDA(enron_dtm, k = 4)
terms(enron_body_lda, 20)
## Topic 1 Topic 2 Topic 3 Topic 4
## [1,] "forwardedmessage" "will" "will" "deal"
## [2,] "emailaddress" "enron" "get" "know"
## [3,] "thanks" "agreement" "know" "let"
## [4,] "kay" "may" "good" "thanks"
## [5,] "vince" "power" "can" "forwardedmessage"
## [6,] "please" "can" "think" "kate"
## [7,] "ben" "gas" "just" "deals"
## [8,] "can" "group" "like" "just"
## [9,] "call" "capacity" "going" "changed"
## [10,] "send" "also" "time" "need"
## [11,] "john" "project" "week" "ive"
## [12,] "email" "energy" "dont" "now"
## [13,] "shall" "one" "well" "please"
## [14,] "enron" "rate" "work" "ill"
## [15,] "corp" "ena" "back" "change"
## [16,] "sent" "business" "one" "ces"
## [17,] "respond" "market" "see" "price"
## [18,] "north" "company" "next" "new"
## [19,] "shirley" "contract" "hope" "contract"
## [20,] "america" "risk" "last" "term"
It would be interesting to see the relationship between different employees within Enron. The best way to do this is by filtering for only Enron employees within the to variable. Better yet, let’s filter for email recipients who have a sent email on record in the Enron Corpus. This guarentees that there is at least a one way connection.
enron_network <- enron_emails %>%
unnest() %>%
group_by(from, to, correspondence) %>%
nest() %>%
filter(to %in% from) %>%
select(-data) %>%
mutate(correspondence = sapply(correspondence, function(x){
if(x < 9){
} else if ( x < 17) {
} else {
}}) %>%
is a new ggplot
extension that is all the rage. It can make pretty neat network graphs pretty easily. However, proper use of ggraph
depends on the user being literate in the igraph
package. It is absolutely worth checking out Katherine Ognyanova’s igraph
tutorial ‘Network Analysis and Visualization with R and igraph’.
graph <- graph_from_data_frame(enron_network)
ggraph(graph, layout = 'kk') +
geom_edge_fan(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
geom_edge_loop(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
geom_node_point() +
theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank())
ggraph(graph, layout = 'fr') +
geom_edge_fan(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
geom_edge_loop(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
geom_node_point() +
theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank())
ggraph(graph, layout = 'linear') +
geom_edge_fan(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
geom_edge_loop(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
geom_node_point() +
theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank())
ggraph(graph, layout = 'linear', circular = T) +
geom_edge_fan(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
geom_edge_loop(aes(color = correspondence), width = 1, arrow = arrow(length = unit(4, 'mm')), start_cap = circle(3, 'mm'), end_cap = circle(3, 'mm')) +
geom_node_point() +
theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank())