Email Analysis with Python 3 (Part II)

Welcome back! This is a sequel to Part I where we covered making changes to our Gmail account, getting the subject of the email and its sender and visualising some of the email data.
The emphasis of this part is on getting the body of the emails.

If you read the first part and tried it out, I hope it was without hitches. If you did not but you are interested in learning how to get the body of the email, you can follow from here. Let's jump right in.

Prerequisites

Python 3

Pandas

A gmail account

Getting The Data

This was covered in Part I as it involves making changes to your Gmail account in order for IMAPLib to work with it.

Step 1: Importing the required libraries to get the email data

Here we import the libraries we need which are imaplib, email, getpass and pandas. You may want to install pandas using pip install pandas if you do not have it.

import imaplib
import email
import getpass
import pandas as pd

Step 2: Gaining access to the email server

Here we log into the email server with our credentials.

username =  input("Enter the email address: ")
password = getpass.getpass("Enter password: ")
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login(username, password)

Step 3: Specifying the mailbox to get data from.

Here we print out the mail list to see the available mailboxes and we select one.

print(mail.list())
mail.select("inbox")

Step 4: Searching and Fetching the data

The block of code below searches the selected mailbox with the given criteria, fetches the emails and stores it to the variable messages.
Here I am searching for emails from FreeCodeCamp.

result, numbers = mail.search(None, '(FROM "[email protected]")')
uids = numbers[0].split()
uids = [id.decode("utf-8") for id in uids ]
result, messages = mail.fetch(','.join(uids) ,'(RFC822)')

Step 5: Preparing the data to be exported

The block of code below loops through the fetched emails, gets the date it was received, who sent it, the subject of the mail and the body of the mail.

We use the walk() method in the email library to get the parts and subparts of the message.

We use the get_content_type() method to get the email body maintype/subtype.

We use the get_payload() to get a string or message instance of the part.

body_list =[]
date_list = []
from_list = [] 
subject_list = []
for _, message in messages[::2]:
  email_message = email.message_from_bytes(message)
  email_subject = email.header.decode_header(email_message['Subject'])[0]
  for part in email_message.walk():
    if part.get_content_type() == "text/plain" :
        body = part.get_payload(decode=True)
        body = body.decode("utf-8")
        body_list.append(body)
    else:
        continue
    if isinstance(email_subject[0],bytes):
      decoded = email_subject.decode(errors="ignore")
      subject_list.append(decoded)
    else:
      subject_list.append(email_subject[0])
  date_list.append(email_message.get('date'))
  fromlist = email_message.get('From')
  fromlist = fromlist.split("<")[0].replace('"', '')
  from_list.append(fromlist)

Here we convert the objects in date_list to datetime objects using the to_datetime() method, because the time has its UTC format attached, we sliced off the UTC format.
The retrieved information is then converted to a pandas DataFrame and exported to a CSV file.

date_list = pd.to_datetime(date_list)
date_list = [item.isoformat(' ')[:-6]for item in date_list]
data = pd.DataFrame(data={'Date':date_list,'Sender':from_list,'Subject':subject_list, 'Body':body_list})
data.to_csv('emails.csv',index=False)

Data Cleaning

Now we are going to view the data and clean it where necessary to make it readable. First we read in the csv file and view it:

data = pd.read_csv("\emails.csv")
data.head()

The output can be seen below, going through there are escape characters in the Body column:

The function below removes the escape characters in the text. In this case "\r\n" as seen in the screenshot above. This makes the text more readable.

def clean_data(data, column, i):
    data.loc[i, column] = data.loc[i, column].split("\r\n")
    new_string = " ".join(data.loc[i, column])
    new_string = new_string.split("'',")
    data[column][i:i+1] = pd.DataFrame(data = new_string)
    return data

The code below is using the function above to clean every email body in the file.

for n in range(len(data)):
    new_data = clean_data(data, column = "Body", i = n)

The output can be seen below:

You are advised to rewrite a new function according to the escape characters you may find in the Subject or Body of the email you retrieved.

Conclusion

I encountered ERRORS while writing this, the most recurring one was being unable to sign in even after following the instructions on the Google help page. This problem was encountered because I have more than one Gmail account signed in, and I was not using my default email. In case you encounter the same, the solution is outlined below:

The instruction said, "If the tips above didn't help, visit https://www.google.com/accounts/DisplayUnlockCaptcha and follow the steps on the page." This opens on a new tab.

The link on the new tab was "https://accounts.google.com/b/0/DisplayUnlockCaptcha" where the digit 0 is for the default account logged in.

Check your accounts in the order in which they are listed and change the digit accordingly (e.g.,”1" is the next email and so on).

You can find the full code on GitHub below: