An explanation of Mass Mail
We define Mass Email as any email that is sent to various different recipients, without making the recipients aware of it.
Mass Email thus differs from Newsletters or Mailing Lists, since these are sent to many different recipients, but the recipients are aware of it. In fact, the whole idea behind Mailing Lists or Newsletters is that lots of different people receive the same content. Our Mass Email Detector, on the other hand, is meant to enable people to verify that an email is genuine in cases where they are not sure if they are the only person that received a particular email..
Furthermore, Mass Email is not just ordinary Spam either. While Spam is also sent to lots of recipients, it mostly consists of unsolicited advertising or of shady and malicious shams. Spam is easy to detect, since verbatim copies of the same content are sent to many recipients. Additionally, Spam is usually of rather poor quality and often contains certain words and expressions that make people aware of the fact that an email is, in fact, Spam. The less straightforward cases of Mass Email, however, such as the job application mentioned above, are more difficult to detect. When sending the same application to many different companies, an applicant may choose to change a small percentage of the content in order to make the email appear more genuine than it actually is. In contrast to ordinary Spam filters, the Mass Email Detector tries to capture the finer nuances of what it means for two texts to be basically the same, so that users can verify that emails that appear genuine actually are.
An explanation of how we check if your mail is a mass email
Flee-Mail relies on a database of known emails that can only be accessed by the backend server. Whenever a user of the Flee-Mail Gmail Add-on clicks on the Flee-Mail icon to check if an email is genuine or not, that email1 is sent to our backend server over an encrypted connection. The server then stores the email in the database and checks if the database contains any other emails from the same sender that also have the same content. If any such emails are found, the server sends the number of matches back to the user’s computer, where it is displayed in the Addon sidebar.
Flee-Mail is especially designed to detect emails that have been slightly personalised in order to make them appear more genuine than they actually are. This means that the backend server has to figure out whether two emails, that are not exactly the same, are really different emails, or whether they still basically contain the same content. We achieve this by measuring the overlap between any two emails. If this overlap is above 90 %, we conclude that two emails are basically the same and report them as matches to the user. If the overlap is below 90 %, we view them as different emails and don’t count the email in question as a match.
If you are quite technical, you might be interested to hear that we use the Levenshtein Distance Algorithm to measure the overlap, but don’t worry if you haven’t heard of it. It is a rather straightforward measure. Levenshtein Distance simply calculates the minimum number of words 2 one needs to change to transform some text — let’s call it text A — into some other text — which we might call text B. If the number of changes is low, the overlap between the texts is high and vice versa.
Sure thing! We might have some text A that is 100 words long, and calculating the Levenshtein Distance we find that we can transform it into text B if we remove 5 words and insert 5 different words in their stead. That is 5 changes in 100 words, or, in other words, 5 % of differing text. The overlap would thus be the complement: 95 % of text that doesn’t have to be changed. Since this percentage is quite high and exceeds our threshold of 90 %, we would conclude that text A and text B are pretty much the same text and report back to the user that we found a match.
Now let’s imagine we have two more texts, let’s call them text C and text D. Like text A, text C is also 100 words long. We compare texts C and D and find that, in order to transform text C into text D, we would need to remove 60 words and insert 80 different words in their stead. The overlap, in this case, would only be 30 %. Since this overlap doesn’t exceed our threshold of 90 %, we conclude that, while there are slight similarities between the two texts, they are not the same. Therefore, we would not count text D as a match for text C.
In order to allow the backend processing, the Addon transfers the sender and recipient addresses as well as the email’s timestamp and content.↩︎
Levenshtein Distance is more commonly used with characters rather than words, but we decided words would make more sense in our case. If we just considered characters, the difference between “house” and “mouse” would be 2, since you just need to remove one character and add another one. The difference between “house” and “mansion”, on the other hand, would be 10, since you need to remove 4 characters (“h”, “o”, “u” and “e”) and add 6 new ones (“m”, “a”, “n”, “i”, “o” and “n”) — actually, we might also remove “h”, “u”, “s” and “e” and add “m”, “a”, “n”, “s”, “i” and “n”, but that would still be 4 deletions and 6 insertions, so it wouldn’t make a difference. While this sort of reasoning is great when dealing with spelling errors, it doesn’t make that much sense for comparing the content of emails. We thus calculate the number of words that need to be replaced, rather than the number of characters. ↩︎