How to Properly Filter Scunthorpe
How to Properly Filter Scunthorpe
Words embedded in other words is a classic profanity filtering challenge. This is commonly referred to as the Scunthorpe problem. In this entertaining piece, Kelly Strain uncovers the root of this problem, offering many insights into filtering along the way.
Join the DZone community and get the full member experience.Join For Free
According to The Independent, Facebook has banned users from boosting posts with the word Scunthorpe.
This left the band October Drift enraged when trying to promote their Scunthorpe show. John Jarman, an advertiser from Scunthorpe, also had a similar experience and shared his opinion on the topic:
"My ad not approved because of the word Scunthorpe. Seriously, Facebook, are your algorithms written by 5-year-olds? I don’t need to see what is and isn’t approved—there’s nothing wrong with the advert, it’s just the fact that word Scunthorpe is in it. As soon as I type the word 'Scunthorpe' I get an immediate warning that my ad contains inappropriate language."
This is not the first time this word has proven difficult to properly classify. In fact, Scunthorpe is rather infamous in the world of profanity filtering.
The Scunthorpe Problem
Words embedded in other words is a classic profanity filtering challenge. This is commonly referred to as the Scunthorpe problem. To recap, Scunthorpe is a town in England where residents have had a hard time registering for services like AOL and Google because the town’s name contains an embedded 4-letter swear word.
AOL first encountered this problem in 1996. It appears Facebook’s profanity algorithm is still stumped in 2016. Inversoft tackled this issue by introducing a Filter Mode. Here’s how it works...
CleanSpeak Filter Mode
For each entry on the CleanSpeak blacklist, a Filter Mode is assigned to tell the filter how the entry should be filtered, particularly when it’s next to other letters and numbers.
I’ll use smurf to demonstrate. For each example, the filter sees smurf and then goes through a process of elimination to decide whether to keep the match or not based on the Filter Mode of the smurf blacklist entry.
If smurf is embedded within random letters and numbers, there is only one filter mode that can be assigned to determine a match. That filter mode is Distinguishable. This is the most aggressive filter mode as entries are matched no matter what the entry is embedded in. Entries that are distinguishable will be filtered similarly when users try various tricks to get around the filter—the stuff CleanSpeak finds for you automatically. These tactics include:
- Replacement characters such as a dollar sign for an S
- Phonetic replacements like a ph in place of an f
- Repeat characters
- Leet speak, in this case, a pipe-underscore-pipe to create a U
- Inflections from parts of speech such as smurfing when smurf is configured as a verb
- And embedding punctuation and spaces between the letters.
Set entries to be Distinguishable when they are phrases, when the entry is unique, or when you simply want to filter it more aggressively and remove some of the false-positive checks that CleanSpeak provides. The latter is common with children’s applications.
What is unique? For a word to be unique, it means that the character sequence would not normally exist in user-generated messages. One trick to determine whether a word is unique is searching for it in the blacklist dictionary. As you can see, there are 550 results when searching for the word "ass" because a-s-s is embedded in many normal words. That’s hardly unique. When searching the dictionary for smurf there are zero results. Therefore, smurf is a good candidate to be Distinguishable, whereas ass is not.
Facebook is most likely using this type of aggressive, Distinguishable filter.
Let's look at smurfhead. The filter sees smurf and this time it’s embedded next to the word head. The filter returns smurfhead as a valid match when smurf is set to distinguishable OR embeddable. Embeddable treats the entry as a match when it’s embedded next to words or numbers. Since head is a word in the dictionary, smurfhead is a valid match.
The Embeddable filter mode solves the original Scunthorpe issue by setting the c-u-n-t entry to Embeddable. The filter knows that scunthorpe shouldn’t be a match. It isn’t a bad word because c-u-n-t isn’t next to a word or number.
Most entries in the blacklist should be set to Embeddable. Exceptions include phrases and unique words that should be Distinguishable and acronyms, numbers, ASCII art, and some small words. For those cases, the "safety" checks for the Embeddable filter mode can still be too aggressive. That’s where Non-Embeddable and Exact Match come in.
The Non-Embeddable filter mode finds an entry (and also looks for repeat characters, replacement characters/symbols, and separators) as long as the entry is not embedded.
The Exact Match filter mode takes it another step further and only treats an entry as a match if it is found exactly as it looks.
Scunthorpe Problem Solved
Facebook simply needs an intelligent profanity filter that can properly identify embedded entries so that the people of Scunthorpe and those around the world using the word in a genuine fashion can freely discuss this town.
Opinions expressed by DZone contributors are their own.