Table of Contents

Spelling-Uncorrector

I present perhaps my most useless project to date: a tool which takes correctly spelt (English) text and introduces random, often elaborate spelling errors.

Try meeh

Type a (correctly spelt) sentence below and watch as it is ruined. Click the output text to randomise the spelling.

How does it work?

The spelling incorrector works by trying to find homophones of the words in the input. Homophones are words which are pronounced the same way but spelt differently, for example "write", "right" and "rite". Rather than just substituting homophones which are real words, the spelling uncorrector takes it a step further and also uses made-up words like "ruyte" and "wryte".

In order to check if two words are homophones you need to see if they're pronounced the same way. Doing this manually would be tedious so instead we can abuse a text-to-speech programme to automate the task. Espeak has a mode in which, instead of speaking, it outputs text in the International Phonetic Alphabet (IPA). To try it out, open up a terminal and run:

$ espeak -q --ipa -v en-gb

Type some words and press enter, espeak will print out the IPA for that word. Press Ctrl+D to exit. For example, entering "write", "right", "rite" "ruyte" or "wryte" will all produce "ɹˈaɪt".

Now we have a way of discovering homophones automatically: we can simply group together all words whose IPA representation is the same since these must be pronounced identically. If we generate the IPA for every word in a dictionary we can find all homophones which are real words. To find the non-real-word (i.e. thoroughly misspelt words) ones we must also find the IPAs of all possible sequences of letters. Python makes it embarrassingly easy to generate the set of every possible word:

>>> from itertools import count, product
>>> from string import ascii_lowercase
>>> def all_words():
...     for length in count(1):
...         for word in product(ascii_lowercase, repeat=length):
...             yield "".join(word)
...
>>> for word in all_words():
...     print(word)
a
b
c
(...snip...)
aa
ab
ac
(...you get the idea...)

We can now set a computer running on the infinite task of finding the pronunciation of every possible word:

$ python all_words.py | espeak -q --ipa -v en-gb | gzip > all_words_ipa.txt.gz

In pipeline above I've opted to gzip the output since it is compresses well (its repetitive text) and because it grows rapidly otherwise.

You'll want to stop this after a while... On my laptop espeak took about a millisecond to work out the IPA for one word. At this rate it'll take about the following amount of time to generate:

All 1 letter words: < 1 second
All 2 letter words: approx 1 second
All 3 letter words: approx 20 seconds
All 4 letter words: approx 10 minutes
All 5 letter words: approx 3 hours
All 6 letter words: approx 3 days
All 7 letter words: approx 3 months
All 8 letter words: approx 7 years
All 9 letter words: approx 150 years
All 10 letter words: approx 4 millennia
...

OK, so exponential growth isn't fun. Luckily, however, there are still plenty of short words in the English language (and its much more fun to misspell short words anyway).

Next we can generate the IPAs for each word in the dictionary. On Unix systems the Unix 'words' file conveniently provides a list of newline-separated dictionary words.

$ espeak -q --ipa -v en-gb < /usr/share/dict/words > dict_words_ipa.txt

Finally, we can use Python to create a lookup from dictionary words to a list of 'words' which are pronounced the same way:

>>> # Read dictionary words and their pronunciations
>>> from collections import defaultdict
>>> ipa_to_dict_words = defaultdict(list)
>>> with open("/usr/share/dict/words", "r") as dict_file:
...     with open("dict_words_ipa.txt", "r") as dict_ipa_file:
...         for dict_word, ipa in zip(dict_file, dict_ipa_file):
...             ipa_to_dict_words[ipa.strip()].append(dict_word.strip())
...
>>> # Find homophones already in the dictionary
>>> homophones = defaultdict(list)
>>> for ipa, words in ipa_to_dict_words.items():
...     for word in words:
...         homophones[word.lower()].extend(words)
...
>>> # Find homophones for dictionary words which occur in the "all_words"
>>> # list. Since the all words list is massive, we don't try and find
>>> # homonyms for all words in this list, just dictionary ones. This
>>> # avoids the need to load the whole mapping into memory at once!
>>> import gzip
>>> try:
...     with gzip.open("all_words_ipa.txt.gz", "rb") as all_words_ipa_file:
...         for word, ipa in zip(all_words(), all_words_ipa_file):
...             for dict_homophone in ipa_to_dict_words[ipa.decode("utf-8").strip()]:
...                 homophones[dict_homophone.lower()].append(word)
... except EOFError:
...     # The GZip file wlil probably be corrupted at the end since we
...     # interrupted the pipeline...
...     pass
...
>>> # Save all homophones which aren't just themselves.
>>> import json
>>> with open("homophones.json", "w") as f:
...     json.dump({w: ws for w, ws in homophones.items() if ws != [w]}, f)

The resulting homophone dictionary can be used to look-up the full variety of misspellings available for a given word.