Update: I have written a part 2 of this blogpost!
This December, I was fascinated by the Boolean satisfiability problem (SAT) and it's applications. I was looking for a problem that had its roots in text-analysis and SAT. Trying to discover for such problems, this answer on CS Theory caught my attention. There have been many crossword solvers which, given a list of solutions, can arrange those in the grid. My motive was to add an extra layer of difficulty: predict the probable guesses using the clues provided by the crossword. As I was already familiar with Python's various NLP libraries, it was an obvious choice for this project.
What does a computer need to solve a crossword?
Moby's Thesaurus,
Wordnet (as bundled in NLTK),
glove-wiki-gigaword-100 (used in Gensim)
and Z3 Theorem Prover.
In crossword, a clue is any word or sentence based on which the crossword solutions (also known as guesses) are to be determined. For example:
Clue 1: Distant (6 words)
Guesses: beyond, places, remote, become, planet
Clue 2: Indian Grandmaster (5 words)
Guesses: barua, india, world, blitz, humpy, arjun, vidit, anand
The above results are shown by my code, with the answer being remote and anand.
For finding probable guesses of a clue, two methods are used. If it is a one-word clue, the single word is searched for in either Moby's Thesaurus or using Gensim's most_similar() method. The Gensim implementation can be done as follows:
import gensim.downloader as api
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
word_vectors = api.load("glove-wiki-gigaword-100")
stop = stopwords.words('english') + list(string.punctuation)
guess_length = N # Length of word required for our crossword
topn = 100 # No. of words to be returned by Gensim's most_similar()
pos_words = [word for word in word_tokenize(clue.lower()) if word not in stop]
probable_guesses = [word for word in word_vectors.most_similar(positive=pos_words, topn=topn) if len(word[0]) == guess_length]
To find sentence-solutions, like Director of Raging Bull (8 letters) and Curtain fabric (5 letters), two implementations work well. The first is to call the exact implementation mentioned above and use guesses obtained in probable_guesses variable as an upper bound for the words to be queried into NLTK's wordnet.synsets() method. Cosine similarity is then measured between definitions obtained from wordnet and the original clue, which is checked against a threshold to determine the words' outcome (i.e whether it is accepted or not). Cosine similarity can be implemented using Gensim's n_similarity() method.
The reason for using most_similar() followed by n_similarity() is based on my observation, which is that the Wordnet implementation helps filter out some false positives.
The second method is to use Wikipedia's MediaWiki API, with the query as the clue and look for solutions in the title and description.
Once probable guesses for each clue have been collected, an SMT solver like Z3 can be used to construct a solution. Having already dealt with the length constraint above, two constraints have to be taken care of:
I found that performing string operations on Z3 can be daunting 😄. By providing a list of ASCII character set encoded in decimal (for each guess) constraints can be simplified. Example: "delhi" becomes [100, 101, 108, 104, 105]. A blog-post following a similar implementation for regex crossword solver can be found here.
The complete code for the project can be found on Github.
What kind of accuracy does the above model provide?
For simple to moderate level clues, the clue-hunting accuracy is about 80%. As of 16th December 2019, the crossword on certain occasions outputs a solution with one or two words being different from the required solution. This happens because the search bound is not tight enough. I am in the process of resolving this issue.
Can this method solve crosswords like the NY Times crossword?
Unfortunately, the answer is No. I have observed that sophisticated crosswords like the NY Times crossword require properties like tense detection and correction. They also require detection of whether the clue is in singular or plural form and to apply the guesses accordingly (I have implemented singular/plural detection in my project, check the Github page). For some crosswords, the clue requires finding an anagram of the clue itself. These types of constraints blow up the required search space and complicate the guess finding process. I am currently reading Dr.Fill: Crosswords and an Implemented Solver for Singly Weighted CSPs and will add a part 2 to this blog later.
I am reluctant to use any newspapers' past crossword data as they are proprietary. Nevertheless, I will try experimenting with those as well!
Here are some related links:
If you have any suggestions/questions, feel free to drop an email!
Update: I have written a part 2 of this blogpost!