Note: Part 1 is available here!
It is surprising how a simple implementation with a humongous amount of data can outsmart it's complex and seemingly efficient counterparts! After writing the last blog-post, I started reading Dr.Fill: Crosswords and an Implemented Solver for Singly Weighted CSPs, a paper explaining a competitive crossword solver. While reading, a dataset on page 15 caught my attention. The dataset was http://www.otsys.com/clue.
Now, as of 20th December 2019, if you visit the above link there are three files provided. The problem with these files is that they are targeted towards people who want to create crosswords rather than those who want to solve them! A quick web search brought me here. This page provides the source code of the project hosted by Otsys. It appeared to me that the cluedata file contained their project's dataset. Unfortunately, I could not cleanly decode the file (latin-1 was the closest working solution) in Python.
The eureka moment for this project came when I found out that Otsys had previously uploaded another file apart from those three files, called clues.bz2. This file contained all the clues they were using for their crossword generator in a non-encoded fashion. Thankfully Internet Archive had a copy of that file!
After that, cosine similarity and sentence tokenization on this data gave me a never-imagined accuracy. For newspaper crosswords, the accuracy was nearly 100%!
Can this work against a customized crossword?
A customized crossword heavily inspired by newspapers like LA Times and NY Times should work well with the dataset mentioned above. From what I have learned, many crossword enthusiasts use Otsys' crossword-generator to create crosswords. However, for clues that appear to break the tradition, the method mentioned in Part-1 should work reasonably well (compared to the above method). I have yet to implement a hybrid implementation, nevertheless, in theory, I believe it can give much better accuracy for unique crosswords.
How is the runtime performance of the above-described method?
Based on my observation, using PyPy can drastically reduce the runtime (compared to CPython). In my case, the difference was 5x! Presently using PyPy, the solver can fetch clues for a 10-clue crossword and solve them in under a minute. Interestingly the SMT implementation using Z3 (as mentioned in the previous blog-post) takes about 2 milliseconds 😄.
I do have plans to rewrite the entire code in Golang. This has been a fun project!
If you have any suggestions/questions, feel free to drop an email!