Can NLP help us decipher the enigmatic manuscript?
Last semester in school, I worked on a project where we attempted to build word embeddings from the Voynich manuscript and use them to learn something about the linguistic structure (or lack thereof) within the text. You can check out our codebase or our paper. The reason why this methodology is exciting is that building word embeddings is completely unsupervised. This makes it a very appealing methodology for decipherment-like tasks. If done properly, we should be able to visualize the word embedding space and understand something about the relationship between words and characters in the manuscript!
However, several issues make this methodology more complicated for Voynich. First, there are problems with transcription and word and sentence segmentation. Several digitalized transcriptions of the manuscript exist, and they disagree in a surprising number of places. A second concern is the issue of data sparsity. Canonical methods for word embeddings are designed to be architecturally simple so that they can be applied to massive data sets (for example, all of Wikipedia). These models overcome their architectural shortcomings from being hit over the head with a lot of data: there is no inductive bias that tells them that relocate and relocates are systematically related, but if the model gets to see all of Wikipedia, it should be able to figure this out. On the other hand, in a low-resource setting like Voynich, the distribution of words is sparse enough that we probably will never see every form of each verb. We need a more complex model that has the ability to learn a generalizable notion of morphology. It seems feasible that there is enough signal in the data for a model to actually learn a meaningful model of Voynich morphosyntax even if the data is too sparse to build a meaningful representation of the syntactic space.
There’s not much we can do to address the transcription concern except being principled about which transcription we adopt. However, there is recent work about building morphologically-informed word embeddings. In particular, our project utilized fastText embeddings, which were developed by Facebook research. FastText works by learning different embeddings for different subsequences of words and representing the embedding for the full word as a combination of these sub-embeddings. Intuitively, subsequences that are morphemes should contribute a meaningful vector to the full word vector, and subsequences that do not encode morphological information will have a representation close to zero. You can find the code to build these embeddings in the voynich2vec repository.
Our original project tried to analyze these vectors in three different ways:
Sadly, I haven’t been able to work on this project much since the beginning of the summer, but I’m keeping track of a central list of ideas for future work here. In particular, I think the most promising question is to extend our preliminary work on morphosyntactic analysis. As I describe in one of the GitHub issues, methods have been developed in the NLP literature for inducing morphology from a vector space of word embeddings. This method could be used to generate a full morphological profile of Voynichese.