Computational Linguistics, Digital Classics, and A Bunch of Stuff Over My Head…

I’m at the very beginning stages of a research project that’s got me pretty excited and pretty terrified. In speaking with some of my students who are beginning to read Greek, it occurs to me that the transition for many seminary students from reading the Greek of the New Testament to the Greek of the world in which the New Testament was formed can be a daunting one. There are many great New Testament Greek grammars, but the problem with them is just that: they are New Testament Greek grammars. The confidence gained by being able to read the Gospel of John is quickly dashed as soon as one picks up Philo, Plutarch, or Lucian, not to mention “real Greek” like Plato and Aristotle. For many students who have been through a couple of years of Greek, though, they want to make this transition but there is no clear way to do so. Most are forced to do what I did, which is struggle through texts that are way over one’s head, clinging tightly to the Liddell, Scott, Jones lexicon and any (often 19th century) English translation one can find.

My project is designed to help this proble. What I would like to help students do is identify which texts are closest to the NT in syntax and grammar, and allow them to begin with those, and then branch out to more and more difficult texts as they become more competent. Now, this might be an easy project if one were simply to interview people who work with the NT and Greek literature. Heck, I myself have some sense of which ones are the the place to start (Apostolic Fathers? Epictetus?). But I am using this idea to expand my abilities and step into the daunting world of computational linguistics. It seems to me that there should be a way in which one could analyze the syntax and grammar of the New Testament, and then create a system by which a given text could be compared to the NT as a standard. For example, if we could determine that the frequency of the use of the optative case in NT Greek is X, then we could show students texts where the use of the optative is similar. If one could determine enough idiosyncrasies of the Greek in the NT, then one could begin to search for Greek texts that demonstrate a similar set of idiosyncrasies. To do this, of course, would seem to be incredibly labor intensive, but computers are designed to do this exact type of repetitive work. If we could teach a computer to identify grammatical structures, then a computer could give us these stats rather quickly and over a large corpus of texts.

But how do you do that?

Fortunately, many people much smarter than I have been at this for a long time. There is an entire field called computational linguistics that has as its very purpose the description of natural language by computers. And though working with languages like English is much easier than working with highly-inflected languages like Greek, the super smart people have determined ways of describing language patterns in Greek. There is an entire sub-field called digital classics that works with these tools. You’ve probably encountered a few of their well-known tools: TLG and Perseus are good examples. Just like students learning to identify forms, high-powered computers have learned the “rules” of ancient languages, and therefore they are able to identify forms in texts just like a struggling Greek student. Rather than tag a text (in XML for example) with grammatical forms (as many projects do), a truly computational approach encodes the rules of the language and then allows the computer to apply those rules to whatever text you pass it. For my project, what’s exciting is that these tools exist (see the Morpheus engine from Perseus), they are available as open-source tools, and the corpus of digital texts is growing all the time. So now I’m proposing to take an engine like Morpheus, working with it to identify the idiosyncrasies of NT Greek (much written about, but this will be an interesting project in and of itself), and then start a system by which I can rank Greek texts using the NT as a baseline. How similar is a given text to the syntax and grammar (not to mention vocabulary) of the NT?

But I’m really doing this as an entree into this world of digital classics. I have a million more ideas about how we can use computers to tell us something about the language of early Christianity. But to do those, I’ve got to get started on this little project. The crazy thing is that there is a ton of support out there to help me. Not only are projects like these well-documented, but there is an enthusiastic and supportive user community there to help. So I’m going to get started with something that is likely well over my head. Yes I have training in classics and in computer science, but I’ve never really put those together. But as Luther would say, sin boldly. And so I will. This is the absolute beauty of the IT world. There are few barriers to entry. Got an idea? Got some sense of how to do it? Then just try. The worst thing that can happen is that you fail but learn something along the way.

Look back here for a lot more information about this project, as I’ll be journalling my project here. The first step is already complete: I downloaded the Morpheus source could and I’m working through it. Also, I encourage you to read this fantastic analysis of digital classics Rome Wasn’t Digitized in a Day: Building a Cyberinfrastructure for Digital Classics by Alison Babeu, the current director of the Perseus project. I learned an amazing amount from reading it.

Thoughts? Ideas? Help? Please send all those things along. I’m already sinning quite boldly.