Project Wulfila

Background outlining various problems and concerns that led to the development of Project Wulfila, a language reference tool for Germanic philology, soon to support other features.

Today I shipped the first feature for my reference tool, called Wulfila

The search command performs a search on an internal SQLite database to retrieve terms that match the given pattern.

user@hostname$ wulfila search "thorpkarl"

þorp-karl
 m. = þorpari, a churl, Fms. x. 372, Þiðr. 231. 
 þorpkarl-ligr, adj. churlish, Hkr. iii. 129.

The push also includes the database, built from the 1875 edition of An Icelandic-English Dictionary by Richard Cleasby and Gudbrand Vigfuson, which is rather like the Oxford English Dictionary for scholars of Old Norse.

History

Early in 2020, before the outbreak of the novel coronavirus pandemic, I found myself with a problem both technical and linguistic.

Germanic philology certification program at Signum University features a two semester introductory course in the field. The first semester involves a very high level survey of the half-dozen or so medieval languages of the Germanic family, which prominently includes Old English and Old Norse, as well as less accessible languages like Old Frisian and Old Saxon. In the second semester, we moved from this introduction to a deep dive into three Germanic languages, only touched upon in the first semester, that do not generally receive the same degree of attention: Gothic, Old Saxon, and Old High German.

Having some experience in Latin, Classical Greek, and Old Norse, it was a thoroughly enjoyable class for me, right in my wheelhouse. But, in the second semester the rigors of the process began to wear on me. Unlike those three previous languages, I did not have print editions of the introductory texts for quick reference. While I can usually guess to differentiate nominative and genitive singulars from dative plurals, it was difficult to keep in mind the full paradigm of verb conjugations.

Initial Purpose

Wulfila began as reference tool with the idea of being able to quickly reference grammatical paradigms. Think about Classical Latin, for instance. If you see a word with an -a ending, gender and number are easy guesses: feminine singular or neuter plural. If feminine singular, it's first declension and could be a nominative, vocative, or locative -a or an ablative without the macron. If neuter plural, it's either nominative or accusative, depending on context.

The initial purpose of Wulfila was to store the paradigms of several languages in internal JSON files, to be pulled up for quick reference as need.

Since we were studying Gothic in class at the time, it was given a name of some significance in Gothic: 𐍅𐌿𐌻𐍆𐌹𐌻𐌰 was an Orthodox Catholic bishop and missionary in the fourth century, who produced a translation of the Bible from Greek into Gothic. Like Saint Cyril after him, he even went so far as to develop a unique alphabet to represent the Germanic sounds. Unlike Cyril, the Wulfila Bible is almost all that remains of the Gothic language.

Despite having definite utility, the project stalled as job-work and classwork took priority, but typical of me, I left the GitHub alive on the notion that I would probably come back later.

Current Problem

Following Germanic Philology II, I took the summer off and focused on work and other things and thence, in the autumnal semester, began my final course in the philology cert: a seminar-level course on the translation of medieval Eddic Poetry.

While I don't have my paradigms for Old Norse memorized any more than I memorized those of Gothic at the start of the year, I do have a much stronger handle on the former than I do on the latter. The greater problem of this semester is not so much the grammatical structure of the language as its lexicon.

Before each class I have to prepare twenty stanzas of verse to sight read. This easily eats up three to six hours a week, hours I'd much rather spend on personal projects or socially distanced engagements.

The main issue is the amount of time it takes to prepare text. In order to sight read, you need to first look the words up in the dictionary. Then, finding their meaning, determine from the inflection their grammatical arrangement. Lastly, the texts typically require some coaxing (usually done during class) to render into a reasonably comprehensible passage of Modern English.

In doing this work, I have three dictionaries to reference:

  • Wiktionary

  • The third volume of A New Introduction to Old Norse, which contains the glossary and index of names.

  • The Cleasby-Vigfuson Icelandic-English Dictionary

Wiktionary provides the most accessible results. It can sort through paradigms in some words and with others matches modern languages, providing a path for you to work back to medieval meanings. But it has problems. In practice, the search index requires an exact match on most entries. Only rarely with very common words is it able to parse a þ or ð from some alternate forms. It is almost certainly ignorant of the point that ø and ǫ are often both represented by ö in Modern Icelandic and normalized texts. Additionally, there is no clear path for searching by a specific language. Where a word like horskr may only exist in Old Norse, enn or um requires wading through entries for dozens of other languages. Most damning of all, its use requires jumping from a terminal where I'm writing out my translations to a web browser, which is always an aggravation when you need to do real work.

A New Introduction to Old Norse is a pretty good resource for learning the language. Couple years ago, when that was the case for me, it was a great reference in doing my homework. But, Eddic Poetry is not covered in the Introduction and it is only rarely that I am referencing a common word with an entry contained therein.

My copy of Cleasby-Vigfuson is near eight hundred pages in print and requires a magnifying glass to read. While the entries provide a log of useful information, it's a bear to read and even more difficult to reference.

The Germanic Lexicon Project provides a digital edition of several public domain dictionaries on old Germanic languages, including the 1875 edition of Cleasby-Vigfuson, but the technology is not quite where I'd like it for fast and easy reference.

When engineers get aggravated, it slips a part of the mind into gear to ponder a solution. I need an easily searchable dictionary to facilitate my classwork. It happens as a writer and Python developer working for a database company, I have just the skills and background for thinking about how to solve this problem.

Solution

My solution to these problems was to rework Wulfila, bringing a secondary or even tertiary task of lexicon development forward, superseding its original purpose.

Data Extraction

It should come as no surprise to those who know me, that I delight in the process of lexicography. Even in profane subjects like technology it holds. The prospect of building out the Reference pages for MariaDB leaves me warm and tingly inside.

That said, even I will acknowledge that, despite the sincere pleasure it would give me, I do not have the time to manually transcribe to plain text a dictionary that runs in excess of twenty-six thousand entries.

Cleasby-Vigfuson as a work in public domain is freely available from the Germanic Lexicon Project with the express note that readers are free to copy the text and use it as they like in software. Fortunately for me, GLP includes an 11MB text file with the complete dictionary, updated periodically with additions and corrections. This provided me with a pretty solid starting point, the delightful and arduous transcription work having been handled by others.

I began with the module to which I passed the URL for the Cleasby-Vigfuson text. This is downloaded to the data/ directory in the repository, where I can use Git to version control the text. The text is written to file only when the file doesn't exist or when the --force option is passed to the utility, to limit the requests I make of the GLP server. Remaining operations are performed locally on this text file.

Next, I pre-process the Cleasby-Vigfuson text. The text by itself is not in a format that facilitates easy data extraction, so I need to break it down into a set of entries. I manage this by first splitting out the pages and then separating the individual lines from the pages. I then pass the line list through a grouping filter that identifies lines containing headwords (the term). These groups are collected into a Word class, which processes the lines to generate a string representation of the term and the term definition.

The __iter__() magic method is implemented on the Word class in a way that facilitates rendering its class data into a dict.

def __iter__(self):
   yield "term", self.term
   yield "def", self.definition

Implementing this method allows me to convert a ``Word`` instance into a ``dict``. It also allows me to cache the list of words in the dictionary using the json. The cache does not serve any purpose in re later reference or performance improvement, but it does generate an artifact to check during development in the event that something goes sideways for the SQLite database.

Lastly, I connect through the sqlite3 module to a SQLite database file within the repository and use the list of dict instances to generate a series of INSERT statements. Wulfila runs INSERT statements on two tables for each entry. The first adds the term and definition to the ``entries`` database. The second adds the pattern and entry index to the patterns database. It inserts a pattern based on the headword found in Cleasby-Vigfuson and then inserts a second pattern based on normalized text with digraphs, (id est, replacing a dictionary þ with the digraph th, ensuring that both forms are available to search). The normalized pattern also removes hyphens, which are useful in headwords but problematic in search.

search Command

The actual search command is implemented through the main wulfila program, called through a sub-command as outlined in the specification. For each term argument passed to the program, Wulfila executes a SELECT statement to find matches in the patterns database and then a set of SELECT statements to find rows in entries with matching keys.

In the future I would like to refactor these statements into a single Join operation or refactor the tables to remove the need or throw the whole thing up onto MariaDB Server, where it would benefit from my having greater familiarity with SQL optimization on InnoDB and ColumnStore.

The result of the SQL queries are a series of matching rows (currently, Wulfila returns an empty line to no matches, which I would also like to fix at some point).

Wulfila loops over the rows building out a result string. Headwords are group with bash characters to put the text in bold. The <I> and <B> characters are converted to open italics and bold, the closing elements set to remove markup. Then, the textwrap module is used to break definitions into a series of lines at terminal width, which are then indented.

user@hostname$ wulfila search vestr

vestr
  n., gen. vestrs, [A.S., Engl., and Germ.
  west; Dan. vester] :-- the
  west; sól í vestri, K.Þ.K., Landn. 276; til
  vestrs, Sks. 179; í vestri miðju, Rb. 92; í vestr,
  towards west. II. as adv. to the
  westward; ríða vestr eða vestan, Ld. 126; vestr
  til Breiðafjarðar, Nj. 1: of western Icel., þykki
  þér eigi gott vestr þar, 11; vestr, in the
  west, Bs. i. 4, 31. 2.
  westwards, towards the British Isles, a
  standing phrase (cp. the use of Hesperia in
  Lat.); sigla vestr um haf, to sail westwards
  over the sea, Fms. i. 22, Orkn. 144; sækja vestr
  til Eyja, west to the Orkneys
  (Shetland), Orkn. 136; vestr fór ek of ver,
  I journeyed westward over the sea, Höfuðl.
  1; in which last passage it is even used of a voyage
  from Iceland to England; til ríkja þeirra er liggja
  vestr þar, Orkn. 144.

Lastly, the entire string is passed to stdout through a single ``print()`` function:

Note, the bold and italics are present in output but not copied out to the Sphinx rendering shown here.

Further Development

Old Icelandic dictionary search for Wulfila is far enough along that I can begin using it as a tool in preparing translations for Eddic Poetry. There are a few outstanding issues that remain.

Consider the entry for hvalr:

user@hostname$ wulfila search hvalr

hvalr
  m., pl. hvalar, Sks. 180 B; hvala, acc. pl., K. þ.
  K. 138; hvalana, Grág. ii. 387; hvala alla, 359;
  mod. hvalir: [A. S. hwœl;  Germ, wall-
  fiscb; Dan. hval] :-- a whale,
  Hým. 21, Rb. 1812. 17, Grág. 1. 159, ii.337: as to
  the right to claim whales as jetsum, see the law in
  Grág. and Jb., the Reka-bálkr and the Sagas passim,
  e. g. Grett. ch. 14,Eb. ch. 57, Háv. ch. 3, Fbr. ch.
  9 :-- there was always a great stir when a whale was
  driven ashore, flýgr fiskisaga ferr hvalsaga; í
  hvals líki, Fms. xi. 182, Fas. ii. 131; hvals auki,
  amber, old Dan. hvals- öky, Sks.;
  hvals hauss, a whale's head; hvals ván,
  expectation of a whale being drifted ashore,
  Vm. 174; hvals verð, a whale's value, Grág.
  ii. 373; hvala blástr, the blowing of a
  whale; hvala-kváma, arrival of shoals of
  whales, Eg. 135; hvala-kyn, a species of
  whale, Sks. 121; in Edda (Gl.) and in Sks. 1. c.
  no less than twenty-five kinds of whales are
  enumerated and described; hvala-skúfr, whale
  guts, a nickname, Landn.; hvala- vetr, a
  winter when many whales were caught, Ann. 1375:
  in local names, Hvals-á, Hvals-nes,
  Hval-fjörðr, Hvals-eyrr, Landn. etc.
  COMPDS: hval-ambr, m. whale amber.
  hval-fiskr, m. a whale.  b0296">

You'll note that it ends with the artifact b0296">. This is part of the page reference in the text retrieved from GLP. I need to add some logic to excise it during preprocessing so that it stays out of the way.

Consider the text at the end: Hval-fjörðr and Hvals-nes are toponyms, likely deriving their names from whale sightings around geographic features. The hval-ambr is literally whale-amber, but likely refers to ambergris. These are terms that would certainly appear in text but which do not appear as explicit entries in Cleasby-Vigfuson. It would require the user to consider word roots in their search. I would like to expand the patterns table to include these, so that someone search hvalfiskr would still find the hvalr entry with minimal effort.

Paradigm Reference

The paradigm reference is also a high priority. I will likely start in Old Norse and then work my way out into other languages. My conlang projects include a set of Germanic languages, so it would be reasonable to have a these pages for reference as I start work building out that content.

Other Languages

In addition to Old Norse, GLP also provides a text file of *An Anglo-Saxon Dictionary*, an 1898 text by Joseph Bosworht and T.Northcote Toller for Old English as well as a few other texts of interest, some of which may require more manual efforts on my part.

I currently intend to relegate these extended projects to after I finish my certification.