Transcription and Programmatic Markup for the DigitalSEL

Medieval writers were keenly aware that their basic mode of textual reproduction, scribal transmission, introduced all sorts of variation and errors into their texts.1 Ælfric of Eynsham, the most prolific Old English author, included a well-known colophon in many of his works in which he prays that:

hwá þas bóc awritan wylle þæt hé hí geornlice gerihte be ðære bysne
whoever wishes to copy this book, will eagerly correct it according to the exemplar.2

More than three hundred years later, Chaucer complains that he is forced to “correcte” transcriptions of his own work after his scribe negligently copied them.3

Scribal Error Scribal transmission is error prone.

As modern editors and scholars, we would do well to consider Ælfric and Chaucer’s complaints when we transcribe and reproduce texts we find in manuscript form. To say nothing of the errors we might introduce in a transcription, the editorial decisions we make when we represent abbreviations, flaws, or other textual features have potential to change our texts in ways that might have struck our medieval subjects as negligent.

In this post, I describe my strategy for transcribing and editing the DigitalSEL. Ultimately, my goal is to create the most conservative machine-readable transcription of each manuscript text that I can, and then use programming to re-write and store these transcriptions in other formats. My goal is to make an edition that has different strata of intervention, rather than an edition that tries to represent editorial decisions in one pass.

How Print Editions Deal with Manuscript Readings

Print editions represent manuscript readings in ways that are bound to the physical limitations of the medium. It is just really difficult to represent complicated manuscripts on a printed page. Abbreviations are a good example of this constraint. Though it could be possible to use a font face with medieval characters to make a printed edition, they are hard for modern readers to understand without training. Print editions, therefore, tend to either silently expand abbreviations or expand them and set the abbreviation in italic font in order to indicate to interested readers where the abbreviations were. Take, for example, this little line from the London, British Library, Stowe 949 version of Saint Michael:4

Saint Michael Snippet

I might transcribe this for print like this:

After þat vr lord vor vs  in is moder was alyȝt

This is fine, of course, but we don’t really get any information about exactly what the abbreviation was in my transcription, and it comes into the world in an altered form.

Using Metadata to Deal with Manuscript Readings

XML is like violence — if it doesn’t solve your problems, you are not using enough of it.5

Excited Library science folks get really excited about metadata.

A popular way to capture some of this extra information in digital editions is to use XML and, specifically, an XML schema established by the Text Encoding Intuitive, or TEI.6 In this editorial strategy, an editor uses XML tags to wrap text with metadata. After a complete TEI encoded transcription is made, editors use XLST to parse or transform the transcription into something reader-friendly.7

A really simple TEI version of our line from Saint Michael might look something like this:

<l id="some_id_number">
    <abbr>aft&er;</abbr>
    <expan>after</expan>
    <abbr>þ&tsup;</abbr>
    <expan>þat</expan>
    vr lord vor vs <pc>&punctelev;</pc> in is moder was alyȝt
</l>

Though using TEI encoding is probably the most common strategy for making a modern digital edition, it has some problems. Besides the trouble of working with complex XML and the difficulty of parsing and transforming it into something that is easy for humans to read, it produces documents with editorial decisions hard-coded into them. If I change my mind about how to handle an abbreviation, for example, it would mean that I would have to go back and emend my original transcription.

A Layered Approach

Rather than start with XML, I plan to create conservative transcriptions and then layer on any encoding or editorial decisions in subsequent versions of the text. For the DigitalSEL, this will happen in two steps:

  1. First, I will use Junicode or Andron Scriptor to make very close Unicode transcriptions of each text. These transcriptions will represent the base version of my edition and I am going to try to make them with as little editorial intervention possible.
  2. Second, I will write text parsers that will convert these transcriptions into other formats and use these as the basis of any editorial versions.

String Manipulation It is easy to use Ruby for string manipulation.

On the first front, I am lucky to be working on this project at a moment when there has been some great work done on Unicode fonts for medieval studies.8 This makes it possible to use recently made font characters to create a fairly good digital representation of most of the things one might find in a medieval or ancient text. Granted, converting a manuscript into any type of font will obscure paleographic and orthographic information, but fonts make it possible to do something pretty amazing — search and parse a text programatically. Good Unicode encoded transcriptions will make it possible to program parsers to rewrite these texts in any way we want. Let’s take our line from Saint Michael for an example.

First, we make a close Unicode transcription of our line, something like this:

Aft͛ þᵗ vr lord vor vs  in is moder was alyȝt

Then we can use this text in a little Ruby string parser.9

Let’s store the line in a variable:10

text = "Aft͛ þᵗ vr lord vor vs  in is moder was alyȝt"
# The text editor has trouble rendering "punctus elevatus", but trust me, it's there!

Next, let’s store the special “er” abbreviation in a separate variable.

text = "Aft͛ þᵗ vr lord vor vs  in is moder was alyȝt"
er = /͛/ # I am using a "regular expression" here rather than a string.
# It looks a little wacky because the character is set above the one in front of it.

Now, we can use gsub to replace the abbreviated character with “er” and, just for the fun of it, I’m going to wrap it in some HTML:11

text = "Aft͛ þᵗ vr lord vor vs  in is moder was alyȝt"
er = /͛/
puts text.gsub(er, "<em>er</em>")
# This renders "Aft<em>er</em> þᵗ vr lord vor vs  in is moder was alyȝt"

If we flex our muscles a little bit, we can extend the standard gsub method into something that will accept key => value pairs so that we can replace as many characters as we want in a single pass:12

class String
  def mgsub(key_value_pairs=[].freeze)
    regexp_fragments = key_value_pairs.collect { |k,v| k }
    gsub(Regexp.union(*regexp_fragments)) do |match|
      key_value_pairs.detect{|k,v| k =~ match}[1]
    end
  end
end
text = "Aft͛ þᵗ vr lord vor vs  in is moder was alyȝt"
replacements = [[/͛/, "<em>er</em>"], [/ᵗ/, "<em>a</em>t"]]
puts text.mgsub(replacements)
# This gives us "Aft<em>er</em> þ<em>a</em>t vr lord vor vs  in is moder was alyȝt"

Rendered in the browser with a medieval font, our line looks like an edited text:

After þat vr lord vor vs  in is moder was alyȝt

If we make our initial transcriptions in CSV format, then we will also be able to manipulate entire lines of poetry because we can use regexp to detect and iterate over new line characters.13

This is just a little example, but I hope it demonstrates the ease with which we can start with a conservative Unicode transcription and rewrite the text any way we might feel like. This strategy has the advantage, I think, of creating a base document that won’t be altered too much by editorial intervention and then enabling us to programmatically rewrite the text, layering in either XML metadata for storing information or HTML for display.

  1. The header image comes from London, British Library, Harley 603 (the “Harley Psalter”), 14r.

  2. My translation. The colophon is found in a number of Ælfric’s works. This quotation, with abbreviations silently expanded, is taken from Peter Clemoes, ed., Ælfric’s Catholic Homilies: The First Series, EETS s.s. 17 (Oxford: Oxford University Press, 1997), Old English Preface, 177; see also Malcolm Godden, ed., Ælfric’s Catholic Homilies: Second Series, EETS s.s. 5 (Oxford: Oxford University Press, 1979), Latin Preface, 42–9; Ælfric’s Lives of Saints, ed. and trans. W. W. Skeat, 2 vols. in 4 pts. EETS o.s. 76, 82, 94, 114 (London: Oxford University Press, 1881–1900), I, 1.74–76; Julius Zupitza, ed., Aelfrics Grammatik und Glossar (Berlin: Weidmann, 1880), 3.20–25; and Richard Marsden, ed., The Old English Heptateuch and Ælfric’s Libellus de Veteri Testamento et Novo I, EETS o.s. 330 (Oxford: Oxford University Press, 2008), 80.117–21. The prefaces were edited as a group by Jonathan Wilcox, ed., Ælfric’s Prefaces, Durham Medieval Texts 9 (Durham: Durham Medieval Texts, 1994).

  3. This short poem goes by many names, including “His Owne Scriveyn”.

  4. An image of the entire page is available from the British Library.

  5. From the Nokogiri page of rubygems.org.

  6. XML stands for Extensible Markup Language. It is basically a very flexible markup language that makes it easy to append metadata to text.

  7. XSLT stands for Extensible Stylesheet Language Transformations.

  8. All hail the Medieval Unicode Font Initiative!

  9. I am using Ruby, but I bet that this would be easy and probably even process faster in Python.

  10. We don’t have to store our text in a variable, but it makes it easier to look at. Programming languages treat different types of data differently. Our medieval transcriptions are “strings,” or strings of characters.

  11. gsub is a standard Ruby String class method for iterating over a string and replacing things in it.

  12. Thanks to Lucas Carlson’s Ruby Cookbook for this.

  13. regexp or “regular expression” is a powerful pattern-matching method found in most computer languages. CSV stands for Comma Separated Values, AKA spreadsheets.

Tools for the DigitalSEL

Conflabunt gladios suos in vomeres 1

In passus VI of Piers Plowman appears Langland’s famous allegorical description of utopian labor—the plowing of the half-acre. Here, Piers the Plowman describes the purpose of his work and, notably, the tools with which he does it:

I wil worschip þer-with · treuthe bi my lyue
And ben his pilgryme atte plow · for pore mennes sake
My plow-[p]ote shal be my pyk-staf · and picche atwo þe rotes
And helpe my culter to kerue · and clense þe forwes2

After I finished writing my last post, as I was making a new project directory and started looking around in the Ruby Toolbox to compare libraries, it occurred to me that I should probably offer some sort of description of the tools I plan to use to develop the project. Though I don’t really plan to make this a “how to program” blog, it would probably be useful to outline the technologies I will be talking about and why I think they will be effective for building the project.

Readers of this blog are doubtless aware of the degree of excitement right now in Medieval Studies for using digital technology in academic work. Though there are some really impressive digital manuscript projects and reference materials, there is opportunity for innovation in an old fashioned corner of philology: textual studies. I plan to write much more about the theoretical underpinnings of what I am up to later, but suffice it to say now that I am building a digital critical edition, and so I am fundamentally interested in storing, retrieving, and parsing texts. Although digital tools for making editions of Middle English texts don’t really exist, it should be possible to recast existing technologies for my purposes.

Our Field to Plow: Some Basics

It sort of goes without saying that the internet is a great platform for a project like the DigitalSEL. It is widespread, does a great job of handling text, and is stable enough that you can watch live video on the subway, for goodness sake. What is less clear, however, are which tools and software would be appropriate for building this kind of digital edition. This is made more complicated because the computation for the internet happens at different stages and in different places: the browser and the server.

How the Internetz Works:

Cat surfing the web

Browsers

The work that is done in a web browser like Chrome, Safari, Firefox, or (shudders) Internet Explorer deals with a how a webpage looks and “acts.” The browser renders content and structure on the page (HTML), implements the page’s style (CSS), and determines the sort of behavior you might notice when a button changes color when you drag your mouse over it (JavaScript). The fact that this work happens on your personal computer or phone accounts for the fact that webpages look different in different browsers, like Chrome or Safari.

Servers

Daamn Checking your bank account after the book fair

Most of the data-heavy computing determining what information is sent to your browser occurs on a server. When you log into your bank account, you are communicating with a server in order to fetch your specific information. Since your bank’s server is basically a special, big computer that you communicate with remotely, the programmer who set it up could install and use whatever software she thought would be good for the project. You can run whatever computer language or software you might like on one. Common server-side programming languages are Java, C++, PHP, Ruby, and Python.

Model View Controllers

Over the years, a number of software frameworks written in these languages have emerged that make it easy to build a new web application complete with server-side components. For the most part, they include elements that constitute a popular pattern called a Model View Controller (MVC):

  • Model: The model, as you could guess from last week’s post, is the part of the application that maps onto and deals with objects in the database. When I wrote about how Aquinas belongs_to the Dominicans, I was talking about logic that is written in the model.

  • Controller: The controller makes the decisions about the requests that it gets from users. When the cat above requested information about Julian of Norwich, it was the controller that decided what to do with the request. After that, it sends the appropriate response back to the user.

  • View: The view section of a MVC includes all of the code that gets sent to your browser so that you can “view” the information you requested. This includes all of the code that makes up the structure, style, and behavior discussed above.

So, for example with the DigitalSEL, if you were to choose to look at the “Life of Mary of Egypt,” the controller would decided whether that was a reasonable request (it is!), and it would send that information along to the model. The model would then figure out if you had made any other special requests about the text you wanted to look at, and then it would retrieve Mary’s life from the database and send it back to the controller. After the controller heard from the model, it would then determine which views it needed, queue them up, and send them back to your browser which would display them for you to take a look at.

bread_cat

In a medieval bread-meme, Mary of Egypt is famous for bringing three loaves with her into desert-exile.

The DigitalSEL on Rails

RoR

There are a number of good MVC frameworks that could work for a project like the DigitalSEL, but I’m using Ruby on Rails (Rails). It is perfectly suited for what I want to do, is actively supported, has a ton of libraries, is easy to get running, and it has the strong advantage of being the tool I have and know. Additionally, Rails is completely free and open source, meaning it is a distributed and community-supported framework that anyone can use.

Caveat Lector

As I suggest above, I probably won’t spend time explaining basic programming concepts (though I will link to them). Nevertheless, I do plan to document every single step I take on the project so it will be possible to follow along and learn from and build upon my work. If you are just interested in tracking my progress, some terminology will be helpful for understanding basic concepts in Rails development:

  • Command line: It is ironic, but a great deal of modern web and software development takes place in an operating system developed in the 1970s called Unix. Programs like Unix or Linux are commonly called the “command line,” sometimes abbreviated “CLi.” On a Mac, the program that runs a born-again version of Unix is called Terminal.

  • Ruby: Ruby is a programming language that was developed in the 1990s by a developer called “Matz,” who, by all accounts, is a really nice fellow. Ruby is the language upon which Rails, the framework, runs.

  • Gem: A gem is the Ruby term for a package of code that you can download and include in your own project. Gems are generally all free and community supported. They are called “modules” or “libraries” in other languages. For example, the Prawn gem is a PDF generator that I plan to use in my project.

  • Git: Git is a command line application for version control. Basically, it allows you to take a snapshot of your project so that if you mess everything up, you can go back and recover previous versions of your work. It is an extremely powerful tool, but is also very confusing. For example, Git is the program and GitHub is a website where people store Git repositories. I plan to blog about it in the future.

  • Text editor: A text editor is just that: a program on your computer that reads and writes text documents. This is different from a document editor, like MS Word, in that a text editor will only save files in the format you tell it to. I use Sublime Text, but there are many others, including Vi or its younger cousin “VIM,” which, believe it or not, you probably already have installed on your computer if you are currently reading this on a Mac. We will talk more about VIM when we have to do programming on the server.

Next up, I plan to walk through the first steps of building the project, explain what comes in a Rails app, and explain how to fire up the server so we can start building the database. I am hoping that the posts get shorter as I plan to get down to brass tacks and just talk about building the app and what to do about mark-up.

  1. The header image is from London, British Library, Additional 42130 (The Luttrell Psalter), fol. 170r.

  2. BX.6.105-107. Text from the Piers Plowman Electronic Archive. A Modern English translation is available from Harvard