Transcription and Programmatic Markup for the DigitalSEL

Transcription and Programmatic Markup for the DigitalSEL

Medieval writers were keenly aware that their basic mode of textual reproduction, scribal transmission, introduced all sorts of variation and errors into their texts.1 Ælfric of Eynsham, the most prolific Old English author, included a well-known colophon in many of his works in which he prays that:

hwá þas bóc awritan wylle þæt hé hí geornlice gerihte be ðære bysne
whoever wishes to copy this book, will eagerly correct it according to the exemplar.2

More than three hundred years later, Chaucer complains that he is forced to “correcte” transcriptions of his own work after his scribe negligently copied them.3

Scribal Error Scribal transmission is error prone.

As modern editors and scholars, we would do well to consider Ælfric and Chaucer’s complaints when we transcribe and reproduce texts we find in manuscript form. To say nothing of the errors we might introduce in a transcription, the editorial decisions we make when we represent abbreviations, flaws, or other textual features have potential to change our texts in ways that might have struck our medieval subjects as negligent.

In this post, I describe my strategy for transcribing and editing the DigitalSEL. Ultimately, my goal is to create the most conservative machine-readable transcription of each manuscript text that I can, and then use programming to re-write and store these transcriptions in other formats. My goal is to make an edition that has different strata of intervention, rather than an edition that tries to represent editorial decisions in one pass.

How Print Editions Deal with Manuscript Readings

Print editions represent manuscript readings in ways that are bound to the physical limitations of the medium. It is just really difficult to represent complicated manuscripts on a printed page. Abbreviations are a good example of this constraint. Though it could be possible to use a font face with medieval characters to make a printed edition, they are hard for modern readers to understand without training. Print editions, therefore, tend to either silently expand abbreviations or expand them and set the abbreviation in italic font in order to indicate to interested readers where the abbreviations were. Take, for example, this little line from the London, British Library, Stowe 949 version of Saint Michael:4

Saint Michael Snippet

I might transcribe this for print like this:

After þat vr lord vor vs  in is moder was alyȝt

This is fine, of course, but we don’t really get any information about exactly what the abbreviation was in my transcription, and it comes into the world in an altered form.

Using Metadata to Deal with Manuscript Readings

XML is like violence — if it doesn’t solve your problems, you are not using enough of it.5

Excited Library science folks get really excited about metadata.

A popular way to capture some of this extra information in digital editions is to use XML and, specifically, an XML schema established by the Text Encoding Intuitive, or TEI.6 In this editorial strategy, an editor uses XML tags to wrap text with metadata. After a complete TEI encoded transcription is made, editors use XLST to parse or transform the transcription into something reader-friendly.7

A really simple TEI version of our line from Saint Michael might look something like this:

<l id="some_id_number">
    <abbr>aft&er;</abbr>
    <expan>after</expan>
    <abbr>þ&tsup;</abbr>
    <expan>þat</expan>
    vr lord vor vs <pc>&punctelev;</pc> in is moder was alyȝt
</l>

Though using TEI encoding is probably the most common strategy for making a modern digital edition, it has some problems. Besides the trouble of working with complex XML and the difficulty of parsing and transforming it into something that is easy for humans to read, it produces documents with editorial decisions hard-coded into them. If I change my mind about how to handle an abbreviation, for example, it would mean that I would have to go back and emend my original transcription.

A Layered Approach

Rather than start with XML, I plan to create conservative transcriptions and then layer on any encoding or editorial decisions in subsequent versions of the text. For the DigitalSEL, this will happen in two steps:

  1. First, I will use Junicode or Andron Scriptor to make very close Unicode transcriptions of each text. These transcriptions will represent the base version of my edition and I am going to try to make them with as little editorial intervention possible.
  2. Second, I will write text parsers that will convert these transcriptions into other formats and use these as the basis of any editorial versions.

String Manipulation It is easy to use Ruby for string manipulation.

On the first front, I am lucky to be working on this project at a moment when there has been some great work done on Unicode fonts for medieval studies.8 This makes it possible to use recently made font characters to create a fairly good digital representation of most of the things one might find in a medieval or ancient text. Granted, converting a manuscript into any type of font will obscure paleographic and orthographic information, but fonts make it possible to do something pretty amazing — search and parse a text programatically. Good Unicode encoded transcriptions will make it possible to program parsers to rewrite these texts in any way we want. Let’s take our line from Saint Michael for an example.

First, we make a close Unicode transcription of our line, something like this:

Aft͛ þᵗ vr lord vor vs  in is moder was alyȝt

Then we can use this text in a little Ruby string parser.9

Let’s store the line in a variable:10

text = "Aft͛ þᵗ vr lord vor vs  in is moder was alyȝt"
# The text editor has trouble rendering "punctus elevatus", but trust me, it's there!

Next, let’s store the special “er” abbreviation in a separate variable.

text = "Aft͛ þᵗ vr lord vor vs  in is moder was alyȝt"
er = /͛/ # I am using a "regular expression" here rather than a string.
# It looks a little wacky because the character is set above the one in front of it.

Now, we can use gsub to replace the abbreviated character with “er” and, just for the fun of it, I’m going to wrap it in some HTML:11

text = "Aft͛ þᵗ vr lord vor vs  in is moder was alyȝt"
er = /͛/
puts text.gsub(er, "<em>er</em>")
# This renders "Aft<em>er</em> þᵗ vr lord vor vs  in is moder was alyȝt"

If we flex our muscles a little bit, we can extend the standard gsub method into something that will accept key => value pairs so that we can replace as many characters as we want in a single pass:12

class String
  def mgsub(key_value_pairs=[].freeze)
    regexp_fragments = key_value_pairs.collect { |k,v| k }
    gsub(Regexp.union(*regexp_fragments)) do |match|
      key_value_pairs.detect{|k,v| k =~ match}[1]
    end
  end
end
text = "Aft͛ þᵗ vr lord vor vs  in is moder was alyȝt"
replacements = [[/͛/, "<em>er</em>"], [/ᵗ/, "<em>a</em>t"]]
puts text.mgsub(replacements)
# This gives us "Aft<em>er</em> þ<em>a</em>t vr lord vor vs  in is moder was alyȝt"

Rendered in the browser with a medieval font, our line looks like an edited text:

After þat vr lord vor vs  in is moder was alyȝt

If we make our initial transcriptions in CSV format, then we will also be able to manipulate entire lines of poetry because we can use regexp to detect and iterate over new line characters.13

This is just a little example, but I hope it demonstrates the ease with which we can start with a conservative Unicode transcription and rewrite the text any way we might feel like. This strategy has the advantage, I think, of creating a base document that won’t be altered too much by editorial intervention and then enabling us to programmatically rewrite the text, layering in either XML metadata for storing information or HTML for display.

  1. The header image comes from London, British Library, Harley 603 (the “Harley Psalter”), 14r.

  2. My translation. The colophon is found in a number of Ælfric’s works. This quotation, with abbreviations silently expanded, is taken from Peter Clemoes, ed., Ælfric’s Catholic Homilies: The First Series, EETS s.s. 17 (Oxford: Oxford University Press, 1997), Old English Preface, 177; see also Malcolm Godden, ed., Ælfric’s Catholic Homilies: Second Series, EETS s.s. 5 (Oxford: Oxford University Press, 1979), Latin Preface, 42–9; Ælfric’s Lives of Saints, ed. and trans. W. W. Skeat, 2 vols. in 4 pts. EETS o.s. 76, 82, 94, 114 (London: Oxford University Press, 1881–1900), I, 1.74–76; Julius Zupitza, ed., Aelfrics Grammatik und Glossar (Berlin: Weidmann, 1880), 3.20–25; and Richard Marsden, ed., The Old English Heptateuch and Ælfric’s Libellus de Veteri Testamento et Novo I, EETS o.s. 330 (Oxford: Oxford University Press, 2008), 80.117–21. The prefaces were edited as a group by Jonathan Wilcox, ed., Ælfric’s Prefaces, Durham Medieval Texts 9 (Durham: Durham Medieval Texts, 1994).

  3. This short poem goes by many names, including “His Owne Scriveyn”.

  4. An image of the entire page is available from the British Library.

  5. From the Nokogiri page of rubygems.org.

  6. XML stands for Extensible Markup Language. It is basically a very flexible markup language that makes it easy to append metadata to text.

  7. XSLT stands for Extensible Stylesheet Language Transformations.

  8. All hail the Medieval Unicode Font Initiative!

  9. I am using Ruby, but I bet that this would be easy and probably even process faster in Python.

  10. We don’t have to store our text in a variable, but it makes it easier to look at. Programming languages treat different types of data differently. Our medieval transcriptions are “strings,” or strings of characters.

  11. gsub is a standard Ruby String class method for iterating over a string and replacing things in it.

  12. Thanks to Lucas Carlson’s Ruby Cookbook for this.

  13. regexp or “regular expression” is a powerful pattern-matching method found in most computer languages. CSV stands for Comma Separated Values, AKA spreadsheets.

Tools for the DigitalSEL

Conflabunt gladios suos in vomeres 1In passus VI of Piers Plowman appears Langland’s famous allegorical description of utopian labor—the ...… Continue reading

Data Modeling for the DigitalSEL

Published on April 09, 2016

Using Jekyll for the DigitalSEL Blog

Published on April 03, 2016

William E. Bolton, PhD

William E. Bolton, PhD
William is a software developer and independent scholar living and working in Philadelphia. My academic work focuses on the lives of saints written in England between 850 and 1350.