Transcription and Programmatic Markup for the DigitalSEL
Medieval writers were keenly aware that their basic mode of textual reproduction, scribal transmission, introduced all sorts of variation and errors into their texts.1 Ælfric of Eynsham, the most prolific Old English author, included a well-known colophon in many of his works in which he prays that:
hwá þas bóc awritan wylle þæt hé hí geornlice gerihte be ðære bysne
whoever wishes to copy this book, will eagerly correct it according to the exemplar.2
More than three hundred years later, Chaucer complains that he is forced to “correcte” transcriptions of his own work after his scribe negligently copied them.3
Scribal transmission is error prone.
As modern editors and scholars, we would do well to consider Ælfric and Chaucer’s complaints when we transcribe and reproduce texts we find in manuscript form. To say nothing of the errors we might introduce in a transcription, the editorial decisions we make when we represent abbreviations, flaws, or other textual features have potential to change our texts in ways that might have struck our medieval subjects as negligent.
In this post, I describe my strategy for transcribing and editing the DigitalSEL. Ultimately, my goal is to create the most conservative machine-readable transcription of each manuscript text that I can, and then use programming to re-write and store these transcriptions in other formats. My goal is to make an edition that has different strata of intervention, rather than an edition that tries to represent editorial decisions in one pass.
How Print Editions Deal with Manuscript Readings
Print editions represent manuscript readings in ways that are bound to the physical limitations of the medium. It is just really difficult to represent complicated manuscripts on a printed page. Abbreviations are a good example of this constraint. Though it could be possible to use a font face with medieval characters to make a printed edition, they are hard for modern readers to understand without training. Print editions, therefore, tend to either silently expand abbreviations or expand them and set the abbreviation in italic font in order to indicate to interested readers where the abbreviations were. Take, for example, this little line from the London, British Library, Stowe 949 version of Saint Michael:4
I might transcribe this for print like this:
After þat vr lord vor vs in is moder was alyȝt
This is fine, of course, but we don’t really get any information about exactly what the abbreviation was in my transcription, and it comes into the world in an altered form.
Using Metadata to Deal with Manuscript Readings
XML is like violence — if it doesn’t solve your problems, you are not using enough of it.5
Library science folks get really excited about metadata.
A popular way to capture some of this extra information in digital editions is to use XML and, specifically, an XML schema established by the Text Encoding Intuitive, or TEI.6 In this editorial strategy, an editor uses XML tags to wrap text with metadata. After a complete TEI encoded transcription is made, editors use XLST to parse or transform the transcription into something reader-friendly.7
A really simple TEI version of our line from Saint Michael might look something like this:
Though using TEI encoding is probably the most common strategy for making a modern digital edition, it has some problems. Besides the trouble of working with complex XML and the difficulty of parsing and transforming it into something that is easy for humans to read, it produces documents with editorial decisions hard-coded into them. If I change my mind about how to handle an abbreviation, for example, it would mean that I would have to go back and emend my original transcription.
A Layered Approach
Rather than start with XML, I plan to create conservative transcriptions and then layer on any encoding or editorial decisions in subsequent versions of the text. For the DigitalSEL, this will happen in two steps:
- First, I will use Junicode or Andron Scriptor to make very close Unicode transcriptions of each text. These transcriptions will represent the base version of my edition and I am going to try to make them with as little editorial intervention possible.
- Second, I will write text parsers that will convert these transcriptions into other formats and use these as the basis of any editorial versions.
It is easy to use Ruby for string manipulation.
On the first front, I am lucky to be working on this project at a moment when there has been some great work done on Unicode fonts for medieval studies.8 This makes it possible to use recently made font characters to create a fairly good digital representation of most of the things one might find in a medieval or ancient text. Granted, converting a manuscript into any type of font will obscure paleographic and orthographic information, but fonts make it possible to do something pretty amazing — search and parse a text programatically. Good Unicode encoded transcriptions will make it possible to program parsers to rewrite these texts in any way we want. Let’s take our line from Saint Michael for an example.
First, we make a close Unicode transcription of our line, something like this:
Aft͛ þᵗ vr lord vor vs in is moder was alyȝt
Then we can use this text in a little Ruby string parser.9
Let’s store the line in a variable:10
Next, let’s store the special “er” abbreviation in a separate variable.
Now, we can use
gsub to replace the abbreviated character with “er” and, just for the fun of it, I’m going to wrap it in some HTML:11
If we flex our muscles a little bit, we can extend the standard
gsub method into something that will accept
key => value pairs so that we can replace as many characters as we want in a single pass:12
Rendered in the browser with a medieval font, our line looks like an edited text:
After þat vr lord vor vs in is moder was alyȝt
If we make our initial transcriptions in CSV format, then we will also be able to manipulate entire lines of poetry because we can use
regexp to detect and iterate over new line characters.13
This is just a little example, but I hope it demonstrates the ease with which we can start with a conservative Unicode transcription and rewrite the text any way we might feel like. This strategy has the advantage, I think, of creating a base document that won’t be altered too much by editorial intervention and then enabling us to programmatically rewrite the text, layering in either XML metadata for storing information or HTML for display.
The header image comes from London, British Library, Harley 603 (the “Harley Psalter”), 14r. ↩
My translation. The colophon is found in a number of Ælfric’s works. This quotation, with abbreviations silently expanded, is taken from Peter Clemoes, ed., Ælfric’s Catholic Homilies: The First Series, EETS s.s. 17 (Oxford: Oxford University Press, 1997), Old English Preface, 177; see also Malcolm Godden, ed., Ælfric’s Catholic Homilies: Second Series, EETS s.s. 5 (Oxford: Oxford University Press, 1979), Latin Preface, 42–9; Ælfric’s Lives of Saints, ed. and trans. W. W. Skeat, 2 vols. in 4 pts. EETS o.s. 76, 82, 94, 114 (London: Oxford University Press, 1881–1900), I, 1.74–76; Julius Zupitza, ed., Aelfrics Grammatik und Glossar (Berlin: Weidmann, 1880), 3.20–25; and Richard Marsden, ed., The Old English Heptateuch and Ælfric’s Libellus de Veteri Testamento et Novo I, EETS o.s. 330 (Oxford: Oxford University Press, 2008), 80.117–21. The prefaces were edited as a group by Jonathan Wilcox, ed., Ælfric’s Prefaces, Durham Medieval Texts 9 (Durham: Durham Medieval Texts, 1994). ↩
I am using Ruby, but I bet that this would be easy and probably even process faster in Python. ↩
We don’t have to store our text in a variable, but it makes it easier to look at. Programming languages treat different types of data differently. Our medieval transcriptions are “strings,” or strings of characters. ↩