A phrase frequently heard in connection with computing is “that’s
not a flaw, that’s a feature.” In the case of LBT the flaw or feature is a commitment to
unlimited extensibility on the one hand and semantic-web principles on the other. The object of study
is conceived less as a fixed set of paper documents to be digitized than as a dynamic nexus of
relationships among persons and documents to be articulated. Materials concerning Byron, his writings,
his public, and his critics are bottomless but not random or formless—consisting as they do of series
of events, collections of documents, and networks of human relationships. These can be charted and
graphed digitally in an archive of machine-readable documents.
Since the Byron literature is but a drop in an ocean of literary
information it becomes necessary to combine unlimited extensibility with principles of selection and
definition. This is where the concept of a “semantic web” comes in. Search engines parse
strings of characters; they know nothing about documents or relationships until they are defined by
means of metadata, markup, and ontologies. Semantic computing implies working with the things
that strings of characters refer to—things like events, documents, and people. Defining semantic
objects within or across digital archives enables one to query and analyze extensive bodies of
information in sophisticated ways. But semantic computing is labor intensive: to be rendered
machine-readable, documents must be parsed and described by human editors.
LBT will consist of an archive of archives, each centered on some point of
contention related to Byron and his reading public. Its documents are linked to machine-readable data
files and human-readable commentary. As material is added and software improves LBT will grow in depth
and complexity, but a project indefinitely extensible is by definition perpetually incomplete. Digital
archives challenge designers to create something operational and useful in the present while
anticipating changes to come. This project anticipates the advent of data- and document-exchange via
distributed computing—surely the quickest way to grow an archive—while undertaking the work of
constructing machine-readable documents for use in such a digital infrastructure.
Whether such reliance on human labor is ultimately more flaw or feature
remains to be seen: it will be years before we have large digital archives of literary material with
the requisite markup and metadata,—with open access and the standards required for computing across
dispersed archives. It may not happen at all. But LBT will explore possibilities for semantic markup
with the ultimate end of being subsumed wholesale into an a computational network. It will grow through
a series of published states not unlike those extensible works Childe Harold and Don
Juan: new “cantos” will be added as time and circumstances allow with the
expectation that the project will evolve in unexpected ways.
One aim of assembling a hypertext Byron archive is to bring context to bear
on documents by linking them to statements to which they responded and those they provoked, calling
attention to acts of reading performed by particular agents in particular circumstances—as opposed to
the generalities of period norms and implied readers. Byron’s “Sketch from Private Life,”
for example, reads rather differently when juxtaposed with the letter penned in response by Mary Anne
Clermont, the object of its personal abuse. By assembling varied responses of readers of this poem we
can better assess the kinds and degrees of personal agency involved in Byronic satire: its effects on
readers, but also the effects of their reactions on the writer. Mary Anne Clermont may have had little
sway over Byron but her friends in “private life,” and friends of her friends in public
life, presumably did.
Another aim is to consider how individual acts played out collectively by
considering how how agents and documents participated in groups. The “Sketch from Private
Life” and Miss Clermont’s response are parts of a whole body of discourse concerned with the
Byron separation. We know that Byron was driven out of society, that he spent his remaining life
responding to his banishment, that the events of 1816 colored public response to all he said and did
afterwards. But just how personal and collective agency played out in these transactions remains
unclear, as does the extent to which decades of writing about the Byron separation contributed,
directly or indirectly, to changing attitudes about marriage. By bringing quantitative data to bear on
this and other Byron controversies—by linking statements to persons, and persons to groups—we might
hope to understand the social dynamics of literary production better than we presently do.
The code for finding, parsing, tabulating, and graphing this kind of
information is not yet written, the documents not yet collected, and it remains to be seen whether
machine-readable documents can be indexed and queried in ways sufficiently general and sufficiently
particular to address research questions like these. But it seems like the right time to experiment.
Quo vadis?
The archive of documents in LBT is intended to top out at perhaps 200
monographs and 2000 articles—about as much as could be managed over fifteen or twenty years working at
the present pace. The proposed scale limits the time that can be allocated for editing, markup, and
annotation. One could spend more time on fewer items, but the object here is to create a middle-sized
digital archive of Byron material done with a middle-level of markup and commentary. There are prefaces
and notes identifying persons and titles, but no passage-specific annotation. The plan is to devote
about equal time to “Lord Byron” and “his Times”—that is, to Byron-specific
documents and to contextual material related to his causes and contemporaries. 1870 seems like a
reasonable terminus ad quem since Byron’s contemporaries were mostly dead by then, though of
course their letters, memoirs, and commentary continued to appear.
At the time of writing the project has completed its third year. The first
two were spent experimenting with different kinds of documents and sub-archives, trying to distinguish
between the possible and the practical, learning how long things might take and estimating trade-offs:
how much editing and annotation would be practical given the constraints? Since in LBT the priority is
establishing connections between persons and documents, a first task was to design a tagging
implementation to link documents to the data files storing information about people and titles. This
needed to done in conjunction with designing a screen-interface to render the machine-readable
documents and data files. By the second year scalability was already becoming an issue, making it
necessary to begin on an indexing system that remains a work in progress.
Several initial document collections were pursued, but one quickly became a
preoccupation: memoirs of contemporary poets and society figures from which names and document-titles
could be harvested. The life-and-letters genre turned out to be of particular interest since
collections of letters enable one to locate names and titles in thousands of time-specific documents
instead of just the larger books of which they are a part. The prospect of extracting from these
volumes a cross-searchable, decades-long, day-by-day chronology of persons, titles, and events is
proving irresistible. Despite the fact that the text of the letters is generally cut, bowdlerized, or
otherwise tampered with, this seems like an efficient way to assemble social context for the Byron
material. The time will come when these user-friendly if unreliable letter-texts can be linked to
better versions outside the LBT archive.
From information in these life-and-letters volumes it becomes possible to
chart the friend-of-a-friend and friend-of-a-foe social networks in which the Byron controversies
played out. With the list of persons culled from these documents now approaching ten thousand (twice
the original estimate) and with convergences beginning to appear among them, efforts will next shift to
attaching to the names information about family connections and social ties in the form of a digital
prosopography. This information will enable querying LBT for references to or information about
“associates of Leigh Hunt,” “persons who wrote for the Edinburgh Review”
or “works written by persons at Cambridge in the 1820s.” If and when the semantic web
develops, lists of names and titles generated from within LBT could be used to search material outside
of the archive in accordance with the principle of extensibility.
In coming years emphasis will shift from memoirs of Byron’s contemporaries
to contemporary memoirs of Byron, and from there, finally, to developing collections of pamphlets and
articles related to the controversies in which Byron was involved. This is not a very methodical plan.
It seems best to work contextually, following one line of development while making occasional forays
into others that bring related persons and documents into view, then returning recursively to the
original train. The hypertext environment is inherently digressive, as are Byronic controversies. Is it
better to begin with the beginning and move chronologically down a document trail, or to begin at the
end, working backwards on the basis of the fuller knowledge supplied in later documents? Thus far, it
seems best just to walk the hermeneutic circle, proceeding backwards, forwards, and sideways as
knowledge accumulates.
Textual Editing and
encoding
Texts in LBT are basic transcriptions. Obvious and unambiguous compositor’s
errors are silently corrected but original spelling and punctuation is otherwise maintained. Hyphens
occurring in line breaks have been removed. There is no pretense made of presenting a critical edition:
where significant textual issues are known to exist they are mentioned in prefaces, but documents have
not been collated and textual variants are not recorded. In the case of newspaper material the text is
keyed by hand, otherwise it is derived by optical character recognition (OCR). While some documents are
scanned on site, recourse is made to OCR available from the vast store of public-domain material
published by Google Books and the Internet Archive, hand-correcting the text against the page images in
the process of markup. These sources have occasional missing or garbled pages that must be supplied
from other witnesses of the same edition but not necessarily the same printing, resulting in a
conflated text. Given the objectives of the project we have pursued amplitude at the expense of
exactitude, a compromise we have been the more willing to make since page images are usually readily
available elsewhere.
Documents are encoded according to the TEI P5 Guidelines using a basic set
of elements and no customization. TEI offers the advantages of standardization and the prospect of
longevity; as a form of XML encoding it is well suited to data manipulation and migration. TEI is
descriptive markup: the “p” element is not a formatting instruction as in HTML, but a
semantic statement indicating “this is a paragraph.” While LBT style sheets output HTML to
the screen, the underlying TEI-encoded documents are structured in ways that enable faceted searching
such as looking for personal names where they appear only in poems or letters, or place names where
they appear only in datelines. Data files in LBT are also encoded in TEI/XML so that at runtime the
style sheets can add notes and links using current information stored outside of the document itself.
With markup as with textual editing there are trade-offs between breadth
and depth. Lighter markup has the advantage of rendering machine-readable documents more intelligible
to human eyes and it enables encoders to participate in the project with less training than would
otherwise be the case. TEI elements are largely inert without the use of project-specific XML IDs and
ID-refs, which is where much of the encoding labor is concentrated: marking a title not just as a
title, but as a particular title, a person as a particular person (more about this below). Since LBT
does not offer page images, extensive use is made of the “rend” attribute used by the style
sheets to render the machine-readable document more attractive to the human eye.
Interface
The screen image in LBT is a rendering of an underlying TEI/XML document
which is itself a rendering of a print object that might be a book, an article, or a newspaper column.
It goes without saying that interface design is all about mediation: between paper and screen media,
between machine-readable and human-readable documents. The current norm for presenting books on the web
is page images overlaying text-searchable, usually uncorrected OCR. Photographic images of pages, while
they give more direct access to the material object than machine-rendered text, do have limitations:
they can be hard on the eye when lacking sufficient contrast and clarity, and are not suited to
hyperlinking. The underlying OCR lacks most of the capabilities of machine-readable text. The HTML
documents output by LBT lack the presence of page images but have the advantage of being designed for
presentation on the screen.
LBT strives to present text that resembles the appearance of the paper
source, adding page breaks and scaling fonts to correspond to the original. There are limits to what
can be done this way. Because of the overlapping hierarchy issue besetting XML documents, page-break
indications and objects tied to them like page numbers, running heads, and footnotes, can be a
challenge to render. TEI does not directly handle screen formatting, which is added at run-time using
style-sheet rules. While rules can be written to handle things like flush-left first paragraphs,
there comes a point where trying to model an irregular document “according to rule” ceases
to be a worthwhile endeavor: style-sheet rules are no match for typesetting when it comes to
flexibility. Since different browsers break lines in paragraphs at different places and none add
end-line hyphens, HTML renderings of paper documents cannot be proper facsimiles.
Yet it does seem worthwhile to try to capture the visual style of a paper
document. While the literal screen-size will be the same, one can suggest the difference between a
quarto, octavo, and a duodecimo by reproducing something of the original formatting. But in some cases
it has seems best not to do so: why render a newspaper article as illegible as its four-column,
tiny-type original? A photographic image is much better suited to conveying that idea. Time constraints
also enter in: monographs get their own, custom style sheets while journal articles are rendered from a
generic style sheet with fewer formatting options. There are limitations to what is possible, and then
there are limitations to what is practical. In the digital medium a desire for systematic consistency
is always pulling against the contrary desire for idiosyncratic accuracy; knowing that these
TEI-encoded documents could be rendered differently using alternative style sheets encourages a kind of
aesthetic pragmatism.
Popups are used rather than end-of-document notes because they make better
use of the screen medium. Some of the documents, the Thomas Medwin Conversations archive for
instance, are extensively cross-linked through marginal notes. At runtime, these pull text from related
documents to create marginal glosses in the HTML output. The visual juxtaposition of a document with
its sources and commentary is a desideratum for LBT, though as yet experimental and imperfectly
implemented. The idea of getting the documents to visually “talk to each other” seems worth
pursuing. Down the road there is the prospect of using distributed computing to pull in text from
outside of LBT, which will doubtless raise a new set of interface issues.
After all, a machine-readable text is not a book. It is an “ordered
hierarchy of content objects,” which is to say a nested set of text-containers labeled
“chapter,” “paragraph,” or “footnote.” These digital objects can be
output to the screen to create a visual simulacrum of a book, but that is not all that can be done with
them. In LBT, machine-readable text is used to generate hyperlinked tables of contents like those
running down the left margin of the screen, and on index pages chronologically-arranged links are
broken out into sub-lists according to appearances of an item in chapters, letters, and verse. In these
ways the machinery behind the interface takes advantage of the container-like structures of XML markup,
treating the parts of a document like fields in a database.
Marginal notes are one example of using the database structure to create
one document out of another by “querying” its content; the records for persons (to which
the popups link) are another. These pages do not exist as such but are created on the fly using
prosopographical data and search algorithms. The style sheet creates what is in effect a navigational
hub by linking to things the person wrote and things written about the person. As LBT develops these
pages will become more elaborate (identifying document exchanges in which they participated, for
instance), and the expectation is that eventually information can be pulled in not only from LBT but
from across the semantic web—in real time, using similar search algorithms.
In short: by means of semantic markup, digital representations of printed
books can become as interface-portals pulling in text and information not just from the archive but
eventually from the web at large. But here again extensibility requires principles of selection if the
potentials of machine-readable text are to be realized.
Needles and Haystacks
Perhaps nothing illustrates the advantages of semantic markup so well as
the problem of searching for personal names on the web: how to identify references to a particular
Smith or Johnson from among the millions of documents containing the strings “Smith” or
“Johnson”? LBT manages this by using unique identifiers in the markup. For example, there
are currently thirteen distinct persons in the archive named “John Murray,” all
distinguishable by unique identifiers: JoMurra1793, JoMurra1859, etc. By means of this semantic markup
a search algorithm can identify particular Murrays even where the document text reads “Mr.
Murray,” “John M,” “Lord Murray,” “Duke of Atholl,”
“Anak of publishers,” or “**** ******”. Similarly, references to books can be
pulled up where titles are distorted or not given at all.
Since LBT is about relationships between persons and documents,
considerable labor goes into this kind of semantic “collation.” Satires often bristle with
dashes and asterisks and controversialists often do not know the names of their anonymous opponents.
But even where there is a name- or title-string to search for it is often garbled in uncorrected OCR, a
greater problem than sometimes realized. The text-handling algorithms used by Google Books and the
Internet Archive automatically “correct” Horne Tooke to “Home” Tooke and John
Galt to John “Gall,” rendering these names all but invisible to string-searches. Semantic
markup not only reduces the number of false hits by discriminating among Murrays, it can boost the
number of correct identifications by an order of magnitude.
Much of the labor going into LBT is simply the conventional philological
work of establishing identifications for “Miss Andrews,” “Mr. Roberts,” and
“Dr. Brown.” Unless there is some obvious risk of ambiguity, nineteenth century documents
suppress the given name and use the more polite if more elliptical form of reference. In the original
index (if there is one) one often finds just “Miss Andrews” or nothing at all because even
contemporaries were having difficulties identifying persons. Where identifications can be made (more
often than not these days) machine-readable indexing can not only by supply the reference, but link it
to additional information about Mr. Roberts or Dr. Brown.
For semantic searching to work across the web persons and titles need to be
uniquely identified. This can be accomplished by adding the prefix “lordbyron.org” to an
LBT ID, though in practice this is not much help. LBT links persons to canonical forms of a name as
used by the Library of Congress, the National Record of Archives, and the Oxford Dictionary of
National Biography. Where persons have yet to be uniquely identified elsewhere, every attempt
is made to supply a machine-readable, eight-digit date of death (“1848-12-03”) that can be
used computationally to discriminate among John Smiths. If and when RDF repositories for persons and
books become part of the web infrastructure for digital humanities, LBT data files will facilitate
identification by linking personal names to parents, spouses, and publication-titles—even to references
to them in books.
Acknowledgements
Before thanking the many persons who have been involved in Lord Byron and
his Times I must shift pronouns and acknowledge that to this point the “we” of the
preceding paragraphs has been largely ”me.” Digital humanities projects we are told—and I
agree,—are necessarily collaborative. And yet, like Bottom in the play, I find myself appropriating all
the roles to myself: projector, editor, and commentator, data-designer, web-designer, and coder;
bibliographer, archivist, and genealogist. This is not how things ought to be and not how I would wish
them to be. But there are reasons.
In more innocent times I became accustomed to going it alone in an
unstructured environment. Doing one’s own design and coding, while it has its downside, has also this
advantage: where working contractually with a grant provider would require making decisions up front
and fufilling terms in two or three years, I have had the luxury of operating within a much longer
time-horizon in a more flexible way. This enables experimentation and risk-taking. LBT is neither fish
nor fowl, neither a massive OCR-with-page-images digitization project nor a scholarly edition done to
exacting standards. While I believe that the implementation of XML/TEI adopted here makes sense, the
infrastructure and software required to fulfill its semantic-web ambitions is, if not quite vaporware,
as yet impractical.
Digital humanities is inherently risky business. No one knows how matters
will stand fifteen or twenty years from now, when or if the semantic web will become a reality, or how
archival projects will be sustained over the long term. As the digital infrastructure required for
humanities computing develops—as I hope it will—risks will decrease, collaboration will become easier,
and projects built on semantic-web principles will realize their potential in an environment where
“undefined extensibility” is unequivocally an asset. Archival projects with defined data
but undefined boundaries are well suited to sharing information and documents with unknown
collaborators in unanticipated ways. In the meantime, LBT is underway, and for that there are many
person to thank.
Peter Graham of the English Department at Virginia Tech has supported the
project in large ways and small, but especially by introducing me to the very generous community of
Byron scholars. In the planning stage and since, we have had a series of fruitful conversations at the
NINES headquarters at the University of Virginia with Andy Stauffer, Jerome McGann, and Dana Wheeles.
They have given sage advice and cheerful encouragement when most needed, and have made building LBT a
more self-conscious and critical enterprise than it would otherwise have been. Andy Stauffer
magnanimously came to our rescue when LBT found itself without a server and continues to host the
project. The “networked infrastructure” fostered by NINES has been an essential incubator
for LBT and many another digital project.
Laura Mandell of 18thConnect, now of ARC, has likewise helped to guide LBT
through its early stages; her vision of an emergent digital infrastructure for the humanities projects
has been an inspiration; the prosopographical dimension of LBT, now getting underway, is a response to
her encouragement. Peter Robinson, too, has helped me to imagine what distributed computing might look
like and how it is to work. Paul Curtis and Peter Cochran, at work on digitizing the Byron letters and
getting them into XML/TEI, have been in conversation and we look forward to partnering with them as a
means of building a connected web of Byron documents beyond LBT.
For technical advice and assistance I am indebted to Cage Slagel who helped
me to grasp the constraints and possibilities of sharing data on the web; the conversations we had will
bear fruit in years to come. I am also grateful to Nick Laiacona and Lou Foster of Performant Software
in Charlottesville for their assistance in getting LBT up and running and for assistance with the
scalability issues that have been making life difficult. As a very sophomoric programmer I particularly
value their professional advice and patience.
My colleagues at Virginia Tech have likewise been instrumental in getting
LBT underway. CATH, our local home, is the creation of the unflappable Dan Mosser with whom it has long
been a pleasure to be associated. He and I have been partners in the digital humanities enterprise for
going on two decades now. Eve Trager in the English Department has been essential to the LBT physical
plant—assisting with technical knowledge and skill beyond my reach. Carolyn Rude, my department head,
has likewise supported the project in the most concrete sort of ways: with summer research assistants
to help with encoding documents, and a course off to pursue a grant proposal. She has been a constant
friend to the digital humanities, understanding the complexities of the work, with its long-term
horizons and short-term bumps and rubs.
Many students have had a hand in LBT, which has proven a useful vehicle for
teaching the principles of TEI-encoding. Melissa Smith worked on the Countess of Blessington in
connection with her 2010 MA thesis at Virginia Tech. In 2009 Anna Mackenzie Radcliffe at UVa worked as
a research assistant on The Last Days of Lord Byron, and in 2011 Hatley Clifford at Virginia
Tech on Astarte. In 2009 Thomas Minogue, a Virginia Tech honors student, worked with me on the
Blackwood’s-London Magazine controversy. In 2010 Daniel Perkins at Virginia Tech did
preliminary markup on W. H. Humphreys’ Journal of a Visit to Greece, and the same year worked on
Millingen’s Memoirs of the Affairs of Greece with his colleagues in the Byron Honors Seminar at
Virginia Tech, Amber Eames and Alex Pettingill.
MA students in the Digital Humanities course at Virginia Tech have also
worked on LBT documents: in 2011 on Cyrus Redding’s life of Thomas Campbell: Paul Spencer;
Jessica Bates, Bruce Blansett, Cacey Canipe, Hatley Clifford, Kaitlin Clinnin, David Charles Duckett,
Kate Natishan, Alex McCarthy, and Elizabeth Phelps. In 2010, on James Kennedy’s Conversations on
Religion, with Lord Byron: Pearl Blevins, Eric Boynton, Andrew Casto, Heather Draxl, Lindsay
Ehrlich, Daniel Helbert, Raymond R Higgins Jr, Michael Lautenschlager, Jerry Liles, Ben McClure, Mary
Papadapolous, Grace Marie Mike, and Ben McClure; and on Pietro Gamba’s Narrative of Lord Byron’s
Last Journey to Greece: Tess Szell; Chelsea Skelley; Todd Stafford; Zach Woods, and Sarah
Yakima.
David Hill Radcliffe
August 2011