The Life and Times of Lord Byron: Indexes

A phrase frequently heard in connection with computing is “that’s not a flaw, that’s a feature.” In the case of LBT the flaw or feature is a commitment to unlimited extensibility on the one hand and semantic-web principles on the other. The object of study is conceived less as a fixed set of paper documents to be digitized than as a dynamic nexus of relationships among persons and documents to be articulated. Materials concerning Byron, his writings, his public, and his critics are bottomless but not random or formless—consisting as they do of series of events, collections of documents, and networks of human relationships. These can be charted and graphed digitally in an archive of machine-readable documents.

Since the Byron literature is but a drop in an ocean of literary information it becomes necessary to combine unlimited extensibility with principles of selection and definition. This is where the concept of a “semantic web” comes in. Search engines parse strings of characters; they know nothing about documents or relationships until they are defined by means of metadata, markup, and ontologies. Semantic computing implies working with the things that strings of characters refer to—things like events, documents, and people. Defining semantic objects within or across digital archives enables one to query and analyze extensive bodies of information in sophisticated ways. But semantic computing is labor intensive: to be rendered machine-readable, documents must be parsed and described by human editors.

LBT will consist of an archive of archives, each centered on some point of contention related to Byron and his reading public. Its documents are linked to machine-readable data files and human-readable commentary. As material is added and software improves LBT will grow in depth and complexity, but a project indefinitely extensible is by definition perpetually incomplete. Digital archives challenge designers to create something operational and useful in the present while anticipating changes to come. This project anticipates the advent of data- and document-exchange via distributed computing—surely the quickest way to grow an archive—while undertaking the work of constructing machine-readable documents for use in such a digital infrastructure.

Whether such reliance on human labor is ultimately more flaw or feature remains to be seen: it will be years before we have large digital archives of literary material with the requisite markup and metadata,—with open access and the standards required for computing across dispersed archives. It may not happen at all. But LBT will explore possibilities for semantic markup with the ultimate end of being subsumed wholesale into an a computational network. It will grow through a series of published states not unlike those extensible works Childe Harold and Don Juan: new “cantos” will be added as time and circumstances allow with the expectation that the project will evolve in unexpected ways.

One aim of assembling a hypertext Byron archive is to bring context to bear on documents by linking them to statements to which they responded and those they provoked, calling attention to acts of reading performed by particular agents in particular circumstances—as opposed to the generalities of period norms and implied readers. Byron’s “Sketch from Private Life,” for example, reads rather differently when juxtaposed with the letter penned in response by Mary Anne Clermont, the object of its personal abuse. By assembling varied responses of readers of this poem we can better assess the kinds and degrees of personal agency involved in Byronic satire: its effects on readers, but also the effects of their reactions on the writer. Mary Anne Clermont may have had little sway over Byron but her friends in “private life,” and friends of her friends in public life, presumably did.

Another aim is to consider how individual acts played out collectively by considering how how agents and documents participated in groups. The “Sketch from Private Life” and Miss Clermont’s response are parts of a whole body of discourse concerned with the Byron separation. We know that Byron was driven out of society, that he spent his remaining life responding to his banishment, that the events of 1816 colored public response to all he said and did afterwards. But just how personal and collective agency played out in these transactions remains unclear, as does the extent to which decades of writing about the Byron separation contributed, directly or indirectly, to changing attitudes about marriage. By bringing quantitative data to bear on this and other Byron controversies—by linking statements to persons, and persons to groups—we might hope to understand the social dynamics of literary production better than we presently do.

The code for finding, parsing, tabulating, and graphing this kind of information is not yet written, the documents not yet collected, and it remains to be seen whether machine-readable documents can be indexed and queried in ways sufficiently general and sufficiently particular to address research questions like these. But it seems like the right time to experiment.

Quo vadis?

The archive of documents in LBT is intended to top out at perhaps 200 monographs and 2000 articles—about as much as could be managed over fifteen or twenty years working at the present pace. The proposed scale limits the time that can be allocated for editing, markup, and annotation. One could spend more time on fewer items, but the object here is to create a middle-sized digital archive of Byron material done with a middle-level of markup and commentary. There are prefaces and notes identifying persons and titles, but no passage-specific annotation. The plan is to devote about equal time to “Lord Byron” and “his Times”—that is, to Byron-specific documents and to contextual material related to his causes and contemporaries. 1870 seems like a reasonable terminus ad quem since Byron’s contemporaries were mostly dead by then, though of course their letters, memoirs, and commentary continued to appear.

At the time of writing the project has completed its third year. The first two were spent experimenting with different kinds of documents and sub-archives, trying to distinguish between the possible and the practical, learning how long things might take and estimating trade-offs: how much editing and annotation would be practical given the constraints? Since in LBT the priority is establishing connections between persons and documents, a first task was to design a tagging implementation to link documents to the data files storing information about people and titles. This needed to done in conjunction with designing a screen-interface to render the machine-readable documents and data files. By the second year scalability was already becoming an issue, making it necessary to begin on an indexing system that remains a work in progress.

Several initial document collections were pursued, but one quickly became a preoccupation: memoirs of contemporary poets and society figures from which names and document-titles could be harvested. The life-and-letters genre turned out to be of particular interest since collections of letters enable one to locate names and titles in thousands of time-specific documents instead of just the larger books of which they are a part. The prospect of extracting from these volumes a cross-searchable, decades-long, day-by-day chronology of persons, titles, and events is proving irresistible. Despite the fact that the text of the letters is generally cut, bowdlerized, or otherwise tampered with, this seems like an efficient way to assemble social context for the Byron material. The time will come when these user-friendly if unreliable letter-texts can be linked to better versions outside the LBT archive.

From information in these life-and-letters volumes it becomes possible to chart the friend-of-a-friend and friend-of-a-foe social networks in which the Byron controversies played out. With the list of persons culled from these documents now approaching ten thousand (twice the original estimate) and with convergences beginning to appear among them, efforts will next shift to attaching to the names information about family connections and social ties in the form of a digital prosopography. This information will enable querying LBT for references to or information about “associates of Leigh Hunt,” “persons who wrote for the Edinburgh Review” or “works written by persons at Cambridge in the 1820s.” If and when the semantic web develops, lists of names and titles generated from within LBT could be used to search material outside of the archive in accordance with the principle of extensibility.

In coming years emphasis will shift from memoirs of Byron’s contemporaries to contemporary memoirs of Byron, and from there, finally, to developing collections of pamphlets and articles related to the controversies in which Byron was involved. This is not a very methodical plan. It seems best to work contextually, following one line of development while making occasional forays into others that bring related persons and documents into view, then returning recursively to the original train. The hypertext environment is inherently digressive, as are Byronic controversies. Is it better to begin with the beginning and move chronologically down a document trail, or to begin at the end, working backwards on the basis of the fuller knowledge supplied in later documents? Thus far, it seems best just to walk the hermeneutic circle, proceeding backwards, forwards, and sideways as knowledge accumulates.

Textual Editing and encoding

Texts in LBT are basic transcriptions. Obvious and unambiguous compositor’s errors are silently corrected but original spelling and punctuation is otherwise maintained. Hyphens occurring in line breaks have been removed. There is no pretense made of presenting a critical edition: where significant textual issues are known to exist they are mentioned in prefaces, but documents have not been collated and textual variants are not recorded. In the case of newspaper material the text is keyed by hand, otherwise it is derived by optical character recognition (OCR). While some documents are scanned on site, recourse is made to OCR available from the vast store of public-domain material published by Google Books and the Internet Archive, hand-correcting the text against the page images in the process of markup. These sources have occasional missing or garbled pages that must be supplied from other witnesses of the same edition but not necessarily the same printing, resulting in a conflated text. Given the objectives of the project we have pursued amplitude at the expense of exactitude, a compromise we have been the more willing to make since page images are usually readily available elsewhere.

Documents are encoded according to the TEI P5 Guidelines using a basic set of elements and no customization. TEI offers the advantages of standardization and the prospect of longevity; as a form of XML encoding it is well suited to data manipulation and migration. TEI is descriptive markup: the “p” element is not a formatting instruction as in HTML, but a semantic statement indicating “this is a paragraph.” While LBT style sheets output HTML to the screen, the underlying TEI-encoded documents are structured in ways that enable faceted searching such as looking for personal names where they appear only in poems or letters, or place names where they appear only in datelines. Data files in LBT are also encoded in TEI/XML so that at runtime the style sheets can add notes and links using current information stored outside of the document itself.

With markup as with textual editing there are trade-offs between breadth and depth. Lighter markup has the advantage of rendering machine-readable documents more intelligible to human eyes and it enables encoders to participate in the project with less training than would otherwise be the case. TEI elements are largely inert without the use of project-specific XML IDs and ID-refs, which is where much of the encoding labor is concentrated: marking a title not just as a title, but as a particular title, a person as a particular person (more about this below). Since LBT does not offer page images, extensive use is made of the “rend” attribute used by the style sheets to render the machine-readable document more attractive to the human eye.

Interface

The screen image in LBT is a rendering of an underlying TEI/XML document which is itself a rendering of a print object that might be a book, an article, or a newspaper column. It goes without saying that interface design is all about mediation: between paper and screen media, between machine-readable and human-readable documents. The current norm for presenting books on the web is page images overlaying text-searchable, usually uncorrected OCR. Photographic images of pages, while they give more direct access to the material object than machine-rendered text, do have limitations: they can be hard on the eye when lacking sufficient contrast and clarity, and are not suited to hyperlinking. The underlying OCR lacks most of the capabilities of machine-readable text. The HTML documents output by LBT lack the presence of page images but have the advantage of being designed for presentation on the screen.

LBT strives to present text that resembles the appearance of the paper source, adding page breaks and scaling fonts to correspond to the original. There are limits to what can be done this way. Because of the overlapping hierarchy issue besetting XML documents, page-break indications and objects tied to them like page numbers, running heads, and footnotes, can be a challenge to render. TEI does not directly handle screen formatting, which is added at run-time using style-sheet rules. While rules can be written to handle things like flush-left first paragraphs, there comes a point where trying to model an irregular document “according to rule” ceases to be a worthwhile endeavor: style-sheet rules are no match for typesetting when it comes to flexibility. Since different browsers break lines in paragraphs at different places and none add end-line hyphens, HTML renderings of paper documents cannot be proper facsimiles.

Yet it does seem worthwhile to try to capture the visual style of a paper document. While the literal screen-size will be the same, one can suggest the difference between a quarto, octavo, and a duodecimo by reproducing something of the original formatting. But in some cases it has seems best not to do so: why render a newspaper article as illegible as its four-column, tiny-type original? A photographic image is much better suited to conveying that idea. Time constraints also enter in: monographs get their own, custom style sheets while journal articles are rendered from a generic style sheet with fewer formatting options. There are limitations to what is possible, and then there are limitations to what is practical. In the digital medium a desire for systematic consistency is always pulling against the contrary desire for idiosyncratic accuracy; knowing that these TEI-encoded documents could be rendered differently using alternative style sheets encourages a kind of aesthetic pragmatism.

Popups are used rather than end-of-document notes because they make better use of the screen medium. Some of the documents, the Thomas Medwin Conversations archive for instance, are extensively cross-linked through marginal notes. At runtime, these pull text from related documents to create marginal glosses in the HTML output. The visual juxtaposition of a document with its sources and commentary is a desideratum for LBT, though as yet experimental and imperfectly implemented. The idea of getting the documents to visually “talk to each other” seems worth pursuing. Down the road there is the prospect of using distributed computing to pull in text from outside of LBT, which will doubtless raise a new set of interface issues.

After all, a machine-readable text is not a book. It is an “ordered hierarchy of content objects,” which is to say a nested set of text-containers labeled “chapter,” “paragraph,” or “footnote.” These digital objects can be output to the screen to create a visual simulacrum of a book, but that is not all that can be done with them. In LBT, machine-readable text is used to generate hyperlinked tables of contents like those running down the left margin of the screen, and on index pages chronologically-arranged links are broken out into sub-lists according to appearances of an item in chapters, letters, and verse. In these ways the machinery behind the interface takes advantage of the container-like structures of XML markup, treating the parts of a document like fields in a database.

Marginal notes are one example of using the database structure to create one document out of another by “querying” its content; the records for persons (to which the popups link) are another. These pages do not exist as such but are created on the fly using prosopographical data and search algorithms. The style sheet creates what is in effect a navigational hub by linking to things the person wrote and things written about the person. As LBT develops these pages will become more elaborate (identifying document exchanges in which they participated, for instance), and the expectation is that eventually information can be pulled in not only from LBT but from across the semantic web—in real time, using similar search algorithms.

In short: by means of semantic markup, digital representations of printed books can become as interface-portals pulling in text and information not just from the archive but eventually from the web at large. But here again extensibility requires principles of selection if the potentials of machine-readable text are to be realized.

Needles and Haystacks

Perhaps nothing illustrates the advantages of semantic markup so well as the problem of searching for personal names on the web: how to identify references to a particular Smith or Johnson from among the millions of documents containing the strings “Smith” or “Johnson”? LBT manages this by using unique identifiers in the markup. For example, there are currently thirteen distinct persons in the archive named “John Murray,” all distinguishable by unique identifiers: JoMurra1793, JoMurra1859, etc. By means of this semantic markup a search algorithm can identify particular Murrays even where the document text reads “Mr. Murray,” “John M,” “Lord Murray,” “Duke of Atholl,” “Anak of publishers,” or “**** ******”. Similarly, references to books can be pulled up where titles are distorted or not given at all.

Since LBT is about relationships between persons and documents, considerable labor goes into this kind of semantic “collation.” Satires often bristle with dashes and asterisks and controversialists often do not know the names of their anonymous opponents. But even where there is a name- or title-string to search for it is often garbled in uncorrected OCR, a greater problem than sometimes realized. The text-handling algorithms used by Google Books and the Internet Archive automatically “correct” Horne Tooke to “Home” Tooke and John Galt to John “Gall,” rendering these names all but invisible to string-searches. Semantic markup not only reduces the number of false hits by discriminating among Murrays, it can boost the number of correct identifications by an order of magnitude.

Much of the labor going into LBT is simply the conventional philological work of establishing identifications for “Miss Andrews,” “Mr. Roberts,” and “Dr. Brown.” Unless there is some obvious risk of ambiguity, nineteenth century documents suppress the given name and use the more polite if more elliptical form of reference. In the original index (if there is one) one often finds just “Miss Andrews” or nothing at all because even contemporaries were having difficulties identifying persons. Where identifications can be made (more often than not these days) machine-readable indexing can not only by supply the reference, but link it to additional information about Mr. Roberts or Dr. Brown.

For semantic searching to work across the web persons and titles need to be uniquely identified. This can be accomplished by adding the prefix “lordbyron.org” to an LBT ID, though in practice this is not much help. LBT links persons to canonical forms of a name as used by the Library of Congress, the National Record of Archives, and the Oxford Dictionary of National Biography. Where persons have yet to be uniquely identified elsewhere, every attempt is made to supply a machine-readable, eight-digit date of death (“1848-12-03”) that can be used computationally to discriminate among John Smiths. If and when RDF repositories for persons and books become part of the web infrastructure for digital humanities, LBT data files will facilitate identification by linking personal names to parents, spouses, and publication-titles—even to references to them in books.

Acknowledgements

Before thanking the many persons who have been involved in Lord Byron and his Times I must shift pronouns and acknowledge that to this point the “we” of the preceding paragraphs has been largely ”me.” Digital humanities projects we are told—and I agree,—are necessarily collaborative. And yet, like Bottom in the play, I find myself appropriating all the roles to myself: projector, editor, and commentator, data-designer, web-designer, and coder; bibliographer, archivist, and genealogist. This is not how things ought to be and not how I would wish them to be. But there are reasons.

In more innocent times I became accustomed to going it alone in an unstructured environment. Doing one’s own design and coding, while it has its downside, has also this advantage: where working contractually with a grant provider would require making decisions up front and fufilling terms in two or three years, I have had the luxury of operating within a much longer time-horizon in a more flexible way. This enables experimentation and risk-taking. LBT is neither fish nor fowl, neither a massive OCR-with-page-images digitization project nor a scholarly edition done to exacting standards. While I believe that the implementation of XML/TEI adopted here makes sense, the infrastructure and software required to fulfill its semantic-web ambitions is, if not quite vaporware, as yet impractical.

Digital humanities is inherently risky business. No one knows how matters will stand fifteen or twenty years from now, when or if the semantic web will become a reality, or how archival projects will be sustained over the long term. As the digital infrastructure required for humanities computing develops—as I hope it will—risks will decrease, collaboration will become easier, and projects built on semantic-web principles will realize their potential in an environment where “undefined extensibility” is unequivocally an asset. Archival projects with defined data but undefined boundaries are well suited to sharing information and documents with unknown collaborators in unanticipated ways. In the meantime, LBT is underway, and for that there are many person to thank.

Peter Graham of the English Department at Virginia Tech has supported the project in large ways and small, but especially by introducing me to the very generous community of Byron scholars. In the planning stage and since, we have had a series of fruitful conversations at the NINES headquarters at the University of Virginia with Andy Stauffer, Jerome McGann, and Dana Wheeles. They have given sage advice and cheerful encouragement when most needed, and have made building LBT a more self-conscious and critical enterprise than it would otherwise have been. Andy Stauffer magnanimously came to our rescue when LBT found itself without a server and continues to host the project. The “networked infrastructure” fostered by NINES has been an essential incubator for LBT and many another digital project.

Laura Mandell of 18thConnect, now of ARC, has likewise helped to guide LBT through its early stages; her vision of an emergent digital infrastructure for the humanities projects has been an inspiration; the prosopographical dimension of LBT, now getting underway, is a response to her encouragement. Peter Robinson, too, has helped me to imagine what distributed computing might look like and how it is to work. Paul Curtis and Peter Cochran, at work on digitizing the Byron letters and getting them into XML/TEI, have been in conversation and we look forward to partnering with them as a means of building a connected web of Byron documents beyond LBT.

For technical advice and assistance I am indebted to Cage Slagel who helped me to grasp the constraints and possibilities of sharing data on the web; the conversations we had will bear fruit in years to come. I am also grateful to Nick Laiacona and Lou Foster of Performant Software in Charlottesville for their assistance in getting LBT up and running and for assistance with the scalability issues that have been making life difficult. As a very sophomoric programmer I particularly value their professional advice and patience.

My colleagues at Virginia Tech have likewise been instrumental in getting LBT underway. CATH, our local home, is the creation of the unflappable Dan Mosser with whom it has long been a pleasure to be associated. He and I have been partners in the digital humanities enterprise for going on two decades now. Eve Trager in the English Department has been essential to the LBT physical plant—assisting with technical knowledge and skill beyond my reach. Carolyn Rude, my department head, has likewise supported the project in the most concrete sort of ways: with summer research assistants to help with encoding documents, and a course off to pursue a grant proposal. She has been a constant friend to the digital humanities, understanding the complexities of the work, with its long-term horizons and short-term bumps and rubs.

Many students have had a hand in LBT, which has proven a useful vehicle for teaching the principles of TEI-encoding. Melissa Smith worked on the Countess of Blessington in connection with her 2010 MA thesis at Virginia Tech. In 2009 Anna Mackenzie Radcliffe at UVa worked as a research assistant on The Last Days of Lord Byron, and in 2011 Hatley Clifford at Virginia Tech on Astarte. In 2009 Thomas Minogue, a Virginia Tech honors student, worked with me on the Blackwood’s-London Magazine controversy. In 2010 Daniel Perkins at Virginia Tech did preliminary markup on W. H. Humphreys’ Journal of a Visit to Greece, and the same year worked on Millingen’s Memoirs of the Affairs of Greece with his colleagues in the Byron Honors Seminar at Virginia Tech, Amber Eames and Alex Pettingill.

MA students in the Digital Humanities course at Virginia Tech have also worked on LBT documents: in 2011 on Cyrus Redding’s life of Thomas Campbell: Paul Spencer; Jessica Bates, Bruce Blansett, Cacey Canipe, Hatley Clifford, Kaitlin Clinnin, David Charles Duckett, Kate Natishan, Alex McCarthy, and Elizabeth Phelps. In 2010, on James Kennedy’s Conversations on Religion, with Lord Byron: Pearl Blevins, Eric Boynton, Andrew Casto, Heather Draxl, Lindsay Ehrlich, Daniel Helbert, Raymond R Higgins Jr, Michael Lautenschlager, Jerry Liles, Ben McClure, Mary Papadapolous, Grace Marie Mike, and Ben McClure; and on Pietro Gamba’s Narrative of Lord Byron’s Last Journey to Greece: Tess Szell; Chelsea Skelley; Todd Stafford; Zach Woods, and Sarah Yakima.

David Hill Radcliffe

August 2011