We have a corpus.

Last time I wrote about my PhD, an age ago, back in November, data collection had just been completed and I was about to start building the corpus proper. Four months and a little bit of my soul later, I’m pleased to report that the corpus is now built. This is exciting, because it’s the first corpus of Dutch English that there is (that we’re aware of, anyway).

Creating the corpus, as I mentioned in my last post, consisted not just of converting many many Word documents into XML files, but also enriching those files with what we call textual markup or annotation. There are several reasons for doing this:

  1. To reinstate the formatting of the original text. An XML file is similar to a .txt file in that any text you copy into it is stripped of all its formatting. So to make sure you are producing a faithful representation of the original text, it’s important to reinstate this formatting using XML tags. For example, if a word in the original text was in italics, it is marked in the XML file like this: <it>word</it>. We do the same thing to indicate bold font and underlining, the start and end of paragraphs, headings, hyperlinks, footnotes , superscript and subscript, and changes of typeface. We also use special tags for quotes; if an academic text in the corpus quotes, say, a British scholar, then we mark the quote as ‘extra-corpus’ (<X>quote</X>) to make sure that it is not included in the analyses. And finally, we have various tags for untranscribed text, so for example if an original academic text had lots of mathematical formulae or tables, which are time-consuming to retype and irrelevant for the linguistic analysis anyway, they are simply marked as <untranscribed type=“formula”/>, <untranscribed type=“table”/>, etc.
  2. To ‘enrich’ the corpus for the purposes of linguistic research. By this we simply mean that tags are also used to provide useful information other than that related to formatting. As an example, Dutch words are marked as such: “The party was like oh my god totally <dutch>gezellig</dutch>.” This is because, in the future, someone, somewhere, may decide they want to research the phenomenon of code-switching, for example (where people switch back forth between languages). Marking every instance of a Dutch word in these English texts means that you would then only need to search the corpus for the tag <dutch> rather than searching individually for, well, every Dutch word there is.
  3. To ensure anonymity for the contributors. All those lovely people who bravely handed over their texts to some random researcher via email need to be guaranteed their privacy. So emails will now read “Dear Ms <anonymisation type=“family-name”/>” or “Didn’t  <anonymisation type=“first-name”/> look horrific the other day?” Naturally, this applies to all texts and all identifying items, like company names, addresses, phone numbers, bank account numbers and even pets’ names.

Not having umpteen undergraduates to do the leg work for me, this meant reading through every line of every text myself and inserting the appropriate markup. But all’s well that ends well: with all the XML files now complete, we can begin analyses! By which, of course, I mean we can begin the process of deciding what to analyse in excruciating detail. Tenses, as in “I am working here since five years”? Lexical items, like “Prof. dr. So-and-so”? Fabulously exciting word orders, like “The by critics highly praised movie was rubbish”? This means the coming months will be filled with a return to the literature: what sorts of analyses have been conducted using similar corpora, and what sorts of analyses seem as though they will be feasible, and interesting, in this corpus? Exciting times …


