Now that data collection is complete, it’s time to get down to the fun stuff. But first, just to clear something up: people often ask me what a ‘corpus’ actually is. Good question. I like the definition provided by Gilquin & Gries in their 2009 article: to summarise their description, a corpus is:
– machine readable
– representative, meaning that it contains data for each part of the variety/register/genre it is supposed to represent
– balanced, meaning that the size of its constituent parts are proportional to the parts of the variety/register/genre the corpus is supposed to represent (this being … er … a ‘theoretical ideal’ given the absence of reliable data on the proportional makeup of genres etc. in any given variety/language).
So: a corpus is a collection of texts that is stored and can be analysed electronically, and designed so as to promote generalisability of the findings to a wider population.
Prima. So now that we have collected the texts (in a carefully balanced, representative way, naturally), the next step is to convert them into the appropriate electronic form. For reasons I won’t go into here, that form is currently XML. To do this, we’re using a software development platform called Eclipse. You first need to set up a DTD (document type definition) file, which is a sort of template that sets up the rules for how you will present the data. From this file you then generate your (hundreds of!) XML files. In each file, you can then insert not just the text itself, but also all the metadata relating to the text: that is, the information that everyone who contributed a text indicated in the questionnaire. In the image below (excuse the quality; no time to fix it at present), you can see that this data is simply entered in the right-hand column.
In the source code, if you like this sort of thing, it looks like this:
These metadata fields are important because they will allow us, later, to search the corpus according to different properties; for example, you might want to look only at texts written by women, or only women with a high education level, or only acrobats aged 70+ with red hair from the south of the country who only eat cheese on Tuesdays, etc.
So this, among a million other things, is what I’m up to at the moment. Along with textual markup, but more on that later …