Goed nieuws

The idea was to complete data collection by the end of September, and happily, we’re on track! I’d like to give precise figures here, but I’m still hoping to get a few more texts this week to fill a couple of final gaps and to replace a few less-than-ideal texts (i.e. those written by people who’ve spent longer than 10 years outside NL, etc.). So in approximate figures only, below is a description of the present data collection. These categories, I think I’ve mentioned before, are based on the structure established by the International Corpus of English (ICE) project. Using this structure means that it will be possible to compare my ‘Dutch English’ findings with different national varieties of English (e.g. Australian vs Jamaican vs Indian, etc.). So without further ado, we have:

  • 60,000 words of correspondence, divided into social and business categories. This mostly concerns emails, but for the first time in an ICE-based project it also includes facebook messages
  • 20,000 words of ‘apprentice academic writing’ (i.e. by graduate students under untimed circumstances; master’s theses mostly!)
  • 80,000 words of academic writing, divided into four categories of around 20,000 words each: humanities, social sciences, natural sciences and technology. These are mostly extracts from PhD dissertations, journal articles, book chapters and monographs.
  • 80,000 words of popular writing, divided into the same four categories as the academic writing. This is a bit of a catch-all category, but it mostly includes magazine articles, blog posts, webpages, etc.
  • 40,000 words of reportage, i.e. press news and feature stories. This was a difficult category to collect, because when Dutch journalists do write in English these texts are usually edited by a native speaker, such as when the NRC had its English section. To avoid this I mainly targeted journalists from smaller publications and foreign correspondents, who may be more likely to write in English.
  • 20,000 words of persuasive writing, i.e. press editorials. Again, a difficult category to collect because there were few to be found. So this section also includes other forms of persuasive writing, like advertorials, and – interestingly, because it hasn’t been done before! – journalists’ blogs. (I get excited about this because when ICE was originally conceived, in the early 1990s, blogs didn’t exist. So how the design needs to be changed to reflect social and technological developments is worthy of reflection. I include journalists’ blogs as a form of editorial given that they often comment on current affairs, but in a way that includes more opinion than e.g. a straight press news report.)
  • 40,000 words of instructional writing, which is divided into administrative/regulatory texts (e.g. contracts, course guides) and skills/hobbies (which out of necessity mostly came from tech blogs)
  • 40,000 words of creative writing, especially short fiction and – again, excitingly – fan fiction (another new element vis-à-vis other ICE-based projects, also stemming from tech developments!).

This comes to a total of about 380,000 words. To reach the full target of 400,000 words, I intend to use 20,000 words from the International Corpus of Learner English (ICLE). This is the sister project to ICE, which has already been conducted by researchers in Nijmegen. The texts in ICLE are deliberately conceived as being written by ‘learners’, whereas the contributors to my project may in many cases be better described as ‘users’. So including one category of ICLE texts, which are undergraduate student essays, should allow for some nice comparisons with e.g. my ‘apprentice academic’ and ‘academic’ categories above.

The basic idea in ICE is that the 400,000 word total is reached by way of 200 texts of around 2000 words each. But of course, monographs are far longer than 2000 words; an extract is thus taken from longer texts like these. Conversely, few of us write e.g. social emails as long as 2000 words. For this reason, a number of texts comprise several subtexts, which effectively means that the true number of contributors is closer to 300. Soon – once I make some pretty tables based on the numbers – I will try to post here some more specific stats, e.g. percentage of men/women, demographic spread (i.e. town/province of birth and town/province in which the contributors were raised), age groupings etc.

In short, I’m excited that things are taking shape! Over the next few months I’ll be converting all these texts into XML and doing all sorts of fiddly housekeeping exercises in preparation for the fun stuff – the actual analyses. Stay tuned!


