Love but hate but love

The thing about PhDs is you love them, but you hate them, but really you love them. Right at this moment, I hate mine with a passion. I had just spent the better part of today manually removing hundreds of progressives from my dataset when I decided to go cook dinner. (Roasted butternut squash and cherry tomatoes with fettuccine, in case you were wondering.) Here’s the thing. I live in college housing here in Cambridge. King’s College – but by ‘college housing’ I don’t mean the lovely old buildings that you see on postcards near the oh-so-famous chapel. Nooo. Only undergraduates – who by no means appreciate it – get to live there. King’s also owns a number of buildings around town, of better and poorer quality, where it sticks its graduate students. Now, I’m not complaining; ours is a beautiful old Victorian house and mine is a truly cavernous room. But the kitchens and bathrooms leave a lot to be desired. Here’s the rub: the oven takes a long, looong while to heat up, and even then it doesn’t get much past lukewarm. So while my squash and tomatoes were a-roasting, I went back to my desk to have another poke around in my dataset.

But it was gone. Yes, I saved it. Yes, I know where I saved it. No, it’s not in the temp folder. In fact, it’s still there, only it doesn’t look like it did a half hour back. I’d exported the AntConc output to Excel and then set up a number of spreadsheets, but obviously it decided not to prompt me to save all of those. Equally obviously, I’m also an idiot. Anyway, I had a minor meltdown, then started again. PROPERLY this time. And got about halfway through to where I was before, all before the squash and tomatoes were cooked.

Anyway. Progressives and datasets and spreadsheets and AntConc, what am I on about? It’s a while since I’ve done a proper project update, so here goes. As I mentioned in half-hearted passing here, one of the case studies we’re doing is on the progressive aspect. How do the Dutchies in my corpus use the English progressive in their writing? There are some interesting hypotheses out there. You may be thinking, ‘You think that spending six months looking at words that end in –ing is interesting?’ Sure I do. I can think of lots of less interesting things. Watching grass grow without a microscope. Knitting one square after another but never making a quilt. Coming up with the optimal toilet scrubber, that sort of thing. (Scratch that, that last one might actually be interesting.)

But back to the hypotheses. Some researchers say that people whose native language doesn’t have a progressive form will have trouble acquiring the English progressive. Dutch has various takes on the progressive, of course, but these have been found to be much more restricted and much less frequently used than the English progressive. So we might find that my Dutchies use the progressive less than native speakers, or in a smaller range of circumstances. On the other hand, lots of researchers claim that in non-native varieties of English the progressive is overused, and extended to places where native speakers wouldn’t use it, such as with stative verbs (I’m knowing this). So, what will it be – overuse or underuse? And can we localise any of these differences to particular verbs or verb classes? For example, I suspect that few Dutch people would ever use knowing as in the example above, but you certainly do hear things like I am living here for three years (for I have lived here for three years) or I am working at the university (for I work at the university). In this case, would we be looking at a systematic pattern, a closed system, where we could say that the Dutch have developed their own new rule, or are these uses just a result of differences in proficiency level?

Let’s follow all those questions with another question; a meta-question if you will: How do we go about answering all those questions? Well. First, you build a corpus. Check. (That sounds like the simplest thing in the world to do, but, interestingly, it took me precisely as many months as there are characters in the phrase ‘you build a corpus’ – 18, that is.) Then you identify all the uses of progressives in the corpus. This means uploading all your lovely corpus files into a corpus analysis program like AntConc, writing a complicated-looking search query like (be|am|’m|are|’re|aren’t|is|’s|isn’t|was|wasn’t|were|weren’t|been)\b\W*(\b[a-z]*\W*){0,3}?[a-z]*ing\b to identify all forms of be followed by a word ending in –ing with a specified number of intervening words (so that you capture not just I was dancing but also I was so totally omg dancing, for example), and hitting go. Thankfully, this takes about 4 seconds.

I did this a few weeks ago for the Dutch English corpus and got about 2500 hits. Sadly, the job’s not over yet. About half of these turn out not to be genuine progressives, and these you have to manually remove. By which I mean, yes, going through thousands of lines in Excel to manually identify cases like This is boring; I was on the loo, thinking; and It’s all about losing control, all of which are returned by the search query but none of which are actually progressives (ok, I did do this semi-automatically, but please, lend me your pity). I ended up with a final dataset of about 1400 progressives.

Awesome, you say. Now it’s time to get analysing. Weeelll, I say. Soon-ish. First, we have to do the same thing at least four more times. Sigh. Because here’s the thing: you can’t say the Dutch overuse or underuse the progressive compared to native speakers, if you don’t know how much native speakers use it. So how can we figure that out? From a corpus, of course! But not just any corpus. It has to be a corpus that’s a lot like mine; that is, roughly comparable in terms of size (i.e. number of words) and text categories (genres). Luckily (come to think of it, luck had nothing to do with it), my corpus just so happens to be designed exactly like the corpora included in the International Corpus of English (ICE) project, which I’ve mentioned in other posts. These ICE corpora are available for a range of different countries. Stellar, you say: just take the British one (ICE-GB) and compare the Dutch one with that. Weeelll, I say (again). Other researchers have done that for different areas of grammar and found that Germans and Norwegians overused them (I’ve forgotten what areas of grammar they were exactly. Can look it up if needed … she offers half-heartedly, since no-one would dream of asking). Woot, great finding right? No. As it turned out, when they compared these same findings with an American corpus, the Germans and Norwegian figures were lower than the American results. In other words, they fell right in between the two groups of native speakers. Not overuse, not underuse either. But you wouldn’t have known that from just comparing with the British data.

So, ideally we want to compare with several different native Englishes. Unfortunately, there’s no ICE-USA yet, but there is one for New Zealand (ICE-NZ). So now we have to do the same procedure, including the manual filtering, to extract all the progressives from ICE-GB and ICE-NZ as well. It was this that I’d been doing on the British dataset when I did whatever moronic thing it was that I did earlier to lose the work. As an aside, funnily enough – though not at all surprisingly – that wasn’t the only moronic thing I did today. As I said above, the dataset for the Dutch corpus has about 1400 progressives. When I ran the search query on the British corpus it came up with about 14,000 hits. Now, this is before the manual filtering process, of course, but even if I exclude half of these it still leaves me with about five times more progressives in the British corpus than the Dutch one. FABULOUS, I thought. Hypothesis upheld: the Dutch massively underuse progressives compared to Brits. A world-changing finding, you’ll agree. FALSE. I’d forgotten to check a box I was supposed to check in AntConc. The real number for the British corpus was more like 2600. Only a wee bit more than the unfiltered figure for the Dutch corpus.

But wait, there’s more. For reasons I’m now too sleepy to go into here, we’re also doing this for ICE India and Singapore. So we’ll be able to compare the Dutch quantitative results with two different native corpora, and two different second-language corpora. Oh, the joy that will bring – just as soon as I’ve compiled the datasets. At least they won’t take 18 months apiece this time.


