Welcome	Language in the news	Language workshops	All articles	News & events

Corpus Linguistics

Corpus linguistics combines computer-based research methods with linguistics. Since the 1960s, collections of data — or 'corpora' — have been used to further explore traditional areas of language study, including many of those discussed in the Linguistic toolbox. The major appeal of corpus linguistics is the huge amount of naturally-occurring data provided by the various types of software available. For example, the British National Corpus, which consists of 100 million words of spoken and written language, gives researchers the scope to investigate patterns in the way we use language based on a large and varied foundation of data. A researcher who wanted to find out more about the way that a term like 'terrorist' is used could search the BNC for occurrences of the word, and use the results as the basis for a wide-ranging analysis of the way it is used.

As well as providing a huge amount of data, corpus programmes provide a way for researchers to ensure that their findings are as objective as possible (although it is important to note that true objectivity is never possible!). The researcher interested in 'terrorist', for instance, might base their study on a large and complete corpus of all the articles in which a certain newspaper uses the word 'terrorist' in a particular year. They could do this by creating their own corpus or corpora, something that can be achieved quickly and easily through programmes such as LexisNexis, which allows a researcher to compile collections of data from various newspapers over chosen periods of time. While this approach limits a researcher's findings to a single source of language, it does provide a comprehensive foundation on which to conduct linguistic analysis, for example by looking at the way 'terrorist' is used in naming.

Corpus programmes can also be used to make comparisons between different corpora. Programmes such as WMatrix allow a researcher not only to investigate the use of language in a single corpus, but also to make comparisons of language use in different corpora: these could be collections of data that they have created themselves or the programme's own, extensive corpora of various types of language. WMatrix, for instance, has collections of imaginative, educational and institutional language. The linguist researching 'terrorist' might compare its use in their newspaper corpus to that in a corpus of fictional writing, or compare the use of the term in corpora of newspaper language dating from before and after an event such as the September 11th attacks on the World Trade Centre. This again gives the researcher the opportunity to make confident findings, whilst adding an extra element of interest to their research.

As well as simply allowing researchers to search large amounts of data for uses of a particular term, corpus programmes have also developed new ways of searching and using data to aid research. Some of the most popular of these corpus methods are summarised below, with illustrative examples of projects that use these approaches.

Keywords

In addition to singling out the uses of a particular search term, corpus programmes such as WordSmith allow researchers to explore keywords. By comparing different corpora, a researcher can find out which words occur relatively frequently in a particular corpus: these are keywords — those that are used significantly more often in one corpus than in another. This can be useful in finding out what makes one user of language different from another. Mulderrig (2008), for instance, compares corpora comprising of education policy documents drawn from different UK governments between 1972 and 2005. The resultant keywords create a picture of key themes in the way different governments have discussed education: for example, keywords in the Thatcher corpus include terms with managerial connotations (standards, performance), while terms relating to economic performance — competitiveness, markets, firms — are key in the Major corpus. These findings might suggest something about Conservative educational policy over the years! In another study of political language, Jeffries and Walker (2012) compare a corpus of reporting of New Labour's time in power with a corpus of the John Major-led Conservatives' administration, identifying themes that are particularly prominent in New Labour language. They find that the words spin, terror, reform, choice and respect are all keywords in the New Labour corpus, creating a picture of the key themes differentiating New Labour under Tony Blair from the Conservatives under John Major.

Concordances and collocates

Concordances allow a researcher to look at the use of particular terms in context. For example, the 'terrorist' researcher might use a corpus programme to uncover all the uses of 'terrorist' in a corpus, and then to compile a list of concordance lines in which 'terrorist' occurs, along with the 10 words that come before and after each instance. This provides the researcher with a more in-depth impression of the contexts in which a term is used. Programmes such as AntConc go a little further, creating graphs and statistics demonstrating where a search term occurs most frequently in a corpus.

A further level of detail can be achieved through looking at collocation. Rather than painstakingly looking through concordance lists to find out which words — or 'collocates' — a search term tends to occur near, a researcher can use a corpus programme to quickly and easily compile statistics listing the words that occur most commonly most common alongside a search term. In their study of the representation of immigration in newspapers, Gabrielatos and Baker (2009) use collocation to look at the words that occur most frequently within five words of search terms such as 'refugee', 'immigrant' and asylum'. The resultant lists of collocates give an immediate impression of how immigration issues are talked about, and provides the basis for a more in-depth investigation of their use. One of the researchers' findings is the frequent use of negative, liquid metaphors such as flood, pour and stream, providing evidence for the conviction that newspapers are partly responsible for negative portrayals of immigration and moral panic surrounding the issue.

Recommended reading

The following books provide an accessible introduction to corpus linguistics and key terms:

Baker, Paul. 2006. Using Corpora in Discourse Analysis. London: Continuum.

Baker, Paul, Andrew Hardie and Tony McEnery. 2009. A Glossary of Corpus Linguistics. Edinburgh: Edinburgh University Press.

The following books provide guides to conducting a corpus study and summaries of existing research:

McEnery, Tony, Richard Xiao and Yukio Tono. 2006. Corpus-Based Language Studies: An Advanced Resource Book. Oxon: Routledge.

Meyer, Charles F. 2002. English Corpus Linguistics: An Introduction. Cambridge: Cambridge University Press.

The International Journal of Corpus Linguistics, Corpora and Corpus Linguistics and Linguistic Theory feature new and original research on corpus linguistic work.

References

Gabrielatos, Costas and Paul Baker. 2008. "Fleeing, Sneaking, Flooding: A Corpus Analysis of Discursive Constructions of Refugees and Asylum Seekers in the UK Press, 1996-2005." Journal of English Linguistics 36 (5): 5-38.

Jeffries, Lesley and Brian Walker. 2012. "Key Words in the Press: A Critical Corpus-Assisted Analysis of Ideology in the Blair Years (1998-2007)." English Text Construction 5 (2): 208-229.

Mulderrig, Jane. 2008. "Using Keywords Analysis in CDA: Evolving Discourses of the Knowledge Economy in Education." In Education and the Knowledge-Based Economy in Europe, ed. by Fairclough, Norman, Bob Jessop and Ruth Wodak, 149-170. Rotterdam: Sense Publishers.

Bitstamp News

Corpus Linguistics

Share this post

Articles

Linguistic Toolbox