Corpus linguistics is a popular field of linguistics which involves the analysis of very large collections of electronically stored texts, aided by computer software. It can be said that corpus is a large collection of computer-readable text of different text-type, represent spoken and written usage. McEnery and Wilson (1996: 1) characterize corpus linguistics as a methodology rather than a traditional branch of linguistics like semantics, grammar, phonetics or sociolinguistics (Baker, 2010: 93)
1. Theoretical concepts
Corpus derives rules or explores trends about the ways people produce language. Human do not always make accurate introspective judgments about language, instead relying on cognitive and social biases. In addition, computers can calculate frequencies and carry out statistical test quickly and accurately. Corpus gives researchers access to linguistics pattern trends – such as collocation (e.g. instances where two words tend to co-occur such as tell and story). Corpus analysis enables researchers to confirm or refute hypotheses about language use. A further advantage of the corpus linguistics approach is that it enables researchers to quantify linguistics patterns, providing more solid conclusions to be reached. For example, “men swear more than women”, a corpus analysis would not only allow us to support or reject this hypothesis, but also show proportionally how often men swear than women, the range of swear words that they use, along with their relative frequencies, as well as affording evidence about differences and similarities of particular context or function of swearing (Baker, 2010: 94)
There are two types of corpus linguistics, corpus-driven and corpus-based approaches. Corpus-driven linguists tend to use a corpus inductively to form hypotheses about language, not making reference to existing linguistic frameworks. Corpus-based linguists tend to use corpora in order to test or refine existing hypotheses taken from other sources.
2. Building and Annotating Corpora
Any text or collection of texts could be theoretically conceived of being a corpus. It is possible to carry out corpus analysis on very small texts. McEnery and Wilson (1996) note that a corpus normally consists of a sample that is ‘maximally representative of the variety under examination, ‘of a finite size’, exist in ‘machine readably’ form, and ‘constitutes a standard reference for the language variety which it represents’. It means that it will be large enough to reveal something about frequencies of certain linguistic phenomena, enabling researchers to examine what is typical, as well as what is rare in language (Baker, 2010: 95)
Kennedy (1998: 68) suggests that ‘for the study of prosody’ (rhythm, stress, and intonation), a corpus of 100,000 words will usually be big enough to make generalization for most descriptive purposes. He also says that an analysis of verb-form morphology would require half a million words. For lexicography (the analysis of words and their uses, often for dictionary building), a million words is unlikely to be large enough, as up to half the words will only occur once. Biber (1993) suggests that a million words would be enough for grammatical studies. British National Corpus covers a very wide range of written and spoken language genres.
Sampling, balance, and representativeness are key theoretical concepts in corpus linguistics. Because a corpus ought to be representative of a particular language, language variety, or topic, the texts within it must be chosen and balanced carefully in order to ensure that some texts do not skew the corpus as a whole. Corpora are often annotated with additional information, allowing more complex calculations to be performed on them. Such information can take several forms.
3. Types and Applications of Corpora
A range of different types of corpora are in existence. First, a distinction needs to be made between general and specialized corpora. A general corpus is one which aims to be representative of a particular language. General corpora, such as the British National Corpus or the Bank of English, contain a large variety of both written and spoken language, as well as different text types, by speakers of different ages, from different regions and from different social classes (Baker, 2010: 99).
A specialized corpus however can be smaller and contains a more restricted set of texts. There could be restrictions on genre, time/place /language variety. Specialized corpora are generally easier than general corpora.
Another distinction involves whether a corpus contains spoken, written or computer-mediated texts. Spoken corpora generally tend to be smaller than written or computer-based corpora, due to complexities surrounding gathering and transcribing data.
Written corpora are generally easier to build (and large achieves of texts that were originally published on paper can be found on the internet, meaning that such texts are already electronically coded). However, unless specifically encoded, formatting information such as font size and colour, as well as pictures can be absented from written corpora.
Corpora of computer mediated texts are expected to become increasingly popular, as societies make more use of electronic forms of communication. Such texts can be very easy to gather – mining programs can store whole websites at a time, although it ought to be pointed out that computer-mediated texts can contain a lot of noise such as spam, hidden keywords designed to make a page be attractive to search engines and navigation menus which may need to be stripped out of individual pages before the text can be included in the corpus.
A third distinction involves the language or languages which a corpus is encoded in. A growing area of corpus linguistics involves the comparison of different language, which is useful in fields such as language testing, language teaching and translation. A multilingual corpus usually contains equal amounts of texts from a number of different languages, often in the same genre. Parallel corpus is a more carefully designed type of multilingual corpus, where the texts are exact equivalents (i.e. translation) of each other. Parallel corpora are often sentence-aligned (i.e. tags are added to the corpus data which act as markers to indicate which sentences are translations of each other).
Finally, a learner corpus is a corpus of a particular language produced by learners of that language. Learner corpora can be useful in allowing teachers to identify common errors at various stages of development, as well as showing over – and under use of lexis or grammar when compared to an equivalent corpus of native speaker language.
4. Corpus Software and Analysis
Corpora are not normally used in conjunction with analysis software which are able to carry out the own analytical interfaces; however other software can be used in conjunction with a range of corpora. For example, LOB (1619) and FLOB (1991), both corpora are a million words in size containing 15 genres of writing and they can be used in order to answer research question regarding language change (Baker, 2010: 102-103).
A related form of frequency analysis involves calculating keywords. Keyword is a word which occurs statistically more frequently in one file of corpus. We could refer to our own knowledge in order to hypothesize explanations for our result. Hypotheses are not always validated upon closer investigation. According to Leech, between 1961 and 1991 both American and British users showed a trend towards decrease in use of modal verbs (have to, need to).
A concordance is simply a list of word or phrase, with a few words of context, so we can see at a glance how the word tends to be used. The examination of concordances also helps to reveal discourse prosodies. Discourse prosodies are often indicative of attitudes. A concordance analysis therefore combines aspect of quantitative and qualitative analyses together. A statistical procedure which helps the information more manageable is collocations. Collocation refers to the statically significant co-occurrence of words.
5. Critical Considerations
Corpus linguistics is not able to answer every research question in the area of linguistics. Few criticism of corpus:
Corpora can be time-consuming, expensive, and difficult to build, requiring careful decisions to be made regarding sampling and representativeness.
Researchers who are not computer literate may initially find it off-putting to have to engage with analytical software or statistical tests.
Corpus analysis works best at identifying certain types of patterns (Baker, 2010: 109-110)
A corpus analysis may produce interesting findings about language, but as many other methodologies; it is a task for humans to provide explanations for those findings.However, these criticism should not preclude corpus analysis (all methods have limitations), but should instead make users aware of potential limitations.The strength of the corpus approach is in using fast and accurate techniques to identify patterns that human analysis would not be notice.Corpus analysis offers a high degree of reliability and validity to linguistic research.
Baker, Paul. 2010. Corpus Methods in Linguistics. In Litosseliti, Lia. 2010. Research Methods in Linguistics. New York: Continnum International Publishing Group.