Type-Token Ratio
Type-token ratio of written language
Take a look at Text 1 below, which is an extract from something I wrote a while ago.
Text 1: Written Language |
But what are thoughts? Well, we all have them. They are variously described as ideas, notions, concepts, impressions, perceptions, views, beliefs, opinions, values, and so on. At times they are brief, coming and going in an instant. On other occasions they seem to endure and we can mull them over again and again in the act we call thinking. We can put them aside, fall asleep, and then return to them later. We refer to them as things we can handle. However, this is just a metaphor. |
If we count the number of words we get a total of 87. The number of words in a text is often referred to as the number of tokens. However, several of these tokens are repeated. For example, the token again occurs two times, the token are occurs three times, and the token and occurs five times. The following table shows all the tokens in Text 1, together with their frequency of occurrence.
rank |
word |
freq |
rank |
word |
freq |
1 |
we |
6 |
32 |
ideas |
1 |
2 |
and |
5 |
33 |
impressions |
1 |
3 |
them |
5 |
34 |
instant |
1 |
4 |
are |
3 |
35 |
is |
1 |
5 |
can |
3 |
36 |
just |
1 |
6 |
they |
3 |
37 |
later |
1 |
7 |
to |
3 |
38 |
metaphor |
1 |
8 |
again |
2 |
39 |
mull |
1 |
9 |
as |
2 |
40 |
notions |
1 |
10 |
in |
2 |
41 |
occasions |
1 |
11 |
on |
2 |
42 |
opinions |
1 |
12 |
a |
1 |
43 |
other |
1 |
13 |
act |
1 |
44 |
over |
1 |
14 |
all |
1 |
45 |
perceptions |
1 |
15 |
an |
1 |
46 |
put |
1 |
16 |
aside |
1 |
47 |
refer |
1 |
17 |
asleep |
1 |
48 |
return |
1 |
18 |
at |
1 |
49 |
seem |
1 |
19 |
beliefs |
1 |
50 |
so |
1 |
20 |
brief |
1 |
51 |
the |
1 |
21 |
but |
1 |
52 |
then |
1 |
22 |
call |
1 |
53 |
things |
1 |
23 |
coming |
1 |
54 |
thinking |
1 |
24 |
concepts |
1 |
55 |
this |
1 |
25 |
described |
1 |
56 |
thoughts |
1 |
26 |
endure |
1 |
57 |
times |
1 |
27 |
fall |
1 |
58 |
values |
1 |
28 |
going |
1 |
59 |
variously |
1 |
29 |
handle |
1 |
60 |
views |
1 |
30 |
have |
1 |
61 |
well |
1 |
31 |
however |
1 |
62 |
what |
1 |
TOTAL |
87 |
We see, then, that of the total of 87 tokens in this text there are 62 so-called types. The relationship between the number of types and the number of tokens is known as the type-token ratio (TTR). For Text 1 above we can now calculate this as follows:
Type-Token Ratio = (number of types/number of tokens) * 100 |
= (62/87) * 100 = 71.3% |
The more types there are in comparison to the number of tokens, then the more varied is the vocabulary, i.e. it there is greater lexical variety.
Type-token ratio of speech
Now take a look at Text 2. This is an extract from a transcribed conversation between two people, P and A.
Text 2: Speech |
01 P: so: (.) er: (..) as you were saying about er:: (.) 02 where are you living now Andrew 03 A: Skipton Lodge 04 P: Skipton Lodge? 05 A: mm (…) Skipton Lodge 06 P: yeah (.) do you like it 07 A: yeah I do 08 P: yeah 09 A: I’ve settled in 10 P: you have (…) good (.) w w what are the things you 11 like about it 12 A: go out in the tow:n 13 P: you go out in the town (…) 14 A: yeah 15 (2.1) 16 with er: Tommy and Martin (.) and er:: (.) Noel 17 P: and? 18 A: NOEL 19 P: oh yes (.) oh he lives there does he? 20 A: yeah he live(s) in the flats 21 P: yeah (.) oh they have flats there do they 22 A: mm 23 (3.3) 24 and er:: 25 (2.3) 26 and I went to see (..) (Elaine) |
As before, I have set out the tokens and their frequency of occurrence in tabular format in the Table below (I have ignored pauses such as (…), (2.1), the repetition of initial /w/ in line 10, and the inserts er, oh and mm).
rank |
word |
freq |
rank |
word |
freq |
1 |
yeah |
6 |
24 |
Andrew |
1 |
2 |
you |
5 |
25 |
as |
1 |
3 |
and |
5 |
26 |
does |
1 |
4 |
in |
3 |
27 |
Elaine |
1 |
5 |
the |
3 |
28 |
good |
1 |
6 |
do |
3 |
29 |
living |
1 |
7 |
he |
3 |
30 |
Martin |
1 |
8 |
I |
2 |
31 |
now |
1 |
9 |
Lodge |
2 |
32 |
saying |
1 |
10 |
Skipton |
2 |
33 |
see |
1 |
11 |
about |
2 |
34 |
settled |
1 |
12 |
are |
1 |
35 |
so |
1 |
13 |
flats |
1 |
36 |
things |
1 |
14 |
go |
1 |
37 |
to |
1 |
15 |
have |
1 |
38 |
Tommy |
1 |
16 |
it |
1 |
39 |
‘ve |
1 |
17 |
like |
1 |
40 |
went |
1 |
18 |
lives |
1 |
41 |
were |
1 |
19 |
Noel |
1 |
42 |
what |
1 |
20 |
out |
1 |
43 |
where |
1 |
21 |
there |
1 |
44 |
with |
1 |
22 |
they |
1 |
45 |
yes |
1 |
23 |
town |
1 |
|||
TOTAL |
88 |
We can now calculate the type-token ratio as before:
Type-Token Ratio = (number of types/number of tokens) * 100 |
= (45/88) * 100 = 51.1% |
Interpretation
You will see that the number of tokens in each of the texts is almost the same (87 in Text 1 and 88 in Text 2). However, the type-token ratios are different: 71% for the written text (Text 1) and just 51% for the spoken text (Text 2). We can say, therefore, that the vocabulary is less varied in the spoken text than in the written text. Or, to put it another way, the written text shows greater lexical variety. A high TTR indicates a large amount of lexical variation and a low TTR indicates relatively little lexical variation. This finding, that the type-token ratio of speech is less than that of written language, is typical.
A major difference between speech and written language is that speech in conversation is produced in real time. There is limited time to think about, and plan, what one wishes to say. Consequently, speakers tend to select words from a relatively restricted vocabulary. In contrast, an author of a written text has much more time to plan and select just the right vocabulary items that best communicate his or her meaning.
As with lexical density, the type-token ratio can also be used to monitor changes in the use of vocabulary items in children with under-developed vocabulary and/or word finding difficulties and, for example, in adults who have suffered a stroke and who consequently exhibit word retrieval difficulties and naming difficulties.
Reference
A few people have contacted me to enquire about a reference for TTR in order to include it in a report, a written assignment, or similar. Unfortunately, there is no reference for TTR as such. It is a well-known measure of lexical variation which is used in many linguistic analyses. If you search the internet for ‘type token ratio’ you will find several of these. I do not know who was the first person to use a measure of TTR in a study but (rather like lexical density) it is now well-known and, as it is in the public domain, no one really references its use anymore in articles, reports, and so on.
However, the book that I often refer to for definitions is:
- Biber, D., Conrad, S. and Leech, G. (2002) The Longman Student Grammar of Spoken and Written English Harlow: Longman. [ISBN: 0 582 237262]. I have found this to be a useful reference text, as it is a corpus-based reference work, i.e. the findings are based on an analysis of real world written and spoken texts.