What is the phonological problem?

The so-called phonological problem is related to linguistic processing and the question of how spoken utterances are understood. Specifically, it is the problem of knowing which particular units (words) are being uttered.

Patterns of sound

The speech signal is a pattern of sound, and sound consists of patterns of minute vibrations in the air. Sounds vary in their frequency distribution.

Harmonic sounds

The sound of a musical instrument playing is relatively harmonic. This means that the energy of the sound is concentrated at certain frequencies of vibration. A plot of the energy of a sound against the frequency at which that energy occurs is called a spectrogram. A spectrogram for a descant recorder’s note is shown in Figure 1.

Spectrogram of the sound of a descant recorder

Figure 1. Spectrogram of a single note from a descant recorder. Note that the greatest concentration of energy is in the lowest band (the fundamental frequency), with regular and increasingly faint harmonics at higher frequencies. The harmonics have clear gaps between them, which creates the feeling of a pure and tuneful note.

As you can see, there are slim, dark (almost black) bands with grey spaces in between. The dark bands are the regions of the frequency spectrum where the acoustic energy is concentrated, whereas in the grey areas there is little or no acoustic energy. The lowest dark band corresponds to the fundamental frequency of the sound. This is where the most energy is concentrated, and it is the fundamental frequency which gives the sensation of the pitch of the sound. The higher bands are called the formant frequencies. In a ‘pure’ tone, their frequencies are mathematical multiples of the fundamental (in acoustics in general, they are also called overtones or harmonics, but in relation to speech, they are always called formants). The relative strengths of the different formants determine the timbre or texture of the sound.


In contrast to harmonic sounds are sounds in which the acoustic energy is dispersed across the frequency spectrum, like the box being dropped in Figure 2. These are experienced as noises rather than tones (it is impossible to hum them), and they appear on the spectrogram as a smear of color.

Spectrogram of a box being dropped

Figure 2. Spectrogram of a box of rice being dropped on a table. Note that, in contrast to the harmonic sound of the flute in Figure 1, the acoustic energy is smeared across the whole frequency range, making it sound like a noise rather than a note.

Managing the phonological problem

The first stage is to extract which phonemes are being uttered. Phonemes come in two major classes:

  1. vowels
  2. consonants


Vowels are harmonic sounds. They are produced by periodic vibration of the vocal folds which in turn causes the vibration of air molecules. The frequency of this vibration determines the pitch (fundamental frequency) of the sound. Different vowels differ in quality, or timbre, and the different qualities are made by changing the shape of the resonating space in front of the vocal folds by moving the position of the lips and tongue relative to the teeth and palate. This produces different spectrogram shapes, as shown in Figure 3.

Spectrogram of five vowels

Figure 3. The spectrogram shapes of five vowels, spoken by the author. The vowels correspond to the vowel sounds in beat, boot, Bart, bite and bait. What distinguishes the different vowels is not the absolute frequency of the formants but their position relative to each other.


Consonants, in contrast to vowels, are not generally harmonic sounds. Vowels are made by the vibration of the vocal folds resonated through the throat and mouth with the mouth at least partly open. Consonants, by contrast, are the various scrapes, clicks and bangs made by closing some part of the throat, tongue or lips for a moment.

The acoustics of the consonants are rather varied. Some consonants produce a burst of broad spectrum noise – look for instance at the /sh/ in Figure 4.

Spectrogram of author saying /sh/

Figure 4. Spectrogram of the author saying /sh/.

Other consonants have relatively little acoustic energy of their own, and are most detectable by the way they affect the onset of the formants of the following vowel.


Phonemes make up small groups called syllables. Typically a syllable will be one consonant followed by one vowel, like me, you or we. Sometimes, though, the syllable will contain more consonants, as in them or string. Different languages allow different syllable shapes, from Hawai’ian which only tolerates alternating consonants and vowels (which we can represent as CVCVCV), to languages like Polish which seem to us to have heavy clusters of consonant sounds. English is somewhere in the middle, allowing things like bra and sta but not allowing other combinations like tsa or pta.

Identifying phonemes in real speech

The task of identifying phonemes in real speech is made difficult by two factors.


The first is the problem of variation. Phonemes might seem categorically different to us, but that is the product of our brain’s activity, not the actual acoustic situation. Vowels differ from each other only by degree, and this is also true for many consonants. A continuum can be set up between a clear /ba/ and a clear /da/. Experiments have been conducted (Fitch et al., 1997) in which a computer-generated sound is varied continuously in a linear fashion between /ba/ and /da/. That is to say, the signal is varied continuously in a linear fashion. But what the hearer hears is a non-linear change – /ba/ until a certain point, thereafter the sound quite suddenly becomes /da/. What the listener does not experience is a sound with some /ba/ properties and some /da/ properties. It is heard as either one or the other, an effect known as categorical perception.

The hearer cannot rely on the absolute frequency of formants to identify vowels, since you can pronounce any vowel at any pitch, and different speakers have different depths of voices. So a transformation must be performed to extract the position of (at least) the first two formants relative to the fundamental frequency. But that is not all. Different dialects have slightly different typical positions of formants of each vowel; within dialects, different speakers have slightly different typical positions; and worst of all, within speakers, the relative positions of the formants change a little from utterance to utterance of the same underlying vowel. The hearer thus needs to make a best guess from very messy data.


This is made more difficult by the second factor, which is called co-articulation. The realization of a phoneme depends on the phonemes next to it. The /b/ of bat is not quite the same, acoustically, as the /b/ of bit. We are so good at hearing phonemes as phonemes that it is difficult to consciously perceive that this is so, except by taking an extreme example, as in the exercise below.


Listen closely to the phoneme /n/ in your own pronunciation of the word ten, in the following three contexts – ten newts, ten kings, ten men. Say the words repeatedly but naturally to identify the precise qualities of the /n/ in each case. Are they the same? If not, what has happened to them?


You will probably find that the articulation of the /n/ is ‘dragged around’ by the following consonant – towards the /ng/ of long in ten kings, and towards /m/ in ten men. If this is not clear, try saying ten men tem men over and over again (or alternatively ten kings teng kings). You soon realize that there is no acoustic difference whatever between the two phrases. This is an example of assimilation, a closely related phenomenon to co-articulation.

Co-articulation makes the task of the hearer even harder, because they have to undo the co-articulation that the speaker has put in (unavoidably, since co-articulation is unstoppable in fast connected speech). A sound which is identical to an /m/ which the listener has previously heard might actually be an /m/, but it might equally be an /n/ with co-articulation induced by the context. An upward curving onset to a vowel might signal a /d/ if the vowel is /e/, but a /b/ if the vowel is /a/. Yet the listener does not yet know what the vowel is. They are having to identify the signal under multiple and simultaneous uncertainties. Speakers use their expectations about what is to follow to resolve these uncertainties.


Other problems arise in the extraction of words from sound. Consider Figure 5, which is a spectrogram of the author saying the phrase My dad’s tutor. Our brains so effortlessly segment speech into words that we are tempted to assume that the breaks between the words are ‘out there’ in the signal. But as you can see, there are no such gaps in speech. There are moments of low acoustic intensity (white vertical areas), but they do not necessarily coincide with word boundaries. Co-articulation of phonemes can cross word boundaries.

Spectrogram of the phrase 'My dad's tutor'

Figure 5. Spectrogram of the phrase My dad’s tutor, spoken by the author.

Strings of phonemes have multiple possible segmentations. My dad’s tutor could be segmented, among many other possibilities, as:

  • (7a) [my] [dads] [tu] [tor]
  • (7b) [mide] [ad] [stewt] [er]
  • (7c) [mida] [dstu] [tor]

The signal rarely contains the key explicitly. The hearer can exploit knowledge of English phonological rules, for example to exclude (7c) on the grounds that it contains an impossible English syllable. Beyond that, knowledge of words must come into play. Segmentation (7b) is phonotactically fine, but doesn’t mean anything in English. Our segmentations always alight on those solutions that furnish a string of real words, so (7a) would be chosen. If your name was (7b), you would have extreme trouble introducing yourself, however clearly you spoke, whereas any segmentation containing whole words is much easier to understand.


Fitch. R.H., Miller, S. and Tallal, P. (1997) ‘Neurobiology of speech perception’ Annual Review of Neuroscience 20, 331-353.


[Information last accessed: 27 July 2017]

This article is adapted from ‘From sound to meaning: hearing, speech and language’. An OpenLearn (http://www.open.edu/openlearn/) chunk reworked by permission of The Open University copyright © 2016 – made available under the terms of the Creative Commons Licence v4.0 http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_GB. As such, it is also made available under the same licence agreement.