I think that over the years Gary has gotten more than a little frustrated with those darn foreign sounds.
Why is it hard to model speech perception?
MAY 4, 2013.
"Let’s take a similarly simple problem from the field of linguistics. You take a person, sit them down in a nice anechoic chamber, plop some high quality earphones on them and play a word that could be “bite” and could be “bike” and ask them to tell you what they heard. What do you need to know to decide which way they’ll go? Well, assuming that your stimuli is actually 100% ambiguous (which is a little unlikely) there a ton of factors you’ll need to take into account. Like, how recently and often has the subject heard each of the words before? (Priming and frequency effects.) Are there any social factors which might affect their choice? (Maybe one of the participant’s friends has a severe overbite, so they just avoid the word “bite” all together.) Are they hungry? (If so, they’ll probably go for “bite” over “bike”.) And all of that assumes that they’re a native English speaker with no hearing loss or speech pathologies and that the person’s voice is the same as theirs in terms of dialect, because all of that’ll bias the listener as well.
The best part?
All of this is incredibly hard to measure. In a lot of ways, human language processing is a black box. We can’t mess with the system too much and taking it apart to see how it works, in addition to being deeply unethical, breaks the system. The best we can do is tap a hammer lightly against the side and use the sounds of the echos to guess what’s inside. And, no, brain imaging is not a magic bullet for this. It’s certainly a valuable tool that has led to a lot of insights, but in addition to being incredibly expensive (MRI is easily more than a grand per participant and no one has ever accused linguistics of being a field that rolls around in money like a dog in fresh-cut grass) we really need to resist the urge to rely too heavily on brain imaging studies, as a certain dead salmon taught us.
But! Even though it is deeply difficult to model, there has been a lot of really good work done on towards a theory of speech perception. I’m going to introduce you to some of the main players, including:
Motor theory
Acoustic/auditory theory
Double-weak theory
Episodic theories (including Exemplar theory!)
https://makingnoiseandhearingthings.com ... erception/Theories of Speech Perception
(Stanford handout)
"1. Theories of speech perception must be able to account for certain facts about the acoustic
speech signal, e.g.:
• There is inter-speaker and intra-speaker variability among signals that convey information about equivalent phonetic events.
• The acoustic speech signal is continuous even though it is perceived as and represents a series of discrete units.
• Speech signals contain cues that are transmitted very quickly (20 to 25 sounds per second) and simultaneously. They must also be able to account for various perceptual phenomena, e.g.:
• categorical perception
• phonemic restoration
• episodic memory
plus, various word recognition effects (e.g., frequency effects, priming, etc.)
2. Theories of speech perception differ with respect to their views of what is perceived and how.
Auditory theories
Auditory Model (Fant, 1960; also Stevens & Blumstein, 1978)
• The assumption of this model is that invariance can always be found in the speech signal by means of extraction into distinctive features. Listeners, through experience with language, are sensitive to the distinctive patterns of the speech wave...
Motor theories
4. Motor Theory (Liberman, et al., 1967; Liberman & Mattingly, 1985)
• Given the lack of acoustic invariance, we can look for invariance in the articulatory domain (i.e., maybe the representational units are defined in articulatory terms).
Motor theory postulates that speech is perceived by reference to how it is produced; that is, when perceiving speech, listeners access their own knowledge of how phonemes are articulated. Articulatory gestures such as rounding or pressing the lips together are units of perception that directly provide the listener with phonetic information. Biological specialization for phonetic gestures prevents listeners from hearing the signal as ordinary sound, but enables them to use the systematic, special relation between signal and sound to perceive the gestures.
• Originally, the motor commands that control articulation
Exemplar Models – Non-analytic approaches (e.g., Johnson, 1997; Goldinger, 1997; Pierrehumbert, 2002)
• In most models of speech perception, the objects of perception (or the representational units) are highly abstract. In fact, information about specific instances of a particular word are abstracted away from and ‘discarded’ in the process of speech perception. So information about a particular speaker or speech style or environmental context can play no role in the representation of words in memory.
Exemplar models postulate that information about particular instances (episodic information) is stored.
Mental representations do not have to be highly abstract.
They do not necessarily lack redundancy.
Categorization of an input is accomplished by comparison with all remembered instances of each category (rather than by comparison with an abstract, prototypical rep’n).
- Often, exemplars are modeled as categorizations of words, but they might also be categorizations of segments or syllables or whatever.
Stored exemplars are activated to a greater or lesser extent according to their degree of similarity to an incoming stimulus; activation levels determine categorization..."