Impact of perceptual-learning strategies and background noise on disordered speech intelligibility

Paul M Evitts; Connie K Porcaro; Tom Gollery; Paul M Evitts; Connie K Porcaro; Tom Gollery

ISSN: 2455-1759

Archives of Otolaryngology and Rhinology

Review Article Open Access Peer-Reviewed

Impact of perceptual-learning strategies and background noise on disordered speech intelligibility

Paul M Evitts^1*, Connie K Porcaro² and Tom Gollery³

Author and article information

¹Communication Sciences and Disorders, Penn State University-Harrisburg, USA
²Department of Communication Sciences and Disorders, Valdosta State University, 1500 N. Patterson St. Valdosta, GA 31698, Georgia
³Department of Education, Southeastern University, 1000 Longfellow Blvd, Lakeland FL 33801, UK

*Corresponding authors: Paul M Evitts, Ph.D., CCC-SLP, Program Chair, Communication Sciences and Disorders, Penn State University-Harrisburg, USA, E-mail: [email protected]

ORCiD : https://orcid.org/0000-0001-5973-5317

doi : 10.17352/2455-1759.000153

Received: 06 March, 2024 | Accepted: 29 March, 2024 | Published: 30 March, 2024

Keywords: Speech perception; Dysphonia; Speech intelligibility; Background noise; Voice disorders

Cite this as

Evitts PM, Porcaro CK, Gollery T. (2024) Impact of perceptual-learning strategies and background noise on disordered speech intelligibility. Arch Otolaryngol Rhinol 10(1): 004-015. DOI: 10.17352/2455-1759.000153

Copyright

© 2024 Evitts PM, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abstract

Objective: There is a plethora of research showing reduced speech intelligibility for a variety of voice disorders (i.e., dysphonia, alaryngeal). Therapeutic approaches to improve intelligibility typically involve targeting the speaker (e.g., clear speech, reduced rate) with minimal attention to the listener. Therefore, there were three purposes of this study: 1) to determine the impact of background noise on the speech intelligibility of disordered speakers; 2) to determine the impact of providing listeners with perceptual-learning strategies on the speech intelligibility of speakers with a voice disorder; and 3) to determine if subjective ratings of voice quality can predict speech intelligibility.

Methods: Sentences were recorded from 12 speakers (2 typical, 3 alaryngeal, 7 dysphonic). Sentences were divided into one of three groups of signal-to-noise ratios (SNR: quiet, +5 dB SNR, and 0 dB SNR) and individually presented to 129 healthy listeners divided into one of three groups (i.e., control, acknowledgment of disorder, cognitive-perceptual strategies). Orthographic transcription was used to assess speech intelligibility. In addition, three expert listeners provided subjective voice quality ratings of all speakers.

Results: Listeners had significantly more intelligibility errors with increased background noise (p <.001) and providing strategies to listeners did not result in a statistical improvement level F(6, 486) = 1.53, p = .17, η²p = 0.02. Regression analysis showed that the subjective voice quality overall severity was able to predict speech intelligibility in the noisy condition (0 dB SNR), accounting for 37% of the variance, R² = .365, F(1,10) = 5.759, p = .037.

Conclusion: Results suggest that increased background noise has a deleterious effect on the speech intelligibility of those with a voice disorder but that providing listeners with strategies in hopes of improving speaker intelligibility was not successful. Results did provide support, however, for the use of subjective voice quality ratings as a potential index of speech intelligibility.

Main article text

Introduction

It has been said that “simply to exist as a normal human being requires interaction with other people” [1]. It then stands to reason that people with a communication disorder, including voice and speech disorders, may potentially be at risk for disrupted social interactions due to the nature of their impairment. There are three components inherent to this communicative interaction: the speaker, the listener, and the listening environment. Decades of research have shown that each plays a pivotal role in the successful verbal exchange of ideas. One group particularly vulnerable to impaired communication are those speakers with either a speech or voice disorder [2]. The two main outcome measures typically utilized in the research to determine a speaker’s ability to transmit ideas successfully to the listener are speech intelligibility and listener comprehension. Speech intelligibility has been defined as the ‘amount of speech understood from the signal alone’ [3,4] while listener comprehension (or speech comprehension as described by can be an extension of speech intelligibility as it measures a listener’s ability to either answer questions [5] or surmise the gist of the speaker’s message [6]. Although some research has shown a strong relationship between the two [7], most research shows a weak-moderate correlation [5,8,9] suggesting that listeners are not reliant on the acoustic signal alone to arrive at the speakers’ intended message and instead, rely on signal-independent information for comprehension purposes.

One recent model of speech perception that illustrates the role of the speaker and the listener and highlights the various factors that may impact both speech intelligibility and listener comprehension was developed by Evitts for alaryngeal speakers [10]. The model, originally based on dysarthric speech [11], depicts those factors that may have a positive or negative impact on the acoustic signal produced by the speaker but also incorporates the additional component of listener processing (Figure 1) as well as the new addition of environmental noise. The original model was based on the recognition of the importance of the listener in the communication exchange with speakers with dysarthria [11].

Additionally, reduced intelligibility was primarily attributed to the speech signal, hence the bulk of this body of research focused on dysarthria. A much smaller body of research has shown, however, that a disordered voice quality also plays a role in intelligibility [9,12,13], listener comprehension [9,14], and amount of listener effort [15]. Regardless, much of the research seeking to improve disordered speech intelligibility targeted speaker modifications (e.g., increased effort, reduced rate). Since speaker modifications may be inherently limited due to disease and existing disability, recent research has begun to recognize the importance of the listener. Such listener-targeted treatment utilizes a perceptual-learning approach which is well founded in the psychological literature [16]. Such treatments are based on the belief that while listeners implicitly experience perceptual learning when presented with minor acoustic degradations, more direct and explicit directions are needed to successfully process more severely disordered speech [17].

As shown in the model of speech intelligibility by Evitts (2019), any treatments aimed at the listener clearly are founded on listener processing. Multiple factors¹ may play a role in this step, including listener attitudes as well as the visual information provided by the speaker [18-20]. In fact, targeting listener attitudes has been shown to have a strong and positive correlation with speech intelligibility for speakers with dysarthria [21]. Research on the use of self-disclosure to improve listener attitudes has also shown promise in the stuttering and alaryngeal speech literature [22,23]. Visual information may include such factors as facial paralysis, the presence of a stoma, the use of an electrolarynx, increased muscular effort, or other factors inherent to speech production. Recently, the COVID-19 pandemic and the subsequent use of facial masks have also shown the variable effects of the altered acoustic and visual signal on speech intelligibility across different types of speech styles (i.e., casual, clear, positive-emotional) [24]. Any of these factors discussed thus far may either have a positive or negative impact on a listeners’ ability to perceive and process the incoming signal. Cognitive workload can then be considered an index of listener processing and listeners have been shown to exert more cognitive workload and experience more errors in intelligibility when presented with dysphonic voices [9].

-----------------------

¹Other factors inherent to the listener also clearly have an impact, including listeners with a hearing impairment, aphasia, and other neurologic or cognitive disorders. For the purposes of this paper, listener processing will focus on typical, healthy listeners.

-----------------------

To improve the speech intelligibility of speakers with dysarthria, several studies have targeted the listener with mixed results. Strategies such as providing topic knowledge to the listener and identifying that listeners use information on syllabic strength and lexical boundary analysis have been promising [25-27]. A recent review on dysarthria remediation, however, highlighted familiarization as perhaps the most efficacious approach to improving speech intelligibility for dysarthric speakers [17]. It should be noted, however, that most of the studies provided in the review only included speech samples from one speaker, thus significantly reducing the ability to generalize results.

Further limiting the ability to generalize the results of this body of research to a broad spectrum of disordered speakers is the fact that the speech samples used were from speakers with dysarthria. Since dysarthria involves disordered speech, there is limited information that has targeted such listener-treatment approaches for speakers with voice disorders. Recently, Porcaro, et al. (2019) provided listeners with specific perceptual-learning strategies (i.e., acknowledgment of disorder, cognitive-perceptual strategies) in hopes of improving the speech intelligibility of 12 speakers with varying degrees of dysphonia. Listeners were divided into three groups: control; acknowledgment of the disorder; and cognitive-perceptual processes. The acknowledgment tactic (self-disclosure) has been used as a strategy to increase communication or change psychosocial perceptions for people who stutter or had a laryngectomy and now use alaryngeal speech [22,23]. Listeners in the cognitive-perceptual processes group were provided with information (strategies) on how to potentially overcome reduced intelligibility due to the degraded acoustic stimuli [28]. While the dysphonic voices were significantly less intelligible than the healthy controls, the use of perceptual-learning strategies did not result in improved intelligibility [13]. A review of the speakers with dysphonia used in the study though showed that nine of the dysphonic speakers were >90% intelligible and the remaining three speakers were 84-85% intelligible. Additionally, only three of the dysphonic speakers had a subjective overall severity rating as moderate-severe as measured on the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) [29]. Thus, conclusions on the potential efficacy of perceptual-learning strategies to improve the speech intelligibility of dysphonic speakers, particularly those with moderate-severe dysphonia, are premature.

Aside from those speakers with a speech or voice disorder, there are other instances when typical healthy speakers can present with reduced speech intelligibility, particularly those speakers with a foreign accent. For example, when English listeners are presented with highly and moderately intelligible Korean-accented English with varying degrees of background noise, results suggest that the foreign-accented speech requires more cognitive effort which ultimately may affect both intelligibility and comprehension [30]. Other research using English-as-a-Second-Language (ESL) speakers also highlights the varying importance of segmental, prosodic, and temporal features on pronunciation as well as the role of segmental accuracy for nativelike pronunciations [31].

An additional factor that has also begun to receive more recent attention in the disordered speech and voice literature is the deleterious effect of background noise on speech intelligibility for disordered speakers. Based on Figure 1, this may directly impact listener processing, which in turn, may influence speech intelligibility and listener comprehension. The inclusion of background noise as a predicating factor on speech intelligibility is well-founded due to its increased ecological validity. That is, communication does not occur in a vacuum and intuitively, both speaker and listener may be performing their roles accordingly but the message is disrupted due to the presence of increased background noise. Research has repeatedly shown that increased signal-to-noise ratio (SNR) negatively impacts speech intelligibility [32,33].

While this is certainly true for typical, healthy speakers, a nascent body of research is showing that disordered speakers are particularly vulnerable to reduced speech intelligibility, especially in the presence of background noise [2]. Such results have been shown for speakers with Parkinson’s Disease [34,35], alaryngeal speakers [2,36], and speakers with dysphonia [37]. In fact, decreased speech intelligibility across the studies was found at similar noise levels, specifically +5 dB SNR and 0 dB SNR. Furthermore, Ishikawa, et al. (2017) found that although speakers with dysphonia were significantly less intelligible than typical speakers with background noise, there were no differences between the two speaker groups in quiet. Speakers used in that study were mild-moderately dysphonic, as evidenced by a mean overall severity rating of 31 mm on the CAPE-V. It should also be noted that two of the speakers, although categorized as dysphonic, had an overall severity rating of < 10 mm. Previous work suggests that speakers with an overall severity rating on the CAPE-V of moderately-deviant may have reduced intelligilbity [9]. Finally, Yoho and Borrie (2018) investigated the effect of background noise on dysarthric speech and found no multiplicative effect of the presence of noise on intelligibility. Contrary to Ishikawa, et al. (2017), this suggests that the presence of background noise has a similar effect on both disordered and typical, healthy speech [38].

In sum, speakers with a speech or voice disorder frequently experience reduced speech intelligibility which may have a negative impact on their ability to successfully communicate their ideas and place them at a disadvantage from numerous perspectives. While there is a plethora of research detailing this reduced speech intelligibility, there is limited research that targets the listener as a means of improving speech intelligibility, particularly for speakers with a disordered voice (i.e., dysphonia, alaryngeal). Furthermore, much of the existing research is marked by reduced ecological validity in that the impact of background noise has not been fully explored. Therefore, the purpose of this study is to investigate the impact of perceptual-learning strategies provided to the listener on the speech intelligibility of disordered speakers in the presence of background noise. One previous attempt at this approach proved unsuccessful [13]. The majority of speakers in that study, however, were highly intelligible (i.e., > 90%). Thus, continued investigation is warranted. The overall hypothesis is that speech intelligibility will decrease as background noise increases across all speakers but providing listeners with perceptual-learning strategies will result in higher intelligibility scores. Specific research questions are as follows:

What is the impact of providing listeners with perceptual-learning strategies (i.e., acknowledgment, cognitive-perceptual) on the speech intelligibility of speakers with a voice disorder in the presence of background noise?

Can subjective ratings of voice quality predict speech intelligibility in a quiet setting and with increased background noise?

The second research question is included to recognize the importance of clinical subjective voice

evaluation and a continued pursuit to determine if such perceptual judgments can be used to predict impairments in speech intelligibility for both dysphonic and alaryngeal speakers [9,13]. Currently, the gold standard of assessing speech intelligibility involves listener transcription and then calculating the percent of words correctly identified [4]. Clinically, this process can be cumbersome and time-consuming. Using the CAPE-V [29], Evitts, et al. (2016) found that the rating category of overall severity was able to successfully predict speech intelligibility while Porcaro, et al. (2019) found that breathiness predicted intelligibility. More work is needed to elucidate the potential use of subjective voice evaluations to predict deficits in speech intelligibility.

Methods

This study was approved by the Institutional Review Boards of Towson University, Johns Hopkins School of Medicine, and Florida Atlantic University.

Participants

Participants included two groups: speakers and listeners. The speaker selection was based on a corpus of recordings and previously published studies from the speech perception lab of the first author (PE). Those disordered speakers from previous studies with the lowest rates of speech intelligibility were included [9,13,19]. While it is recognized that alaryngeal speech is markedly different than laryngeal speech [39], both alaryngeal and dysphonic speakers were included as the primary purpose of the study was to determine if providing listeners with cognitive-perceptual processes could improve disordered speech intelligibility, regardless of nature of the voice disorder. Ultimately, 12 speakers were included in the final study, including seven speakers with dysphonia, three alaryngeal, and two healthy controls. The seven speakers with dysphonia were all female with a mean age of 25.3 years who received a diagnosis of phonotrauma as a result of vocal nodules, vocal polyps, or secondary muscle tension dysphonia by a board-certified laryngologist at Johns Hopkins School of Medicine, Department of Otolaryngology. Alaryngeal speakers consisted of three males with a mean age of 70.3 years representing each mode of alaryngeal speech (i.e., tracheoesophageal [TE], esophageal [ES], and electrolaryngeal [EL]). Finally, two healthy female speakers with perceptually normal voice quality served as controls (mean age = 34 years). Inclusion criteria for all speakers were no history of cognitive, hearing, speech, or language impairments that affected speech or voice production (other than those directly associated with laryngeal cancer or the presence of dysphonia), and English as their first language.

Listeners were recruited through two different university programs. Inclusion criteria included: English as their primary language; no reported history of a learning disability; no reported history of a language or cognitive disorder; no history of a traumatic brain injury or any other injury that affected hearing or cognition. All listeners passed a hearing screening at 500 Hz, 1000 Hz, 2000 Hz, and 4000 Hz. Ultimately, 129 listeners (115 female, 14 male, mean age = 21.9 years) were included in the study.

Stimuli recording and preparation

All speakers were recorded in a quiet room while seated. A headset microphone (AKG C-420 III, AKG Acoustics, Vienna, Austria) was placed two inches from the corner of their mouth and audio was recorded at a 48 kHz sampling rate. Audio files were analyzed using an acoustic analysis program (Computerized Speech Laboratory 4500, KayPentax, Montvale, NJ) and saved as individual wave files. To reduce the potential of familiarity, each speaker produced a different set of phonemically balanced sentences from the Hearing in Noise Test (HINT) [40]. Similar to Ishikawa, et al. (2017), edited files were downsampled to 22500 Hz, and intensity was stabilized at 71.2-71.5 dB SPL. Background noise (+5 SNR, 0 SNR) was added to two-thirds of the files using MATLAB thus creating three sets of audio stimuli (quiet, +5 SNR, 0 SNR) across the 12 speakers. Background noise was extracted from a recording of cafeteria noise (Auditec, St. Louis, MO). Using previous methods [9,13,19], three master lists of audio stimuli were created with each list containing the 161 audio stimuli in a randomized order with a 10-second pause between each stimulus. This was done to reduce the potential of familiarity as listeners would not be presented with the same sentences. In sum, each list contained 161 sentences randomized for speaker and noise level. Each list contained 53 quiet stimuli, 54 with + 5 dB SNR background noise added, and 54 stimuli with 0 dB SNR background noise added. The 161 total stimuli also included 10% additional stimuli added in order to determine inter- and intra-rater reliability.

Listening procedure

The procedure for the listening task was also based on previous work [9,13,19]. Briefly, listeners were individually presented with 161 audio stimuli at two separate university locations. All listeners passed a hearing screening at 500, 1000, 2000, and 4000 Hz dB SPL. Listeners were seated in a soundproof booth and were first provided instructions on the task, followed by two examples and an opportunity to adjust the volume to a desired level. Listeners were also randomly assigned to one of three perceptual-learning strategy groups: control, acknowledgment strategy, and cognitive-perceptual strategy group. Listeners in the control group were provided the following instructions: “You will hear a series of sentences produced by different speakers. Please write down exactly what you hear on the form in front of you”. Based on methods by Blood and Blood (1982), listeners in the acknowledgment group were provided the following instructions: “You are going to hear a series of sentences produced by different speakers. Some of the speakers you hear will have a voice disorder. The medical term for a voice disorder is dysphonia. Dysphonia occurs when a person’s vocal folds or vocal cords do not move vibrate or move like they should. For this task, please write down exactly what you hear on the form in front of you”. Finally, listeners in the cognitive-perceptual group were provided the instructions which were based on Klasner and Yorkston (2005): “You are going to hear a series of sentences produced by different speakers. The speech and voices you hear may be difficult to understand. Please use the following strategies to help out with the task of understanding them”. Listeners were then provided with a paper with descriptions of three cognitive-perceptual strategies and were able to follow along as the examiner read each strategy.

Segmental strategy: If the sounds are difficult to understand, try using the other sounds within the word to figure out what the word is.
Cognitive strategy: Some of the voices that you’re going to hear may sound distorted. For those voices, try to pay close attention to the words being said.
Linguistic strategy: If you’re having a hard time understanding some of the words, try using the other words around it to figure out what the words may be.

After reading the paper, the listener was instructed to “write down exactly what you hear on the form in front you using the strategies we discussed to help you”. Listeners were then presented the signals binaurally through two speakers (Bose Companion 2 Series II, Bose Corporation, Framingham, MA).

Scoring

Speaker intelligibility was determined by counting the number of correctly transcribed words in each sentence and dividing by the total number of words possible. Scores across listeners were then calculated. Since different words convey different levels of meaning, articles, and content words were also counted.

Voice quality ratings

As discussed earlier, subjective voice quality ratings are considered the gold standard of voice evaluation. In hopes of using such ratings as a more expedient and efficient measure of speech intelligibility, all speakers were evaluated by three expert, licensed, and certified speech-language pathologists, each with at least 10 years of specialized voice experience. Expert raters were provided with a randomized set of folders, each containing 10 HINT sentences produced by each speaker. Expert raters were asked to complete a CAPE-V [29] on each speaker. Expert rater responses were measured in millimeters using a digital caliper (Avenger Products, Henderson, NV).

Results

Results are divided into the following sections: preliminary analyses, descriptive results, and primary analysis.

Preliminary analyses

In order to appropriately determine the final data set, the following preliminary analyses were performed: comparison of intelligibility lists; assessment of intra- and inter-listener agreement, assessment of intra- and inter-expert rater agreement, and finally, assessment of intra-scorer measurement.

Comparison across intelligibility lists: To determine if the intelligibility lists had different rates, a one-way ANOVA was calculated. The one-way ANOVA showed no significant differences across the lists, F(2,404) = 0.0113, p = 0.989.

Intra- and inter-listener agreement: To determine the reliability between and across listeners, Cronbach’s α for a random 10% of the listeners was calculated. Values for intra-listener and inter-listener agreement ranged from 0.875 to 1.00 showing strong support for the use of the data [41,42].

Learning and fatigue effects: To determine if there were a significant number of errors in the first, middle, or third portions of the listening experiment, a one-way ANOVA was calculated for a random 10% of the listeners. The ANOVA showed there was not a significant difference in the number of errors across listening portions, F(2, 23)= 0.521, p = 0.601.

Intra- and inter-expert rater agreement: To determine the reliability of the expert listeners, Cronbach’s α was also calculated across the three listeners. All those values > 0.6 were considered acceptable [41,42]. α values were as follows: overall severity – 0.89; Roughness – 0.54; Breathiness – 0.9; Strain – 0.6; Pitch – 0.42; and Loudness – 0.48. Due to the low-reliability ratings for Roughness, Pitch, and Loudness, data from those ratings would not be used in the final analysis.

Descriptive results

Descriptive statistical techniques were used to assess mean intelligibility by speaker type and background noise level (Table 1) as well as mean intelligibility by speaker type, noise level, and strategy (Tables 2).

Table 2: Descriptive statistics for intelligibility rates by speaker by background noise level and strategy.
Variable	% Intel	SD	Min	Max	Skewness	Kurtosis
Alaryngeal Quiet Mean
Control	73	0.15	0.27	1.00	-0.40	-0.38
Acknowledgement	75	0.16	0.33	1.43	0.07	0.89
Cognitive-Perceptual Strategies	76	0.14	0.47	1.03	-0.06	-0.84
Alaryngeal 5 SNR Mean
Control	57	0.17	0.07	0.94	-0.21	0.03
Acknowledgement	55	0.19	0.00	1.00	-0.10	-0.08
Cognitive-Perceptual Strategies	56	0.17	0.17	0.93	-0.01	-0.36
Alaryngeal 0 SNR Mean
Control	46	0.21	0.06	1.00	0.27	-0.57
Acknowledgement	43	0.22	0.00	1.00	0.09	-0.47
Cognitive-Perceptual Strategies	46	0.22	0.00	1.00	0.16	-0.49
Dysphonic Quiet Mean
Control	77	0.04	0.61	0.83	-1.46	2.28
Acknowledgement	77	0.04	0.60	0.83	-1.37	2.81
Cognitive-Perceptual Strategies	78	0.03	0.69	0.83	-0.51	-0.29
Dysphonic 5 SNR Mean
Control	72	0.11	0.40	0.92	-0.50	-0.12
Acknowledgement	69	0.13	0.34	0.93	-0.58	-0.13
Cognitive-Perceptual Strategies	73	0.11	0.47	0.90	-0.50	-0.62
Dysphonic 0 SNR Mean
Control
Acknowledgement
Cognitive-Perceptual Strategies
Typical Quiet Mean
Control	92	0.13	0.50	1.00	-1.54	1.19
Acknowledgement	90	0.15	0.40	1.00	-1.50	1.26
Cognitive-Perceptual Strategies	91	0.13	0.50	1.00	-1.65	1.86
Typical 5 SNR Mean
Control	83	0.09	0.50	1.00	-0.66	1.12
Acknowledgement	84	0.10	0.42	1.00	-0.99	2.99
Cognitive-Perceptual Strategies	85	0.08	0.62	1.00	-0.58	0.02
Typical 0 SNR Mean
Control	87	0.17	0.30	1.10	-1.47	1.70
Acknowledgement	89	0.15	0.50	1.00	-1.25	0.40
Cognitive-Perceptual Strategies	90	0.16	0.40	1.00	-1.70	2.07
Note. % Intel = percent intelligilbity; SD = standard deviation.

Article and content errors

It was also of interest to determine if there was a significant difference in article vs. content word errors across listeners (Figure 2). A one-way ANOVA showed a significant difference among modes of voice, F(1,3)=49.10, p < .001. Bonferroni Post hoc analysis showed that the number of article errors with typical voice was significantly different than alaryngeal (p < .001), and that dysphonic voice was significantly different than alaryngeal voice (p < .001). A one-way ANOVA for content words also showed a significant difference among modes of voice, F(1, 3)=302.96, p < .001. Bonferroni post analysis showed that typical was significantly different than dysphonic voice (p < .001), typical was significantly different than alaryngeal voice (p < .001), and dysphonic voice was significantly different than alaryngeal voice (p < .001) (Figure 2).

Primary analysis

Prior to the primary analysis, it was of interest to ensure that the typical speakers used in the study were, indeed, typical. To determine this, a Repeated Measures ANOVA statistical technique with one within-subjects factor was conducted to determine whether significant differences in intelligibility exist among the three speaker types of Alaryngeal, Dysphonic, and Typical featured in the study. The main effect for the within-subjects factor was statistically significant, F(2, 492) = 1010.39; p < .001, indicating there were significant differences between the values for intelligibility rates for the speaker types of Alaryngeal, Dysphonic, and Typical. The effect for speaker intelligibility across typical speakers and disordered speakers (Alaryngeal and Dysphonic) was also statistically significant, F(1, 246) = 1488.38, p < .001, indicating there were significant differences between the intelligibility rates of typical and disordered speakers (alaryngeal + dysphonic speakers combined). The mean intelligibility rate difference of 0.26 (SD = 0.01) favoring that of typical speakers was statistically significant (t (246) = 38.58, p < .001.

The first two research questions focused on the number of speech intelligibility errors by the amount of background noise and by strategy. Figure 3 provides descriptive information on the speech intelligibility of each speaker type by level of background noise.

A multivariate analysis of variance (MANOVA) statistical technique was conducted to assess if there were statistically significant differences in the linear combination of speaker type (Alaryngeal; Dysphonic; and Typical) for the levels of strategy. As a result, the main effect for strategy was not manifested at a statistically significant, level F(6, 486) = 1.53, p = .17, η²p = 0.02, indicating the linear combination of the speaker types of Alaryngeal, Dysphonic, and Typical was similar for each level of the variable strategy. The effect of strategy upon intelligibility by typical speakers and disordered speakers (Alaryngeal and Dysphonic combined) was similarly non-statistically significant, F(4, 488) = 1.19, p = .31, η²p = 0.01, indicating that the linear combination of intelligibility rates for typical speakers and disordered speakers was similar for each level of strategy.

Intelligibility of speaker type for levels of strategy

A Mixed Model ANOVA with one within-subjects factor and one between-subjects factor was conducted to determine whether significant differences exist among intelligibility rates between the levels of Strategy. The main effect for strategy was not statistically significant, F(2, 244) = 1.20, p = .30, indicating the levels of strategy were all similar for the intelligibility rates of the three speaker types represented in the study. The main effect for the within-subjects factor was statistically significant, F(2, 488) = 1003.90, p < .001, indicating there were significant differences between the values of intelligibility rates across the three speaker types. The interaction effect between the within-subjects factor and strategy was not statistically significant, F(4, 488) = 0.56, p = .66, indicating that the relationships between intelligibility rates of speaker types were similar between the levels of strategy (Table 3).

A follow-up Mixed Model ANOVA with one within-subjects factor (intelligibility rates of typical and disordered speakers) and one between-subjects factor was conducted to determine whether significant differences exist among intelligibility rates between the levels of Strategy. The main effect for Strategy was not statistically significant, F(2, 244) = 1.64, p = .20, indicating the levels of Strategy were all similar for intelligibility rates for typical and disordered speakers. The main effect for the within-subjects factor was statistically significant, F(1, 244) = 1474.22, p < .001, indicating there were significant differences between the values of intelligibility rates for typical and disordered speakers. The interaction effect between the within-subjects factor and Strategy was not significant, F(2, 244) = 0.32, p = .73, indicating that the relationship between intelligibility rates of typical and disordered speakers was similar between the levels of Strategy. Tukey HSD comparisons were conducted to assess the differences in the estimated marginal means for each combination of between-subject and within-subject effects (Table 4).

Effect of strategy upon speaker intelligibility rates by background noise level

A multivariate analysis of variance (MANOVA) statistical technique was conducted to assess if there were statistically significant differences in the linear combination of speaker type among the levels of Strategy. Using strategy as the variable, results for alaryngeal speakers did not show a significant difference between speech intelligibility levels and background noise levels (Alaryngeal Quiet, Alaryngeal +5 dB SNR, Alaryngeal 0 dB SNR), F(6, 1128) = 1.61, p = .14, ηp² = .01. The main effect for strategy was also not statistically significant for dysphonic speakers, F(6, 116) = 1.56, p = .17, η²p = 0.07, indicating that the linear combination of the mean scores for Dysphonic Quiet, Dysphonic + 5 dB SNR, and Dysphonic 0 dB SNR was similar for each level of strategy. Finally, a multivariate analysis of variance (MANOVA) was conducted to assess if there were statistically significant differences in the linear combination of the mean scores for Typical Quiet, Typical +5 dB SNR, and Typical 0 dB SNR between the levels of strategy. The main effect for strategy was not statistically significant, F(6, 494) = 0.89, p = .50, η²p = 0.01, indicating that the linear combination of the mean scores for Typical Quiet, Typical +5 dB SNR, and Typical 0 dB SNR was similar for each level of strategy.

Predictive ability of voice quality ratings and speech intelligibility

The second research question focused on the ability of subjective voice quality ratings provided by expert listeners to predict the speech intelligibility of disordered speakers. As discussed earlier, this was included in a continued effort to explore the use of subjective voice quality ratings as a means of predicting speech intelligibility. If possible, this would be clinically useful as existing assessments such as the CAPE-V [29] could also serve to provide information on speech intelligibility in lieu of more time-consuming methods. Descriptive information on the speech intelligibility of speakers and results of the CAPE-V by expert listeners is provided in Table 5. Please note that +5 dB SNR was not used in the statistical analysis in an effort to increase the statistical power and to better reflect more severe listening conditions. Due to the low reliability of the rating roughness, pitch, and loudness, only overall severity, breathiness, and strain were included in the regression analysis.

Regression analysis showed that none of the voice qualities from the CAPE-V were able to predict speech intelligibility in the quiet condition. There were two different models that were predictive of speech intelligibility using the CAPE-V ratings as the independent variables and percent intelligibility in the noisy condition (0 dB SNR) as the dependent variable. The first model, which included the rating overall severity, was able to predict 37% of the variance associated with percent intelligibility: R² = .365, F(1, 10) = 5.759, p = .037. The beta coefficient for this model was -.605 indicating that for every unit increase in overall severity, there would be a .605 decrease in the percent intelligibility. The second model included the ratings of overall severity and strain from the CAPE-V ratings and was able to predict 66% of the variance associated with percent intelligibility in the noisy condition: R² = .663, F(2,9) = 8.849, p = .007. Beta coefficients in this model were -1.431 for overall severity and .990 for strain indicating that for every unit increase in overall severity and strain, there would be a 1.431 decrease and a .990 increase in speech intelligibility.

Discussion

The overarching purpose of this study was to investigate if providing listeners with cognitive-perceptual strategies would increase the speech intelligibility of disordered speakers in situations that reflect everyday communication (i.e., increased SNR/background noise). As discussed earlier, although numerous gains have been made to help disordered speakers improve their intelligibility, it is also advantageous to target reducing the effort on the part of the listener to recognize the speech [43]. With the current study, overall results showed that increased background noise did, for the most part, reduce speech intelligibility but that providing listeners with cognitive-perceptual strategies in hopes of overcoming the degraded acoustic stimuli did not help improve intelligibility scores.

Background noise and speech intelligibility

The finding that increased background noise negatively impacted speech intelligibility is not new. This is true for typical, healthy speakers [44] as well as speakers with dysarthria [38], dysphonia [37], and laryngectomy [2]. Across speaker modes, alaryngeal speakers experienced a mean 29% reduction, followed by a 13% reduction for dysphonic speakers, and a 3% mean reduction in speech intelligibility from the quiet listening condition to the noisy or 0 dB SNR. The finding that disordered speech was more vulnerable to background noise was also consistent with previous research on speakers with a disordered voice [2,45]. It should be noted that the speakers used in the Ishikawa, et al (2017) included two speakers with normal voice quality ratings (< 10 mm on the CAPE-V overall severity) and the remaining four speakers were rated as mild-moderately dysphonic (35 mm - 53 mm on the CAPE-V overall severity). Furthermore, three of the speakers’ ratings of breathiness on the CAPE-V were < 4 mm. Additionally, alaryngeal speakers used in the Eadie, et al. (2021) exhibited only ‘mild speech imprecisions’ and five speakers had ‘intact speech’. Previous research suggests that dysphonic speakers may not experience intelligibility deficits until subjective CAPE-V ratings of breathiness are moderate-severe [13] or ratings of overall severity are moderate-severe [9].

Results on the impact of background noise on the speech intelligibility of speakers with dysarthria are less conclusive. For example, Yoho and Borrie (2018) found that the combination of background noise and dysarthric speech did not have a multiplicative effect on speech intelligibility. That study only used one speaker with mild-moderate ataxia dysarthria. Conversely, other studies have suggested that dysarthric speech is more vulnerable to background noise due to the presence of a multiplicative effect of speech degradation and background noise [46,47]. Additional research is warranted to clarify this effect but, intuitively, it makes sense that disordered speakers are more susceptible to reduced speech intelligibility given adverse listening conditions.

Generally speaking, listeners utilize both bottom-up processing where phonemic and acoustic elements are analyzed as well as top-down processing where listeners rely on their own experience/knowledge to fill in the gaps of a message [43]. However, as an acoustic stimulus becomes more degraded, it forces the listener to employ more top-down processing which results in increased effort or increased cognitive workload [9,43]. Using reaction times to measure cognitive workload, Evitts, et al. (2016) found that as listeners employed more cognitive workload, intelligibility scores decreased. Listener comprehension, on the other hand, was not affected. This may speak to the relationship between intelligibility and comprehension. As discussed in Figure 1, most research shows a weak-moderate relationship between the two [5,8,9]. It may be that intelligibility tasks are inherently more difficult and require more cognitive workload than tasks aimed at comprehension where the listener only needs to arrive at the gist of the speakers’ message, not each and every individual phonetic unit.

Listener strategies and speech intelligibility

Aside from the issue of background noise, the primary aim of the study was to determine if providing listeners with cognitive-perceptual strategies improved speech intelligibility. Unfortunately, this hypothesis proved untrue. Even when the disordered speakers were combined in the current study (dysphonia + alaryngeal), providing listeners with information on the degraded auditory stimuli did not improve speech intelligibility. While this is consistent with previous results using similar methods and similar speakers [13], it is in stark contrast to numerous studies showing that listeners do, indeed, use such strategies and that intelligibility can increase when trained on such strategies. For example, initial research aimed at determining the source of listener transcription errors showed that lexical boundary errors accounted for many of those errors by typical listeners when presented with hypokinetic dysarthric speech [25]. Subsequent research showed that listeners also have different error patterns depending on the type of dysarthric speech [48]. Based on this listener-perceptual approach, Klasner and Yorkston (2005) investigated the effect of providing listeners with different strategies (segmental, suprasegmental, linguistic, cognitive) on the intelligibility of dysarthric speech. Results showed that listeners employed different strategies based on the type of dysarthric speech and that listeners used both top-down and bottom-up strategies to improve intelligibility [28].

The tactic of acknowledgment was also included in this study. This strategy included acknowledging the presence of the voice disorder for the listener prior to the listening task in hopes of reducing the cognitive demands of the listener by explaining the source of the different-sounding voice. The listener, in turn, could then theoretically focus on the intelligibility task. The approach showed promise in the psychosocial literature as a means of improving communication between speakers with a disorder and typical, healthy listeners [22,23]. Clearly, the impact of acknowledging a disorder may be limited to psychosocial perceptions related to personality and willingness to communicate, among others [22,23].

Perhaps the most promising listener-oriented approach thus far in recognizing the need to shift the burden of behavioral change from the speaker to the listener is the issue of familiarity [17]. Two comprehensive reviews of interventions targeted at listeners to improve speech intelligibility for dysarthric speakers both point to this strategy as having the most efficacy [16,17]. It is argued that familiarization with a disordered signal “induces an attentional shift toward more phonetically informative acoustic cues” [16]. This notion is supported by recent findings of increased cognitive workload when listeners are presented with dysphonic and alaryngeal speech [9,49]. In fact, research has shown that simply exposing listeners to disordered speech (passive familiarization) results in intelligibility gains, listeners trained with specific feedback on transcription tasks (explicit familiarization) made substantially higher gains in intelligibility with dysarthric speech [16]. Thus, continued efforts targeting the listener to improve speech intelligibility for disordered speakers, regardless of the nature of the disorder (voice or speech), should focus on this promising area.

Other than familiarization, there are other factors related to the listener that need to be addressed when discussing speech perception. Many of these are specifically explored by a consensus of interdisciplinary experts and resulted in a Framework for Understanding Effortful Listening (FUEL) [50]. The framework recognizes the principle that the presence of background noise results in depleted cognitive resources for the listener, thus affecting intelligibility and comprehension. Pertinent to studies targeting the listener and the perception of disordered speech, the issue of conation is also presented. Conation is a centuries-old neuropsychological concept that refers to the listener's ability to focus one’s attention on a task and, although there is overlap with motivation, is considered a separate and distinct factor [51] and should be considered in future research in the disordered speech perception literature.

In addition to conation, other factors inherent to the listener also need to be considered in future research. For example, although related to conation, motivation of the listener plays an integral role in speech perception tasks. In fact, when listeners have reduced motivation to process a degraded acoustic signal or in the presence of background noise, the result may still be no change in effort [52]. Furthermore, when listeners are given a severely degraded acoustic signal, they may implicitly determine that they will not be successful and will thus, reduce effort and cognitive resources [52]. Aside from conation and motivation, working memory and overall cognitive ability may be additional listener factors to consider in this area of research [50,52]. Given that older adults may have reduced working memory [53] and potential hearing loss, their ability to not only perform a transcription task but also to perceive individual phonetic units may be diminished, especially when presented with a degraded acoustic signal.

Voice quality ratings and speech intelligibility

This study also sought to further elucidate the role of subjective voice quality ratings in predicting speech intelligibility. This reflects a continued effort in the disordered speech literature to replace the time-consuming process of orthographic transcription to assess speech intelligilbity [54,55]. Results from this study indicate that the subjective rating of overall severity on the CAPE-V was able to account for 37% of the variance of speech intelligibility and that the ratings overall severity + strain were able to account for 66% of the variance associated with speech intelligibility with 0 dB SNR as background noise. This is promising given previous results using similar methods. Specifically, regression analysis by Evitts, et al. (2016) found that overall severity accounted for 32% of the variance and overall severity + strain accounted for 36% of the variance. However, similar regression analysis also found that only the voice quality rating breathiness was predictive of speech intelligibility, accounting for 41% of speech intelligibility [13]. Both of these results were in a quiet setting with no background noise. Regression analyses in the current study found that no voice quality ratings were able to predict speech intelligibility in a quiet setting. It should be noted that in the current study, expert listener ratings for rough, pitch, and loudness were found to be statistically unreliable and were thus, not included in the final analysis. The issue of reduced listener reliability of voice quality, even with expert listeners, certainly poses a problem as this area of research develops but clearly, there are issues with perceptual ratings of voice quality that need to be resolved [56,57].

The use of crowdsourcing may be helpful in addressing some of the issues with subjective voice quality ratings. Crowdsourcing through companies such as Amazon Mechanical Turk, offers researchers the ability to access a much larger pool of listeners than previous logistics would permit. Traditionally, researchers would need to recruit listeners and present those individual listeners with the intelligibility task in person and in a soundproof booth. All of which was very time-consuming. Crowdsourcing allows access to a potentially unlimited pool of listeners and thus, much stronger results and generalizability [58]. Since crowdsourcing platforms also generally require financial payment to participants, the issue of conation and motivation may also be better addressed through such recruiting methods.

Limitations

There were specific limitations to this study which prevented generalizing the results to all disordered speakers. First, the sample of speakers was relatively limited. Specifically, only three alaryngeal speakers were included, one from each mode of alaryngeal speech. Also, even though attempts were made to include dysphonic speakers with lower baseline intelligibility, this did not turn out to be the case. Speech intelligibility data in quiet for dysphonic speakers ranged from 85-100%. Granted, the study did include a larger sample of dysphonic speakers (n = 7) than other studies but, given the large degree of variability across dysphonic and alaryngeal speakers, a larger sample size is warranted prior to generalizing the results. In addition, other factors that may impact speech intelligibility were addressed (e.g., prosody, and articulation). The second limitation involved the reliability ratings of the expert listeners. Three of the voice quality ratings on the CAPE-V were not included in the final regression analysis. Although results were consistent with previous attempts (i.e., specific voice quality ratings were able to predict speech intelligibility), it is again, difficult to generalize the results based on the reduced reliability. Third, one could argue that the current study had reduced ecological validity. Specifically, listeners were presented with audio-only stimuli and not audiovisual. Mode of presentation (audio-only vs. audiovisual) has been shown to influence the amount of cognitive resources required to complete a listening task [59]. Since most communication includes a visual component, future research should include audiovisual stimuli. Additionally, the current study did not incorporate a dual-task paradigm when assessing speech intelligibility. Communication often occurs in situations where listeners have divided attention. Including a dual-task paradigm in future studies would afford insight into how disordered speakers are perceived in real-world situations. Fourth, the mean age of the listeners in the current study was 21.9 years. Previous research has shown that age impacts the degree of listener effort in varying degrees of background noise [60]. Future research should include a larger sample size of listeners to better reflect different age groups in order to better generalize the results.

Conclusion

The main purpose of the study was to determine the effects of background noise on disordered speech and to determine if providing listeners with strategies would improve speech intelligibility. This is in recognition of the fact that disordered speakers may have inherent limitations to the amount of improvement possible and that strategies targeting the listener are warranted. Overall, results confirmed previous findings that disordered speakers are more vulnerable to reduced speech intelligibility in the presence of background noise but that providing listeners with perceptual strategies in hopes of improving speech intelligibility was unsuccessful. In addition, results showed that the subjective voice quality rating overall severity provided by expert listeners was able to predict 37% of the variance associated with speech intelligibility. Clinically, continued investigation of voice quality ratings as an index of speech intelligibility can be a more efficient means of assessment.

Acknowledgment

Portions of the results were previously presented at the 2019 ASHA Convention, in Orlando, FL. The authors would like to thank Callan Cloonan, Elizabeth Coletti, Christie Getejanc, Anna Horn, Jillian Scott, Allison Stolz, and Sarah Yinger for their assistance with data collection and data entry. The authors would also like to thank speech-language pathologists Rina Abrams, Christina Dastolfo-Hromack, and Ashley Davis for serving as expert raters. Finally, the authors would like to thank Dr. Nirmal Srinivasan for his invaluable assistance with stimulus preparation.

References

Article Alerts

Subscribe to our articles alerts and stay tuned.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Quick Enquiry

Table 4: Marginal means contrasts for each combination of within-subject variables by strategy.
Contrast	Difference	SE	df	t	p
Controls
Alaryngeal Mean – Dysphonic Mean	-0.13	0.01	244	-10.93	< .001
Alaryngeal Mean – Typical Mean	-0.33	0.01	244	-22.13	< .001
Dysphonic Mean – Typical Mean	-0.20	0.01	244	-19.56	< .001
Acknowledgment Strategy
Alaryngeal Mean – Dysphonic Mean	-0.15	0.01	244	-11.85	< .001
Alaryngeal Mean – Typical Mean	-0.34	0.02	244	-21.00	< .001
Dysphonic Mean – Typical Mean	-0.19	0.01	244	-16.88	< .001
Cognitive-Perceptual Strategies
Alaryngeal Mean – Dysphonic Mean	-0.14	0.01	244	-12.16	< .001
Alaryngeal Mean – Typical Mean	-0.33	0.02	244	-21.57	< .001
Dysphonic Mean – Typical Mean	-0.18	0.01	244	-17.36	< .001
Note. Tukey HSD Pairwise Comparisons were used to test the differences in estimated marginal means.

Table 5: CAPE-V ratings and intelligibility for regression analysis.
	Intelligibility %			CAPE-V Ratings (mm)
Voice Type	Quiet	0 dB SNR	overall severity		Roughness	Breathiness	Strain	Pitch	Loudness
Alaryngeal	48	31	79.04		59.52	8.37	42.27	66.28	39.58
Alaryngeal	74	72	71.96		53.58	9.47	53.92	38.33	26.83
Alaryngeal	90	55	64.31		58.78	4.6	33.9	47.32	27.48
Dysphonic	85	62	64.1		10.48	68.46	40.72	15.57	31.62
Dysphonic	100	59	47.76		35.73	34.93	44.76	34.54	10.94
Dysphonic	92	79	23.63		24.39	12.42	20.76	13.21	4.08
Dysphonic	100	75	51.55		14.04	44.78	50.83	26.01	16.35
Dysphonic	100	88	70.79		32.34	15.51	73.26	41.12	26.84
Dysphonic	90	97	38.23		28.45	32.86	27.47	10.99	13.63
Dysphonic	97	91	54.47		46.46	28.47	57.54	13.8	14.15
Typical	100	83	10.19		9.68	0	4.89	0	0
Typical	94	100	6.28		8.35	0	4.42	0	0

Archives of Otolaryngology and Rhinology

Impact of perceptual-learning strategies and background noise on disordered speech intelligibility

Paul M Evitts^1*, Connie K Porcaro² and Tom Gollery³

Author and article information

Abstract

Indexing and Abstracting

Main article text

Introduction

Methods

Participants

Stimuli recording and preparation

Listening procedure

Scoring

Voice quality ratings

Results

Preliminary analyses

Descriptive results

Article and content errors

Primary analysis

Intelligibility of speaker type for levels of strategy

Effect of strategy upon speaker intelligibility rates by background noise level

Predictive ability of voice quality ratings and speech intelligibility

Discussion

Background noise and speech intelligibility

Listener strategies and speech intelligibility

Voice quality ratings and speech intelligibility

Limitations

Conclusion

Acknowledgment

References

Advertisement

Article Alerts

Table of Contents

Submit your next article Peertechz Publications, also join of our fulfilled creators. Submit a Manuscript

© Peertechz Publications Inc., 10880 Wilshire Blvd., Suite 1101, Los Angeles, California, 90024, USA

Table 1: Descriptive statistics for intelligibility rates by speaker type and background noise level.
Variable	% Intel	SD	Min	Max	Skewness	Kurtosis
Alaryngeal Quiet Total	74	0.15	0.27	1.43	-0.15	0.12
Alaryngeal 5 SNR Total	56	0.18	0.00	1.00	-0.12	-0.10
Alaryngeal 0 SNR Total	45	0.22	0.00	1.00	0.16	-0.48
Dysphonic Quiet Total	77	0.04	0.60	0.83	-1.34	2.67
Dysphonic 5 SNR Total	71	0.12	0.34	0.93	-0.58	-0.09
Dysphonic 0 SNR Total	64	0.18	0.19	0.98	-0.33	-0.64
Typical Quiet Total	91	0.14	0.40	1.00	-1.57	1.48
Typical 5 SNR Total Mean	84	0.09	0.42	1.00	-0.82	1.93
Typical 0 SNR Total Mean	88	0.16	0.30	1.10	-1.49	1.53
Note. % Intel = percent intelligilbity; SD = standard deviation.

Table 3: Main effects and interaction effect for speaker type by strategy.
Source	df	SS	MS	F	p	ηp2
Between-Subjects
Strategy	2	0.04	0.02	1.20	.30	0.01
Residuals	244	3.81	0.02
Within-Subjects
Within Factor	2	13.73	6.86	1003.90	< .001	0.80
Strategy Within Factor	4	0.02	0.00	0.56	.66	0.00
Residuals	488	3.34	0.01

Archives of Otolaryngology and Rhinology

Impact of perceptual-learning strategies and background noise on disordered speech intelligibility

Paul M Evitts1*, Connie K Porcaro2 and Tom Gollery3

Author and article information

Abstract

Indexing and Abstracting

Main article text

Introduction

Methods

Participants

Stimuli recording and preparation

Listening procedure

Scoring

Voice quality ratings

Results

Preliminary analyses

Descriptive results

Article and content errors

Primary analysis

Intelligibility of speaker type for levels of strategy

Effect of strategy upon speaker intelligibility rates by background noise level

Predictive ability of voice quality ratings and speech intelligibility

Discussion

Background noise and speech intelligibility

Listener strategies and speech intelligibility

Voice quality ratings and speech intelligibility

Limitations

Conclusion

Acknowledgment

References

Advertisement

Article Alerts

Table of Contents

Submit your next article Peertechz Publications, also join of our fulfilled creators. Submit a Manuscript

© Peertechz Publications Inc., 10880 Wilshire Blvd., Suite 1101, Los Angeles, California, 90024, USA

Paul M Evitts^1*, Connie K Porcaro² and Tom Gollery³