Session Poster I:

Poster I

Chair: Inger Moen, Allard Jongman
Date: Monday - August 06, 2007
Time: 14:20
Room: Poster Area


Uta Benner, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart
Ines Flechsig, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart
Grzegorz Dogil, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart
Bernd Möbius, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart
  In a speech production model proposed by Levelt a distinction is made between two routes of phonetic implementation in speech. A syllabary route is used to retrieve the stored motor programs for the most frequent syllables of a language, and segment-by-segment assembly is used for the implementation of low-frequency syllables. One of the predictions of the model is that there should be a difference in coarticulation between motor programs retrieved from the syllabary and programs that are computed online. In this paper we present two laboratory experiments and a corpus study on German which were designed to verify this prediction. Our results support the hypothesis that articulatory programs for high-frequency syllables are implemented differently than those for rare syllables.
Angélique Amelot, Institut de la Communication Parlée Université Stendhal - 1180, Avenue Centrale, BP 25, 38040 GRENOBLE CEDEX 9 Tél. +33 (0)4 76 82 43 37
Solange Rossato, Institut de la Communication Parlée Université Stendhal - 1180, Avenue Centrale, BP 25, 38040 GRENOBLE CEDEX 9 Tél. +33 (0)4 76 82 43 37
  The contrast between oral and nasal vowels in French is known to involve secondary cues in addition to nasality; it is an open issue to what extent differences of velum height between the two sets of vowels are preserved in rapid speech. This study compares the velar movements for nasal vowels and consonants; it investigates contextual nasalisation; and it provides new data on how nasalisation is affected by speech rate. Velar position is measured with an electromagnetic articulatograph (EMA) for two French speakers. Our results confirm that (i) nasal vowels are produced with a lower velum height than nasal consonants; (ii) the contrast between nasal and oral vowels is maintained in nasal context; (iii) velum height targets for nasal and oral segments show some overlap, especially sequences of nasal consonant + oral vowels or liquids; and (iv) nasal vowels have a relatively longer duration which is preserved under rapid speech rate.
Poster I-6 Voicing assimilation in journalistic speech
Pierre André Hallé, LPP/CNRS 19 rue des Bernardins, 75005 Paris
Martine Adda-Decker, LIMSI/CNRS, bat. 508 F-91403 Orsay cedex
  We used a corpus of radio and television speech to run a quantitative study of voicing assimilation in French. The results suggest that, although voicing may be gradient rather than all-or-none, voicing assimilation is essentially categorical. The amount of voicing assimilation little depends on underlying voicing but clearly varies with speech rate and also with consonant manner of articulation. The results also suggest that voicing assimilation, though largely regressive, is not purely unidirectional.
Poster I-8 Articulatory optimisation in perturbed vowel articulation
Jana Brunner, Humboldt-Universität zu Berlin, ICP-Gipsa-lab, INP Grenoble and ZAS Berlin
Phil Hoole, Institut für Phonetik und Sprachliche Kommunikation der Ludwig-Maximilisans-Universität München
Pascal Perrier, ICP, Gipsa-lab, CNRS, INP Grenoble & Université Stendhal
  A two-week perturbation EMA-experiment was carried out with palatal prostheses. Articulatory effort for five speakers was assessed by means of peak acceleration and jerk during the tongue tip gestures from /t/ towards /i, e, o, y, u/. After a period of no change speakers showed an increase in these values. Towards the end of the experiment the values decreased. The results are interpreted as three phases of carrying out changes in the internal model. At first, the complete production system is shifted in relation to the palatal change, afterwards speakers explore different production mechanisms which involves more articulatory effort. This second phase can be seen as a training of the internal model during which input-output pairs are tested with respect to their articulatory effort. In the third phase speakers start to select an optimal movement strategy to produce the sounds so that the values decrease.
Shigeaki Amano, NTT Communication Science Laboratories, NTT Corporation
Ryoko Mugitani, NTT Communication Science Laboratories, NTT Corporation
Tessei Kobayashi, NTT Communication Science Laboratories, NTT Corporation
  Hirata and Whiton [2005] revealed that the production boundary between a single and a geminate stop in Japanese is invariant over speaking rates in terms of the ratio of stop closure duration to word duration (closure-word ratio). This study addressed the question of whether the ratio is also invariant for the perceptual boundary. An experiment was conducted to obtain the perceptual boundary between a single and a geminate stop at slow, normal and fast speaking rates. The results showed that the closure-word ratio at the perceptual boundary did not coincide with that at its production boundary. However, the closure-word ratio was consistent within each stimulus item for all speaking rates, although it was different among the stimulus items. The results suggest that the closure-word ratio at the perceptual boundary is invariant over speaking rates within an item, but some item-related factors affect it.
Thomas Ulrich Christiansen, Ørsted*DTU, Technical University of Denmark
Steven Greenberg, Silicon Speech
  The spectro-temporal coding of Danish consonants was investigated using an information-theoretic approach. Listeners were asked to identify eleven different consonants spoken in a CV[l] syllable context. Each syllable was processed so that only a portion of the original audio spectrum was present. Narrow speech-bands, with center frequencies of 750 Hz, 1500 Hz and 3000 Hz, were presented individually and in combination with each other. The modulation spectrum of each band was low-pass filtered at 24, 12, 6 and 3 Hz. Confusion matrices of the consonant-identification data were computed. From these the amount of information transmitted for each of three phonetic features (voicing, manner and place) was calculated for each condition. Such analyses indicate that: (1) Accurate, robust decoding of place-of-articulation information requires broadband cross-spectral integration (2) Place-of-articulation information is most closely associated with the modulation spectrum above 12 Hz.
Poster I-14 Prosodic conditioning of Portuguese subjects’ perception of vowel nasality
John Hajek, School of Languages and Linguistics, University of Melbourne
Ian Watson, Phonetics Laboratory and Christ Church, University of Oxford, Great Britain
  We examine the sensitivity of Portuguese subjects to a series of prosodic parameters previously shown to condition perception of vowel nasality, hypothesizing that the presence in Portuguese of long, strongly nasal vowels would (i) provoke lower nasality ratings than observed in English and French subjects and (ii) make these insensitive to prosodic parameters under investigation. The results confirm (i) but not (ii). Although there was some language-specificity in their responses, the subjects were sensitive to all the parameters in question, confirming their robustness.
Poster I-16 Comparing Human and Machine Vowel Classification
Uwe D. Reichel, Department of Phonetics and Speech Processing, University of Munich
Katalin Mády, Department of Phonetics and Speech Processing, University of Munich
  In this study we compare human ability to identify vowels with a machine learning approach. A perception experiment for 14 Hungarian vowels in isolation and embedded in a carrier word was accomplished, and a C4.5 decision tree was trained on the same material. A comparison between the identification results of the subjects and the classifier showed that in three of four conditions (isolated vowel quantity and identity, embedded vowel identity) the performance of the classifier was superior and in one condition (embedded vowel quantity) equal to the subjects' performance. This outcome can be explained by perceptual limits of the subjects and by stimulus properties. The classifier's performance was significantly weakened by replacing the continuous spectral information by binary 3-Bark thresholds as proposed in phonetic literature. Parts of the resulting decision trees can be interpreted phonetically, which could qualify this classifier as a tool for phonetic research.
Benjamin Munson, University of Minnesota
  Sex differences in vowel acoustics were found to be mediated by words' frequency of use and phonological neighborhood density. Larger sex differences in vowel-space expansion were found for words with high-frequency of use and words with small phonological neighborhoods than for words than for low-frequency and high density words. Results suggest that talkers' production of social-indexical variants is constrained by the influence these might have on word recognition.
Poster I-20 Relationship between harmonic amplitudes and spectral zeros and glottal open quotient
Peter J. Murphy, University of Limerick
  An analysis of spectral details relating to the glottal flow waveform and its first derivative can be used to inform both formant and parametric synthesis strategies. Specifically, the current study presents a conceptual basis for the empirically known relationship between the difference in amplitude between the first and second harmonics (H1-H2) and open quotient (OQ). The position of the first spectral null and the pattern of spectral zeros are shown to contain information relevant to the duration of the open period. The analysis suggests conditions for optimum power output for specific pulse characteristics. These conditions may be important for improved naturalness of the resulting synthesized waveforms and may also be relevant to vocal performance issues.
Cyril Auran, Laboratoire Savoirs, Textes, Langage, UMR 8163 CNRS, Université Lille 3 - Charles de Gaulle
  This study is part of a wider project analyzing the roles of prosody and anaphora in discourse organization in English and French, and linking production and perception. More specifically, the aim of this paper is twofold: it explores the interactions of prosody and anaphora in French discourse and their consequences in terms of cognitive processing cost for the hearer; these results are based on an indirect methodology which constitutes the second aspect of this work. More specifically, this study explores the interplay hypothesis between pronominal anaphora and the phonetic realization of intonation unit onsets using cross-modal semantic priming in French.
Poster I-24 Visualizing Levels of Rhythmic Organization
Petra Wagner, Universität Bonn
  The paper presents a method to visualize the timing related levels of prosodic organization that have an influence on the rhythmic shape of an utterance. Timing relations can be characteristic of a language or a speaking style. The method is illustrated on various languages classified as stress timed or syllable timed, on a rhythmically unclassified language and L2 speech. The visualization method can be used to detect rhythmically relevant levels of organization within the prosodic hierarchy, e.g. whether rhythm manifests itself primarily on the level of prosodic feet, phrasal organization or reduction. Our method helps to identify language and speaking style related rhythmical preferences and can classify languages rhythmically. It is able to visualize subtle and large differences between stress timed and syllable timed languages and timing related performance problems of L2 speech.
Alex del Giudice , University of California, San Diego
Ryan K. Shosted, University of Illinois at Urbana-Champaign
Katherine Davidson, University of California, San Diego
Mohammad Salihie, University of California, San Diego
Amalia Arvaniti, University of California, San Diego
  The labeling of “elbows” in an F0 contour is considered an enterprise beset with difficulty due to the inability of humans to locate pitch elbows with accuracy, consistency and in a manner devoid of theoretical bias. This paper investigates the extent to which human labelers can agree with one another in locating elbows and how they fare by comparison to four algorithms. The results show that humans are more consistent than has been suggested and that the algorithm that best approximates their intuition is the least-squares fitting algorithm. The success of algorithmic elbow location, however, depends on the selection of the contour stretch in which the elbow is to be located; This selection is most consistent if performed by a theoretically informed human annotator, strongly suggesting that a completely a-theoretical annotation of F0 contours may be impossible to achieve, and ultimately undesirable.
Poster I-28 Perceptual evidence for direct acoustic correlates of stress in Spanish
Marta Ortega-Llebaria, University of Texas at Austin
Pilar Prieto, Universitat Autonnoma de Barcelona and ICREA
Maria del Mar Vanrell, Universitat Autonoma de Barcelona
  This article provides evidence for the perception of the stress contrast in deaccented contexts in Spanish. Twenty participants were asked to identify oxytone words which varied orthogonally in two bi-dimensional paroxytone-oxytone continua: one of duration and spectral tilt, and the other of duration and overall intensity. Results indicate that duration and overall intensity were cues to stress, while spectral tilt was not. Moreover, stress detection depended on vowel type: the stress contrast was perceived more consistently in [a] than in [i]. Thus, in spite of lacking vowel reduction, stress in Spanish has its own phonetic material in the absence of pitch accents. However, we cannot speak of cues to stress in general since they depend on the characteristics of the vowel.
Ying Wai Wong, The Chinese University of Hong Kong
Yi Xu, University College London
  A systematic study of F0 perturbation by voiceless consonants in Cantonese is carried out. Apart from a voiceless interval introduced, a production asymmetry is found: F0 contours are raised by prevocalic consonants but lowered by postvocalic consonants at the C-V and V-C transitions. Moreover, initial consonants are found to differ in the duration of the voiceless intervals they introduce. Based on the recent finding that F0 production is synchronized with the syllable, we demonstrate that such durational differences need to be taken into consideration before accurate measurement of F0 perturbations can be made.
Anne H. Fabricius, Roskilde University
  This paper examines formant data from a corpus of male speakers of RP born during the twentieth century. It compares average formant positions in the F1/F2 plane for the short vowels LOT and FOOT. The relative positions of the two vowels are represented by a single numerical value, the calculated angle from LOT to FOOT relative to the vertical. Changing angle values between the early and the later part of the twentieth century can be clearly seen in the data, reflecting a diachronic process of FOOT-fronting well documented in varieties of British English, (Torgersen and Kerswill [9]), including RP, (Hawkins and Midgley [5]). One aim of the paper is methodological, in that it demonstrates the versatility of an angle calculation method developed by Anon [1], used in combination with F1/F2 plots, in producing replicable quantified measures which demonstrate changing vowel juxtapositions in real time.
Irene Jacobi, Amsterdam Center for Language and Communication, University of Amsterdam
Louis Pols, Amsterdam Center for Language and Communication, University of Amsterdam
Jan Stroop, Amsterdam Center for Language and Communication, University of Amsterdam
  To judge the influence of speaker background on the quality of five Standard Dutch long vowels and diphthongs, the spectra of these vowel realizations in the spontaneous speech of 70 subjects were measured and analyzed with regard to the subjects’ age, sex, regions of education and residence, and their level of education and occupation. Besides the level of education /occupation, the factor ’age group’ had a major effect on the variations in speech production. The vowel attributes ’onset’ and ’degree of diphthongization’ were affected differently. Highly educated speakers of the younger and middle-aged generation displayed systematic age patterns; lowly educated speakers and the older generation did not. A slight effect of region of residence was found for some females. An effect of sex was found for the higher educated speakers of the youngest age group. The vowel variations that were related to age reflected several pronunciation changes in progress.
Poster I-36 Language-specific production patterns in the first year of life
Izabelle Grenon, University of Victoria
Allison Benner, University of Victoria
John H. Esling, University of Victoria
  The production of sounds by infants from 1 to 12 months is evaluated according to place of articulation to verify the hypothesis that infants' production becomes language-specific towards the end of the first year. This study is based on an analysis of 4,499 sounds produced by 19 infants raised in one of 3 linguistic contexts: Canadian English, Moroccan Arabic, and Bai (a Tibeto-Burman language spoken in China). Our results reveal that towards the end of the first year (10-12 months), infants show a preference for producing sounds at places of articulation that reflect their linguistic background, a finding that parallels results obtained in perceptual studies. Contrary to our expectations, however, the infants' production at the end of the first year, albeit language-specific, does not directly correspond to the adult model.
Poster I-38 From Tone to Accent: the Tonal Transfer Strategy for Chinese L2 learners
Chen Yudong, University of Illinois at Urbana Champaign
  This paper investigates the acquisition of Spanish prosodic patterns by Chinese learners. Pitch plays different linguistic roles in Mandarin Chinese and in Spanish. In Chinese the tonal contour of individual syllables is lexically contrastive. In Spanish the tonal contours characterize utterances and convey pragmatic functions. Conversely, Spanish has lexically contrastive stress which serves as anchoring points for local pitch excursions. In this paper we find strong evidence for the hypothesis that Mandarin learners of Spanish interpret the contours of Spanish words in citation form as a lexical property of individual syllables. This interpretation leads these learners to employ contours with a tonal rise in the stressed syllable and a fall on the post-tonic syllable. For instance, a word with stress on the penultimate syllable is produced as having a rising tone on the penultimate syllable (=tone 2 in Chinese) and a falling tone on the final syllable (=tone 4).
Poster I-40 Acoustic realization of lexical accent and its effects on phrase intonation in English speakers' Japanese
Mariko Kondo, School of International Liberal Studies, Waseda University
  Acoustic manipulation of Japanese prosody by English speakers was investigated. The study examined how fluent Japanese speakers of English realize Japanese lexical accent in terms of mora duration and the fundamental frequency, and also whether they transfer acoustic features associated with English word stress to Japanese lexical accent. The experimental results found that ‘more fluent’ speakers of Japanese used F0 to indicate lexical accent without increasing mora duration, whereas ‘less fluent’ speakers did not, and instead increased the duration of accented vowels at the same time suppressing the F0 increase. The results also found that the English speakers were unable to produce non-accented words and place an accent in a word, which triggers downstep. Moreover, they tended to place an accent in each word rather than using a phrase accent, which caused an overall impression of foreign accent despite a good control of speech rhythm.
Poster I-42 The relative contributions of intonation and duration to intelligibility in Norwegian as a second language
Snefrid Holm, Norwegian University of Science and Technology
  This paper describes an experiment designed to investigate the relative contributions of intonation and duration to the intelligibility of Norwegian as a second language (N2). Recordings of Norwegian sentences read by speakers of 7 different native languages (L1s) were used. The global intonation and the phoneme durations of each N2 utterance were manipulated so as to match a native Norwegian speaker’s productions of the same sentences. A perception experiment was carried out in which native Norwegian listeners wrote down what they perceived of each N2 sentence. Intonation manipulation is shown to enhance the N2 intelligibility for the English and German L1 groups. Duration manipulation is shown to enhance the N2 intelligibility for the French, Tamil and Persian L1 groups. For the English, German, Tamil and Russian L1 groups intonation contributes more to N2 intelligibility than duration. For the French speakers duration contributes more to N2 intelligibility than intonation.
Ineke Mennen, Queen Margaret University Edinburgh
Felix Schaeffler, Queen Margaret University Edinburgh
Gerard Docherty, Newcastle University
  This paper presents preliminary findings of a systematic comparison of various measures of pitch range for speakers of Southern Standard British English and Northern Standard German. The purpose of the study as a whole is to develop the methodology to allow comparisons of pitch range across languages and regional accents, and to determine how they correlate with listeners’ perceptual sensitivity to cross-language/accent differences. In this paper we report on how four measures of pitch range in read speech (text, sentences) compare across the languages. The results show that the measures of the difference between the 90th and 10th percentile, and +/- 2 standard deviations around the mean differentiate the groups of speakers in the direction predicted by the stereotypical beliefs described in the literature about German and English. These differences are most obvious in the read text and longer sentences and the effect disappears in sentences of short duration.
Poster I-46 Consonant-labiovelar glide combinations in Spanish and Korean
Yunju Suh, SUNY at Stony Brook
  This paper investigates the acoustic properties of the combinations of a consonant and a labiovelar glide (Cw combinations), and shows that the universally favored and disfavored consonant places for Cw combinations exhibit the most and the least acoustic cues for C-Cw contrast, respectively. Spanish and Korean, different in how they phonetically implement the Cw combinations (one a consonant cluster and the other a labialized consonant), are used as subject languages.
Poster I-48 Motor Speech Disorders in Three Parkinsonian Syndromes: A Comparative Study
Heike Penner, Geriatrisches Zentrum Heidelberg
Maria Wolters, Centre for Speech Technology Research, University of Edinburgh
Nicholas Miller, School of Education Communication and Language Sciences
  This paper presents results of an acoustic investigation of speech in progressive supranuclear palsy (PSP), multiple system atrophy (MSA) and idiopathic Parkinson's disease (IPD). The study had two aims: (a) to provide a first acoustic description of the speech of people with PSP and MSA, (b) to compare acoustic characteristics of the dysarthria associated with PSP and MSA with classic hypokinetic dysarthria. Four acoustic parameters (voice quality, pitch range, vowel space and rate in syllable repetition) were investigated in 17 patients with PSP and 9 patients with MSA and compared with data from a large-scale study of IPD patients. Participants with PSP and MSA performed significantly worse than the PD group on Alternating Motion Rate tasks. In addition, the pitch range of PSP participants was restricted. We discuss the potential of these speech tasks for early differential diagnosis.
Poster I-50 Characterization of the Pathological Voices (Dysphonia) in the frequency space
Gilles Pouchoulin, Laboratoire Informatique d'Avignon (LIA)
Corinne Fredouille, Laboratoire Informatique d'Avignon (LIA)
Jean-François Bonastre, Laboratoire Informatique d'Avignon (LIA)
Alain Ghio, Laboratoire Parole et Langage (CNRS-LPL)
Joana Revis, Lab. Audio-Phonologie Expérimentale et Clinique (LAPEC)
  This paper is related to dysphonic voice assessment. It aims at characterizing dysphonia in the frequency domain. In this context, a GMM-based automatic classification system is coupled with a frequency subband architecture in order to investigate which frequency bands are relevant for dysphonia characterization. Through various experiments, the low frequencies [0-3000]Hz tend to be more interesting for dysphonia discrimination compared with higher frequencies.
Marion Coadou, Laboratoire Parole et Langage, Université de Provence
Abderrazak Rougab, Laboratoire Parole et Langage, Université de Provence
  This study is, to our knowledge, the first to compare the voice quality of several accents of the British Isles. Our hypothesis is that voice quality can vary according to the regional accent of the speaker. The Long Term Average Spectrum (LTAS) was measured for each of the 50 speakers. Then, in order to test our hypothesis, a Principal Component Analysis (PCA) was carried out to compare the spectra. The results showed that at least two accent groups could be isolated from the others. The spectra of the Belfast accent were particularly concentrated around the negative part of the first component. This can be explained by the fact that the Belfast accent is still strongly influenced by the Celtic languages spoken in the region.
Carlos Monzo, Department of Communications and Signal Theory, Enginyeria i Arquitectura La Salle, Ramon Llull University, Barcelona, Spain
Francesc Alías, Department of Communications and Signal Theory, Enginyeria i Arquitectura La Salle, Ramon Llull University, Barcelona, Spain
Ignasi Iriondo, Department of Communications and Signal Theory, Enginyeria i Arquitectura La Salle, Ramon Llull University, Barcelona, Spain
Xavier Gonzalvo, Department of Communications and Signal Theory, Enginyeria i Arquitectura La Salle, Ramon Llull University, Barcelona, Spain
Santiago Planet, Department of Communications and Signal Theory, Enginyeria i Arquitectura La Salle, Ramon Llull University, Barcelona, Spain
  In this work, the capability of voice quality parameters to discriminate among different expressive speech styles is analyzed. To that effect, the data distribution of these parameters, directly measured from the acoustic speech signal, is used to train a Linear Discriminant Analysis that conducts an automatic classification. As a result, the most relevant voice quality patterns for discriminating expressive speech styles are obtained for a diphone and triphone Spanish speech corpus with five expressive speaking styles: neutral, happy, sad, sensual and aggressive.
Yi Xu, University College London, London
SUTHATHIP CHUENWATTANAPRANITHI, King Mongkut's University of Technology Thonburi
  Human speech conveys emotions not only by words, but also by nonverbal acoustic cues. The hypothesis was tested that anger and joy can be conveyed in speech by displaying effort to sound larger or smaller, just as expressing dominance and submission in animal communication. Human listeners perceived vowels synthesized with a statically lengthened vocal tract and lowered pitch as from a large person, but from an angry person when the lengthening and lowering were dynamic. The opposite was true for perceiving small body size and joy. These results point to a “size code” shared by human and nonhuman communications.
Poster I-58 Mothers Are Less Efficient in Employing Prosodic Disambiguation in Child-Directed Speech than Non-Mothers: Is There a Trade-Off Between Affective and Linguistic Prosody?
Sonja Schaeffler, Queen Margaret University, Edinburgh
Vera Kempe, Stirling University
  This study examines prosodic disambiguation in child-directed (CD) speech. Twenty-four mothers addressed syntactically ambiguous sentences to their 2;0 to 3;8 year old child and to an adult confederate. Twenty-four non-mothers addressed an imaginary toddler and an imaginary adult. We found that only mothers increased pitch and produced the CD-typical pitch excursions when addressing their children. In contrast, non-mothers, but not mothers, used prosodic disambiguation in CD speech, which was corroborated by a forced choice test in which 48 listeners judged the intended meaning of each sentence. The results suggest that if speakers express genuine positive affect, they tend to emphasise affective prosody at the expense of linguistic prosody. In the case of CD speech, this communication strategy may be more effective as it serves to elicit the child’s attention.
Poster I-60 Speech and sign - it's all in the motion
Stina Ojala, Department of Information Technology, University of Turku
Olli Aaltonen, Department of Phonetics, University of Turku
  Speech research has shown that vowels are less categorical than consonants, but a similar correlation in sign, i.e. between handshapes and place of articulation, is not yet known. The handshapes seem similar to vowels: they are continuum-like and follow coarticulatory principles. Here categorization and discrimination of handshapes were studied from the perspective of vowel perception. According to the results handshapes from the Finnish Sign Language handshape continuum transcribed as /G/-/X/ are perceived similarly than vowels varying systematically along a phonetic continuum. As in vowels, a phoneme boundary between signs can be found. In addition, there is a tendency for enhanced discrimination at the boundary zone. However, these results are typical to native signers only.
Poster I-62 Investigating HMMs as a parametric model for expressive speech synthesis in German
Sacha Krstulovic, DFKI GmbH
Anna Hunecke, DFKI GmbH
Marc Schröder, DFKI GmbH
  The paper investigates the potential of HMM based synthesis to support the parameterisation of expressive speech in German. First, we review the assets of HMMs in the perspective of previous works in speech modelling and speech transformation. It is shown that HMMs define a flexible parametric model of the speech acoustics. HMM-based synthesis has also supported cross-speaker and cross-speaking style transformations with a good level of perceptual quality, albeit in other languages than German and over a limited range of styles. To try these considerations in our research framework, we have therefore performed a preliminary application of HMM technology to the synthesis of excited football announcements in German. It is shown that a highly intelligible voice can be obtained, but that the rendering of the prosodic and voice quality correlates of excitement could benefit from some improvement in well identified areas.
Poster I-64 Automatic detection of foreign accent for automatic speech recognition
Katarina Bartkova, R&D France Telecom
Denis Jouvet, R&D France Telecom
  Recognition of foreign accented speech remains among the most difficult tasks in automatic speech recognition. It was observed that using models trained on foreign data together with native models improves the recognition for speakers with foreign accent. However such an approach degrades the recognition performances on native speakers. In order to avoid such performance degradation the degree of accent should be detected prior to the recognition process. In this paper an automatic method of detection of the degree of foreign accent is proposed and results are compared with accent labeling carried out by an expert phonetician. This made possible a better targeting of speakers having a heavy foreign accent which allowed using the foreign accent dedicated model when necessary and thus improving recognition performances on non-native speech without major performance degradation on native speakers.
Poster I-66 Construction of perception stimuli with copy synthesis
Yves Laprie, LORIA
Anne Bonneau, LORIA
  A number of experiments in perception requires the construction of speech-like stimuli whose acoustic content needs to be manipulated easily. Formant synthesis offers the possibility of editing all the parameters of speech. However, the construction of stimuli by hand is a very laborious task and therefore automatic tools are necessary. This paper describes two main extensions of a copy synthesis algorithm previously proposed. The first concerns formant tracking which relies on a concurrent curve strategy. The second is a pitch synchronous amplitude adjustment algorithm that enables the capture of fast varying amplitude transitions in consonants. In addition, the automatic determination of the source parameters through the computation of F0 and of the friction to voicing ratio enables the speech signals to be copied automatically. This copy synthesis is evaluated on sentences and V-Stop-V stimuli.
MAUREEN STONE, Dept of Biomedical Sciences and Orthodontics, University of Maryland Dental School, Baltimore, MD, USA
  Many applications require the production of intelligible speech from articulatory data. This paper outlines a research program (Ouisper : Oral Ultrasound synthetIc SPEech souRce) to synthesize speech from ultrasound acquisition of the tongue movement and video sequences of the lips. Video data is used to search in a multistream corpus associating images of the vocal tract and lips with the audio signal. The search is driven by the recognition of phone units using Hidden Markov Models trained on video sequences. Preliminary results support the feasibility of this approach.
Poster I-70 An update on phonetic symbols in Unicode
John Wells, Phonetics & Linguistics, UCL
  The problem of including phonetic symbols in popular computer applications such as word-processing, email, presentation graphics, and web pages has by now been largely, though not entirely, solved through the implementation of the Unicode standard. This paper traces the advances made in this field since the last ICPhS and assesses the current position. With the general availability of Unicode, the various unstandardized custom fonts that phoneticians previously used must now be treated as ‘legacy fonts’. A remaining issue is that of the input of special characters: but in this area, too, satisfactory solutions are now readily available.

