Yu, H.; Fingscheidt, T.
Elektronische Medien: Systemtheorie und Technik
in Proc. of ESSV 2009, Dresden, Germany, Sept. 2009.
Beamforming for car applications has gained much attention in the past. With upcoming wideband speech telephony appropriate solutions operating at a sampling frequency of 16 kHz are required. While a number of proposals aim at quite idealistic conditions, our approach intentionally employs low-cost microphones, and it is optimized and tested with real multi-channel signals acquired using these sensors. Moreover, we assume the microphone array to be integrated into the head-unit of the car. Although from a signal-to-noise ratio perspective this is not an ideal location, it is yet very attractive, since no further wiring is necessary and radio navigation systems manufacturers can offer compact and optimized solutions. To achieve the required level of directivity (and therefore noise reduction) in car noise, we exploit the a priori noise field coherence of diffuse noise. An adaptive smoothing approach for post-filter estimation along with a new combination of the beamformer and the post-filter is proposed well suited for the low-cost microphones. Meanwhile, an intrusive instrumental evaluation methodology will be introduced. We will show that a significant level of noise attenuation can be achieved, while simultaneously the quality of the speech component will be improved compared to the state of the art.
Fingscheidt, T.; Setiawan, P.; Höge, H.:
Signalverarbeitung für die Verkehrsinformationstechnik
(INTERSPEECH) 2009, S. 2959-2962, Brighton, September 2009.
Balazs Fodor, David Scheler, Tim Fingscheidt
Signalverarbeitung für die Verkehrsinformationstechnik
4th Biennial Workshop on DSP for In-Vehicle Systems and Safety, Dallas, TX, USA, June 25-27, 2009
The obligation to press a push-to-speak button before issuing a voice command to a speech dialog system is not only inconvenient, it also leads to decreased recognition accuracy if the user starts speaking prematurely. In this paper, we investigate the performance of a so-called talk-and-push (TAP) system, which permits the user to begin an utterance within a certain time frame before or after pressing the button. This is achieved using a speech signal buffer in conjunction with an acoustic echo cancellation unit and a combined noise reduction and start-ofutterance detection. In comparison with a state-of-the-art system employing loudspeaker muting, the TAP system delivers significant improvements in the word error rate.
Balazs Fodor, David Scheler, Suhadi Suhadi, Tim Fingscheidt
Signalverarbeitung für die Verkehrsinformationstechnik
AES 36th International Conference, Dearborn, Michigan, USA, June 2-4, 2009
Speech dialog system users often times issue their commands before or during push-to-speak (PTS) button use. This leads to degraded system performance already in the first turn. We propose a system called talk-and-push (TAP) that allows the user to start talking before or after pushing the PTS button, as is common when tapping on someone's shoulder. An acoustic echo cancellation optimized for in-car use reduces FM radio echoes, so that no muting of the FM radio signal is necessary. A notch filter to remove the beep, buffering of the speech signal, and an intelligent noise robust voice activity detection that signals the start of utterance to the automatic speech recognizer are further core components of our proposed system. Significant word error rate improvements vs. state of the art with muted FM radio signals are reported.
C. Voges, V. Märgner, T. Fingscheidt
Signalverarbeitung für die Verkehrsinformationstechnik
in Proc. of IS&T Archiving Conference, Arlington, VA, U.S.A., May 2009.
C. Voges, M. Siekmann, T. Fingscheidt
Signalverarbeitung für die Verkehrsinformationstechnik
in Proc. of IS&T Archiving Conference, Arlington, VA, U.S.A., May 2009.
Suhadi, S.; Fingscheidt, T.:
Signalverarbeitung für die Verkehrsinformationstechnik
in Proc. of ITG-Fachtagung "Sprachkommunikation", Aachen, Germany, Oct. 2008, VDE-Verlag.
In our previous publication, we proposed a data-driven speech enhancement with so-called ideal gain averaging (IGA) weighting rules to estimate the clean speech spectra. Being implemented as a table look-up, the subband individual weighting rules were trained separately for speech presence and speech absence by taking the average of all ideal gains computed from clean speech and noise training signals recorded in the environment of interest. In this contribution we present a new training methodology selecting appropriate ideal gains to compute the final IGA weighting rules for speech presence and speech absence. This selection of ideal gains effectively reduces the bias of the weighting rules under mediumand low SNR conditions, which occurs due to the imperfect voice activity detection (VAD) computation. Compared to our previous publication, the proposed training methodology yields an improvement in terms of speech preservation and noise attenuation.
Steinert, K.; Schönle, M.; Beaugeant, C.; Fingscheidt, T.:
Signalverarbeitung für die Verkehrsinformationstechnik
in Proc. of ITG-Fachtagung "Sprachkommunikation", Aachen, Germany, Oct. 2008, VDE-Verlag.
Steinert, K.; Suhadi, S.; Fingscheidt, T.; Schoenle, M.:
Signalverarbeitung für die Verkehrsinformationstechnik
in Proc. of IWAENC'08, Seattle, Washington, USA, Sept. 2008.
An important parameter in quality assessment of speech enhancement systems is speech distortion, measured in terms of quality of the speech component. In fact, in the context of noise reduction, the user tends to prefer a certain degree of residual noise over distorted speech with suppressed background noise. The challenge of instrumental speech component quality evaluation lies, among others, in the mere availability of the enhanced output signal mixture rather than its speech portion. In this paper we present a method to extract the speech component from the enhanced output signal with high accuracy, given the input signal components speech, noise, and echo. We apply this method to a black box speech component quality comparison of two speech enhancement systems and report on instrumental and subjective tests with focus on double-talk.
Bauer, P.; Fingscheidt, T.; Lieb, M.:
Signalverarbeitung für die Verkehrsinformationstechnik
in Proc. of ESSV 2008, Frankfurt a.M., Germany, Sept. 2008.
Artificial bandwidth extension (ABWE) aims to improve the quality and intelligibility of narrowband speech signals. However, today's state-of-the-art techniques still perform rather poorly for some speech sounds. Obviously there are phonemes that prove to be more critical than others. This fact might be related to an irregular energy allocation in the spectral domain. When most of the energy components are located at high frequencies, a definite phonetic identification based on the narrowband is rather complicated. Misclassification might be the consequence producing lisping effects that form a serious obstacle for the acceptance of ABWE. Our paper therefore focuses on this problem: A phonetic analysis points out the critical phonemes, if they are misclassified at all will be further investigated by several classification experiments. For this a new redesign of the vector quantizer (VQ) codebook has been developed. It consists of a phoneme-based approach whereby each codebook class directly represents a specific phoneme. Classification experiments are carried out for both narrowband and wideband speech phonemes in order to evaluate the influence of the acoustic bandwidth. It turns out that those phonemes which are most critical - in particular the fricatives /s/ and /z/ - are classified relatively well by means of the narrowband speech. However, they seem not to be represented adequately by the codebook representatives for the spectral reconstruction of the upper frequency band.
Voges, C.; Märgner, V.; Fingscheidt, T.:
Signalverarbeitung für die Verkehrsinformationstechnik
in Proc. of IS&T Archiving Conference, Bern, Switzerland, June 2008.
Fingscheidt, T.; Suhadi, S.; Stan, S.:
Signalverarbeitung für die Verkehrsinformationstechnik
in IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 4, pp. 825-834, May 2008.
In this paper, we present a training-based approach to speech enhancement that exploits the spectral statistical characteristics of clean speech and noise in a specific environment. In contrast to many state-of-the-art approaches, we do not model the probability density function (pdf) of the clean speech and the noise spectra. Instead, subband-individual weighting rules for noisy speech spectral amplitudes are separately trained for speech presence and speech absence from noise recordings in the environment of interest. Weighting rules for a variety of cost functions are given; they are parameterized and stored as a table look-up. The speech enhancement system simply works by computing the weighting rules from the table look-up indexed by the a posteriori signal-to-noise ratio (SNR) and the a priori SNR for each subband computed on a Bark scale. Optimized for an automotive environment, our approach outperforms known-environment-independent-speech enhancement techniques, namely the a priori SNR-driven Wiener filter and the minimum mean square error (MMSE) log-spectral amplitude estimator, both in terms of speech distortion and noise attenuation.
Steinert, K.; Schönle, M.; Beaugeant, C.; Fingscheidt, T.:
Signalverarbeitung für die Verkehrsinformationstechnik
in ICASSP'08, Las Vegas, Nevada, USA, Apr. 2008.
Fingscheidt, T.; Martin, R.; Heute, U.; Antweiler, C.:
Signalverarbeitung für die Verkehrsinformationstechnik
in Advances in Digital Speech Transmission, Eds., pp. 281-310, John Wiley & Sons, Ltd, West Sussex, England, 2008.
Bauer, P.; Fingscheidt, T.:
Signalverarbeitung für die Verkehrsinformationstechnik
in Proc. of ICASSP, Las Vegas, Nevada, USA, Apr. 2008.
Artificial bandwidth extension techniques can be employed in mobile terminals to improve the quality of the far-end speaker?s signal at the receiver. To accomplish this, usually statistical models are trained requiring wideband speech material from a language that is expected to be used in the conversation. In practice however, the language of a certain phone conversation is not known to the user equipment. Therefore we investigated the performance of an HMM-based multilingually trained artificial bandwidth extension on speech signals of which the language was unseen in training. The cross-language training and test turned out to cause only minor degradations compared to the use of monolingually trained acoustic models of the language used in test. Our findings indicate that artificial bandwidth extension can be efficiently trained with multilingual speech data without significant losses in speech quality.
Fingscheidt, T.; Suhadi, S.; Steinert, K.:
Signalverarbeitung für die Verkehrsinformationstechnik
in Proc. of ICASSP'08, Las Vegas, Nevada, USA, Apr. 2008.
Quality assessment of speech enhancement systems has to deal with aspects such as distortion of the near-end talker?s speech, and with the attenuation and distortion of the noise and the echo in different test cases. We propose first steps into the direction of a new black box objective quality assessment of speech enhancement schemes, based on our previous work on decomposition of the (enhanced) speech signal into its components speech, (residual) noise, and (residual) echo. Having these signals available, to our knowledge, for the first time a black box objective quality assessment of an entire speech enhancement system is proposed allowing for simultaneous measurement of, e.g., noise attenuation, echo return loss enhancement (ERLE), and perceptual evaluation of speech quality (PESQ) of the speech component in a wide range of test scenarios including
double-talk. The derived scheme proves to be very useful for testing hands-free devices in practice but also for objective evaluation of sophisticated algorithms in science.
Bauer, P.; Fingscheidt, T.:
Signalverarbeitung für die Verkehrsinformationstechnik
in Proc. of DAGA 2008, Dresden, Germany, Mar. 2008.
Artificial bandwidth extension techniques can be employed in mobile terminals to improve the intelligibility and quality of the far-end speaker's speech signal at the receiver. To accomplish this, usually statistical models are trained requiring wideband speech material from the conversational partner, or at least from the language that is expected to be used in the conversation. In practice however, both, the speaker and language of a certain phone conversation are not known to the user equipment. Therefore we investigated the performance of an HMM-based multilingually trained artificial bandwidth extension on speech signals of which the speaker and language were unseen in training. The cross-language training and test turned out to cause only minor degradations compared to the use of monolingually trained acoustic models of the language used in test. The experimental results further showed that both of these speaker-independent methods could even keep up with the speaker-dependent technique to a large extent. Our findings indicate that artificial bandwidth extension can be efficiently trained with speaker- and language-independent speech data without significant losses in speech intelligibility and quality.
Steinert, K.; Suhadi, S.; Schönle, M.; Fingscheidt, T.:
Signalverarbeitung für die Verkehrsinformationstechnik
in Proc. of DAGA'08 (invited talk), Dresden, Germany, Mar. 2008.
Speech quality evaluation of hands-free terminals is a complex task. Several aspects have to be taken into account such as the various conversational situations and a possibly nonlinear and time-variant system behavior. The lack of access to the internal signal processing of black-box systems complicates a separate assessment of the processed clean speech, echo, and noise. In this paper we present an objective evaluation of the performance of two hands-free systems in terms of echo attenuation and speech distortion during double-talk. Based on an earlier published signal separation method, we consider the processed echo and the processed clean speech relative to the respective unprocessed signal individually. Our findings are compared with the results of a subjective listening test.
Bauer, P.; Fingscheidt, T.:
Signalverarbeitung für die Verkehrsinformationstechnik
Beitrag zur CeBit-Ausgabe der ntz, vol. 2008, no. 2, 2008.
Telefonieanbieter bereiten die Einführung einer
Breitband-Telefonie vor, die eine wesentlich bessere Sprachqualität bietet. Im Zusammenspiel mit
schmalbandigen Endgeräten auf der Gegenseite kann eine künstliche Erweiterung der Sprachbandbreite die Qualität am
breitbandigen Endgerät deutlich verbessern.
Hindelang, T.; Adrat, M.; Fingscheidt, T.; Heinen, S.:
Signalverarbeitung für die Verkehrsinformationstechnik
in European Transactions on Telecommunications, vol. 18, no. 8, pp. 851- 858, Dec. 2007.