Tone Recognition of Continuous Speech of Standard Chinese Using Neural Network and Tone Nucleus Mode

Abstract

In this paper, we investigate the effectiveness of articulatory information for Mandarin tone modeling and recognition in a deep neural network – hidden Markov model (DNN-HMM) framework. In conventional approaches, prosodic evidence (e.g., F0, duration and energy) is used to build tone classifiers, we here propose performance enhancement techniques in three areas: (i) adding articulatory features (AFs) and acoustic features, such as MFCCs (Mel frequency cepstrum coefficients), for tone modeling; (ii) adopting phone-dependent tone modeling; and (iii) using tone-based extended recognition network (ERN) to reduce the tone search space. The first approach is feature-related, it explicitly employs the AFs as a form of tonal features and is implemented through a multi-stage procedure. The second approach is model-related and directly extends to phone-dependent tone modeling so that each modeling unit (e.g., tonal phone) not only contains tone information, but also integrates the phone/articulatory information. Finally, the third technique is search-related with a phone-dependent tone-based expanding searching network. A series of comprehensive experiments is conducted using different input feature sets. It is demonstrated that (i) tone recognition accuracy is boosted by incorporating articulatory information, and (ii) ERN, attains the lowest tone error rate of 7.17%, with a 56% relative error reduction from the prosody-only baseline system error of 16.36%.

References

Yang, W. J., Lee, J. C., Chang, Y. C., & Wang, H. C. (1988). Hidden markov model for mandarin lexical tone recognition. IEEE Transactions on Acoustics Speech & Signal Processing, 36(7), 988–992.

Article MATH Google Scholar
Chang, P. C., Sun, S. W., & Chen, S. H. (1990). Mandarin tone recognition by multi-layer perceptron. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 517-520). IEEE.
Chao, Y. R. (1965). A grammar of spoken Chinese. Berkeley: Univ of California Press.

Google Scholar
Lee, L. S., Tseng, C. Y., & Hsieh, C. J. (1993). Improved tone concatenation rules in a formant-based Chinese text-to-speech system. IEEE Transactions on Speech and Audio Processing, 1(3), 287–294.

Article Google Scholar
Shih, C. (1987). The phonetics of the Chinese tonal system. Bell Laboratories Technical Memorandum.
Umeda, N. (1980). F0 declination is situation dependent. Journal of the Acoustical Society of America, 68(3), S70–S70.

Article Google Scholar
Xu, Y. (1999). Effects of tone and focus on the formation and alignment of F0 contours. Journal of Phonetics, 27(1), 55–105.

Article Google Scholar
Wang, Y. B., & Lee, L. S. (2010). Mandarin tone recognition using affine-invariant prosodic features and tone posteriorgram. Medicine, 2850–2853.
Lee, T., Lau, W., Wong, Y. W., & Ching, P. C. (2002). Using tone information in Cantonese continuous speech recognition. ACM Transactions on Asian Language Information Processing, 1(1), 83–102.

Article Google Scholar
Zhang, J., & Hirose, K. (2004). Tone nucleus modeling for Chinese lexical tone recognition. Speech Communication, 42(3), 447–466.

Article Google Scholar
Peng, G., & Wang, S. Y. (2004). An innovative prosody modeling method for Chinese speech recognition. International Journal of Speech Technology, 7, 129–140.

Article Google Scholar
Qian, Y., Lee, T., & Soong, F. K. (2007). Tone recognition in continuous Cantonese speech using supratone models. The Journal of the Acoustical Society of America, 121(5), 2936–2945.

Article Google Scholar
Ryant, N., Yuan, J., & Liberman, M. (2014). Mandarin tone classification without pitch tracking. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4868-4872). IEEE.
Ryant, N., Slaney, M., Liberman, M., Shriberg, E., & Yuan, J. (2014). Highly accurate mandarin tone classification in the absence of pitch information. In SPEECHPROSODY 7 -- 7th International conference on Speech Prosody, May 20-23, Dublin, Ireland, Proceedings, 2014, pp. 673-677.
Zhang, J. (1987). The intrinsic fundamental frequency of vowels and the effect of speech modes on formants. Acta Acustica (pp. 390-393).
Lehiste, I., & Peterson, G. E. (1961). Some basic considerations in the analysis of intonation. The Journal of the Acoustical Society of America, 33(4), 419–425.

Article Google Scholar
Davis, S., & Mermelstein, P. (1990). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Readings in Speech Recognition, 28(4), 65–74.

Article Google Scholar
Chuang, C. K., & Wang, W. S. (1978). Psychophysical pitch biases related to vowel quality, intensity difference, and sequential order. Journal of the Acoustical Society of America, 64(4), 1004–1014.

Article Google Scholar
Cao, C., Xie, Y., Lin, J., Li, Q., & Zhang, J. (2016). The preliminary study of influence on tone perception from segments. The 10th international symposium on chinese spoken language processing.
Chao, H., Yang, Z., & Liu, W. (2012). Improved tone modeling by exploiting articulatory features for Mandarin speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4741-4744). IEEE.
Li, W., Siniscalchi, S. M., Chen, N. F., & Lee, C. H. (2016). Using tone-based extended recognition network to detect non-native Mandarin tone mispronunciations. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016 Asia-Pacific (pp. 1-4). IEEE.
Wu, Z. J., & Lin, M. C. (1989). Experimental phonetics summary [M] (pp. 153–191). Beijing: Higher Education Press.

Google Scholar
Zhang, J. L. (2010). Fundamentals of Chinese Man-Machine communication. Shanghai: Shanghai Scientific & Technical Publishers.

Google Scholar
Kirchhoff, K., Fink, G. A., & Sagerer, G. (2002). Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication, 37(3), 303–319.

Article MATH Google Scholar
Li, W., Siniscalchi, S. M., Chen, N. F., & Lee, C. H. (2016). Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6135-6139). IEEE.
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. New York: John Wiley & Sons.

MATH Google Scholar
Gopinath, R. A. (1998). Maximum likelihood modeling with Gaussian distributions for classification. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 661-664). IEEE.
Gales, M. J. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech & Language, 12(2), 75–98.

Article Google Scholar
Matsoukas, S., Schwartz, R., Jin, H., & Long, N. (1997). Practical implementations of speaker-adaptive training. Darpa Speech Recognition Workshop, 21(6), 12–13.
Liu, C., Ge, F., Pan, F., Dong, B., & Yan, Y. (2009). A one-step tone recognition approach using MSD-HMM for continuous speech. INTERSPEECH, Conference of the International Speech Communication Association, Brighton, United Kingdom, September (pp.3015–3018). DBLP.
Wang, X. D., Hirose, K., Zhang, J. S., & Minematsu, N. (2008). Tone recognition of continuous mandarin speech based on tone nucleus model and neural network. Ieice Transactions on Information & Systems, E91-D(6), 1748–1755.

Article Google Scholar
Chen, J. C., & Jang, J. S. R. (2008). Trues: tone recognition using extended segments. ACM Transactions on Asian Language Information Processing (TALIP), 7(3), 10.

Article Google Scholar
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.

Article Google Scholar
Xu, B., Zhang, H., Gao, S., Zhao, B., Li, C., & Huang, T. (2000). Update progress of Sinohear: Advanced Mandarin LVCSR System At NLPR. In Proc. ICSLP, vol 3, 798–801.
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 807-814).
Chollet, F., (2015). Keras. GitHub repository, https://github.com/fchollet/keras.
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.

Article MathSciNet MATH Google Scholar
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning representations by back-propagating errors. Cognitive Modeling, 5(3), 1.

MATH Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., … & Silovsky, J. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. EPFL-CONF-192584). IEEE Signal Processing Society.
Chang, E., Zhou, J. L., Di, S., Huang, C., & Kai-FuLee. (2000). Large vocabulary Mandarin speech recognition with different approaches in modeling tones. Proc Icslp, 983–986.
Hu, W., Qian, Y., & Soong, F. K. (2014). A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training. In 2014 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3206-3210). IEEE.
Tong, R., Chen, N. F., Ma, B., & Li, H. (2015). Goodness of Tone (GOT) for Non-native Mandarin tone recognition. In INTERSPEECH, Dresden Germany, pp. 801-805.
Xin, L., Siu, M. H., Hwang, M. Y., Ostendorf, M., & Tan, L. (2006). Improved tone modeling for Mandarin broadcast news speech recognition. In INTERSPEECH, Icslp, Ninth International Conference on Spoken Language Processing, Pittsburgh, Pa, USA, September. DBLP.
Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., & Khudanpur, S. (2014). A pitch extraction algorithm tuned for automatic speech recognition. In 2014 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2494-2498). IEEE.
Talkin, D. (1995). A robust algorithm for pitch tracking (RAPT). Speech Coding and Synthesis, 495, 518.

Google Scholar
Olsberg, M., Xu, Y., & Green, J. (2007). Dependence of tone perception on syllable perception. In INTERSPEECH, Conference of the International Speech Communication Association, Antwerp, Belgium, August (pp.2649-2652). DBLP.
Wu, Z., & Lin, M. (1987). Experimental phonetics summary. Beijing: China Higher Education Press.

Google Scholar
Chen, Y., & Xu, Y. (2006). Production of weak elements in speech – evidence from f₀ patterns of neutral tone in standard chinese. Phonetica, 63(1), 47–75.

Article Google Scholar
Chen, N. F., Wee, D., Tong, R., Ma, B., & Li, H. (2016). Large-scale characterization of non-native Mandarin Chinese spoken by speakers of European origin: analysis on iCALL. Speech Communication, 84, 46–56.

Article Google Scholar

Download references

Acknowledgments

This work is supported by Beijing Wutong Innovation Platform of Beijing Language and Culture University (16PT05) and BLCU support project for young researchers program (16YCX163) (Special Funds of Basic Research Costs for the National University). The second author was partially supported by a grant from the China Scholarship Council.

Author information

Authors and Affiliations

College of Information Sciences, Beijing Language and Culture University, Beijing, China

Ju Lin, Yingming Gao, Yanlu Xie & Jinsong Zhang
School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA

Wei Li, Sabato Marco Siniscalchi & Chin-Hui Lee
Institute for Infocomm Research, Singapore, Singapore

Nancy F. Chen
Department of Telematics, Kore University of Enna, Enna, Italy

Sabato Marco Siniscalchi

Corresponding author

Correspondence to Jinsong Zhang.

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lin, J., Li, W., Gao, Y. et al. Improving Mandarin Tone Recognition Based on DNN by Combining Acoustic and Articulatory Features Using Extended Recognition Networks. J Sign Process Syst 90, 1077–1087 (2018). https://doi.org/10.1007/s11265-018-1334-2

Download citation

Received: 09 February 2017
Revised: 07 September 2017
Accepted: 24 January 2018
Published: 08 February 2018
Issue Date: July 2018
DOI : https://doi.org/10.1007/s11265-018-1334-2

Keywords

Articulatory features
MFCC
Posterior probabilities
Deep neural network
Mandarin tone recognition
Tone-based extended recognition network

russobern1954.blogspot.com

Source: https://link.springer.com/article/10.1007/s11265-018-1334-2

Tone Recognition of Continuous Speech of Standard Chinese Using Neural Network and Tone Nucleus Mode

Abstract

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Keywords

0 Response to "Tone Recognition of Continuous Speech of Standard Chinese Using Neural Network and Tone Nucleus Mode"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel