Tone Recognition of Continuous Speech of Standard Chinese Using Neural Network and Tone Nucleus Mode

Abstract

In this paper, we investigate the effectiveness of articulatory information for Mandarin tone modeling and recognition in a deep neural network – hidden Markov model (DNN-HMM) framework. In conventional approaches, prosodic evidence (e.g., F0, duration and energy) is used to build tone classifiers, we here propose performance enhancement techniques in three areas: (i) adding articulatory features (AFs) and acoustic features, such as MFCCs (Mel frequency cepstrum coefficients), for tone modeling; (ii) adopting phone-dependent tone modeling; and (iii) using tone-based extended recognition network (ERN) to reduce the tone search space. The first approach is feature-related, it explicitly employs the AFs as a form of tonal features and is implemented through a multi-stage procedure. The second approach is model-related and directly extends to phone-dependent tone modeling so that each modeling unit (e.g., tonal phone) not only contains tone information, but also integrates the phone/articulatory information. Finally, the third technique is search-related with a phone-dependent tone-based expanding searching network. A series of comprehensive experiments is conducted using different input feature sets. It is demonstrated that (i) tone recognition accuracy is boosted by incorporating articulatory information, and (ii) ERN, attains the lowest tone error rate of 7.17%, with a 56% relative error reduction from the prosody-only baseline system error of 16.36%.

References

  1. Yang, W. J., Lee, J. C., Chang, Y. C., & Wang, H. C. (1988). Hidden markov model for mandarin lexical tone recognition. IEEE Transactions on Acoustics Speech & Signal Processing, 36(7), 988–992.

    Article  MATH  Google Scholar

  2. Chang, P. C., Sun, S. W., & Chen, S. H. (1990). Mandarin tone recognition by multi-layer perceptron. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 517-520). IEEE.

  3. Chao, Y. R. (1965). A grammar of spoken Chinese. Berkeley: Univ of California Press.

    Google Scholar

  4. Lee, L. S., Tseng, C. Y., & Hsieh, C. J. (1993). Improved tone concatenation rules in a formant-based Chinese text-to-speech system. IEEE Transactions on Speech and Audio Processing, 1(3), 287–294.

    Article  Google Scholar

  5. Shih, C. (1987). The phonetics of the Chinese tonal system. Bell Laboratories Technical Memorandum.

  6. Umeda, N. (1980). F0 declination is situation dependent. Journal of the Acoustical Society of America, 68(3), S70–S70.

    Article  Google Scholar

  7. Xu, Y. (1999). Effects of tone and focus on the formation and alignment of F0 contours. Journal of Phonetics, 27(1), 55–105.

    Article  Google Scholar

  8. Wang, Y. B., & Lee, L. S. (2010). Mandarin tone recognition using affine-invariant prosodic features and tone posteriorgram. Medicine, 2850–2853.

  9. Lee, T., Lau, W., Wong, Y. W., & Ching, P. C. (2002). Using tone information in Cantonese continuous speech recognition. ACM Transactions on Asian Language Information Processing, 1(1), 83–102.

    Article  Google Scholar

  10. Zhang, J., & Hirose, K. (2004). Tone nucleus modeling for Chinese lexical tone recognition. Speech Communication, 42(3), 447–466.

    Article  Google Scholar

  11. Peng, G., & Wang, S. Y. (2004). An innovative prosody modeling method for Chinese speech recognition. International Journal of Speech Technology, 7, 129–140.

    Article  Google Scholar

  12. Qian, Y., Lee, T., & Soong, F. K. (2007). Tone recognition in continuous Cantonese speech using supratone models. The Journal of the Acoustical Society of America, 121(5), 2936–2945.

    Article  Google Scholar

  13. Ryant, N., Yuan, J., & Liberman, M. (2014). Mandarin tone classification without pitch tracking. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4868-4872). IEEE.

  14. Ryant, N., Slaney, M., Liberman, M., Shriberg, E., & Yuan, J. (2014). Highly accurate mandarin tone classification in the absence of pitch information. In SPEECHPROSODY 7 -- 7th International conference on Speech Prosody, May 20-23, Dublin, Ireland, Proceedings, 2014, pp. 673-677.

  15. Zhang, J. (1987). The intrinsic fundamental frequency of vowels and the effect of speech modes on formants. Acta Acustica (pp. 390-393).

  16. Lehiste, I., & Peterson, G. E. (1961). Some basic considerations in the analysis of intonation. The Journal of the Acoustical Society of America, 33(4), 419–425.

    Article  Google Scholar

  17. Davis, S., & Mermelstein, P. (1990). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Readings in Speech Recognition, 28(4), 65–74.

    Article  Google Scholar

  18. Chuang, C. K., & Wang, W. S. (1978). Psychophysical pitch biases related to vowel quality, intensity difference, and sequential order. Journal of the Acoustical Society of America, 64(4), 1004–1014.

    Article  Google Scholar

  19. Cao, C., Xie, Y., Lin, J., Li, Q., & Zhang, J. (2016). The preliminary study of influence on tone perception from segments. The 10th international symposium on chinese spoken language processing.

  20. Chao, H., Yang, Z., & Liu, W. (2012). Improved tone modeling by exploiting articulatory features for Mandarin speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4741-4744). IEEE.

  21. Li, W., Siniscalchi, S. M., Chen, N. F., & Lee, C. H. (2016). Using tone-based extended recognition network to detect non-native Mandarin tone mispronunciations. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016 Asia-Pacific (pp. 1-4). IEEE.

  22. Wu, Z. J., & Lin, M. C. (1989). Experimental phonetics summary [M] (pp. 153–191). Beijing: Higher Education Press.

    Google Scholar

  23. Zhang, J. L. (2010). Fundamentals of Chinese Man-Machine communication. Shanghai: Shanghai Scientific & Technical Publishers.

    Google Scholar

  24. Kirchhoff, K., Fink, G. A., & Sagerer, G. (2002). Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication, 37(3), 303–319.

    Article  MATH  Google Scholar

  25. Li, W., Siniscalchi, S. M., Chen, N. F., & Lee, C. H. (2016). Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6135-6139). IEEE.

  26. Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. New York: John Wiley & Sons.

    MATH  Google Scholar

  27. Gopinath, R. A. (1998). Maximum likelihood modeling with Gaussian distributions for classification. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 661-664). IEEE.

  28. Gales, M. J. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech & Language, 12(2), 75–98.

    Article  Google Scholar

  29. Matsoukas, S., Schwartz, R., Jin, H., & Long, N. (1997). Practical implementations of speaker-adaptive training. Darpa Speech Recognition Workshop, 21(6), 12–13.

  30. Liu, C., Ge, F., Pan, F., Dong, B., & Yan, Y. (2009). A one-step tone recognition approach using MSD-HMM for continuous speech. INTERSPEECH, Conference of the International Speech Communication Association, Brighton, United Kingdom, September (pp.3015–3018). DBLP.

  31. Wang, X. D., Hirose, K., Zhang, J. S., & Minematsu, N. (2008). Tone recognition of continuous mandarin speech based on tone nucleus model and neural network. Ieice Transactions on Information & Systems, E91-D(6), 1748–1755.

    Article  Google Scholar

  32. Chen, J. C., & Jang, J. S. R. (2008). Trues: tone recognition using extended segments. ACM Transactions on Asian Language Information Processing (TALIP), 7(3), 10.

    Article  Google Scholar

  33. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.

    Article  Google Scholar

  34. Xu, B., Zhang, H., Gao, S., Zhao, B., Li, C., & Huang, T. (2000). Update progress of Sinohear: Advanced Mandarin LVCSR System At NLPR. In Proc. ICSLP, vol 3, 798–801.

  35. Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 807-814).

  36. Chollet, F., (2015). Keras. GitHub repository, https://github.com/fchollet/keras.

  37. Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.

    Article  MathSciNet  MATH  Google Scholar

  38. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning representations by back-propagating errors. Cognitive Modeling, 5(3), 1.

    MATH  Google Scholar

  39. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., … & Silovsky, J. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. EPFL-CONF-192584). IEEE Signal Processing Society.

  40. Chang, E., Zhou, J. L., Di, S., Huang, C., & Kai-FuLee. (2000). Large vocabulary Mandarin speech recognition with different approaches in modeling tones. Proc Icslp, 983–986.

  41. Hu, W., Qian, Y., & Soong, F. K. (2014). A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training. In 2014 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3206-3210). IEEE.

  42. Tong, R., Chen, N. F., Ma, B., & Li, H. (2015). Goodness of Tone (GOT) for Non-native Mandarin tone recognition. In INTERSPEECH, Dresden Germany, pp. 801-805.

  43. Xin, L., Siu, M. H., Hwang, M. Y., Ostendorf, M., & Tan, L. (2006). Improved tone modeling for Mandarin broadcast news speech recognition. In INTERSPEECH, Icslp, Ninth International Conference on Spoken Language Processing, Pittsburgh, Pa, USA, September. DBLP.

  44. Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., & Khudanpur, S. (2014). A pitch extraction algorithm tuned for automatic speech recognition. In 2014 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2494-2498). IEEE.

  45. Talkin, D. (1995). A robust algorithm for pitch tracking (RAPT). Speech Coding and Synthesis, 495, 518.

    Google Scholar

  46. Olsberg, M., Xu, Y., & Green, J. (2007). Dependence of tone perception on syllable perception. In INTERSPEECH, Conference of the International Speech Communication Association, Antwerp, Belgium, August (pp.2649-2652). DBLP.

  47. Wu, Z., & Lin, M. (1987). Experimental phonetics summary. Beijing: China Higher Education Press.

    Google Scholar

  48. Chen, Y., & Xu, Y. (2006). Production of weak elements in speech – evidence from f0 patterns of neutral tone in standard chinese. Phonetica, 63(1), 47–75.

    Article  Google Scholar

  49. Chen, N. F., Wee, D., Tong, R., Ma, B., & Li, H. (2016). Large-scale characterization of non-native Mandarin Chinese spoken by speakers of European origin: analysis on iCALL. Speech Communication, 84, 46–56.

    Article  Google Scholar

Download references

Acknowledgments

This work is supported by Beijing Wutong Innovation Platform of Beijing Language and Culture University (16PT05) and BLCU support project for young researchers program (16YCX163) (Special Funds of Basic Research Costs for the National University). The second author was partially supported by a grant from the China Scholarship Council.

Author information

Authors and Affiliations

Corresponding author

Correspondence to Jinsong Zhang.

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lin, J., Li, W., Gao, Y. et al. Improving Mandarin Tone Recognition Based on DNN by Combining Acoustic and Articulatory Features Using Extended Recognition Networks. J Sign Process Syst 90, 1077–1087 (2018). https://doi.org/10.1007/s11265-018-1334-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI : https://doi.org/10.1007/s11265-018-1334-2

Keywords

  • Articulatory features
  • MFCC
  • Posterior probabilities
  • Deep neural network
  • Mandarin tone recognition
  • Tone-based extended recognition network

russobern1954.blogspot.com

Source: https://link.springer.com/article/10.1007/s11265-018-1334-2

0 Response to "Tone Recognition of Continuous Speech of Standard Chinese Using Neural Network and Tone Nucleus Mode"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel