›› 2012,Vol. ›› Issue (2): 358-375.doi: 10.1007/s11390-012-1228-x

• • 上一篇    下一篇

词表外语音串检出技术中的串相关信任度正规化

Dong Wang1,3, Member, IEEE, Javier Tejedor1,2, Simon King1, Senior Member, IEEE and Joe Frankel1, Member, IEEE   

  • 收稿日期:2011-03-23 修回日期:2011-12-01 出版日期:2012-03-05 发布日期:2012-03-05

Term-Dependent Confidence Normalisation for Out-of-Vocabulary Spoken Term Detection

Dong Wang1,3, Member, IEEE, Javier Tejedor1,2, Simon King1, Senior Member, IEEE and Joe Frankel1, Member, IEEE   

  1. 1. Centre for Speech Technology Research, University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9LW, U.K.;
    2. Human Computer Technology Laboratory (HCTLab), School of Computer Engineering and Telecommunication University Autonomous of Madrid, Avenue Francisco Tomás y Valiente 11, 28049, Madrid, Spain;
    3. Nuance Communications, 1 Wayside Road, Burlington, MA 01803, U.S.A.
  • Received:2011-03-23 Revised:2011-12-01 Online:2012-03-05 Published:2012-03-05

判断检出串的信任度是语音串检出(Spoken Term Detection, STD) 技术中的关键组成部分,检出系统要依靠这些信任度来判决一个检出串是否为可信检出。在各种信任度估计中,基于语音识别网格的信任度估计应用的最为广泛。这种估计将检出串在网格中的后验概率视为检出串的信任度。由于理论上的完整性和计算上的简便性,这种信任度被应用在大多数检出系统中。然而,这种信任度估计有一个明显的缺陷,即所有检出串被认为是平权的,而不管所检出串本身在语音和语言上的差异。例如,有些串的出现频率很高,而另一些串的频率相对较低,但基于网格的信任度估计对这一因素并未考虑,而是将所有串的信任度无差别地应用于可信检出判决。这一问题对于词表外语音串的检出尤为突出,因词表外语音串往往表现出更为明鲜的串间差异。为解决这一问题,本文提出了一种串相关的信任度正规化技术。我们首先提出一种面向度量标准的正规化。这种正规化考虑用以评价检出系统检出水平的度量标准的特性(对于语音串检出任务,这一标准为串平均检出值, Average Term Weighted Value, ATWV),从而实现对串间频率差异的补偿。进一步的研究表明,这种面向度量的正规化要求无偏差的串检出后验概率,而基于网格的信任度并不完全等同于串检出后验概率,而是存在一定偏差。这种偏差一方面来源于语音识别系统中词表和语言模型上限制,一方面来源于生成识别网格过程中的剪枝。这种偏差使得面向度量的正规化变得不精确。为了解决这一问题,我们首先引入一种线性补偿,这种补偿通过对基于网格信任度的线性变换使得信任度更接近于真实的检出串后检概率。线性补偿的进一步扩展是基于区分性模型的非线性补偿。这种补偿通过非线性区分性模型(比如多层感知器, Multi-Layer Perceptron, MLP)将基于网格的信任度变换成串检出后验概率,同时引入串相关的各种特征。我们使用英语多方会议语料库来构造实验。实验结果证明我们提出的技术确实显著地提高了语音串检出系统的检出水平,特别是对基于音素的检出系统和词表外串检出任务,提高尤为明显。本文提出的技术可以有效提高串检出系统的检出能力,这为进一步扩展语音串检出的相关应用提供了基础,特别是对于无限制词表的大规模应用系统,我们所提出方法具有特别的优势。

Abstract: An important component of a spoken term detection (STD) system involves estimating confidence measures of hypothesised detections. A potential problem of the widely used lattice-based confidence estimation, however, is that the confidence scores are treated uniformly for all search terms, regardless of how much they may differ in terms of phonetic or linguistic properties. This problem is particularly evident for out-of-vocabulary (OOV) terms which tend to exhibit high intra-term diversity. To address the impact of term diversity on confidence measures, we propose in this work a term-dependent normalisation technique which compensates for term diversity in confidence estimation. We first derive an evaluation-metric-oriented normalisation that optimises the evaluation metric by compensating for the diverse occurrence rates among terms, and then propose a linear bias compensation and a discriminative compensation to deal with the bias problem that is inherent in lattice-based confidence measurement and from which the Term Specific Threshold (TST) approach suffers. We tested the proposed technique on speech data from the multi-party meeting domain with two state-of-the-art STD systems based on phonemes and words respectively. The experimental results demonstrate that the confidence normalisation approach leads to a significant performance improvement in STD, particularly for OOV terms with phoneme-based systems.

[1] Mamou J, Ramabhadran B, Siohan O. Vocabulary indepen-dent spoken term detection. In Proc. the 30th ACM-SIGIR,Amsterdam, the Netherlands, July 23-27, 2007, pp.615-622.

[2] Mamou J, Ramabhadran B. Phonetic query expansionfor spoken document retrieval. In Proc. the 9th IN-TERSPEECH, Brisbane, Australia, September 22-26, 2008,pp.2106-2109.

[3] Can D, Cooper E, Sethy A, White C, Ramabhadran B,Saraclar M. Effect of pronunciations on OOV queries in spo-ken term detection. In Proc. ICASSP 2009, Taipei, China,April 19-24, 2009, pp.3957-3960.

[4] Fiscus J G, Ajot J, Garofolo J S, Doddingtion G. Resultsof the 2006 spoken term detection evaluation. In Proc.Workshop on Searching Spontaneous Conversational Speech(SIGIR-SSCS), Amsterdam, the Netherlands, July 2007,pp.45-50.

[5] Vergyri D, Stolcke A, Gadde R R, Wang W. The SRI 2006spoken term detection system. In Proc. NIST Spoken TermDetection Workshop (STD 2006), Gaithersburg, USA, De-cember 14-15, 2006.

[6] Vergyri D, Shafran I, Stolcke A, Gadde R R, Akbacak M,Roark B, Wang W. The SRI/OGI 2006 spoken term detec-tion system. In Proc. the 8th INTERSPEECH, Antwerp,Belgium, August 27-31, 2007, pp.2393-2396.

[7] Akbacak M, Vergyri D, Stolcke A. Open-vocabulary spokenterm detection using graphone-based hybrid recognition sys-tems. In Proc. ICASSP 2008, Las Vegas, USA, March 31-April 4, 2008, pp.5240-5243.

[8] Szöke I, Fapso M, Karafiát M, Burget L, Grézl F, SchwarzP, Glembek O, Matejka P, Kopecky J, Cernocky J. Spo-ken term detection system based on combination of LVCSRand phonetic search. In Lecture Notes in Computer Science4892, Popescn-Belis A, Bourlard H, Reanals S (eds.), SpringerBerlin/Heidelberg, September 2008, pp.237-247.

[9] Szöke I, Burget L, Cernocky J, Fapso M. Sub-word modelingof out of vocabulary words in spoken term detection. In Proc.IEEE Workshop on Spoken Language Technology (SLT2008),Goa, India, December 15-19, 2008, pp.273-276.

[10] Szöke I, Fapso M, Burget L, Cernocky J. Hybrid word-subword decoding for spoken term detection. In Proc. SpeechSearch Workshop at SIGIR (SSCS 2008), Singapore, Singa-pore, July 20-24, 2008, pp.42-48.

[11] Meng S, Yu P, Liu J, Seide F. Fusing multiple systems intoa compact lattice index for Chinese spoken term detection.In Proc. ICASSP 2008, Las Vegas, USA, March 31-April 4,2008, pp.4345-4348.

[12] Thambiratmann K, Sridharan S. Rapid yet accurate speechindexing using dynamic match lattice spotting. IEEE Trans-actions on Audio, Speech, and Language Processing, 2007,15(1): 346-357.

[13] Wallace R, Vogt R, Baker B, Sridharan S. Optimising fig-ure of merit for phonetic spoken term detection. In Proc.ICASSP 2010, Dallas, USA, March 14-19, 2010, pp.5298-5301.

[14] Parada C, Sethy A, Dredze M, Jelinek F. A spoken term de-tection framework for recovering out-of-vocabulary words us-ing the web. In Proc. Interspeech 2010, Makuhari, Japan,September 26-30, 2010, pp.1269-1272.

[15] Jansen A, Church K, Hermansky H. Towards spoken termdiscovery at scale with zero resources. In Proc. INTER-SPEECH 2010, Makuhari, Japan, September 26-30, 2010,pp.1676-1679.

[16] Parada C, Sethy A, Ramabhadran B. Balancing false alarmsand hits in spoken term detection. In Proc. ICASSP 2010,Dallas, USA, March 14-19, 2010, pp.5286-5289.

[17] Schneider D, Mertens T, Larson M, Kohler J. Contextual veri-fication for open vocabulary spoken term detection. In Proc.INTERSPEECH 2010, Makuhari, Japan, September 26-30,2010, pp.697-700.

[18] Chan C A, Lee L S. Unsupervised spoken-term detection withspoken queries using segment-based dynamic time warping.In Proc. INTERSPEECH 2010, Makuhari, Japan, Septem-ber 26-30, 2010, pp.693-696.

[19] Chen C P, Lee H Y, Yeh C F, Lee L S. Improved spokenterm detection by feature space pseudo-relevance feedback. In Proc. INTERSPEECH 2010, Makuhari, Japan, Septem-ber 26-30, 2010, pp.1672-1675.

[20] Motlicek P, Valente F, Garner P. English spoken termdetection in multilingual recordings. In Proc. INTER-SPEECH 2010, Makuhari, Japan, September 26-30, 2010,pp.206-209.

[21] Szöke I, Fapso M, Karafiát M, Burget L, Grézl F, Schwarz P,Glembek O, Matejka P, Kontár S, Cernocky J. BUT systemfor NIST STD 2006 | English. In Proc. NIST Spoken TermDetection Evaluation Workshop (STD 2006), Gaithersburg,USA, December 14-15, 2006.

[22] Miller D R H, Kleber M, Kao C L, Kimball O, Colthurst T,Lowe S A, Schwartz R M, Gish H. Rapid and accurate spokenterm detection. In Proc. INTERSPEECH 2007, Antwerp,Belgium, August 27-31, 2007, pp.314-317.

[23] Seide F, Yu P, Ma C, Chang E. Vocabulary-independentsearch in spontaneous speech. In Proc. ICASSP 2004, Vol.1,Montreal, Quebec, Canada, May 17-21, 2004, pp.253-256.

[24] Logan B, Thong J M V, Moreno P J. Approaches to reducethe effects of OOV queries on indexed spoken audio. IEEETransaction on Multimedia, 2005, 7(5): 899-906.

[25] Logan B, Moreno P, Deshmuk O. Word and sub-word index-ing approaches for reducing the effects of OOV queries onspoken audio. In Proc. the 2rd HLT, San Diego, USA, March24-27, 2002, pp.31-35.

[26] Ma B, Li H. A phonotactic-semantic paradigm for automaticspoken document classification. In Proc. the 28th Interna-tional ACM SIGIR Conference on Research and Develop-ment in Information retrieval, Salvador, Brazil, August 15-19, 2005, pp.369-376.

[27] Pinto J, Szöke I, Prasanna S, Hermansky H. Fast approximatespoken term detection from sequence of phonemes. In Proc.the 31st Annual International ACM SIGIR Conference, Sin-gapore, Singapore, July 20-24, 2008, pp.28-33.

[28] Meng S, Yu P, Seide F, Liu J. A study of lattice-based spo-ken term detection for Chinese spontaneous speech. In Proc.ASRU2007, Kyoto, Japan, December 9-13, 2007, pp.635-640.

[29] Wang D, Frankel J, Tejedor J, King S. A comparison ofphone and grapheme-based spoken term detection. In Proc.ICASSP 2008, Las Vegas, USA, March 31-April 4, 2008,pp.4969-4972.

[30] Wallace R, Vogt R, Sridharan S. A phonetic search approachto the 2006 NIST spoken term detection evaluation. InProc. IINTERSPEECH 2007, Antwerp, Belgium, August 27-31, 2007, pp.2385-2388.

[31] Parlak S, Sara~clar M. Spoken term detection for Turkishbroadcast news. In Proc. ICASSP 2008, Las Vegas, USA,March 31-April 4, 2008, pp.5244-5247.

[32] James D A. A system for unrestricted topic retrieval from ra-dio news broadcasts. In Proc. ICASSP 1996, Vol.1, Atlanta,USA, May 7-10, 1994, pp.279-282.

[33] Jones G J F, Foote J T, Sp?arck Jones K S, Young S J. Retriev-ing spoken documents by combining multiple index sources.In Proc. the 19th ACM SIGIR, Zurich, Switzerland, August18-22, 1996, pp.30-38.

[34] Saraclar M, Sproat R. Lattice-based search for spoken utte-rance retrieval. In Proc. HLT-NAACL 2004, Boston, USA,May 2-7, 2004, pp.129-136.

[35] Iwata K, Shinoda K, Furui S. Robust spoken term detectionusing combination of phone-based and word-based recogni-tion. In Proc. INTERSPEECH 2008, Brisbane, Australia,September 22-26, 2008, pp.2195-2198.

[36] Yu P, Seide F. A hybrid word/phoneme-based approachfor improved vocabulary-independent search in spontaneousspeech. In Proc. ICSLP 2004, Jeju, Korea, October 4-8, 2004,pp.293-296.

[37] Yazgan A, Saraclar M. Hybrid language models for out ofvocabulary word detection in large vocabulary conversationalspeech recognition. In Proc. ICASSP 2004, Vol.1, Montreal,Canada, May 17-21, 2004, pp.745-748.

[38] NIST. The spoken term detection (STD) 2006 evaluationplan. National Institute of Standards and Technology(NIST), Gaithersburg, USA, 10 edition, September 2006,http://www.nist.gov/speech/tests/std.

[39] Martin A, Doddington G, Kamm T, Ordowski M, PrzybockiM. The DET curve in assessment of detection task perfor-mance. In Proc. Eurospeech1997, Vol.4, Rhodes, Greece,September 22-25, 1997, pp.1895-1898.

[40] Wessel F, Macherey K, Schl?uter R. Using word probabilitiesas confidence measures. In Proc. ICASSP 1998, Vol.1, Seat-tle, Washington, USA, May 12-15, 1998, pp.225-228.

[41] Rohlicek J R, Russell W, Roukos S, Gish H. Continuoushidden Markov modeling for speaker-independent word spot-ting. In Proc. ICASSP 1989, Glasgow, UK, May 23-26, 1989,pp.627-630.

[42] Cox S, Rose R. Confidence measures for the SWITCHBOARDdatabase. In Proc. ICASSP 1996, Vol.1, Atlanta, USA, May7-10, 1996, pp.511-514.

[43] Weintraub M. LVCSR log-likelihood ratio scoring for keywordspotting. In Proc. ICASSP 1995, Vol.1, Detroit, USA, May9-12, 1995, pp.297-300.

[44] Setlur A R, Sukkar R A, Jacob J. Correcting recognition er-rors via discriminative utterance verification. In Proc. IC-SLP 1996, Philadelphia, USA, October 1996, pp.602-605.

[45] James D A, Young S J. A fast lattice-based approach to vo-cabulary independent wordspotting. In Proc. ICASSP 1994,Vol.1, Adelaide, Australia, April 19-22, 1994, pp.377-380.

[46] Kemp T, Schaaf T. Estimating confidence using word lattices.In Proc. EUROSPEECH1997, Rhodes, Greece, September22-25, 1997, pp.827-830.

[47] Rahim M G, Lee C H, Juang B H. Discriminative utteranceverification for connected digits recognition. IEEE Transac-tions on Speech and Audio Processing, 1997, 5(3): 266-277.

[48] Sukkar R A. Subword-based minimum verification error (SB-MVE) training for task independent utterance verification. InProc. ICASSP 1998, Vol.1, Seattle, USA, May 12-15, 1998,pp.229-232

[49] Gillick L, Ito Y, Young J. A probabilistic approach to con-fidence estimation and evaluation. In Proc. ICASSP 1997,Munich, Germany, April 21-24, 1997, pp.879-882.

[50] Siu M, Gish H, Richardson F. Improved estimation, eval-uation and applications of confidence measures for speechrecognition. In Proc. EUROSPEECH1997, Rhodes, Greece,September 22-25, 1997, pp.831-834.

[51] Chase L. Word and acoustic confidence annotation for largevocabulary speech recognition. In Proc. EUROSPEECH1997, Rhodes, Greece, September 22-25, 1997, pp.815-818.

[52] Hauptmann A G, Jones R E, Seymore K, Slattery S T, Wit-brock M J, Siegler M A. Experiments in information re-trieval from spoken documents. In Proc. DARPA Workshopon Broadcast News Transcription and Understanding, Lans-downe, USA, February 8-11, 1998, pp.175-181.

[53] Kamppari S O, Hazen T J. Word and phone level acousticconfidence scoring. In Proc. ICASSP 2000, Vol.3, Istanbul,Turkey, June 5-9, 2000, pp.1799-1802.

[54] ábrego G A H. Confidence measures for speech recogni-tion and utterance verification [PhD thesis]. Polytechnic ofCatalu~na, March 2000.

[55] Zhang R, Rudnicky A I. Word level confidence annotation us-ing combinations of features. In Proc. EUROSPEECH2001,Aalborg, Denmark, September 3-7, 2001, pp.2105-2108.

[56] Sudoh K, Tsukada H, Isozaki H. Discriminative named en-tity recognition of speech data using speech recognition con-fidence. In Proc. ICSLP 2006, Pittsburgh, USA, September17-21, 2006, pp.1153-1156.

[57] Shafran Z, Roark B, Fisher S. OGI spoken term detection sys-tem. In Proc. NIST Spoken Term Detection Workshop (STD2006), Gaithersburg, USA, December 14-15, 2006, pp.1-15.

[58] Jiang H. Confidence measures for speech recognition: A sur-vey. Speech Communication 2005, 45(4): 455-470.

[59] Siu M, Gish H. Evaluation of word confidence for speechrecognition systems. Computer Speech and Language, 1999,13(4): 299-319.

[60] Mathan L, Miclet L. Rejection of extraneous input in speechrecognition applications, using multi-layer perceptrons andthe trace of HMMs. In Proc. ICASSP 1991, Vol.1, Toronto,Canada, April 14-17, 1991, pp.93-96.

[61] Neti C V, Roukos S, Eide E. Word-based confidence mea-sures as a guide for stack search in speech recognition. InProc. ICASSP 1997, Munich, Germany, April 21-24, 1997,pp.883-886.

[62] Bishop C M. Neural Networks for Pattern Recognition. Ox-ford University Press, 1995.

[63] Wang D, King S, Frankel J. Stochastic pronunciation model-ing for out-of-vocabulary spoken term detection. IEEE Trans.Audio, Speech, and Language Processing, 2011, 19(4): 688-698.

[64] Hain T, Burget L, Dines J, Garau G, Karafiat M, LincolnM, Vepa J, Wan V. The AMI meeting transcription system:Progress and performance. In Lecture Notes in Computer Sci-ence 4299, Renals S et al. (eds.), Springer Berlin/Heidelberg,2006, pp.419-431.

[65] Deligne S, Yvon F, Bimbot F. Variable-length sequencematching for phonetic transcription using joint multigrams.In Proc. EUROSPEECH1995, Madrid, Spain, September 18-21, 1995, pp.2243-2246.

[66] Chang C C, Lin C J. LIBSVM: A library for support vectormachines. http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.

[67] Liaw A, Wiener M. Classification and regression by randomforest. R News, 2002, 2(3): 18-22.

[68] Can D, Sara~clar M. Score distribution based term specificthresholding for spoken term detection. In Proc. NAACLHLT 2009, Boulder, USA, May 31-June 5, 2009, pp.269-272.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 乔香珍;. An Efficient Parallel Algorithm for FFT[J]. , 1987, 2(3): 174 -190 .
[2] 周巢尘; 柳欣欣;. Denote CSP with Temporal Formulas[J]. , 1990, 5(1): 17 -23 .
[3] 马军; 马绍汉;. An O(k~2n~2) Algorithm to Find a k-Partition in a k-Connected Graph[J]. , 1994, 9(1): 86 -91 .
[4] 王学军; 石纯一;. A Multiagent Dynamic interaction Testbed:Theoretic Framework, System Architecture and Experimentation[J]. , 1997, 12(2): 121 -132 .
[5] 齐越胜; 王保中; 康立山;. Genetic Programming with Simple Loops[J]. , 1999, 14(4): 429 -433 .
[6] 彭伟; 卢锡城;. An Approach to Support IP Multicasting in Networks with Mobile Hosts[J]. , 1999, 14(6): 529 -538 .
[7] . L-树匹配:一种面向含噪声海量文本流的新型数据抽取模型和算法[J]. , 2005, 20(6): 763 -773 .
[8] . 语义数据库网格的查询优化算法[J]. , 2006, 21(4): 597 -608 .
[9] . 从监督视角进行多示例学习[J]. , 2006, 21(5): 800 -809 .
[10] . 下一代互联网体系结构[J]. , 2006, 21(5): 723 -731 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: