2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information
Login Paper Search My Schedule Paper Index Help

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.
  1. Create a login based on your email (takes less than one minute)
  2. Perform 'Paper Search'
  3. Select papers that you desire to save in your personalized schedule
  4. Click on 'My Schedule' to see the current list of selected papers
  5. Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper IDSPE-39.6
Paper Title HYPOTHESIS STITCHER FOR END-TO-END SPEAKER-ATTRIBUTED ASR ON LONG-FORM MULTI-TALKER RECORDINGS
Authors Xuankai Chang, Johns Hopkins University, United States; Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka, Microsoft Corporation, United States
SessionSPE-39: Speech Recognition 13: Acoustic Modeling 1
LocationGather.Town
Session Time:Thursday, 10 June, 15:30 - 16:15
Presentation Time:Thursday, 10 June, 15:30 - 16:15
Presentation Poster
Topic Speech Processing: [SPE-GASR] General Topics in Speech Recognition
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Abstract Recently, an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification. The E2E SA-ASR has shown significant improvement of speaker-attributed word error rate (SA-WER) for monaural overlapped speech consists of various number of speakers. However, it is known that E2E model sometimes suffered from degradation due to training / testing condition mismatches. Especially, it has not yet been investigated that whether the E2E SA-ASR model works well for very long recordings, which is longer than that in the training data. In this paper, we first explore the E2E SA-ASR for long-form multi-talker recordings while investigating a known decoding algorithm of long-form audio for single-speaker ASRs. We then propose a novel method, called hypothesis stitcher, that takes multiple hypotheses from short-segmented audio and outputs a fused single hypothesis. We propose several variants of model architectures for the hypothesis stitcher, and evaluate them by comparing with conventional decoding methods. In our evaluation with LibriSpeech and LibriCSS corpora, we show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings.