Low Resource Language Understanding in Voice Assistants

20 May

Friday, 05/20/2022 10:00am to 12:00pm

Zoom

PhD Dissertation Proposal Defense

Abstract:

Voice assistants such as Amazon Alexa, Apple Siri, and Google Assistant have become ubiquitous. They rely on spoken language understanding, which typically consists of an Automatic Speech Recognition (ASR) component and a Natural Language Understanding (NLU) component. ASR takes user speech as input and generates a text transcription. NLU takes the text transcription as input and generates a semantic parse to identify the requested actions, called intents (play music, turn on lights, etc.) and any relevant entities, called slots (which song to play? which lights to turn on?).

These components require massive amounts of training data to achieve good performance. In this dissertation, I identify and explore various data-related challenges to improve language understanding in voice assistants, specifically, the NLU component and the pipelined ASR-NLU architecture.

I first present a state-of-the-art NLU system based on sequence-to-sequence neural models that simplifies the traditional semantic parsing architecture, while also allowing it to handle complex user utterances consisting of multiple nested intents and slots. This work serves as an anchor for future data-constraint work. Next, I present an architecture to completely replace the pipelined ASR-NLU system with a fully end-to-end system. Our system is jointly trained on multiple speech-to-text and text-to-text tasks, allowing for transfer learning and also creating a shared representation for both speech and text. It outperforms previous pipelined and end-to-end systems, and performs end-to-end semantic parsing on a new domain by only training on a few text-to-text annotated NLU examples. Finally, I demonstrate how to train large sequence-to-sequence NLU systems using a handful of examples by using auxiliary tasks to pre-train various components of the system. In upcoming work, I propose to explore the paradigm of universal semantic parsing, especially zero-shot domain adaptation. The task of zero-shot domain adaptation aims to parse utterances from a new domain using only documentary information about the new domain but without any additional training data. I present initial results from this work and describe a research plan to address remaining challenges.

Advisor: Andrew McCallum

Join via Zoom

Low Resource Language Understanding in Voice Assistants

Subscribe to the CICS eNewsletter