1,721,000 research outputs found
AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation
Attention is the core mechanism of today's most used architectures for
natural language processing and has been analyzed from many perspectives,
including its effectiveness for machine translation-related tasks. Among these
studies, attention resulted to be a useful source of information to get
insights about word alignment also when the input text is substituted with
audio segments, as in the case of the speech translation (ST) task. In this
paper, we propose AlignAtt, a novel policy for simultaneous ST (SimulST) that
exploits the attention information to generate source-target alignments that
guide the model during inference. Through experiments on the 8 language pairs
of MuST-C v1.0, we show that AlignAtt outperforms previous state-of-the-art
SimulST policies applied to offline-trained models with gains in terms of BLEU
of 2 points and latency reductions ranging from 0.5s to 0.8s across the 8
languages.Comment: Accepted at Interspeech 202
Direct Models for Simultaneous Translation and Automatic Subtitling: FBK@IWSLT2023
This paper describes the FBK's participation in the Simultaneous Translation
and Automatic Subtitling tracks of the IWSLT 2023 Evaluation Campaign. Our
submission focused on the use of direct architectures to perform both tasks:
for the simultaneous one, we leveraged the knowledge already acquired by
offline-trained models and directly applied a policy to obtain the real-time
inference; for the subtitling one, we adapted the direct ST model to produce
well-formed subtitles and exploited the same architecture to produce timestamps
needed for the subtitle synchronization with audiovisual content. Our
English-German SimulST system shows a reduced computational-aware latency
compared to the one achieved by the top-ranked systems in the 2021 and 2022
rounds of the task, with gains of up to 3.5 BLEU. Our automatic subtitling
system outperforms the only existing solution based on a direct system by 3.7
and 1.7 SubER in English-German and English-Spanish respectively.Comment: Published at IWSTL 202
Attention as a Guide for Simultaneous Speech Translation
The study of the attention mechanism has sparked interest in many fields,
such as language modeling and machine translation. Although its patterns have
been exploited to perform different tasks, from neural network understanding to
textual alignment, no previous work has analysed the encoder-decoder attention
behavior in speech translation (ST) nor used it to improve ST on a specific
task. In this paper, we fill this gap by proposing an attention-based policy
(EDAtt) for simultaneous ST (SimulST) that is motivated by an analysis of the
existing attention relations between audio input and textual output. Its goal
is to leverage the encoder-decoder attention scores to guide inference in real
time. Results on en->{de, es} show that the EDAtt policy achieves overall
better results compared to the SimulST state of the art, especially in terms of
computational-aware latency.Comment: Accepted to ACL 202
SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation
This paper describes the FBK’s participation in the Simultaneous Translation Evaluation Campaign at IWSLT 2024. For this year’s submission in the speech-to-text translation (ST) sub-track, we propose SimulSeamless, which is realized by combining AlignAtt and SeamlessM4T in its medium configuration. The SeamlessM4T model is used ‘off-the-shelf’ and its simultaneous inference is enabled through the adoption of AlignAtt, a SimulST policy based on cross-attention that can be applied without any retraining or adaptation of the underlying model for the simultaneous task. We participated in all the Shared Task languages (English->German, Japanese, Chinese, and Czech->English), achieving acceptable or even better results compared to last year’s submissions. SimulSeamless, covering more than 143 source languages and 200 target languages, is released at: https://github.com/hlt-mt/FBK-fairseq/
Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?
The field of natural language processing (NLP) has recently witnessed a transformative shift with the emergence of foundation models, particularly Large Language Models (LLMs) that have revolutionized text-based NLP. This paradigm has extended to other modalities, including speech, where researchers are actively exploring the combination of Speech Foundation Models (SFMs) and LLMs into single, unified models capable of addressing multimodal tasks. Among such tasks, this paper focuses on speech-to-text translation (ST). By examining the published papers on the topic, we propose a unified view of the architectural solutions and training strategies presented so far, highlighting similarities and differences among them. Based on this examination, we not only organize the lessons learned but also show how diverse settings and evaluation approaches hinder the identification of the best-performing solution for each architectural building block and training choice. Lastly, we outline recommendations for future works on the topic aimed at better understanding the strengths and weaknesses of the SFM+LLM solutions for ST
SBAAM! Eliminating Transcript Dependency in Automatic Subtitling
Subtitling plays a crucial role in enhancing the accessibility of audiovisual content and encompasses three primary subtasks: translating spoken dialogue, segmenting translations into concise textual units, and estimating timestamps that govern their on-screen duration. Past attempts to automate this process rely, to varying degrees, on automatic transcripts, employed diversely for the three subtasks. In response to the acknowledged limitations associated with this reliance on transcripts, recent research has shifted towards transcription-free solutions for translation and segmentation, leaving the direct generation of timestamps as uncharted territory. To fill this gap, we introduce the first direct model capable of producing automatic subtitles, entirely eliminating any dependence on intermediate transcripts also for timestamp prediction. Experimental results, backed by manual evaluation, showcase our solution’s new state-of-the-art performance across multiple language pairs and diverse conditions
StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection
Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream. Unlike simultaneous ST (SimulST), which deals with pre-segmented speech, StreamST faces the challenges of handling continuous and unbounded audio streams. This requires additional decisions about what to retain of the previous history, which is impractical to keep entirely due to latency and computational constraints. Despite the real-world demand for real-time ST, research on streaming translation remains limited, with existing works solely focusing on SimulST. To fill this gap, we introduce StreamAtt, the first StreamST policy, and propose StreamLAAL, the first StreamST latency metric designed to be comparable with existing metrics for SimulST. Extensive experiments across all 8 languages of MuST-C v1.0 show the effectiveness of StreamAtt compared to a naive streaming baseline and the related state-of-the-art SimulST policy, providing a first step in StreamST research
When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP
Despite its crucial role in research experiments, code correctness is often presumed solely based on the perceived quality of results. This assumption, however, comes with the risk of erroneous outcomes and, in turn, potentially misleading findings. To mitigate this risk, we posit that the current focus on reproducibility should go hand in hand with the emphasis on software quality. We support our arguments with a case study in which we identify and fix three bugs in widely used implementations of the state-of-the-art Conformer architecture. Through experiments on speech recognition and translation in various languages, we demonstrate that the presence of bugs does not prevent the achievement of good and reproducible results, which however can lead to incorrect conclusions that potentially misguide future research. As countermeasures, we release pangoliNN, a library dedicated to testing neural models, and propose a Code-quality Checklist, with the goal of promoting coding best practices and improving software quality within the NLP community
Joint Speech Translation and Named Entity Recognition
Modern automatic translation systems aim at place the human at the center by
providing contextual support and knowledge. In this context, a critical task is
enriching the output with information regarding the mentioned entities, which
is currently achieved processing the generated translation with named entity
recognition (NER) and entity linking systems. In light of the recent promising
results shown by direct speech translation (ST) models and the known weaknesses
of cascades (error propagation and additional latency), in this paper we
propose multitask models that jointly perform ST and NER, and compare them with
a cascade baseline. The experimental results show that our models significantly
outperform the cascade on the NER task (by 0.4-1.0 F1), without degradation in
terms of translation quality, and with the same computational efficiency of a
plain direct ST model.Comment: Accepted at INTERSPEECH 202
How “Real” is Your Real-Time Simultaneous Speech-to-Text Translation System?
Simultaneous speech-to-text translation (SimulST) translates source-language speech into target-language text concurrently with the speaker’s speech, ensuring low latency for better user comprehension. Despite its intended application to unbounded speech, most research has focused on human pre-segmented speech, simplifying the task and overlooking significant challenges. This narrow focus, coupled with widespread terminological inconsistencies, is limiting the applicability of research outcomes to real-world applications, ultimately hindering progress in the field. Our extensive literature review of 110 papers not only reveals these critical issues in current research but also serves as the foundation for our key contributions. We: 1) define the steps and core components of a SimulST system, proposing a standardized terminology and taxonomy; 2) conduct a thorough analysis of community trends; and 3) offer concrete recommendations and future directions to bridge the gaps in existing literature, from evaluation frameworks to system architectures, for advancing the field towards more realistic and effective SimulST solutions
- …
