1,720,984 research outputs found

    Deep audio-visual speech recognition

    Full text link
    The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem -- unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) we compare two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss. Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release two new datasets for audio-visual speech recognition: LRS2-BBC, consisting of thousands of natural sentences from British television; and LRS3-TED, consisting of hundreds of hours of TED and TEDx talks obtained from YouTube. The models that we train surpass the performance of all previous work on lip reading benchmark datasets by a significant margin

    Audio-visual deep learning

    Full text link
    Human perception and learning are inherently multimodal: we interface with the world through multiple sensory streams, including vision, audition, touch, olfaction and taste. By contrast, automatic approaches for machine perception and learning have traditionally depended on single modalities, by processing, for instance, video, audio or speech separately. The goal of this thesis is instead utilizing the natural co-occurrence of audio and visual information in videos to learn useful tasks. The thesis is structured around four main themes: (i) lip reading and Audio-Visual Speech Recognition (AVSR); (ii) audio-visual speech enhancement and separation; (iii) audio-visual sound source localization and detection; (iv) sign language recognition; Lip reading is the ability to recognise speech by observing the speaker’s lip movements; it is a challenging task and has many important applications including enabling speech impaired individuals to better communicate. We build and improve on recent breakthroughs by exploring the use of Transformer-based architectures, proposing attention based pooling mechanisms for representation aggregation, as well as using sub-word units instead of character tokenisation. These enhancements, combined with improvements to the training protocol, yield substantial performance boosts, resulting in state-of-the art results on the challenging LRS2 and LRS3 datasets. Moreover, we develop a method for exploiting unlabelled speech video by distilling an Automatic Speech Recognition Model into a lip-reading one. Finally we show that it is possible to identify spoken language just by observing a speaker’s lip movements. Speech enhancement and separation increases the signal-to-noise ratio of noisy speech audio, by filtering out interfering voices or background noise. Until recently, works in this area focused on solving the problem by using the audio modality alone. We first propose tackling this problem audio-visually by conditioning on each speaker’s lip movements. We then further improve this approach by making it robust to visual occlusions. Recent works have shown that it is possible to determine the spatial location of sound-making objects in video frames by exploiting correlations between the audio and video signals. We present a method to improve and extend these techniques, by grouping heat maps into distinct object representations that can be used for various downstream tasks, without the need for face detectors. The resulting method is entirely self-supervised and can be used for extending tasks such as active speaker detection and speech separation in new domains, e.g. videos of cartoons or puppets. We then propose a method that uses similar principles in order to train object detection models without relying on human annotation, by deriving all the necessary supervision from audio-visual correspondence cues. Finally we consider the problem of automatic sign-language recognition, which to-date remains unsolved, despite all the progress in related vision and natural language processing tasks. The main blocker is the scarcity of large-scale annotated sign-language datasets. We attempt to solve this problem by using sign-interpreted TV broadcasts footage, combined with subtitles obtained from the corresponding audio speech. Towards achieving this goal we first train Transformer models to identify and temporally localize instances of sings in continuous signed videos, thus automatically generating thousands of annotations for a large sign vocabulary. We then directly tackle the problem of temporally aligning the asynchronous subtitles to the sign language footage

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis

    Dispelling the Myths Behind First-author Citation Counts

    Full text link
    We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more sophisticated methods

    Author Index

    No full text
    Nao informado

    Sub-word level lip reading with visual attention

    Full text link
    The goal of this paper is to learn strong lip reading models that can recognise speech in silent videos. Most prior works deal with the open-set visual speech recognition problem by adapting existing automatic speech recognition techniques on top of trivially pooled visual features. Instead, in this paper, we focus on the unique challenges encountered in lip reading and propose tailored solutions. To this end, we make the following contributions: (1) we propose an attention-based pooling mechanism to aggregate visual speech representations; (2) we use sub-word units for lip reading for the first time and show that this allows us to better model the ambiguities of the task; (3) we propose a model for Visual Speech Detection (VSD), trained on top of the lip reading network. Following the above, we obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets, and even surpass models trained on large-scale industrial datasets by using an order of magnitude less data. Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models, significantly reducing the performance gap between lip reading and automatic speech recognition. Moreover, on the AVA-ActiveSpeaker benchmark, our VSD model surpasses all visual-only baselines and even outperforms several recent audio-visual methods

    Speech recognition models are strong lip-readers

    Full text link
    In this work, we show that a large pre-trained ASR model can be adapted to perform lip-reading. Our method enables an ASR model like Whisper to interpret lip movements in a video and output text transcriptions. We achieve this by learning a cross-modal mapping from a lip sequence to a speech sequence, allowing a pre-trained ASR model to directly perform lip-reading. The mapping can be learnt simply by backpropagating the cross-entropy loss on the text labels through the pre-trained, frozen ASR model. We achieve an impressive gain of 5.7 WER in the low data regime on the LRS3 benchmark over previous lip-reading methods. Finally, we demonstrate that the same strategy can be extended to other visual speech tasks, such as identifying the spoken language in silent videos
    corecore