1,720,966 research outputs found

    Robust recognition of human behaviour in challenging environments

    No full text
    Novel techniques have been developed for the automatic recognition of human behaviour in challenging environments using information from visual and infra-red camera feeds. The techniques have been applied to two interesting scenarios: Recognise drivers' speech using lip movements and recognising audience behaviour, while watching a movie, using facial features and body movements. Outcome of the research in these two areas will be useful in the improving the performance of voice recognition in automobiles for voice based control and for obtaining accurate movie interest ratings based on live audience response analysis

    Facial feature detection for in-car environment

    No full text
    Acoustically, vehicles are extremely noisy environments\ud and as a consequence audio-only in-car voice recognition\ud systems perform very poorly. Seeing that the visual modality\ud is immune to acoustic noise, using the visual lip information from the driver is seen as a viable strategy in circumventing this problem. However, implementing such an approach requires a system being able to accurately locate and track the driver’s face and facial features in real-time. In this paper we present such an approach using the Viola-Jones algorithm. Using this system, we present our results which show that using the Viola-Jones approach is a suitable method of locating and tracking the driver’s lips despite the visual variability of illumination and\ud head pose

    Multiple cameras for audio-visual speech recognition in an automotive environment

    No full text
    Audio-visualspeechrecognition, or the combination of visual lip-reading with traditional acoustic speechrecognition, has been previously shown to provide a considerable improvement over acoustic-only approaches in noisy environments, such as that present in an automotive cabin. The research presented in this paper will extend upon the established audio-visualspeechrecognition literature to show that further improvements in speechrecognition accuracy can be obtained when multiple frontal or near-frontal views of a speaker's face are available. A series of visualspeechrecognition experiments using a four-stream visual synchronous hidden Markov model (SHMM) are conducted on the four-camera AVICAR automotiveaudio-visualspeech database. We study the relative contribution between the side and central orientated cameras in improving visualspeechrecognition accuracy. Finally combination of the four visual streams with a single audio stream in a five-stream SHMM demonstrates a relative improvement of over 56% in word recognition accuracy when compared to the acoustic-only approach in the noisiest conditions of the AVICAR database.\ud \ud \u

    Visual front-end wars : Viola-Jones face detector vs Fourier Lucas-Kanade

    No full text
    The performance of visual speech recognition (VSR)\ud systems are significantly influenced by the accuracy of\ud the visual front-end. The current state-of-the-art VSR\ud systems use off-the-shelf face detectors such as Viola-\ud Jones (VJ) which has limited reliability for changes in\ud illumination and head poses. For a VSR system to perform\ud well under these conditions, an accurate visual front\ud end is required. This is an important problem to be solved\ud in many practical implementations of audio visual speech\ud recognition systems, for example in automotive environments\ud for an efficient human-vehicle computer interface.\ud In this paper, we re-examine the current state-of-the-art\ud VSR by comparing off-the-shelf face detectors with the\ud recently developed Fourier Lucas-Kanade (FLK) image\ud alignment technique. A variety of image alignment and\ud visual speech recognition experiments are performed on\ud a clean dataset as well as with a challenging automotive\ud audio-visual speech dataset. Our results indicate that the\ud FLK image alignment technique can significantly outperform\ud off-the shelf face detectors, but requires frequent\ud fine-tuning

    S.: Audio visual automatic speech recognition in vehicles

    No full text
    Acoustically, car cabins are extremely noisy and as a consequence, existing audio-only speech recognition systems, for voice-based control of vehicle functions such as the GPS based navigator, perform poorly. Audio-only speech recognition systems fail to make use of the visual modality of speech (eg: lip movements). As the visual modality is immune to acoustic noise, utilising this visual information in conjunction with an audio only speech recognition system has the potential to improve the accuracy of the system. The field of recognising speech using both auditory and visual inputs is known as Audio Visual Speech Recognition (AVSR). Continuous research in AVASR field has been ongoing for the past twenty-five years with notable progress being made. However, the practical deployment of AVASR systems for use in a variety of real-world applications has not yet emerged. The main reason is due to most research to date neglecting to address variabilities in the visual domain such as illumination and viewpoint in the design of the visual front-end of the AVSR system. In this paper we present an AVASR system in a real-world car environment using the AVICAR database [1], which is a publically available in-car database and we show that the use of visual speech conjunction with the audio modality is a better approach to improve the robustness and effectiveness of voice-only recognition systems in car cabin environments. 1

    Can audio-visual speech recognition outperform acoustically enhanced speech recognition in automotive environment?

    No full text
    The use of visual features in the form of lip movements to improve the performance of acoustic speech recognition has been shown to work well, particularly in noisy acoustic conditions. However, whether this technique can outperform speech recognition incorporating well-known acoustic enhancement techniques, such as spectral subtraction, or multi-channel beamforming is not known. This is an important question to be answered especially in an automotive environment, for the design of an efficient human-vehicle computer interface. We perform a variety of speech recognition experiments on a challenging automotive speech dataset and results show that synchronous HMM-based audio-visual fusion can outperform traditional single as well as multi-channel acoustic speech enhancement techniques. We also show that further improvement in recognition performance can be obtained by fusing speech-enhanced audio with the visual modality, demonstrating the complementary nature of the two robust speech recognition approaches

    Cascading appearance-based features for visual voice activity detection

    No full text
    The detection of voice activity is a challenging problem, especially\ud when the level of acoustic noise is high. Most current\ud approaches only utilise the audio signal, making them susceptible\ud to acoustic noise. An obvious approach to overcome this\ud is to use the visual modality. The current state-of-the-art visual\ud feature extraction technique is one that uses a cascade of visual\ud features (i.e. 2D-DCT, feature mean normalisation, interstep\ud LDA). In this paper, we investigate the effectiveness of this\ud technique for the task of visual voice activity detection (VAD),\ud and analyse each stage of the cascade and quantify the relative\ud improvement in performance gained by each successive stage.\ud The experiments were conducted on the CUAVE database and\ud our results highlight that the dynamics of the visual modality\ud can be used to good effect to improve visual voice activity detection\ud performance

    Fourier lucas-kanade algorithm

    No full text
    Abstract—In this paper we propose a framework for both gradient descent image and object alignment in the Fourier domain. Our method centers upon the classical Lucas & Kanade (LK) algorithm where we represent the source and template/model in the complex 2D Fourier domain rather than in the spatial 2D domain. We refer to our approach as the Fourier LK (FLK) algorithm. The FLK formulation is advantageous when one pre-processes the source image and template/model with a bank of filters (e.g. oriented edges, Gabor, etc.) as: (i) it can handle substantial illumination variations, (ii) the inefficient pre-processing filter bank step can be subsumed within the FLK algorithm as a sparse diagonal weighting matrix, (iii) unlike traditional LK the computational cost is invariant to the number of filters and as a result far more efficient, and (iv) this approach can be extended to the inverse compositional form of the LK algorithm where nearly all steps (including Fourier transform and filter bank pre-processing) can be pre-computed leading to an extremely efficient and robust approach to gradient descent image matching. Further, these computational savings translate to non-rigid object alignment tasks that are considered extensions of the LK algorithm such as those found in Active Appearance Models (AAMs)

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
    corecore