1,721,071 research outputs found
Human vs Machine Spoofing
Listening test materials for "Human vs Machine Spoofing Detection on Wideband and Narrowband data." They include lists of the speech material selected from the SAS spoofing database and the listeners' responses. The main data file has been split into five smaller files (labelled "aa" to "ae") for ease of download
Experiment materials for "Disfluencies in change detection in natural, vocoded and synthetic speech."
The current dataset is associated with the DiSS paper "Disfluencies in change detection in natural, vocoded and synthetic speech." In this paper we investigate the effect of filled pauses, a discourse marker and silent pauses in a change detection experiment in natural, vocoded and synthetic speech. In natural speech change detection has been found to increase in the presence of filled pauses, we extend this work by replicating earlier findings and explore the effect of a discourse marker, like, and silent pauses. Furthermore we report how the use of "unnatural" speech, namely synthetic and vocoded, affects change detection rates
Listening test materials for "Evaluating comprehension of natural and synthetic conversational speech"
Current speech synthesis methods typically operate on isolated sentences and lack convincing prosody when generating longer segments of speech. Similarly, prevailing TTS evaluation paradigms, such as intelligibility (transcription word error rate) or MOS, only score sentences in isolation, even though overall comprehension arguably is more important for speech-based communication. In an effort to develop more ecologically-relevant evaluation techniques that go beyond isolated sentences, we investigated comprehension of natural and synthetic speech dialogues. Specifically, we tested listener comprehension on long segments of spontaneous and engaging conversational speech (three 10-minute radio interviews of comedians). Interviews were reproduced either as natural speech, synthesised from carefully prepared transcripts, or synthesised using durations from forced-alignment against the natural speech, all in a balanced design. Comprehension was measured using multiple choice questions. A significant difference was measured between the comprehension/retention of natural speech (74% correct responses) and synthetic speech with forced-aligned durations (61% correct responses). However, no significant difference was observed between natural and regular synthetic speech (70% correct responses). Effective evaluation of comprehension remains elusive.The dataset is described in the readme.txt file
Experiment materials for "The temporal delay hypothesis: Natural, vocoded and synthetic speech."
Including disfluencies in synthetic speech is being explored as a way of making synthetic speech sound more natural and conversational. How to measure whether the resulting speech is actually more natural, however, is not straightforward. Conventional approaches to synthetic speech evaluation fall short as a listener is either primed to prefer stimuli with filled pauses or when they aren't primed they prefer more fluent speech. Reaction time experiments from psycholinguistics may circumvent this issue. In this paper, we revisit one such reaction time experiment. For natural speech, delays in word onset were found to facilitate word recognition regardless of the type of delay; be they filled pause (um), silent or a tone. We reused the materials for natural speech, and extended it to vocoded and synthetic speech. The results partially replicate previous findings. For natural and vocoded speech, if the delay is a silent pause, significant increases in the speed of word recognition are found. If the delay comprises filled pauses there is a significant increase in reaction time for vocoded speech but not for natural speech. For synthetic speech, no clear effects of delay on word recognition are found. We hypothesise this is because it takes longer (requires more cognitive resources) to process synthetic speech than natural or vocoded speech
Superseded - Human vs Machine Spoofing
This Item has been replaced. Please see Wester, M; Wu, Z; Yamagishi, J. (2015). Human vs Machine Spoofing, [dataset]. University of Edinburgh. https://doi.org/10.7488/ds/258
Artificial Personality
This dataset is associated with the paper “Artificial Personality and Disfluency” by Mirjam Wester, Matthew Aylett, Marcus Tomalin and Rasmus Dall published at Interspeech 2015, Dresden.
The focus of this paper is artificial voices with different personalities. Previous studies have shown links between an individual's use of disfluencies in their speech and their perceived personality. Here, filled pauses (uh and um) and discourse markers (like, you know, I mean) have been included in synthetic speech as a way of creating an artificial voice with different personalities. We discuss the automatic insertion of filled pauses and discourse markers (i.e., fillers) into otherwise fluent texts. The automatic system is compared to a ground truth of human ``acted" filler insertion. Perceived personality (as defined by the big five personality dimensions) of the synthetic speech is assessed by means of a standardised questionnaire. Synthesis without fillers is compared to synthesis with either spontaneous or synthetic fillers. Our findings explore how the inclusion of disfluencies influences the way in which subjects rate the perceived personality of an artificial voice
Superseded - Human vs Machine Spoofing
This Item has been replaced. Please see Wester, M; Wu, Z; Yamagishi, J. (2015). Human vs Machine Spoofing, [dataset]. University of Edinburgh. http://dx.doi.org/10.7488/ds/258.Wu, Zhizheng; Yamagishi, Junichi; Wester, Mirjam. (2015). Superseded - Human vs Machine Spoofing, [dataset]. http://dx.doi.org/10.7488/ds/257
SUPERSEDED - The Voice Conversion Challenge 2016
THIS VERSION HAS BEEN REPLACED DUE TO SOME OF THE FILES BEING CORRUPTED. PLEASE SEE THE NEW VERSION OF THIS DATASET AT https://doi.org/10.7488/ds/1575 . > The Voice Conversion Challenge (VCC) 2016, one of the special sessions at Interspeech 2016, deals with speaker identity conversion, referred as Voice Conversion (VC). The task of the challenge was speaker conversion, i.e., to transform the voice identity of a source speaker into that of a target speaker while preserving the linguistic content. Using a common dataset consisting of 162 utterances for training and 54 utterances for evaluation from each of 5 source and 5 target speakers, 17 groups working in VC around the world developed their own VC systems for every combination of the source and target speakers, i.e., 25 systems in total, and generated voice samples converted by the developed systems. The objective of the VCC was to compare various VC techniques on identical training and evaluation speech data. The samples were evaluated in terms of target speaker similarity and naturalness by 200 listeners in a controlled environment. This dataset consists of the participants' VC submissions and the listening test results for naturalness and similarity. See also "The Voice Conversion Challenge, 2016: multidimensional scaling (MDS) listening test results" (DOI: 10.7488/ds/1504)..wav files in multiple subdirectories, 4 tab-delimited .txt files plus one .xlsx file outlining variables contained in the .txt files
Listening test materials for "Robust TTS duration modelling using DNNs"
See readme.txtThis data release contains listening test materials associated with the paper "Robust TTS duration modelling using DNNs", presented at ICASSP 2016 in Shanghai, China.Henter, Gustav Eje; Ronanki, Srikanth; Watts, Oliver; Wester, Mirjam; Wu, Zhizheng; King, Simon. (2016). Listening test materials for "Robust TTS duration modelling using DNNs", [dataset]. University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR). http://dx.doi.org/10.7488/ds/1317
VCC 2016
The Voice Conversion Challenge (VCC) 2016, one of the special sessions at Interspeech 2016, deals with speaker identity conversion, referred as Voice Conversion (VC). The task of the challenge was speaker conversion, i.e., to transform the voice identity of a source speaker into that of a target speaker while preserving the linguistic content. Using a common dataset consisting of 162 utterances for training and 54 utterances for evaluation from each of 5 source and 5 target speakers, 17 groups working in VC around the world developed their own VC systems for every combination of the source and target speakers, i.e., 25 systems in total, and generated voice samples converted by the developed systems. The objective of the VCC was to compare various VC techniques on identical training and evaluation speech data. The samples were evaluated in terms of target speaker similarity and naturalness by 200 listeners in a controlled environment. This dataset consists of the participants' VC submissions and the listening test results for naturalness and similarity. For further information please see the accompanying paper "Interspeech2016_VC_challenge_description.pdf" included in this dataset. See also "The Voice Conversion Challenge, 2016: multidimensional scaling (MDS) listening test results" (DOI: 10.7488/ds/1504)..wav files in multiple subdirectories, 4 tab-delimited .txt files plus one .xlsx file outlining variables contained in the .txt files
- …
