Search CORE

1,721,071 research outputs found

Human vs Machine Spoofing

Author: Yamagishi Junichi
Wester Mirjam
Wu Zhizheng
Publication venue
Publication date: 2015
Field of study

Listening test materials for "Human vs Machine Spoofing Detection on Wideband and Narrowband data." They include lists of the speech material selected from the SAS spoofing database and the listeners' responses. The main data file has been split into five smaller files (labelled "aa" to "ae") for ease of download

Edinburgh DataShare

Experiment materials for "Disfluencies in change detection in natural, vocoded and synthetic speech."

Author: Corley Martin
Dall Rasmus
Wester Mirjam
Publication venue
Publication date: 2015
Field of study

The current dataset is associated with the DiSS paper "Disfluencies in change detection in natural, vocoded and synthetic speech." In this paper we investigate the effect of filled pauses, a discourse marker and silent pauses in a change detection experiment in natural, vocoded and synthetic speech. In natural speech change detection has been found to increase in the presence of filled pauses, we extend this work by replicating earlier findings and explore the effect of a discourse marker, like, and silent pauses. Furthermore we report how the use of "unnatural" speech, namely synthetic and vocoded, affects change detection rates

Edinburgh DataShare

Listening test materials for "Evaluating comprehension of natural and synthetic conversational speech"

Author: Henter Gustav Eje
Watts Oliver
Wester Mirjam
Publication venue
Publication date: 2016
Field of study

Current speech synthesis methods typically operate on isolated sentences and lack convincing prosody when generating longer segments of speech. Similarly, prevailing TTS evaluation paradigms, such as intelligibility (transcription word error rate) or MOS, only score sentences in isolation, even though overall comprehension arguably is more important for speech-based communication. In an effort to develop more ecologically-relevant evaluation techniques that go beyond isolated sentences, we investigated comprehension of natural and synthetic speech dialogues. Specifically, we tested listener comprehension on long segments of spontaneous and engaging conversational speech (three 10-minute radio interviews of comedians). Interviews were reproduced either as natural speech, synthesised from carefully prepared transcripts, or synthesised using durations from forced-alignment against the natural speech, all in a balanced design. Comprehension was measured using multiple choice questions. A significant difference was measured between the comprehension/retention of natural speech (74% correct responses) and synthetic speech with forced-aligned durations (61% correct responses). However, no significant difference was observed between natural and regular synthetic speech (70% correct responses). Effective evaluation of comprehension remains elusive.The dataset is described in the readme.txt file

Edinburgh DataShare

Experiment materials for "The temporal delay hypothesis: Natural, vocoded and synthetic speech."

Author: Corley Martin
Dall Rasmus
Wester Mirjam
Publication venue
Publication date: 2015
Field of study

Including disfluencies in synthetic speech is being explored as a way of making synthetic speech sound more natural and conversational. How to measure whether the resulting speech is actually more natural, however, is not straightforward. Conventional approaches to synthetic speech evaluation fall short as a listener is either primed to prefer stimuli with filled pauses or when they aren't primed they prefer more fluent speech. Reaction time experiments from psycholinguistics may circumvent this issue. In this paper, we revisit one such reaction time experiment. For natural speech, delays in word onset were found to facilitate word recognition regardless of the type of delay; be they filled pause (um), silent or a tone. We reused the materials for natural speech, and extended it to vocoded and synthetic speech. The results partially replicate previous findings. For natural and vocoded speech, if the delay is a silent pause, significant increases in the speed of word recognition are found. If the delay comprises filled pauses there is a significant increase in reaction time for vocoded speech but not for natural speech. For synthetic speech, no clear effects of delay on word recognition are found. We hypothesise this is because it takes longer (requires more cognitive resources) to process synthetic speech than natural or vocoded speech

Edinburgh DataShare

Superseded - Human vs Machine Spoofing

Author: Yamagishi Junichi
Wu Zhizheng
Wester Mirjam
Publication venue
Publication date: 2015
Field of study

This Item has been replaced. Please see Wester, M; Wu, Z; Yamagishi, J. (2015). Human vs Machine Spoofing, [dataset]. University of Edinburgh. https://doi.org/10.7488/ds/258

Edinburgh DataShare

Artificial Personality

Author: Aylett Matthew
Tomalin Marcus
Wester Mirjam
Dall Rasmus
Publication venue
Publication date: 2015
Field of study

This dataset is associated with the paper “Artificial Personality and Disfluency” by Mirjam Wester, Matthew Aylett, Marcus Tomalin and Rasmus Dall published at Interspeech 2015, Dresden. The focus of this paper is artificial voices with different personalities. Previous studies have shown links between an individual's use of disfluencies in their speech and their perceived personality. Here, filled pauses (uh and um) and discourse markers (like, you know, I mean) have been included in synthetic speech as a way of creating an artificial voice with different personalities. We discuss the automatic insertion of filled pauses and discourse markers (i.e., fillers) into otherwise fluent texts. The automatic system is compared to a ground truth of human ``acted" filler insertion. Perceived personality (as defined by the big five personality dimensions) of the synthetic speech is assessed by means of a standardised questionnaire. Synthesis without fillers is compared to synthesis with either spontaneous or synthetic fillers. Our findings explore how the inclusion of disfluencies influences the way in which subjects rate the perceived personality of an artificial voice

Edinburgh DataShare

Superseded - Human vs Machine Spoofing

Author: Institute of Language Cognition and Computation
Euan MacDonald Centre for Motor Neuron Disease Research
Wester Mirjam; id_orcid
School of Informatics
Yamagishi Junichi
Wu Zhizheng
Publication venue
Publication date: 09/06/2015
Field of study

This Item has been replaced. Please see Wester, M; Wu, Z; Yamagishi, J. (2015). Human vs Machine Spoofing, [dataset]. University of Edinburgh. http://dx.doi.org/10.7488/ds/258.Wu, Zhizheng; Yamagishi, Junichi; Wester, Mirjam. (2015). Superseded - Human vs Machine Spoofing, [dataset]. http://dx.doi.org/10.7488/ds/257

Edinburgh Research Explorer

SUPERSEDED - The Voice Conversion Challenge 2016

Author: Toda Tomoki
Saito Daisuke
Yamagishi Junichi
Chen Ling-Hui
Villavicencio Fernando
Wester Mirjam
Wu Zhizheng
Publication venue
Publication date: 2016
Field of study

THIS VERSION HAS BEEN REPLACED DUE TO SOME OF THE FILES BEING CORRUPTED. PLEASE SEE THE NEW VERSION OF THIS DATASET AT https://doi.org/10.7488/ds/1575 . > The Voice Conversion Challenge (VCC) 2016, one of the special sessions at Interspeech 2016, deals with speaker identity conversion, referred as Voice Conversion (VC). The task of the challenge was speaker conversion, i.e., to transform the voice identity of a source speaker into that of a target speaker while preserving the linguistic content. Using a common dataset consisting of 162 utterances for training and 54 utterances for evaluation from each of 5 source and 5 target speakers, 17 groups working in VC around the world developed their own VC systems for every combination of the source and target speakers, i.e., 25 systems in total, and generated voice samples converted by the developed systems. The objective of the VCC was to compare various VC techniques on identical training and evaluation speech data. The samples were evaluated in terms of target speaker similarity and naturalness by 200 listeners in a controlled environment. This dataset consists of the participants' VC submissions and the listening test results for naturalness and similarity. See also "The Voice Conversion Challenge, 2016: multidimensional scaling (MDS) listening test results" (DOI: 10.7488/ds/1504)..wav files in multiple subdirectories, 4 tab-delimited .txt files plus one .xlsx file outlining variables contained in the .txt files

Edinburgh DataShare

Listening test materials for "Robust TTS duration modelling using DNNs"

Author: Institute of Language Cognition and Computation
Wester Mirjam; id_orcid
Ronanki Srikanth
School of Philosophy Psychology and Language Sciences
School of Informatics
Henter Gustav
King Simon; id_orcid
Watts Oliver
Wu Zhizheng
Publication venue
Publication date: 20/01/2016
Field of study

See readme.txtThis data release contains listening test materials associated with the paper "Robust TTS duration modelling using DNNs", presented at ICASSP 2016 in Shanghai, China.Henter, Gustav Eje; Ronanki, Srikanth; Watts, Oliver; Wester, Mirjam; Wu, Zhizheng; King, Simon. (2016). Listening test materials for "Robust TTS duration modelling using DNNs", [dataset]. University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR). http://dx.doi.org/10.7488/ds/1317

Edinburgh Research Explorer

VCC 2016

Author: Tomoki Toda
Saito Daisuke
Yamagishi Junichi
Chen Ling-Hui
Villavicencio Fernando
Wester Mirjam
Wu Zhizheng
Publication venue
Publication date: 2016
Field of study

The Voice Conversion Challenge (VCC) 2016, one of the special sessions at Interspeech 2016, deals with speaker identity conversion, referred as Voice Conversion (VC). The task of the challenge was speaker conversion, i.e., to transform the voice identity of a source speaker into that of a target speaker while preserving the linguistic content. Using a common dataset consisting of 162 utterances for training and 54 utterances for evaluation from each of 5 source and 5 target speakers, 17 groups working in VC around the world developed their own VC systems for every combination of the source and target speakers, i.e., 25 systems in total, and generated voice samples converted by the developed systems. The objective of the VCC was to compare various VC techniques on identical training and evaluation speech data. The samples were evaluated in terms of target speaker similarity and naturalness by 200 listeners in a controlled environment. This dataset consists of the participants' VC submissions and the listening test results for naturalness and similarity. For further information please see the accompanying paper "Interspeech2016_VC_challenge_description.pdf" included in this dataset. See also "The Voice Conversion Challenge, 2016: multidimensional scaling (MDS) listening test results" (DOI: 10.7488/ds/1504)..wav files in multiple subdirectories, 4 tab-delimited .txt files plus one .xlsx file outlining variables contained in the .txt files

Edinburgh DataShare