1,218 research outputs found

    The Spoken Wikipedia Corpora

    No full text
    The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are – for one reason or another – unable or unwilling to consume the written version of the article. Our resource, the Spoken Wikipedia Corpus, consolidates the Spoken Wikipediae, adding text segmentation, normalization, time-alignment and further annotations, making it accessible for research and fostering new ways of interacting with the material. Timo Baumann and Arne Köhn and Felix Hennig. 2018. The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening, in Language Resources and Evaluation, Special Issue representing significant contributions of LREC 2016. Arne Köhn, Florian Stegen, Timo Baumann. 2016. Mining the Spoken Wikipedia for Speech Data and Beyond, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). CLARIN Metadata summary for The Spoken Wikipedia Corpora (CMDI-based) Title: The Spoken Wikipedia Corpora Description: The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are – for one reason or another – unable or unwilling to consume the written version of the article. Our resource, the Spoken Wikipedia Corpus, consolidates the Spoken Wikipediae, adding text segmentation, normalization, time-alignment and further annotations, making it accessible for research and fostering new ways of interacting with the material. Publication date: 2017 Data owner: Timo Baumann - Universität Hamburg Contributors: Timo Baumann (author), Arne Köhn (author), Florian Stegen (author) Languages: English (eng), German (deu), Dutch (nld) Size: 5397 article, 1005 hour Segmentation units: other Genre: encyclopedia Modality: spoken References: Timo Baumann; Arne Köhn; Felix Hennig (2018) The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening References: Arne Köhn; Florian Stegen; Timo Baumann (2016) Mining the Spoken Wikipedia for Speech Data and Beyon

    Mixed effect quantile and M-quantile regression for spatial data

    Full text link
    Observed data are frequently characterized by a spatial dependence; that is the observed values can be influenced by the "geographical" position. In such a context it is possible to assume that the values observed in a given area are similar to those recorded in neighboring areas. Such data is frequently referred to as spatial data and they are frequently met in epidemiological, environmental and social studies, for a discussion see Haining, (1990). Spatial data can be multilevel, with samples being composed of lower level units (population, buildings) nested within higher level units (census tracts, municipalities, regions) in a geographical area. Green and Richardson (2002) proposed a general approach to modelling spatial data based on finite mixtures with spatial constraints, where the prior probabilities are modelled through a Markov Random Field (MRF) via a Potts representation (Kindermann and Snell, 1999, Strauss, 1977). This model was defined in a Bayesian context, assuming that the interaction parameter for the Potts model is fixed over the entire analyzed region. Geman and Geman (1984) have shown that this class process can be modelled by a Markov Random Field (MRF). As proved by the Hammersley-Clifford theorem, modelling the process through a MRF is equivalent to using a Gibbs distribution for the membership vector. In other words, the spatial dependence between component indicators is captured by a Gibbs distribution, using a representation similar to the Potts model discussed by Strauss (1977). In this work, a Gibbs distribution, with a component specific intercept and a constant interaction parameter, as in Green and Richardson (2002), is proposed to model effect of neighboring areas. This formulation allows to have a parameter specific to each component and a constant spatial dependence in the whole area, extending to quantile and m-quantile regression the proposed by Alfò et al. (2009) who suggested to have both intercept and interaction parameters depending on the mixture component, allowing for different prior probability and varying strength of spatial dependence. We propose, in the current dissertation to adopt this prior distribution to define a Finite mixture of quantile regression model (FMQRSP) and a Finite mixture of M-quantile regression model (FMMQSP), for spatial data

    Replication Data for: Efficient Application of Accelerator Cards for the Coupling Library preCICE

    No full text
    This dataset contains all testcase setup files and result files for the measurements presented in the Master's thesis with the title "Efficient Application of Accelerator Cards for the Coupling Library preCICE" (Author: Timo Pierre Schrader). Furthermore, it contains the version of preCICE used throughout this thesis. The thesis revolves around GPU acceleration of RBF data mapping in preCICE. See the README for more information how to build and run the testcase

    Robust Small Area Estimation Under Spatial Non-stationarity

    Full text link
    The effective use of spatial information in a regression-based approach to small area estimation is an important practical issue. One approach to account for geographic information is by extending the linear mixed model to allow for spatially correlated random area effects. An alternative is to include the spatial information by a non-parametric mixed models. Another option is geographic weighted regression where the model coefficients vary spatially across the geography of interest. Although these approaches are useful for estimating small area means efficiently under strict parametric assumptions, they can be sensitive to outliers. In this paper, we propose robust extensions of the geographically weighted empirical best linear unbiased predictor. In particular, we introduce robust projective and predictive estimators under spatial non-stationarity. Mean squared error estimation is performed by two analytic approaches that account for the spatial structure in the data. Model-based simulations show that the methodology proposed often leads to more efficient estimators. Furthermore, the analytic mean squared error estimators introduced have appealing properties in terms of stability and bias. Finally, we demonstrate in the application that the new methodology is a good choice for producing estimates for average rent prices of apartments in urban planning areas in Berlin

    Estimating regional income indicators under transformations and access to limited population auxiliary information

    Full text link
    Spatially disaggregated income indicators are typically estimated by using model-based methods that assume access to auxiliary information from population micro-data. In many countries like Germany and the UK population micro-data are not publicly available. In this work we propose small area methodology when only aggregate population-level auxiliary information is available. We use data-driven transformations of the response to satisfy the parametric assumptions of the used models. In the absence of population micro-data, appropriate bias-corrections for small area prediction are needed. Under the approach we propose in this paper, aggregate statistics (means and covariances) and kernel density estimation are used to resolve the issue of not having access to population micro-data. We further explore the estimation of the mean squared error using the parametric bootstrap. Extensive model-based and design-based simulations are used to compare the proposed method to alternative methods. Finally, the proposed methodology is applied to the 2011 Socio-Economic Panel and aggregate census information from the same year to estimate the average income for 96 regional planning regions in German

    Constructing sociodemographic indicators for national statistical institutes by using mobile phone data: estimating literacy rates in Senegal

    Full text link
    Modern systems of official statistics require the accurate and timely estimation of sociodemographic indicators for disaggregated geographical regions. Traditional data collection methods such as censuses or household surveys impose great financial and organizational bur- dens on national statistical institutes. The rise of new information and communication technolo- gies offers promising sources to mitigate these shortcomings.We propose a unified approach for national statistical institutes in developing countries based on small area estimation that allows for the estimation of sociodemographic indicators by using mobile phone data. In particular, the methodology is applied to mobile phone data from Senegal for deriving subnational estimates of the share of illiterates disaggregated by gender. The estimates are used to identify hotspots of illiterates with a need for additional infrastructure or policy adjustments. Although we focus on literacy as a particular sociodemographic indicator, the approach proposed is applicable to indicators from national statistics in general

    Smoothing and benchmarking for small area estimation

    No full text
    Small area estimation is concerned with methodology for estimating population parameters associated with a geographic area defined by a cross-classification that may also include non-geographic dimensions. In this paper, we develop constrained estimation methods for small area problems: those requiring smoothness with respect to similarity across areas, such as geographic proximity or clustering by covariates, and benchmarking constraints, requiring weighted means of estimates to agree across levels of aggregation. We develop methods for constrained estimation decision theoretically and discuss their geometric interpretation. The constrained estimators are the solutions to tractable optimisation problems and have closed-form solutions. Mean squared errors of the constrained estimators are calculated via bootstrapping. Our approach assumes the Bayes estimator exists and is applicable to any proposed model. In addition, we give special cases of our techniques under certain distributional assumptions. We illustrate the proposed methodology using web-scraped data on Berlin rents aggregated over areas to ensure privacy.</p

    Construction of regional consumer price indices using small area estimation

    No full text
    Consumer Price Indices (CPI) are used in many ways by the government, businesses, and society in general. They can affect interest rates, tax allowances, wages, state benefits, and many other payments. The CPI is a fixed (national) basket index, where a range of goods and services is priced each month, and the expenditure shares on items in the basket are used to weight the price information together. The starting point for a regional price index should be a regional basket of goods and services. In the current poster, we derive regional baskets from the UK Living Costs and Food Survey (LCF), taking the products (COICOP classification) with the largest proportion of expenditures. As the sample size is naturally much smaller for regions, the accuracy of the direct estimates on the basket will be reduced. In order to overcome this problem one possibility - discussed in the poster - is to pool multiple years of LCF data to increase the sample size. Another is to consider small area estimation approaches for the regional basket. Ideally, the small area estimates would be constrained to the overall expenditure total. Therefore, we assess some benchmarking approaches. Since the conceptual framework of CPI-calculationfor the UK and Germany do not differ too much the presented methodology can also be adapted for the calculation of regional CPIs for Germany.<br/

    Domain prediction with grouped income data

    Full text link
    One popular small area estimation method for estimating poverty and inequality indicators is the empirical best predictor under the unit-level nested error regression model with a continuous dependent variable. However, parameter estimation is more challenging when the response variable is grouped due to data confidentiality concerns or concerns about survey response burden. The work in this paper proposes methodology that enables fitting a nested error regression model when the dependent variable is grouped. Model parameters are then used for small area prediction of finite population parameters of interest. Model fitting in the case of a grouped response variable is based on the use of a stochastic expectation–maximization algorithm. Since the stochastic expectation–maximization algorithm relies on the Gaussian assumptions of the unit-level error terms, adaptive transformations are incorporated for handling departures from normality. The estimation of the mean squared error of the small area parameters is facilitated by a parametric bootstrap that captures the additional uncertainty due to the grouping mechanism and the possible use of adaptive transformations. The empirical properties of the proposed methodology are assessed by using model-based simulations and its relevance is illustrated by estimating deprivation indicators for municipalities in the Mexican state of Chiapas
    corecore