1,721,015 research outputs found

    Inference With Unequal Cluster Sizes

    No full text
    A random sample is not always of a fixed, a priori determined size. Examples include sequential sampling and stopping rules, missing data, and clusters with random size. Often there then is no complete sufficient statistic. Completeness means that any measurable function of a sufficient statistic that has zero expectation for every value of the parameter indexing the parametric model class, is the zero function almost everywhere. A simple characterization of incompleteness is given for the exponential family in terms of the mapping between the sufficient statistic and the parameter, based upon the implicit function theorem. Essentially this is a comparison of the dimension of the sufficient statistic to the length of the parameter vector. This results in an easy verifiable criterion for incompleteness, clear and simple to use, even for complex settings as is shown for missing data and clusters of random size. The analysis of hierarchical data that take the form of clusters with random size has received considerable attention in literature. In this work, the focus was on clustered data with unequal cluster sizes, meaning that a joint model of outcome and sample size was not studied. Also, the focus here was on samples that are very large in terms of number of clusters and/or members per cluster, on the one hand, as well as on very small samples (e.g., when studying rare diseases), on the other. Whereas maximum likelihood inference is straightforward in medium to large samples, in samples of sizes considered here it may become prohibitive. Sample-splitting (Molenberghs, Verbeke, and Iddi, 2011) was proposed as a way to replace iterative optimization of a likelihood that does not admit an analytical solution, with closed-form calculations. Pseudo-likelihood (Molenberghs et al., 2014), consisting of computing weighted averages over solutions obtained from subsamples created according to sample size, was used. As a result, the statistical properties of this approach were investigated. In a first attempt, the compound-symmetry variance structure was used to investigate this modelling framework. In a subsample with only clusters of the same size, there are closed-from solutions and other useful properties can be obtained. The operational characteristics are studied using simulations. It follows that the proposed non-iterative methods have a strong beneficial impact on computation time. Next, statistically and computationally efficient estimation in a hierarchical data setting with unequal cluster sizes and an AR(1) covariance structure was studied. As for the compound-symmetry model, the pseudo-likelihood and split-sample methods of Fieuws and Verbeke (2006) and Molenberghs, Verbeke, and Iddi (2011) were used. Maximum likelihood estimation for AR(1) requires numerical iteration when cluster sizes are unequal. A near optimal non-iterative procedure was proposed. Results showed that the method is statistically nearly as efficient as maximum likelihood, but shows great savings in computation time. The odds ratio is a frequently used measure to investigate the association between binary variables. Often, such outcomes are measured across strata of different sizes. Mantel and Haenszel (1959) proposed estimators for a common odds ratio, taking into account the stratification. The most common one is among the best known and most used estimators in statistics. The setting studied by Mantel and Haenszel fits within this framework of samplesplitting and combining with proper weights. The Mantel and Haenszel estimator does not follow from optimality considerations, but nevertheless has properties similar to and often better than the optimal estimator. This was done by comparing it to the optimal estimator, whose existence was demonstrated in spite of the absence of complete sufficient statistics. It is shown, via simulations, that the optimal estimator outperforms the MantelHaenszel estimator only in certain settings with huge sample sizes. Missing data is almost inevitable in correlated-data studies. For non-Gaussian outcomes with moderate to large sequences, direct-likelihood methods can involve complex, hard-to-manipulate likelihoods. Popular alternative approaches, like generalized estimating equations, that are frequently used to circumvent the computational complexity of full likelihood, are less suitable when scientific interest, at least in part, is placed on the association structure; pseudo-likelihood methods are then a viable alternative. When the missing data are missing at random, Molenberghs et al. (2011) proposed a suite of corrections to the standard form of pseudo-likelihood, taking the form of singly and doubly robust estimators. They provided the basis, and exemplified it in insightful yet primarily illustrative examples. The important case of marginal models for hierarchical binary data was considered. Our doubly robust estimator is more convenient than the classical doubly robust estimators. The ideas are illustrated using a marginal model for a binary response, more specifically a Bahadur model.Een steekproef is niet steeds van een vaste, vooraf bepaalde grootte. Voorbeelden zijn sequentiële studies, ontbrekende gegevens en ongebalanceerde hiërarchische data. In dit soort settings is er vaak geen complete sufficient statistic. Een eenvoudige karakterisering van completeness wordt geformuleerd voor de exponentiële familie in termen van de dimensievergelijking tussen de sufficient statistic en de parameter, gebaseerd op de impliciete functiestelling. Het is een eenvoudig en makkelijk verifieerbaar criterium, zelfs voor complexe settings met ontbrekende gegevens en ongebalanceerde hiërarchische data. Ongebalanceerde hiërarchische data werd al vanuit verschillende invalshoeken bestudeerd. In deze thesis ligt de focus op steekproeven die zeer groot zijn, m.a.w. veel clusters of veel metingen per cluster, en die zeer klein zijn (studies van zeldzame ziekten). De Maximum likelihood estimator bepalen in middelgrote steekproeven is goed uitvoerbaar, maar in de settings die hier besproken worden, kan dat moeilijkheden met zich meebrengen, zoals geen analystische oplossingen van gesloten vorm en de likelihoodsfunctie kan alleen iteratief geoptimaliseerd worden. Bijgevolg werd de steekproef opgedeeld in stukken naargelang de grootte van de clusters (Molenberghs, Verbeke, and Iddi, 2011). Deze deelsteekproeven werden hierdoor gebalanceerd en resulteren wel in oplossingen van gesloten vorm. Een pseudo-likelihood werd gebruikt om de oplossingen van elke deelsteekproef te combineren gebruikmakend van gewichten. De eigenschappen van deze methodologie werden in detail onderzocht op gebalanceerde data die een compound-symmetry covariantiestructuur volgen. Via een simulatiestudie werd de toepasbaarheid onderzocht. Hieruit volgt dat deze niet-iteratieve methode slechts een korte berekeningstijd vereist en zeer precies is. Vervolgens werd deze schattingsmethode verder onderzocht in een ongebalanceerde hiërarchische dataset met een autoregressive (AR(1)) covariantiestructuur. Ook hier is deze methode bijna even efficiënt als maximum likelihood en de berekeningstijd is veel lager. The odds ratio is een statistiek die frequent gebruikt wordt om de associatie tussen binaire variabelen te onderzoeken. Ook in dit soort settings kunnen er groeperingen van de gegevens voorkomen van ongelijke grootte. De meeste gekende en gebruikte schatter is deze ontworpen door Mantel and Haenszel (1959). De schatter combineert de odds ratio van subpopulaties in een gewogen schatter, maar volgt niet vanuit optimalisatieberekeningen. The Mantel en Haenszel schatter werd vergeleken met de optimale schatter. Hieruit kan geconcludeerd worden dat de Mantel en Haenszel schatter over zeer goede eigenschappen beschikt. Enkel in settings met zeer grote steekproefgroottes zal de optimale schatter het beter dan doen de Mantel en Haenszel schatter. Ontbrekende gegevens komen zeer vaak voor in dit soort settings. Voor nietnormaalverdeelde gegevens van een zeer grote steekproef, kunnen de berekeningen van de likelihoodsfunctie zeer complex worden. Generalized estimating equations is dan een goed alternatief, maar minder geschikt indien de interesse (gedeeltelijk) gaat naar de correlatiestructuur van de data. Pseudo-likelihoodsfuncties zijn hier beter geschikt. Wanneer de ontbrekende gegevens missing at random zijn, maakte Molenberghs et al. (2011) enkelvoudige en dubbelvoudige robuste aanpassingen aan de standaard pseudo-likelihoodsfunctie om correcte inferentie te kunnen doen. Waar dat zij de algemene basis hiervan vormden, focuste dit werk op marginale modellen voor hiërarchische binare data. Een Bahadur model werd hier gekozen als marginaal model

    Inference With Unequal Cluster Sizes

    No full text
    A random sample is not always of a fixed, a priori determined size. Examples include sequential sampling and stopping rules, missing data, and clusters with random size. Often there then is no complete sufficient statistic. Completeness means that any measurable function of a sufficient statistic that has zero expectation for every value of the parameter indexing the parametric model class, is the zero function almost everywhere. A simple characterization of incompleteness is given for the exponential family in terms of the mapping between the sufficient statistic and the parameter, based upon the implicit function theorem. Essentially this is a comparison of the dimension of the sufficient statistic to the length of the parameter vector. This results in an easy verifiable criterion for incompleteness, clear and simple to use, even for complex settings as is shown for missing data and clusters of random size. The analysis of hierarchical data that take the form of clusters with random size has received considerable attention in literature. In this work, the focus was on clustered data with unequal cluster sizes, meaning that a joint model of outcome and sample size was not studied. Also, the focus here was on samples that are very large in terms of number of clusters and/or members per cluster, on the one hand, as well as on very small samples (e.g., when studying rare diseases), on the other. Whereas maximum likelihood inference is straightforward in medium to large samples, in samples of sizes considered here it may become prohibitive. Sample-splitting (Molenberghs, Verbeke, and Iddi, 2011) was proposed as a way to replace iterative optimization of a likelihood that does not admit an analytical solution, with closed-form calculations. Pseudo-likelihood (Molenberghs et al., 2014), consisting of computing weighted averages over solutions obtained from subsamples created according to sample size, was used. As a result, the statistical properties of this approach were investigated. In a first attempt, the compound-symmetry variance structure was used to investigate this modelling framework. In a subsample with only clusters of the same size, there are closed-from solutions and other useful properties can be obtained. The operational characteristics are studied using simulations. It follows that the proposed non-iterative methods have a strong beneficial impact on computation time. Next, statistically and computationally efficient estimation in a hierarchical data setting with unequal cluster sizes and an AR(1) covariance structure was studied. As for the compound-symmetry model, the pseudo-likelihood and split-sample methods of Fieuws and Verbeke (2006) and Molenberghs, Verbeke, and Iddi (2011) were used. Maximum likelihood estimation for AR(1) requires numerical iteration when cluster sizes are unequal. A near optimal non-iterative procedure was proposed. Results showed that the method is statistically nearly as efficient as maximum likelihood, but shows great savings in computation time. The odds ratio is a frequently used measure to investigate the association between binary variables. Often, such outcomes are measured across strata of different sizes. Mantel and Haenszel (1959) proposed estimators for a common odds ratio, taking into account the stratification. The most common one is among the best known and most used estimators in statistics. The setting studied by Mantel and Haenszel fits within this framework of samplesplitting and combining with proper weights. The Mantel and Haenszel estimator does not follow from optimality considerations, but nevertheless has properties similar to and often better than the optimal estimator. This was done by comparing it to the optimal estimator, whose existence was demonstrated in spite of the absence of complete sufficient statistics. It is shown, via simulations, that the optimal estimator outperforms the MantelHaenszel estimator only in certain settings with huge sample sizes. Missing data is almost inevitable in correlated-data studies. For non-Gaussian outcomes with moderate to large sequences, direct-likelihood methods can involve complex, hard-to-manipulate likelihoods. Popular alternative approaches, like generalized estimating equations, that are frequently used to circumvent the computational complexity of full likelihood, are less suitable when scientific interest, at least in part, is placed on the association structure; pseudo-likelihood methods are then a viable alternative. When the missing data are missing at random, Molenberghs et al. (2011) proposed a suite of corrections to the standard form of pseudo-likelihood, taking the form of singly and doubly robust estimators. They provided the basis, and exemplified it in insightful yet primarily illustrative examples. The important case of marginal models for hierarchical binary data was considered. Our doubly robust estimator is more convenient than the classical doubly robust estimators. The ideas are illustrated using a marginal model for a binary response, more specifically a Bahadur model.Een steekproef is niet steeds van een vaste, vooraf bepaalde grootte. Voorbeelden zijn sequentiële studies, ontbrekende gegevens en ongebalanceerde hiërarchische data. In dit soort settings is er vaak geen complete sufficient statistic. Een eenvoudige karakterisering van completeness wordt geformuleerd voor de exponentiële familie in termen van de dimensievergelijking tussen de sufficient statistic en de parameter, gebaseerd op de impliciete functiestelling. Het is een eenvoudig en makkelijk verifieerbaar criterium, zelfs voor complexe settings met ontbrekende gegevens en ongebalanceerde hiërarchische data. Ongebalanceerde hiërarchische data werd al vanuit verschillende invalshoeken bestudeerd. In deze thesis ligt de focus op steekproeven die zeer groot zijn, m.a.w. veel clusters of veel metingen per cluster, en die zeer klein zijn (studies van zeldzame ziekten). De Maximum likelihood estimator bepalen in middelgrote steekproeven is goed uitvoerbaar, maar in de settings die hier besproken worden, kan dat moeilijkheden met zich meebrengen, zoals geen analystische oplossingen van gesloten vorm en de likelihoodsfunctie kan alleen iteratief geoptimaliseerd worden. Bijgevolg werd de steekproef opgedeeld in stukken naargelang de grootte van de clusters (Molenberghs, Verbeke, and Iddi, 2011). Deze deelsteekproeven werden hierdoor gebalanceerd en resulteren wel in oplossingen van gesloten vorm. Een pseudo-likelihood werd gebruikt om de oplossingen van elke deelsteekproef te combineren gebruikmakend van gewichten. De eigenschappen van deze methodologie werden in detail onderzocht op gebalanceerde data die een compound-symmetry covariantiestructuur volgen. Via een simulatiestudie werd de toepasbaarheid onderzocht. Hieruit volgt dat deze niet-iteratieve methode slechts een korte berekeningstijd vereist en zeer precies is. Vervolgens werd deze schattingsmethode verder onderzocht in een ongebalanceerde hiërarchische dataset met een autoregressive (AR(1)) covariantiestructuur. Ook hier is deze methode bijna even efficiënt als maximum likelihood en de berekeningstijd is veel lager. The odds ratio is een statistiek die frequent gebruikt wordt om de associatie tussen binaire variabelen te onderzoeken. Ook in dit soort settings kunnen er groeperingen van de gegevens voorkomen van ongelijke grootte. De meeste gekende en gebruikte schatter is deze ontworpen door Mantel and Haenszel (1959). De schatter combineert de odds ratio van subpopulaties in een gewogen schatter, maar volgt niet vanuit optimalisatieberekeningen. The Mantel en Haenszel schatter werd vergeleken met de optimale schatter. Hieruit kan geconcludeerd worden dat de Mantel en Haenszel schatter over zeer goede eigenschappen beschikt. Enkel in settings met zeer grote steekproefgroottes zal de optimale schatter het beter dan doen de Mantel en Haenszel schatter. Ontbrekende gegevens komen zeer vaak voor in dit soort settings. Voor nietnormaalverdeelde gegevens van een zeer grote steekproef, kunnen de berekeningen van de likelihoodsfunctie zeer complex worden. Generalized estimating equations is dan een goed alternatief, maar minder geschikt indien de interesse (gedeeltelijk) gaat naar de correlatiestructuur van de data. Pseudo-likelihoodsfuncties zijn hier beter geschikt. Wanneer de ontbrekende gegevens missing at random zijn, maakte Molenberghs et al. (2011) enkelvoudige en dubbelvoudige robuste aanpassingen aan de standaard pseudo-likelihoodsfunctie om correcte inferentie te kunnen doen. Waar dat zij de algemene basis hiervan vormden, focuste dit werk op marginale modellen voor hiërarchische binare data. Een Bahadur model werd hier gekozen als marginaal model

    The role of frailty in shaping social contact patterns in Belgium, 2022-2023

    No full text
    Social contact data are essential for understanding the spread of respiratory infectious diseases and designing effective prevention strategies. However, many studies often overlook the heterogeneity in mixing patterns among older age groups and individual frailty levels, assuming homogeneity across these sub-populations. This shortcoming may undermine non-pharmaceutical interventions by not targeting specific contact behaviours, potentially reducing their effectiveness in controlling disease. To address this gap, we conducted a contact survey in Flanders, Belgium (June 2022-June 2023). We collected data from 5995 participants (overall response rates of 19.34%) who recorded 31,375 contacts with distinct individuals. Within this cohort, 14.50% were classified as frail, and 46.85% were classified as non-frail. On average, participants report 5.48 contacts daily, with a median of 4 contacts (IQR: 2-7). These contacts vary based on participants' age and frailty levels, influenced by the locations of their interactions. Using the collected data, we reconstructed frailty-dependent contact matrices and developed a contact-based mathematical model that integrates participants' and contactees' frailty levels to investigate how frailty levels affect transmission dynamics. Incorporating frailty levels into the mathematical model substantially alters the shape of epidemic curves and peak incidences. Such insights might provide useful insights for informing non-pharmaceutical interventions, indicating the potential benefit of similar data collection in different countries.Funding Funding for this study [study number: 215366] was provided by GSK (GlaxoSmithKline). GSK was provided the opportunity to review a preliminary version of this publication for factual accuracy, but the authors are solely responsible for final content and interpretation. Acknowledgements The authors gratefully acknowledge the IMI VITAL project for their valuable input and feedback during the development of the study protocol. We extend our sincere thanks to the Ipsos team for conducting the survey, collecting data, and facilitating the rapid progress of this study. We especially appreciate the exceptional project management support provided by Sarah Vercruysse. All important findings will be informed to the IMI VITAL WP3

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis

    Optimal weighted estimation versus Cochran-Mantel-Haenszel

    No full text
    sponsorship: Financial support from the IAP research network #P7/06 of the Belgian Government (Belgian Science Policy) is gratefully acknowledged. The research leading to these results has also received funding from the European Seventh Framework program FP7 2007-2013 under grant agreement Nr. 602552. We gratefully acknowledge support from the IWT-SBO ExaScience grant. Intego is funded on a regular basis by the Flemish Government (Ministry of Health and Welfare). (IAP research network of the Belgian Government (Belgian Science Policy)|P7/06, European Seventh Framework program FP7 2007-2013|602552, IWT-SBO ExaScience grant, Flemish Government (Ministry of Health and Welfare))status: Publishe

    Dispelling the Myths Behind First-author Citation Counts

    Full text link
    We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more sophisticated methods

    Author Index

    No full text
    Nao informado
    corecore