1,721,015 research outputs found
Inference With Unequal Cluster Sizes
A random sample is not always of a fixed, a priori determined size. Examples include sequential sampling and stopping rules, missing data, and clusters with random size. Often
there then is no complete sufficient statistic. Completeness means that any measurable
function of a sufficient statistic that has zero expectation for every value of the parameter
indexing the parametric model class, is the zero function almost everywhere. A simple
characterization of incompleteness is given for the exponential family in terms of the mapping between the sufficient statistic and the parameter, based upon the implicit function
theorem. Essentially this is a comparison of the dimension of the sufficient statistic to
the length of the parameter vector. This results in an easy verifiable criterion for incompleteness, clear and simple to use, even for complex settings as is shown for missing data
and clusters of random size.
The analysis of hierarchical data that take the form of clusters with random size has
received considerable attention in literature. In this work, the focus was on clustered
data with unequal cluster sizes, meaning that a joint model of outcome and sample size
was not studied. Also, the focus here was on samples that are very large in terms of
number of clusters and/or members per cluster, on the one hand, as well as on very small
samples (e.g., when studying rare diseases), on the other. Whereas maximum likelihood
inference is straightforward in medium to large samples, in samples of sizes considered here
it may become prohibitive. Sample-splitting (Molenberghs, Verbeke, and Iddi, 2011) was
proposed as a way to replace iterative optimization of a likelihood that does not admit an
analytical solution, with closed-form calculations. Pseudo-likelihood (Molenberghs et al.,
2014), consisting of computing weighted averages over solutions obtained from subsamples
created according to sample size, was used. As a result, the statistical properties of this
approach were investigated. In a first attempt, the compound-symmetry variance structure
was used to investigate this modelling framework. In a subsample with only clusters of the same size, there are closed-from solutions and other useful properties can be obtained.
The operational characteristics are studied using simulations. It follows that the proposed
non-iterative methods have a strong beneficial impact on computation time.
Next, statistically and computationally efficient estimation in a hierarchical data setting with unequal cluster sizes and an AR(1) covariance structure was studied. As for the
compound-symmetry model, the pseudo-likelihood and split-sample methods of Fieuws
and Verbeke (2006) and Molenberghs, Verbeke, and Iddi (2011) were used. Maximum
likelihood estimation for AR(1) requires numerical iteration when cluster sizes are unequal. A near optimal non-iterative procedure was proposed. Results showed that the
method is statistically nearly as efficient as maximum likelihood, but shows great savings
in computation time.
The odds ratio is a frequently used measure to investigate the association between
binary variables. Often, such outcomes are measured across strata of different sizes.
Mantel and Haenszel (1959) proposed estimators for a common odds ratio, taking into
account the stratification. The most common one is among the best known and most
used estimators in statistics.
The setting studied by Mantel and Haenszel fits within this framework of samplesplitting and combining with proper weights. The Mantel and Haenszel estimator does
not follow from optimality considerations, but nevertheless has properties similar to and
often better than the optimal estimator. This was done by comparing it to the optimal
estimator, whose existence was demonstrated in spite of the absence of complete sufficient
statistics. It is shown, via simulations, that the optimal estimator outperforms the MantelHaenszel estimator only in certain settings with huge sample sizes.
Missing data is almost inevitable in correlated-data studies. For non-Gaussian outcomes with moderate to large sequences, direct-likelihood methods can involve complex,
hard-to-manipulate likelihoods. Popular alternative approaches, like generalized estimating equations, that are frequently used to circumvent the computational complexity of
full likelihood, are less suitable when scientific interest, at least in part, is placed on the
association structure; pseudo-likelihood methods are then a viable alternative. When the
missing data are missing at random, Molenberghs et al. (2011) proposed a suite of corrections to the standard form of pseudo-likelihood, taking the form of singly and doubly
robust estimators. They provided the basis, and exemplified it in insightful yet primarily
illustrative examples. The important case of marginal models for hierarchical binary data
was considered. Our doubly robust estimator is more convenient than the classical doubly
robust estimators. The ideas are illustrated using a marginal model for a binary response,
more specifically a Bahadur model.Een steekproef is niet steeds van een vaste, vooraf bepaalde grootte. Voorbeelden zijn
sequentiële studies, ontbrekende gegevens en ongebalanceerde hiërarchische data. In dit
soort settings is er vaak geen complete sufficient statistic. Een eenvoudige karakterisering van completeness wordt geformuleerd voor de exponentiële familie in termen van de
dimensievergelijking tussen de sufficient statistic en de parameter, gebaseerd op de impliciete functiestelling. Het is een eenvoudig en makkelijk verifieerbaar criterium, zelfs voor
complexe settings met ontbrekende gegevens en ongebalanceerde hiërarchische data.
Ongebalanceerde hiërarchische data werd al vanuit verschillende invalshoeken bestudeerd. In deze thesis ligt de focus op steekproeven die zeer groot zijn, m.a.w. veel clusters
of veel metingen per cluster, en die zeer klein zijn (studies van zeldzame ziekten). De
Maximum likelihood estimator bepalen in middelgrote steekproeven is goed uitvoerbaar,
maar in de settings die hier besproken worden, kan dat moeilijkheden met zich meebrengen, zoals geen analystische oplossingen van gesloten vorm en de likelihoodsfunctie kan
alleen iteratief geoptimaliseerd worden. Bijgevolg werd de steekproef opgedeeld in stukken
naargelang de grootte van de clusters (Molenberghs, Verbeke, and Iddi, 2011). Deze deelsteekproeven werden hierdoor gebalanceerd en resulteren wel in oplossingen van gesloten
vorm. Een pseudo-likelihood werd gebruikt om de oplossingen van elke deelsteekproef
te combineren gebruikmakend van gewichten. De eigenschappen van deze methodologie
werden in detail onderzocht op gebalanceerde data die een compound-symmetry covariantiestructuur volgen. Via een simulatiestudie werd de toepasbaarheid onderzocht. Hieruit
volgt dat deze niet-iteratieve methode slechts een korte berekeningstijd vereist en zeer
precies is.
Vervolgens werd deze schattingsmethode verder onderzocht in een ongebalanceerde
hiërarchische dataset met een autoregressive (AR(1)) covariantiestructuur. Ook hier is
deze methode bijna even efficiënt als maximum likelihood en de berekeningstijd is veel lager.
The odds ratio is een statistiek die frequent gebruikt wordt om de associatie tussen
binaire variabelen te onderzoeken. Ook in dit soort settings kunnen er groeperingen van
de gegevens voorkomen van ongelijke grootte. De meeste gekende en gebruikte schatter
is deze ontworpen door Mantel and Haenszel (1959).
De schatter combineert de odds ratio van subpopulaties in een gewogen schatter,
maar volgt niet vanuit optimalisatieberekeningen. The Mantel en Haenszel schatter werd
vergeleken met de optimale schatter. Hieruit kan geconcludeerd worden dat de Mantel
en Haenszel schatter over zeer goede eigenschappen beschikt. Enkel in settings met
zeer grote steekproefgroottes zal de optimale schatter het beter dan doen de Mantel en
Haenszel schatter.
Ontbrekende gegevens komen zeer vaak voor in dit soort settings. Voor nietnormaalverdeelde gegevens van een zeer grote steekproef, kunnen de berekeningen van de
likelihoodsfunctie zeer complex worden. Generalized estimating equations is dan een goed
alternatief, maar minder geschikt indien de interesse (gedeeltelijk) gaat naar de correlatiestructuur van de data. Pseudo-likelihoodsfuncties zijn hier beter geschikt. Wanneer de
ontbrekende gegevens missing at random zijn, maakte Molenberghs et al. (2011) enkelvoudige en dubbelvoudige robuste aanpassingen aan de standaard pseudo-likelihoodsfunctie
om correcte inferentie te kunnen doen. Waar dat zij de algemene basis hiervan vormden,
focuste dit werk op marginale modellen voor hiërarchische binare data. Een Bahadur
model werd hier gekozen als marginaal model
Inference With Unequal Cluster Sizes
A random sample is not always of a fixed, a priori determined size. Examples include sequential sampling and stopping rules, missing data, and clusters with random size. Often
there then is no complete sufficient statistic. Completeness means that any measurable
function of a sufficient statistic that has zero expectation for every value of the parameter
indexing the parametric model class, is the zero function almost everywhere. A simple
characterization of incompleteness is given for the exponential family in terms of the mapping between the sufficient statistic and the parameter, based upon the implicit function
theorem. Essentially this is a comparison of the dimension of the sufficient statistic to
the length of the parameter vector. This results in an easy verifiable criterion for incompleteness, clear and simple to use, even for complex settings as is shown for missing data
and clusters of random size.
The analysis of hierarchical data that take the form of clusters with random size has
received considerable attention in literature. In this work, the focus was on clustered
data with unequal cluster sizes, meaning that a joint model of outcome and sample size
was not studied. Also, the focus here was on samples that are very large in terms of
number of clusters and/or members per cluster, on the one hand, as well as on very small
samples (e.g., when studying rare diseases), on the other. Whereas maximum likelihood
inference is straightforward in medium to large samples, in samples of sizes considered here
it may become prohibitive. Sample-splitting (Molenberghs, Verbeke, and Iddi, 2011) was
proposed as a way to replace iterative optimization of a likelihood that does not admit an
analytical solution, with closed-form calculations. Pseudo-likelihood (Molenberghs et al.,
2014), consisting of computing weighted averages over solutions obtained from subsamples
created according to sample size, was used. As a result, the statistical properties of this
approach were investigated. In a first attempt, the compound-symmetry variance structure
was used to investigate this modelling framework. In a subsample with only clusters of the same size, there are closed-from solutions and other useful properties can be obtained.
The operational characteristics are studied using simulations. It follows that the proposed
non-iterative methods have a strong beneficial impact on computation time.
Next, statistically and computationally efficient estimation in a hierarchical data setting with unequal cluster sizes and an AR(1) covariance structure was studied. As for the
compound-symmetry model, the pseudo-likelihood and split-sample methods of Fieuws
and Verbeke (2006) and Molenberghs, Verbeke, and Iddi (2011) were used. Maximum
likelihood estimation for AR(1) requires numerical iteration when cluster sizes are unequal. A near optimal non-iterative procedure was proposed. Results showed that the
method is statistically nearly as efficient as maximum likelihood, but shows great savings
in computation time.
The odds ratio is a frequently used measure to investigate the association between
binary variables. Often, such outcomes are measured across strata of different sizes.
Mantel and Haenszel (1959) proposed estimators for a common odds ratio, taking into
account the stratification. The most common one is among the best known and most
used estimators in statistics.
The setting studied by Mantel and Haenszel fits within this framework of samplesplitting and combining with proper weights. The Mantel and Haenszel estimator does
not follow from optimality considerations, but nevertheless has properties similar to and
often better than the optimal estimator. This was done by comparing it to the optimal
estimator, whose existence was demonstrated in spite of the absence of complete sufficient
statistics. It is shown, via simulations, that the optimal estimator outperforms the MantelHaenszel estimator only in certain settings with huge sample sizes.
Missing data is almost inevitable in correlated-data studies. For non-Gaussian outcomes with moderate to large sequences, direct-likelihood methods can involve complex,
hard-to-manipulate likelihoods. Popular alternative approaches, like generalized estimating equations, that are frequently used to circumvent the computational complexity of
full likelihood, are less suitable when scientific interest, at least in part, is placed on the
association structure; pseudo-likelihood methods are then a viable alternative. When the
missing data are missing at random, Molenberghs et al. (2011) proposed a suite of corrections to the standard form of pseudo-likelihood, taking the form of singly and doubly
robust estimators. They provided the basis, and exemplified it in insightful yet primarily
illustrative examples. The important case of marginal models for hierarchical binary data
was considered. Our doubly robust estimator is more convenient than the classical doubly
robust estimators. The ideas are illustrated using a marginal model for a binary response,
more specifically a Bahadur model.Een steekproef is niet steeds van een vaste, vooraf bepaalde grootte. Voorbeelden zijn
sequentiële studies, ontbrekende gegevens en ongebalanceerde hiërarchische data. In dit
soort settings is er vaak geen complete sufficient statistic. Een eenvoudige karakterisering van completeness wordt geformuleerd voor de exponentiële familie in termen van de
dimensievergelijking tussen de sufficient statistic en de parameter, gebaseerd op de impliciete functiestelling. Het is een eenvoudig en makkelijk verifieerbaar criterium, zelfs voor
complexe settings met ontbrekende gegevens en ongebalanceerde hiërarchische data.
Ongebalanceerde hiërarchische data werd al vanuit verschillende invalshoeken bestudeerd. In deze thesis ligt de focus op steekproeven die zeer groot zijn, m.a.w. veel clusters
of veel metingen per cluster, en die zeer klein zijn (studies van zeldzame ziekten). De
Maximum likelihood estimator bepalen in middelgrote steekproeven is goed uitvoerbaar,
maar in de settings die hier besproken worden, kan dat moeilijkheden met zich meebrengen, zoals geen analystische oplossingen van gesloten vorm en de likelihoodsfunctie kan
alleen iteratief geoptimaliseerd worden. Bijgevolg werd de steekproef opgedeeld in stukken
naargelang de grootte van de clusters (Molenberghs, Verbeke, and Iddi, 2011). Deze deelsteekproeven werden hierdoor gebalanceerd en resulteren wel in oplossingen van gesloten
vorm. Een pseudo-likelihood werd gebruikt om de oplossingen van elke deelsteekproef
te combineren gebruikmakend van gewichten. De eigenschappen van deze methodologie
werden in detail onderzocht op gebalanceerde data die een compound-symmetry covariantiestructuur volgen. Via een simulatiestudie werd de toepasbaarheid onderzocht. Hieruit
volgt dat deze niet-iteratieve methode slechts een korte berekeningstijd vereist en zeer
precies is.
Vervolgens werd deze schattingsmethode verder onderzocht in een ongebalanceerde
hiërarchische dataset met een autoregressive (AR(1)) covariantiestructuur. Ook hier is
deze methode bijna even efficiënt als maximum likelihood en de berekeningstijd is veel lager.
The odds ratio is een statistiek die frequent gebruikt wordt om de associatie tussen
binaire variabelen te onderzoeken. Ook in dit soort settings kunnen er groeperingen van
de gegevens voorkomen van ongelijke grootte. De meeste gekende en gebruikte schatter
is deze ontworpen door Mantel and Haenszel (1959).
De schatter combineert de odds ratio van subpopulaties in een gewogen schatter,
maar volgt niet vanuit optimalisatieberekeningen. The Mantel en Haenszel schatter werd
vergeleken met de optimale schatter. Hieruit kan geconcludeerd worden dat de Mantel
en Haenszel schatter over zeer goede eigenschappen beschikt. Enkel in settings met
zeer grote steekproefgroottes zal de optimale schatter het beter dan doen de Mantel en
Haenszel schatter.
Ontbrekende gegevens komen zeer vaak voor in dit soort settings. Voor nietnormaalverdeelde gegevens van een zeer grote steekproef, kunnen de berekeningen van de
likelihoodsfunctie zeer complex worden. Generalized estimating equations is dan een goed
alternatief, maar minder geschikt indien de interesse (gedeeltelijk) gaat naar de correlatiestructuur van de data. Pseudo-likelihoodsfuncties zijn hier beter geschikt. Wanneer de
ontbrekende gegevens missing at random zijn, maakte Molenberghs et al. (2011) enkelvoudige en dubbelvoudige robuste aanpassingen aan de standaard pseudo-likelihoodsfunctie
om correcte inferentie te kunnen doen. Waar dat zij de algemene basis hiervan vormden,
focuste dit werk op marginale modellen voor hiërarchische binare data. Een Bahadur
model werd hier gekozen als marginaal model
The role of frailty in shaping social contact patterns in Belgium, 2022-2023
Social contact data are essential for understanding the spread of respiratory infectious diseases and designing effective prevention strategies. However, many studies often overlook the heterogeneity in mixing patterns among older age groups and individual frailty levels, assuming homogeneity across these sub-populations. This shortcoming may undermine non-pharmaceutical interventions by not targeting specific contact behaviours, potentially reducing their effectiveness in controlling disease. To address this gap, we conducted a contact survey in Flanders, Belgium (June 2022-June 2023). We collected data from 5995 participants (overall response rates of 19.34%) who recorded 31,375 contacts with distinct individuals. Within this cohort, 14.50% were classified as frail, and 46.85% were classified as non-frail. On average, participants report 5.48 contacts daily, with a median of 4 contacts (IQR: 2-7). These contacts vary based on participants' age and frailty levels, influenced by the locations of their interactions. Using the collected data, we reconstructed frailty-dependent contact matrices and developed a contact-based mathematical model that integrates participants' and contactees' frailty levels to investigate how frailty levels affect transmission dynamics. Incorporating frailty levels into the mathematical model substantially alters the shape of epidemic curves and peak incidences. Such insights might provide useful insights for informing non-pharmaceutical interventions, indicating the potential benefit of similar data collection in different countries.Funding
Funding for this study [study number: 215366] was provided by GSK (GlaxoSmithKline). GSK was provided the opportunity to review a preliminary version of this publication for factual accuracy, but the authors are solely
responsible for final content and interpretation.
Acknowledgements
The authors gratefully acknowledge the IMI VITAL project for their valuable input and feedback during the development of the study protocol. We extend our sincere thanks to the Ipsos team for conducting the survey, collecting data, and facilitating the rapid progress of this study. We especially appreciate the exceptional project management support provided by Sarah Vercruysse. All important findings will be informed to the IMI VITAL WP3
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
Appropriate Similarity Measures for Author Cocitation Analysis
We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
Optimal weighted estimation versus Cochran-Mantel-Haenszel
sponsorship: Financial support from the IAP research network #P7/06 of the Belgian Government (Belgian Science Policy) is gratefully acknowledged. The research leading to these results has also received funding from the European Seventh Framework program FP7 2007-2013 under grant agreement Nr. 602552. We gratefully acknowledge support from the IWT-SBO ExaScience grant. Intego is funded on a regular basis by the Flemish Government (Ministry of Health and Welfare). (IAP research network of the Belgian Government (Belgian Science Policy)|P7/06, European Seventh Framework program FP7 2007-2013|602552, IWT-SBO ExaScience grant, Flemish Government (Ministry of Health and Welfare))status: Publishe
Dispelling the Myths Behind First-author Citation Counts
We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued
use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation
counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more
sophisticated methods
- …
