1,721,018 research outputs found
Recommended from our members
Analysis of the Harvard Computer Society Email Archives: An Exploration of Differential Privacy in Practice
This thesis provides a rudimentary introduction to differential privacy as a framework for modern data privacy, using the Harvard Computer Society email list archives as an investigative medium. The differentially private analysis of this dataset includes but is not limited to: time series of list usage, email topic modeling, and sentiment analysis. OpenDP’s Python
package for differential privacy is used extensively to execute computations, and the API is evaluated as a standalone programming framework within itself. Novel graph differential private algorithms are both implemented and empirically assessed. Lastly, this thesis discusses a significant inherent challenge in balancing contrasting aspects of differential privacy and exploratory data analysis
Recommended from our members
Differential Expressiveness: A Data-Centered Perspective on Algorithmic Bias
The interdisciplinary study of algorithmic fairness and bias has enjoyed a meteoric rise in popularity over the past several years, motivated in no small part by the increasingly influential impact of machine learning in many aspects of daily life. One part of this field examines the foundational issue of bias being present in the training data that is provided to an algorithm, seeking to develop ways to describe and mitigate this issue.
We propose a new and broad characterization of a kind of data bias that we call differential expressiveness (DE). We formulate DE as being quality of an individual feature in a dataset, conveying a condition where the values of the feature cannot be consistently interpreted across different individuals. Contextualizing our presentation with an overview of the development of algorithmic fairness, we give two mathematical interpretations of DE and explore how the interpretations relate to one another. In addition, we discuss a variety of case studies illustrating how we can use DE to interpret data bias in real-world examples. Finally, we explore how DE complements existing frameworks in the literature for modelling data bias
Recommended from our members
OpenDP Programming Framework for Renyi Privacy Filters and Odometers
Data scientists work with large-scale sensitive data, which inevitably leads to privacy risks. Differential Privacy (DP) is a mathematical definition of privacy that aims to mitigate privacy risks inherent in data analysis and machine learning.
OpenDP, an open-source software DP tool, allows the government, industry, and academic institutions to share sensitive data to researchers or the public while preserving privacy.
An active research question in DP literature is how to bound the total privacy loss of a sequence of DP computations. Most known DP theorems and the OpenDP library require the privacy parameters of each computation to be fixed in advance. However, this prevents the design of privacy preserving machine learning algorithms that change the privacy budget on the fly. For the adaptive parameter setting, privacy filters and odometers are two objects designed to model the total privacy loss.
In this paper, we extend the programming framework for OpenDP library to handle DP composition under adaptive privacy budgets through Renyi Differential Privacy (RDP). To do so, we construct Renyi filter and odometer and prove its privacy guarantees by generalizing RDP Adaptive Composition. To generalize our Odometer results, we implement a constructor that converts any Odometer to a Filter. Our results allow for the real world DP deployment of ML algorithms and interactive query interfaces that adaptively update the privacy budget
Recommended from our members
In the Blink of an Eye: A Unified Theory for Feature Emergence in Generative Models
Generative models, which produce samples of data such as text or images, are transforming the way we interact with technology. However, they often fail quickly in problematic and unintuitive ways. For example, a language model given a software engineering problem suddenly switched from coding to searching for pictures of Yellowstone National Park, and these rapid shifts in behavior have been observed in reasoning traces and hacks. This phenomenon is not unique to language models: in image generation models, key features of the final output, like objects in the background or the color, are also decided in narrow “critical windows” of the generation process.
While critical windows for a particular type of image generation model called diffusion have been studied at length by statistical physicists, existing theory relies on the specifics of diffusion and strong assumptions on the distribution of model generations. In this thesis, we develop a unifying framework for critical windows that shows that they emerge generically when the sampler specializes to a sub-population of the distribution it models. Drawing on tools from information theory, machine learning, high-dimensional probability theory, and statistical physics, our theory improves upon previous work by using rigorous mathematical tools and is agnostic to the underlying model type or distribution, applying to both language models and diffusion. The key insight of our approach is to exploit the powerful formalism for generative models of stochastic localization, which has roots as a proof technique in probability theory.
Leveraging our consolidated theory for critical windows, we apply it to different examples of critical windows in theoretical and empirical contexts. We provide a novel interpretation of the all-or-nothing phase transition in statistical inference as a critical window and use our framework to explain different failure modes of language models. We finally validate our predictions empirically for real-world models, and demonstrate that critical windows have applications towards improving the safety, privacy, and fairness of generative models.Computer Scienc
Recommended from our members
Concurrent Composition of Interactive Mechanisms with Adaptive Privacy-loss Parameters
Over recent decades, the predominance of data analysis algorithms has made privacy an increasingly greater concern. Differential privacy is one framework for providing privacy guarantees for analysis of sensitive data. A persistent research direction in differential privacy is providing theoretical support for all the variety of ways a data analyst can interact with a dataset, so that practical implementations can have provable privacy guarantees.
This thesis is concerned with the setting in which an analyst interacting with a set of differentially-private mechanisms is interested in both adaptively interleaving queries between mechanisms and also creating new ones. Previous work has provided provable guarantees for the sequential composition of non-interactive mechanisms with adaptive privacy-loss parameters, and the concurrent composition of interactive mechanisms with pre-fixed privacy-loss parameters, but no work has addressed the setting in which both the interaction and the privacy-loss parameters can both be chosen adaptively. Hence, we study the concurrent composition of interactive differentially-private mechanisms with adaptively chosen privacy-loss parameters. We provide formulations of privacy filters and odometers, specialized interactive mechanisms that allow for concurrent composition and adaptive composition, and also provide support for privacy-loss tracking. We prove that every valid privacy filter and odometer for non-interactive mechanisms extends to the concurrent composition of interactive mechanisms if privacy loss is measured using -DP, -DP, or R\'enyi DP of fixed order.
Our results offer strong theoretical foundations for enabling full adaptivity in composing differentially private interactive mechanisms, showing that concurrency does not affect the privacy guarantees. This thesis allows simplifications for existing code repositories, and also widens the range of scenarios in which differentially-private mechanisms can be applied with robust privacy guarantees.Computer Scienc
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Recommended from our members
Finding Simple Models of Complex Objects: From Regularity Lemmas to Algorithmic Fairness
In this thesis, we study connections between the recent literature on multi-group fairness for prediction algorithms and previous well-known results in graph theory, computational complexity, additive combinatorics, information theory, and cryptography. Our starting point are the definitionsof multiaccuracy and multicalibration, which have established themselves as mathematical measures of algorithmic fairness. Multicalibration guarantees accurate (calibrated) predictions for every
subpopulation that can be identified within a specified class of computations, whereas multiaccuracy is a strictly weaker notion which only guarantees accuracy on average.
The task of building multiaccurate predictors is closely related to the well-known regularity lemma, which is an older result in computational complexity. This is a central theorem that has many important implications in different areas, including the weak Szemerédi regularity lemma in graph theory, Impagliazzo’s Hardcore Lemma in complexity theory, the Dense Model Theorem in additive combinatorics, computational analogues of entropy in information theory, and weaker notions of zero-knowledge in cryptography. The relationship between multiaccuracy and the regularity lemma thus implies that a multiaccurate predictor can prove all of these fundamental theorems. By formalizing this observation, we then ask: If we start with a multicalibrated predictor instead, what strengthened and more general versions of these fundamental theorems do we obtain? Through
the lenses of multi-group fairness, we are able to cast the notion of multicalibration back into the realm of complexity theory and obtain stronger and more general versions of Impagliazzo’s Hardcore Lemma, characterizations of pseudoentropy, and the Dense Model Theorem. Moreover, along the way, we present a unified approach of all these fundamental theorems
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
Appropriate Similarity Measures for Author Cocitation Analysis
We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
- …
