1,721,170 research outputs found

    Approximating the Sparsest Cut

    No full text

    How to train your instruction-following text encoder without labeling

    No full text
    Text embedding models encode the semantic content of natural language inputs into fixed-length vectors. Contrastive learning has been the go-to training strategy to ensure semantically similar vectors are closely positioned in the embedding space. While successful, this training recipe requires large amounts of labeled training data in order to cover diverse domains. In addition, current text embedding models cannot adhere to user instructions when encoding inputs. In this thesis, we provide the first attempt to build text embedding models that can (1) adhere to user instructions, and (2) generalize without domain-specific annotated data. We leverage the recent success of generative large language models (LLMs), which exhibit strong domain generalization and rich latent knowledge. Transferring these properties to text encoders can enrich their contextualized representations and allow for instruction-controlled representations. In this work, we rely on the state-space model (SSM) parametrization to achieve such goals. SSMs are defined through ordinary differential equations with respect to the state vector, capturing the dynamics of an information system over time. State vectors are a fixed-length compression of the system’s past trajectory. We show that, empirically, state vectors in learned, discretized SSMs still preserve this information-rich property in language modeling tasks, and can be applied off-the-shelf to perform text embedding tasks. The instruction-following ability of pretrained generative LMs allows the state vectors to be sensitive to user intentions and can successfully compress different information, given different prompts

    A Theoretical Analysis of Contrastive Unsupervised Representation Learning

    No full text
    Supervised learning has long been considered an empirically successful and theoretically well motivated paradigm in machine learning which continues to be an active area of research. On the other hand, recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the well-known word2vec embedding algorithm: leveraging availability of pairs of semantically “similar” data points and “negative samples,” the learner forces the inner product of representations of similar pairs with each other to be higher on average than with negative samples. In contrast with supervised learning, such methods lack a strong theoretical grounding and are thus not as well understood. This paper uses the term {\em contrastive} learning for such algorithms and presents a theoretical framework to analyze them by introducing latent classes and hypothesizing that semantically similar points are sampled from the same latent class. This minimal framework allows us to show provable guarantees on the performance of the learned representations on the average classification task that is comprised of a subset of the same set of latent classes. Our generalization bound also shows that learned representations can reduce labeled sample complexity on downstream tasks. To support the theory we conduct controlled experiments in both the text and image domains using function classes of practical interest. We hope that such theoretical frameworks can, in the future, promote a principled study of unsupervised learning methods

    Sparse and Efficient Transfer Learning via Winning Lottery Tickets

    No full text
    In this thesis, we extend the Lottery Ticket Hypothesis of Frankle & Carbin (ICLR `19) to a variety of transfer learning problems. We identify sparse, trainable sub-networks that can be found on a source dataset and transferred to a variety of down-stream tasks. Our results show that sparse sub-networks with approximately 85-95% of weights removed exceed the accuracy of the original network when transferred to other tasks. We experimentally show that a sparse representation learned by a deep convolutional network trained on CIFAR-10 can be transferred to SmallNORB and FashionMNIST in a number of realistic settings. In addition, we show the existence of the first sparse, trainable sub-networks for natural language tasks; in particular, we show that BERT with up to 81.5% of parameters removed can reach the original test accuracy for the CoNLL-2003 Named Entity Recognition task

    Predicting Netflix Movie Ratings using a Topic Modeling Algorithm

    No full text
    Latent factor models and matrix factorization algorithms were some of the most successful stand-alone algorithms used for predicting movie ratings in the Netflix Prize. To address the sparsity in the movie rating training set, many matrix factorization algorithms train only on the observed ratings and use regularization to avoid overfitting. Topic modeling algorithms must also be able to handle high sparsity. Given a collection of documents, the purpose of topic modeling is to discover the high-level thematic structure that best explains the collection of documents as a whole. In the same way, we might hope that given a collection of movie ratings, we can uncover the high-level movie genres that best explain the collection of movie ratings as a whole. Mathematically, topic modeling can be interpreted as recovering the first factor in a matrix factorization, subject to some constraints. By this view, perhaps a topic modeling algorithm can be the first step in a matrix factorization algorithm that predicts Netflix movie ratings. In this thesis, we develop a three-step algorithm for predicting movie ratings using a matrix factorization of the form M = AW: first we obtain a collection of genres using a topic modeling algorithm, then we generate a suitable A matrix from the collection of genres, and finally we use the A matrix to get the W matrix

    On Provable Algorithms for Topic Modeling

    No full text

    MCMC algorithms for sampling from multimodal and changing distributions

    No full text
    The problem of sampling from a probability distribution is a fundamental problem in Bayesian statistics and machine learning, with applications throughout the sciences. One common algorithmic framework for solving this problem is Markov Chain Monte Carlo (MCMC). However, a large gap exists between simple settings where MCMC has been proven to work, and complex settings arising in practice. In this thesis, I make progress towards closing this gap, focusing on two hurdles in particular. In Chapter 2, I consider the problem of sampling from *multimodal* distributions. Many distributions arising in practice, from simple mixture models to deep generative models, are multimodal, so any Markov chain which makes local moves will get stuck in one mode. Although a variety of temperature heuristics are used to address this problem, their theoretical guarantees are not well-understood even for simple multimodal distributions. I analyze an algorithm combining Langevin diffusion with *simulated tempering*, a heuristic which speeds up mixing by transitioning between different temperatures of the distribution. I develop a general method to prove mixing time using ``soft decompositions'' of Markov processes, and use it to prove rapid mixing for (polynomial) mixtures of log-concave distributions. In Chapter 3, I address the problem of sampling from the distributions pt(x)ek=0tfk(x)p_t(x) \propto e^{-\sum_{k=0}^t f_k(x)} for each epoch 1tT1\le t\le T in an *online* manner, given a sequence of (convex) functions f0,,fTf_0,\ldots, f_T. This problem arises in large-scale Bayesian inference (for instance, online logistic regression) where instead of obtaining all the observations at once, one constantly acquires new data, and must continuously update the distribution. All previous results for this problem imply a bound on the number of gradient evaluations at each epoch tt that grows at least linearly in tt. For this problem, I show that a certain variance-reduced SGLD (stochastic gradient Langevin dynamics) algorithm solves the online sampling problem with fixed TV-error ϵ\epsilon with an *almost constant* number of gradient evaluations per epoch
    corecore