1,721,166 research outputs found

    Hierarchical Dirichlet scaling process

    Full text link
    We present the hierarchical Dirichlet scaling process (HDSP), a Bayesian nonparametric mixed membership model. The HDSP generalizes the hierarchical Dirichlet process to model the correlation structure between metadata in the corpus and mixture components. We construct the HDSP based on the normalized gamma representation of the Dirichlet process, and this construction allows incorporating a scaling function that controls the membership probabilities of the mixture components. We develop two scaling methods to demonstrate that different modeling assumptions can be expressed in the HDSP. We also derive the corresponding approximate posterior inference algorithms using variational Bayes. Through experiments on datasets of newswire, medical journal articles, conference proceedings, and product reviews, we show that the HDSP results in a better predictive performance than labeled LDA, partially labeled LDA, and author topic model and a better negative review classification performance than the supervised topic model and SVM.11Nsciescopu

    Topical interest & degree of involvement of bilingual editors in Wikipedia

    No full text
    Language reveals a lot of information about its speakers. Speakers of one language usually share common cultural habits or regional characteristics, and their similarities become more obvious within the context where there are multiple languages in use. We focus on studying bilingual users of Wikipedia, one of the largest multilingual user-generated content platforms. In Wikipedia, we can observe the patterns in the English edition, where users of multiple languages come together to express their thoughts and interests in the common language of English. To understand the specific topics edited by bilingual users, we analyze them in terms of revision counts, topics, and country names. We find that bilingual users are generally interested in more local topics, and their language is highly related with their topics. Also, we observe that the topical diversity decreases with the proportion of English edits, and more concentrates on topics related with countries and cultures

    Pythonpad

    No full text
    We propose Pythonpad, an open-source JavaScript library that supports web-based Python programming exercises. Unlike other standalone web-based programming tools, Pythonpad can be easily integrated into other websites. Although it runs learners' Python code in client-side web browsers, Pythonpad supports a file system, building and importing external modules, and many essential built-in Python libraries to teach basic programming concepts in CS1 classes. © 2021 Owner/Author

    CS1QA: A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course

    No full text
    We introduce CS1QA, a dataset for code-based question answering in the programming education domain. CS1QA consists of 9,237 question-answer pairs gathered from chat logs in an introductory programming class using Python, and 17,698 unannotated chat data with code. Each question is accompanied with the student’s code, and the portion of the code relevant to answering the question. We carefully design the annotation process to construct CS1QA, and analyze the collected dataset in detail. The tasks for CS1QA are to predict the question type, the relevant code snippet given the question and the code and retrieving an answer from the annotated corpus.Results for the experiments on several baseline models are reported and thoroughly analyzed. The tasks for CS1QA challenge models to understand both the code and natural language. This unique dataset can be used as a benchmark for source code comprehension and question answering in the educational setting

    Conversation model fine-tuning for classifying client utterances in counseling dialogues

    No full text
    The recent surge of text-based online counseling applications enables us to collect and analyze interactions between counselors and clients. A dataset of those interactions can be used to learn to automatically classify the client utterances into categories that help counselors in diagnosing client status and predicting counseling outcome. With proper anonymization, we collect counselor-client dialogues, define meaningful categories of client utterances with professional counselors, and develop a novel neural network model for classifying the client utterances. The central idea of our model, ConvMFiT, is a pre-trained conversation model which consists of a general language model built from an out-of-domain corpus and two role-specific language models built from unlabeled in-domain dialogues. The classification result shows that ConvMFiT outperforms state-of-the-art comparison models. Further, the attention weights in the learned model confirm that the model finds expected linguistic patterns for each category
    corecore