1,720,958 research outputs found
Automatic Text Ontological Representation and Classification via Fundamental to Specific Conceptual Elements (TOR-FUSE)
In this dissertation, we introduce a novel text representation method mainly used for text classification purpose. The presented representation method is initially based on a variety of closeness relationships between pairs of words in text passages within the entire corpus. This representation is then used as the basis for our multi-level lightweight ontological representation method (TOR-FUSE), in which documents are represented based on their contexts and the goal of the learning task. The method is unlike the traditional representation methods, in which all the documents are represented solely based on the constituent words of the documents, and are totally isolated from the goal that they are represented for. We believe choosing the correct granularity of representation features is an important aspect of text classification. Interpreting data in a more general dimensional space, with fewer dimensions, can convey more discriminative knowledge and decrease the level of learning perplexity. The multi-level model allows data interpretation in a more conceptual space, rather than only containing scattered words occurring in texts. It aims to perform the extraction of the knowledge tailored for the classification task by automatic creation of a lightweight ontological hierarchy of representations. In the last step, we will train a tailored ensemble learner over a stack of representations at different conceptual granularities. The final result is a mapping and a weighting of the targeted concept of the original learning task, over a stack of representations and granular conceptual elements of its different levels (hierarchical mapping instead of linear mapping over a vector). Finally the entire algorithm is applied to a variety of general text classification tasks, and the performance is evaluated in comparison with well-known algorithms
Towards a Privacy Preserving Framework for Publishing Longitudinal Data
Recent advances in information technology have enabled public organizations and corporations to collect and store huge amounts of individuals' data in data repositories. Such data are powerful sources of information about an individual's life such as interests, activities, and finances. Corporations can employ data mining and knowledge discovery techniques to extract useful knowledge and interesting patterns from large repositories of individuals' data. The extracted knowledge can be exploited to improve strategic decision making, enhance business performance, and improve services. However, person-specific data often contain sensitive information about individuals and publishing such data poses potential privacy risks. To deal with these privacy issues, data must be anonymized so that no sensitive information about individuals can be disclosed from published data while distortion is minimized to ensure usefulness of data in practice. In this thesis, we address privacy concerns in publishing longitudinal data. A data set is longitudinal if it contains information of the same observation or event about individuals collected at several points in time. For instance, the data set of multiple visits of patients of a hospital over a period of time is longitudinal. Due to temporal correlations among the events of each record, potential background knowledge of adversaries about an individual in the context of longitudinal data has specific characteristics. None of the previous anonymization techniques can effectively protect longitudinal data against an adversary with such knowledge. In this thesis we identify the potential privacy threats on longitudinal data and propose a novel framework of anonymization algorithms in a way that protects individuals' privacy against both identity disclosure and attribute disclosure, and preserves data utility. Particularly, we propose two privacy models: (K,C)^P -privacy and (K,C)-privacy, and for each of these models we propose efficient algorithms for anonymizing longitudinal data. An extensive experimental study demonstrates that our proposed framework can effectively and efficiently anonymize longitudinal data
Towards a Privacy Preserving Framework for Publishing Longitudinal Data
Recent advances in information technology have enabled public organizations and corporations to collect and store huge amounts of individuals' data in data repositories. Such data are powerful sources of information about an individual's life such as interests, activities, and finances. Corporations can employ data mining and knowledge discovery techniques to extract useful knowledge and interesting patterns from large repositories of individuals' data. The extracted knowledge can be exploited to improve strategic decision making, enhance business performance, and improve services. However, person-specific data often contain sensitive information about individuals and publishing such data poses potential privacy risks. To deal with these privacy issues, data must be anonymized so that no sensitive information about individuals can be disclosed from published data while distortion is minimized to ensure usefulness of data in practice. In this thesis, we address privacy concerns in publishing longitudinal data. A data set is longitudinal if it contains information of the same observation or event about individuals collected at several points in time. For instance, the data set of multiple visits of patients of a hospital over a period of time is longitudinal. Due to temporal correlations among the events of each record, potential background knowledge of adversaries about an individual in the context of longitudinal data has specific characteristics. None of the previous anonymization techniques can effectively protect longitudinal data against an adversary with such knowledge. In this thesis we identify the potential privacy threats on longitudinal data and propose a novel framework of anonymization algorithms in a way that protects individuals' privacy against both identity disclosure and attribute disclosure, and preserves data utility. Particularly, we propose two privacy models: (K,C)^P -privacy and (K,C)-privacy, and for each of these models we propose efficient algorithms for anonymizing longitudinal data. An extensive experimental study demonstrates that our proposed framework can effectively and efficiently anonymize longitudinal data
Automatic Text Ontological Representation and Classification via Fundamental to Specific Conceptual Elements (TOR-FUSE)
In this dissertation, we introduce a novel text representation method mainly used for text classification purpose. The presented representation method is initially based on a variety of closeness relationships between pairs of words in text passages within the entire corpus. This representation is then used as the basis for our multi-level lightweight ontological representation method (TOR-FUSE), in which documents are represented based on their contexts and the goal of the learning task. The method is unlike the traditional representation methods, in which all the documents are represented solely based on the constituent words of the documents, and are totally isolated from the goal that they are represented for. We believe choosing the correct granularity of representation features is an important aspect of text classification. Interpreting data in a more general dimensional space, with fewer dimensions, can convey more discriminative knowledge and decrease the level of learning perplexity. The multi-level model allows data interpretation in a more conceptual space, rather than only containing scattered words occurring in texts. It aims to perform the extraction of the knowledge tailored for the classification task by automatic creation of a lightweight ontological hierarchy of representations. In the last step, we will train a tailored ensemble learner over a stack of representations at different conceptual granularities. The final result is a mapping and a weighting of the targeted concept of the original learning task, over a stack of representations and granular conceptual elements of its different levels (hierarchical mapping instead of linear mapping over a vector). Finally the entire algorithm is applied to a variety of general text classification tasks, and the performance is evaluated in comparison with well-known algorithms
Learning relational cliches with contextual generalization.
Inductive logic programming (ILP) is concerned with the problem of inducing concepts represented as logic programs (or Horn clauses) from examples. Top-down inductive learners such as FOIL (Quinlan 1990; Cameron-Jones et al. 1993) learn Horn clauses adding one literal at a time using a hill-climbing search. These learners suffer from local plateaus, where the selection of a conjunction of literals, rather than a single literal, would improve the accuracy of the clause. The problem becomes the search for combinations of literals rather than just single literals. A mechanism to search efficiently through the space of combinations of literals is needed. The FOCL system (Pazzani et al. 1991) solved this problem by giving the concept learner hand-made "relational cliches" which are combinations of literals to consider while learning. The problem is that these cliches are hard to derive and often specific to a domain. So, it would be desirable to learn them automatically. As a part of this thesis, an inductive learner called CLUSE (Cliches Learned and USEd) has been developed that learns combinations of literals called relational cliches. The underlying idea is to learn cliches from examples of a concept and to use them with a hill-climbing learner to escape local plateaus. Cliches are learned from a concept in one domain and used to learn concepts within the same domain as well as across domains. Assuming that cliches are learned and used in the same domain, literals used to express different concepts overlap. Consequently cliches learned from one concept should provide appropriate lookahead to learn concepts in the same domain. On the other hand, these cliches probably have few literals in common with concepts across domains, hence the need for more general cliches. To solve this, CLUSE learns two kinds of cliches: Domain Dependent Cliches expressed as a conjunction of literals specific to a domain, and Domain Independent Cliches where literals have variable predicate symbols. CLUSE is a bottom-up inductive relational learner based on Relative Least General Generalization (RLGG). To remedy the inefficiency and the overgeneralization problems of RLGG, a modified version of RLGG has been developed that exploits the context in which LGG is applied. The modified RLGG is called Contextual Least General Generalization (CLGG). Empirical experiments with CLUSE reveal that cliches learned with CLUSE provide appropriate lookahead to escape local plateaus of a hill-climbing learner both within and across domains. For the purpose of the evaluation, FOIL has been extended to learn concepts with or without cliches. In two domains of application, cliches have proven to be useful. One domain is the real-life application defining structures for the finite element methods (FEM). The other domain is the synthetic domain of blocks, which offers a wide variety of problems (or concepts). Other domains of application such as drug design, text categorization, and detecting traffic problems are also discussed
Task Oriented Privacy-preserving (TOP) Technologies Using Automatic Feature Selection
A large amount of digital information collected and stored in datasets creates vast opportunities for knowledge discovery and data mining. These datasets, however, may contain sensitive information about individuals and, therefore, it is imperative to ensure that their privacy is protected.
Most research in the area of privacy preserving data publishing does not make any assumptions about an intended analysis task applied on the dataset. In many domains such as healthcare, finance, etc; however, it is possible to identify the analysis task beforehand. Incorporating such knowledge of the ultimate analysis task may improve the quality of the anonymized data while protecting the privacy of individuals. Furthermore, the existing research which consider the ultimate analysis task (e.g., classification) is not suitable for high-dimensional data.
We show that automatic feature selection (which is a well-known dimensionality reduction technique) can be utilized in order to consider both aspects of privacy and utility simultaneously. In doing so, we show that feature selection can enhance existing privacy preserving techniques addressing k-anonymity and differential privacy and protect privacy while reducing the amount of modifications applied to the dataset; hence, in most of the cases achieving higher utility.
We consider incorporating the concept of privacy-by-design within the feature selection process. We propose techniques that turn filter-based and wrapper-based feature selection into privacy-aware processes. To this end, we build a layer of privacy on top of regular feature selection process and obtain a privacy preserving feature selection that is not only guided by accuracy but also the amount of protected private information.
In addition to considering privacy after feature selection we introduce a framework for a privacy-aware feature selection evaluation measure. That is, we incorporate privacy during feature selection and obtain a list of candidate privacy-aware attribute subsets that consider (and satisfy) both efficacy and privacy requirements simultaneously.
Finally, we propose a multi-dimensional, privacy-aware evaluation function which incorporates efficacy, privacy, and dimensionality weights and enables the data holder to obtain a best attribute subset according to its preferences
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
Appropriate Similarity Measures for Author Cocitation Analysis
We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
- …
