1,720,970 research outputs found
A Novel Boolean Kernels Family for Categorical Data
Kernel based classifiers, such as SVM, are considered state-of-the-art algorithms and are widely used on many classification tasks. However, this kind of methods are hardly interpretable and for this reason they are often considered as black-box models. In this paper, we propose a new family of Boolean kernels for categorical data where features correspond to propositional formulas applied to the input variables. The idea is to create human-readable features to ease the extraction of interpretation rules directly from the embedding space. Experiments on artificial and benchmark datasets show the effectiveness of the proposed family of kernels with respect to established ones, such as RBF, in terms of classification accuracy
A Heuristic for Multi-attribute Vehicle Routing Problems in Express Freight Transportation
Learning representations for biomedical named entity recognition
Biomedical Named Entity Recognition is a common task in Natural Language Processing applications, whose purpose is to recognize and categorize different types of entities in biomedical documents. Recently, the literature has shown effective methods based on combinations of Machine Learning algorithms and Natural Language Processing techniques. However, a critical issue of such applications is the choice of the data representation. Generic and abstract word-embeddings can be easily used to train a learning algorithm, without prior knowledge of the domain. On the other hand, dedicated hand-crafted features are expensive to define, but they could represent better the specific problem. In this work, an extensive experimental assessment is carried out, where different representations have been analyzed. Then, a general framework to learn the representation by combining general and domain-specific features is proposed and evaluated, showing empirical results on the CRAFT corpus
Automatic Detection of Cross-language Verbal Deception
The assessment of how a deceptive message is produced in different languages has received little attention, with the majority of studies focused on the English language. Moreover, there is no agreement about the stability of linguistic clues of deceit across different languages. In this paper, we address this issue by analysing both theory-driven linguistic markers of deception (cognitive load hypothesis) and standard text categorisation features. After compiling a multilingual corpus of both honest and deceitful first-person opinions regarding five different topics, we assessed the cross-language applicability of four different features sets in within-topic, cross-topic and cross-language binary classification experiments. Results showed promising classification performances in all the three experiments with few exceptions. Interestingly, linguistic markers of deceit linked to the cognitive load hypothesis exhibited the same trend in the two languages under investigation and the cross-language evaluation highlighted their usefulness in spotting deceit between different languages
Learning adaptive representations for entity recognition in the biomedical domain
Background
Named Entity Recognition is a common task in Natural Language Processing applications, whose purpose is to recognize named entities in textual documents. Several systems exist to solve this task in the biomedical domain, based on Natural Language Processing techniques and Machine Learning algorithms. A crucial step of these applications is the choice of the representation which describes data. Several representations have been proposed in the literature, some of which are based on a strong knowledge of the domain, and they consist of features manually defined by domain experts. Usually, these representations describe the problem well, but they require a lot of human effort and annotated data. On the other hand, general-purpose representations like word-embeddings do not require human domain knowledge, but they could be too general for a specific task.
Results
This paper investigates methods to learn the best representation from data directly, by combining several knowledge-based representations and word embeddings. Two mechanisms have been considered to perform the combination, which are neural networks and Multiple Kernel Learning. To this end, we use a hybrid architecture for biomedical entity recognition which integrates dictionary look-up (also known as gazetteers) with machine learning techniques. Results on the CRAFT corpus clearly show the benefits of the proposed algorithm in terms of F1 score.
Conclusions
Our experiments show that the principled combination of general, domain specific, word-, and character-level representations improves the performance of entity recognition. We also discussed the contribution of each representation in the final solution
- …
