1,721,062 research outputs found
Demonstrating non-inferiority of easy interpretable methods for insolvency prediction
Insolvency prediction and credit rating are challenging tasks used to evaluate commercial enterprises based on qualitative and quantitative attributes. One way to approach this task is machine learning whereby hypotheses are trained on sample data. The advantages are the automatization of the process obviating the need of human knowledge and thus, its high level objectivity. Nevertheless, this approach does not claim to be perfect as it does not completely replace human knowledge. Hence, these hypotheses are intended to be used as decision support for financial experts and thus, offer an advantage over black box hypotheses. We demonstrate how easily interpretable white box hypotheses (Decision Trees (DTs) and Disjunctive Normal Forms (DNFs)) are not inferior to more difficult interpretable gray box hypotheses (Random Forests (RFs) and Artificial Neural Networks (ANNs)) or even non-interpretable black box hypotheses (Support Vector Machines (SVMs)). We calculate DTs by means of Quinlans famous C4.5 algorithm. For DNFs, we developed an algorithm we call threshold heuristic. In our case study, a database with financial statements of 5152 enterprises is used to evaluate the performance in insolvency prediction for all classifiers. A common problem in insolvency prediction is extremely imbalanced data because naturally there are very few insolvent enterprises. Therefore we apply an asymmetric bagging method which increases the performance with extremely imbalanced data sets. In our case study, interpretable hypotheses perform better than other hypotheses. DTs have a better alpha-error, whereas DNFs outperform DTs with respect to the beta-error. We compared both hypothesis classes and exemplary hypotheses. Both are interpretable in different ways leaving the choice to preference. This leads to the conclusion that interpretable threshold based methods are appropriate for classification problems in finance. In this domain, they are not inferior to more sophisticated methods like SVMs. (C) 2015 Elsevier Ltd. All rights reserved
Parity graph-driven read-once branching programs and an exponential lower bound for integer multiplication
AbstractBranching programs are a well-established computation model for Boolean functions, especially read-once branching programs have been studied intensively. Exponential lower bounds for read-once branching programs are known for a long time. On the other hand, the problem of proving superpolynomial lower bounds for parity read-once branching programs is still open. In this paper restricted parity read-once branching programs are considered and an exponential lower bound on the size of the so-called well-structured parity graph-driven read-once branching programs for integer multiplication is proven. This is the first strongly exponential lower bound on the size of a parity nonoblivious read-once branching program model for an explicitly defined Boolean function. In addition, more insight into the structure of integer multiplication is yielded
Interpretable Multiclass Models for Corporate Credit Rating Capable of Expressing Doubt
Corporate credit rating is a process to classify commercial enterprises based on their creditworthiness. Machine learning algorithms can construct classification models, but in general they do not tend to be 100% accurate. Since they can be used as decision support for experts, interpretable models are desirable. Unfortunately, interpretable models are provided by only few machine learners. Furthermore, credit rating often is a multiclass problem with more than two rating classes. Due to this fact, multiclass classification is often achieved via meta-algorithms using multiple binary learners. However, most state-of-the-art meta-algorithms destroy the interpretability of binary models. In this study, we present Thresholder, a binary interpretable threshold-based disjunctive normal form (DNF) learning algorithm in addition to modifications of popular multiclass meta-algorithms which maintain the interpretability of our binary classifier. Furthermore, we present an approach to express doubt in the decision of our model. Performance and model size are compared with other interpretable approaches for learning DNFs (RIPPER) and decision trees (C4.5) as well as non-interpretable models like random forests, artificial neural networks, and support vector machines. We evaluate their performances on three real-life data sets divided into three rating classes.In this case study all threshold-based and interpretable models perform equallywell and significantly better than other methods. Our new Thresholder algorithm builds the smallest models while its performance is as good as the best methods of our case study. Furthermore, Thresholder marks many potential misclassifications in advance with a doubt label without increasing the classification error
Scipio: Using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species
Background: For many types of analyses, data about gene structure and locations of non-coding regions of genes are required. Although a vast amount of genomic sequence data is available, precise annotation of genes is lacking behind. Finding the corresponding gene of a given protein sequence by means of conventional tools is error prone, and cannot be completed without manual inspection, which is time consuming and requires considerable experience. Results: Scipio is a tool based on the alignment program BLAT to determine the precise gene structure given a protein sequence and a genome sequence. It identifies intron-exon borders and splice sites and is able to cope with sequencing errors and genes spanning several contigs in genomes that have not yet been assembled to supercontigs or chromosomes. Instead of producing a set of hits with varying confidence, Scipio gives the user a coherent summary of locations on the genome that code for the query protein. The output contains information about discrepancies that may result from sequencing errors. Scipio has also successfully been used to find homologous genes in closely related species. Scipio was tested with 979 protein queries against 16 arthropod genomes ( intra species search). For cross- species annotation, Scipio was used to annotate 40 genes from Homo sapiens in the primates Pongo pygmaeus abelii and Callithrix jacchus. The prediction quality of Scipio was tested in a comparative study against that of BLAT and the well established program Exonerate. Conclusion: Scipio is able to precisely map a protein query onto a genome. Even in cases when there are many sequencing errors, or when incomplete genome assemblies lead to hits that stretch across multiple target sequences, it very often provides the user with the correct determination of intron-exon borders and splice sites, showing an improved prediction accuracy compared to BLAT and Exonerate. Apart from being able to find genes in the genome that encode the query protein, Scipio can also be used to annotate genes in closely related species
Protein function prediction in genomes: Critical assessment of coiled-coil predictions based on protein structure data
Coiled-coil regions were among the first protein motifs described structurally and theoretically. The beauty and simplicity of the motif gives hope to detecting coiled-coil regions with reasonable accuracy and precision in any protein sequence. Here, we re-evaluated the most commonly used coiled-coil prediction tools with respect to the most comprehensive reference data set available, the entire Protein Data Base (PDB), down to each amino acid and its secondary structure. Apart from the thirtyfold difference in number of predicted coiled-coils the tools strongly vary in their predictions, across structures and within structures. The evaluation of the false discovery rate and Matthews correlation coefficient, a widely used performance metric for imbalanced data sets, suggests that the tested tools have only limited applicability for large data sets. Coiled-coil predictions strongly impact the functional characterization of proteins, are used for functional genome annotation, and should therefore be supported and validated by additional information
Combining features in a graphical model to predict protein binding sites
Large efforts have been made in classifying residues as binding sites in proteins using machine learning methods. The prediction task can be translated into the computational challenge of assigning each residue the label binding site or non-binding site. Observational data comes from various possibly highly correlated sources. It includes the structure of the protein but not the structure of the complex. The model class of conditional random fields (CRFs) has previously successfully been used for protein binding site prediction. Here, a new CRF-approach is presented that models the dependencies of residues using a general graphical structure defined as a neighborhood graph and thus our model makes fewer independence assumptions on the labels than sequential labeling approaches. A novel node feature change in free energy is introduced into the model, which is then denoted by F-CRF. Parameters are trained with an online large-margin algorithm. Using the standard feature class relative accessible surface area alone, the general graph-structure CRF already achieves higher prediction accuracy than the linear chain CRF of Li et al. F-CRF performs significantly better on a large range of false positive rates than the support-vector-machine-based program PresCont of Zellner et al. on a homodimer set containing 128 chains. F-CRF has a broader scope than PresCont since it is not constrained to protein subgroups and requires no multiple sequence alignment. The improvement is attributed to the advantageous combination of the novel node feature with the standard feature and to the adopted parameter training method. Proteins 2015; 83:844-852. (c) 2015 Wiley Periodicals, Inc
On approximation by circle plus-OBDDs
We show that for every V-OBDD of quasipolynomial size there is a circle plus-OBDD of quasipolynornial size computing the same function up to a small fraction of the inputs. We also prove that the final step in the approximation cannot be improved and discuss possibilities of proving non-approximability results and connections to lower bounds on matrix rigidity. (c) 2006 Elsevier B.V. All rights reserved
Gene prediction with a hidden Markov model and a new intron submodel
The problem of finding the genes in eukaryotic DNA sequences by computational methods is still not satisfactorily solved. Gene finding programs have achieved relatively high accuracy on short genomic sequences but do not perform well on longer sequences with an unknown number of genes in them. Here existing programs tend to predict many false exons. We have developed a new program, AUGUSTUS, for the ab initio prediction of protein coding genes in eukaryotic genomes. The program is based on a Hidden Markov Model and integrates a number of known methods and submodels. It employs a new way of modeling intron lengths. We use a new donor splice site model, a new model for a short region directly upstream of the donor splice site model that takes the reading frame into account and apply a method that allows better GC-content dependent parameter estimation. AUGUSTUS predicts on longer sequences far more human and drosophila genes accurately than the ab initio gene prediction programs we compared it with, while at the same time being more specific. A web interface for AUGUSTUS and the executable program are located a
A Generalized Model of PAC Learning and its Applicability
We combine a new data model, where the random classification is subjected to rather weak
restrictions which in turn are based on the Mammen−Tsybakov [E. Mammen and A.B. Tsybakov,
Ann. Statis. 27 (1999) 1808–1829; A.B. Tsybakov,
Ann. Statis. 32 (2004) 135–166.] small margin conditions,
and the statistical query (SQ) model due to Kearns [M.J. Kearns, J. ACM
45 (1998) 983–1006] to what we refer to as PAC + SQ model. We generalize the class
conditional constant noise (CCCN) model introduced by Decatur [S.E. Decatur, in
ICML ’97: Proc. of the Fourteenth Int. Conf. on Machine Learn. Morgan
Kaufmann Publishers Inc. San Francisco, CA, USA (1997) 83–91] to the noise model
orthogonal to a set of query functions. We show that every polynomial time PAC + SQ learning algorithm can be
efficiently simulated provided that the random noise rate is orthogonal to the query
functions used by the algorithm given the target concept. Furthermore, we extend the
constant-partition classification noise (CPCN) model due to Decatur [S.E. Decatur, in
ICML ’97: Proc. of the Fourteenth Int. Conf. on Machine Learn. Morgan
Kaufmann Publishers Inc. San Francisco, CA, USA (1997) 83–91] to what we call the
constant-partition piecewise orthogonal (CPPO) noise model. We show how statistical
queries can be simulated in the CPPO scenario, given the partition is known to the
learner. We show how to practically use PAC +
SQ simulators in the noise model orthogonal to the query space by
presenting two examples from bioinformatics and software engineering. This way, we
demonstrate that our new noise model is realistic
Calculation and optimization of thresholds for sets of software metrics
In this article, we present a novel algorithmic method for the calculation of thresholds for a metric set. To this aim, machine learning and data mining techniques are utilized. We define a data-driven methodology that can be used for efficiency optimization of existing metric sets, for the simplification of complex classification models, and for the calculation of thresholds for a metric set in an environment where no metric set yet exists. The methodology is independent of the metric set and therefore also independent of any language, paradigm or abstraction level. In four case studies performed on large-scale open-source software metric sets for C functions, C+ +, C# methods and Java classes are optimized and the methodology is validated
- …
