1,721,232 research outputs found

    Embedding genetic improvement into programming languages

    No full text
    We present a vision of genetic improvement firmly embedded in, and supported by, programming languages. Genetic improvement has already been envisioned as the next compiler, which would take human wrifien programs as input and return versions optimised with respect to various objectives. As an intermediate stage, or perhaps to complement the fully automated vision, we imagine genetic improvement processes that are hinted at and directed by humans but understood and undertaken by programming languages and their runtimes, via interactions through the source code. We examine existing similar ideas and examine the benefits of embedding them within programming languages

    Empirical Evaluation of Fault Localisation Using Code and Change Metrics

    No full text
    Fault localisation aims to reduce the debugging efforts of human developers by highlighting the program elements that are suspected to be the root cause of the observed failure. Spectrum Based Fault Localisation (SBFL), a coverage based approach, has been widely studied in many researches as a promising localisation technique. Recently, however, it has been proven that SBFL techniques have reached the limit of further improvement. To overcome the limitation, we extend SBFL with code and change metrics that have been mainly studied in defect prediction, such as size, age, and churn. FLUCCS, our fault learn-to-rank localisation technique, employs both existing SBFL formulas and these metrics as input. We investigate the effect of employing code and change metrics for fault localisation using four different learn-to-rank techniques: Genetic Programming, Gaussian Process Modelling, Support Vector Machine, and Random Forest. We evaluate the performance of FLUCCS with 386 real world faults collected from Defects4J repository. The results show that FLUCCS with code and change metrics places 144 faults at the top and 304 faults within the top ten. This is a significant improvement over the state-of-art SBFL formulas, which can locate 65 and 212 faults at the top and within the top ten, respectively. We also investigate the feasibility of cross-project transfer learning of fault localisation. The results show that, while there exist project-specific properties that can be exploited for better localisation per project, ranking models learnt from one project can be applied to others without significant loss of effectiveness.

    FDG: A Precise Measurement of Fault Diagnosability Gain of Test Cases

    No full text
    The performance of many Fault Localisation (FL) techniques directly depends on the quality of the used test suites. Consequently, it is extremely useful to be able to precisely measure how much diagnostic power each test case can introduce when added to a test suite used for FL. Such a measure can help us not only to prioritise and select test cases to be used for FL, but also to effectively augment test suites that are too weak to be used with FL techniques. We propose FDG, a new measure of Fault Diagnosability Gain for individual test cases. The design of FDG is based on our analysis of existing metrics that are designed to prioritise test cases for better FL. Unlike other metrics, FDG exploits the ongoing FL results to emphasise the parts of the program for which more information is needed. Our evaluation of FDG with Defects4J shows that it can successfully help the augmentation of test suites for better FL. When given only a few failing test cases (2.3 test cases on average), FDG can effectively augment the given test suite by prioritising the test cases generated automatically by EvoSuite: the augmentation can improve the acc@1 and acc@10 of the FL results by 11.6x and 2.2x on average, after requiring only ten human judgements on the correctness of the assertions EvoSuite generates.Comment: 13 pages, 6 figures (to be published in ISSTA'22

    Evaluating Surprise Adequacy for Deep Learning System Testing

    No full text
    The rapid adoption of Deep Learning (DL) systems in safety critical domains such as medical imaging and autonomous driving urgently calls for ways to test their correctness and robustness. Borrowing from the concept of test adequacy in traditional software testing, existing work on testing of DL systems initially investigated DL systems from structural point of view, leading to a number of coverage metrics. Our lack of understanding of the internal mechanism of Deep Neural Networks (DNNs), however, means that coverage metrics defined on the Boolean dichotomy of coverage are hard to intuitively interpret and understand. We propose the degree of out-of-distribution-ness of a given input as its adequacy for testing: the more surprising a given input is to the DNN under test, the more likely the system will show unexpected behavior for the input. We develop the concept of surprise into a test adequacy criterion, called Surprise Adequacy (SA). Intuitively, SA measures the difference in the behavior of the DNN for the given input and its behavior for the training data. We posit that a good test input should be sufficiently, but not overtly, surprising compared to the training dataset. This article evaluates SA using a range of DL systems from simple image classifiers to autonomous driving car platforms, as well as both small and large data benchmarks ranging from MNIST to ImageNet. The results show that the SA value of an input can be a reliable predictor of the correctness of the mode behavior. We also show that SA can be used to detect adversarial examples, and also be efficiently computed against large training dataset such as ImageNet using sampling

    Towards Autonomous Testing Agents via Conversational Large Language Models

    No full text
    Software testing is an important part of the development cycle, yet it requires specialized expertise and substantial developer effort to adequately test software. Recent discoveries of the capabilities of large language models (LLMs) suggest that they can be used as automated testing assistants, and thus provide helpful information and even drive the testing process. To highlight the potential of this technology, we present a taxonomy of LLM-based testing agents based on their level of autonomy, and describe how a greater level of autonomy can benefit developers in practice. An example use of LLMs as a testing assistant is provided to demonstrate how a conversational framework for testing can help developers. This also highlights how the often criticized 'hallucination' of LLMs can be beneficial for testing. We identify other tangible benefits that LLM-driven testing agents can bestow, and also discuss potential limitations

    The Inversive Relationship Between Bugs and Patches: An Empirical Study

    No full text
    Software bugs 1 pose an ever-present concern for developers, and patching such bugs requires a considerable amount of costs through complex operations. In contrast, introducing bugs can be an effortless job, in that even a simple mutation can easily break the Program Under Test (PUT). Existing research has considered these two opposed activities largely separately, either trying to automatically generate realistic patches to help developers, or to find realistic bugs to simulate and prevent future defects. Despite the fundamental differences between them, however, we hypothesise that they do not syntactically differ from each other when considered simply as code changes. To examine this assumption systematically, we investigate the relationship between patches and buggy commits, both generated manually and automatically, using a clustering and pattern analysis. A large scale empirical evaluation reveals that up to 70% of patches and faults can be clustered together based on the similarity between their lexical patterns; further, 44% of the code changes can be abstracted into the identical change patterns. Moreover, we investigate whether code mutation tools can be used as Automated Program Repair (APR) tools, and APR tools as code mutation tools. In both cases, the inverted use of mutation and APR tools can perform surprisingly well, or even better, when compared to their original, intended uses. For example, 89% of patches found by SequenceR, a deep learning based APR tool, can also be found by its inversion, i.e., a model trained with faults and not patches. Similarly, real fault coupling study of mutants reveals that TBar, a template based APR tool, can generate 14% and 3% more fault couplings than traditional mutation tools, PIT and Major respectively, when used as a mutation tool. Our findings suggest that the valid scope of mining code changes for either mutation or APR can be wider than previously thought

    Arachne: Search Based Repair of Deep Neural Networks

    Full text link
    The rapid and widespread adoption of Deep Neural Networks (DNNs) has called for ways to test their behaviour, and many testing approaches have successfully revealed misbehaviour of DNNs. However, it is relatively unclear what one can do to correct such behaviour after revelation, as retraining involves costly data collection and does not guarantee to fix the underlying issue. This paper introduces Arachne, a novel program repair technique for DNNs, which directly repairs DNNs using their input-output pairs as a specification. Arachne localises neural weights on which it can generate effective patches and uses Differential Evolution to optimise the localised weights and correct the misbehaviour. An empirical study using different benchmarks shows that Arachne can fix specific misclassifications of a DNN without reducing general accuracy significantly. On average, patches generated by Arachne generalise to 61.3% of unseen misbehaviour, whereas those by a state-of-the-art DNN repair technique generalise only to 10.2% and sometimes to none while taking tens of times more than Arachne. We also show that Arachne can address fairness issues by debiasing a gender classification model. Finally, we successfully apply Arachne to a text sentiment model to show that it generalises beyond Convolutional Neural Networks

    Why train-and-select when you can use them all?

    No full text
    Learn-to-rank techniques have been successfully applied to fault localisation to produce ranking models that place faulty program elements at or near the top. Genetic Programming has been successfully used as a learning mechanism to produce highly effective ranking models for fault localisation. However, the inherent stochastic nature of GP forces its users to learn multiple ranking models and choose the best performing one for the actual use. This train-and-select approach means that the absolute majority of the computational resources that go into the evolution of ranking models are eventually wasted. We introduce Ensemble Model for Fault Localisation (EMF), which is a learn-to-rank fault localisation technique that utilises all trained models to improve the accuracy of localisation even further. EMF ranks program elements using a lightweight, voting-based ensemble of ranking models. We evaluate EMF using 389 real-world faults in Defects4J benchmark. EMF can place 30.1% more faults at the top when compared to the best performing individual model from the train-and-select approach. We also apply Genetic Algorithm (GA) to construct the best performing ensemble. Compared to naively using all ranking models, GA generated ensembles can localise further 9.2% more faults at the top on average
    corecore