1,721,190 research outputs found

    Testing the consistency of multimachine databases for physical studies of regression

    No full text
    The investigation of various aspects of tokamak physics is performed with a combination of experiments carried out in different machines, to improve the statistical basis of the results and to cover a sufficient wide region of the operational space. Therefore, in the last decades, various multimachine databases have been built to address general and specific physical questions, particularly related to the extrapolation of present results to the next generation of devices. In this paper, a methodology of analysis is presented, to assess whether a multimachine data set is sufficiently coherent to really substantiate the conclusions, which are expected to be derived from it. A series of statistical and information theoretical tools have been refined to address the consistency of the data provided by the different devices. The developed techniques allow determination of whether it is reasonable to expect that the physics is the same in the various devices and/or that the entries do not present unacceptable bias. To exemplify the potential of the proposed approach, a systematic analysis of the ITPA database of the confinement time has been performed, using both dimensional and dimensionless quantities. The results obtained strongly suggest that better care should be taken in ensuring the coherence of data obtained from different experiments on different devices

    A Model Falsification Approach to Learning in Non-Stationary Environments for Experimental Design

    Full text link
    The application of data driven machine learning and advanced statistical tools to complex physics experiments, such as Magnetic Confinement Nuclear Fusion, can be problematic, due the varying conditions of the systems to be studied. In particular, new experiments have to be planned in unexplored regions of the operational space. As a consequence, care must be taken because the input quantities used to train and test the performance of the analysis tools are not necessarily sampled by the same probability distribution as in the final applications. The regressors and dependent variables cannot therefore be assumed to verify the i.i.d. (independent and identical distribution) hypothesis and learning has therefore to take place under non stationary conditions. In the present paper, a new data driven methodology is proposed to guide planning of experiments, to explore the operational space and to optimise performance. The approach is based on the falsification of existing models. The deployment of Symbolic Regression via Genetic Programming to the available data is used to identify a set of candidate models, using the method of the Pareto Frontier. The confidence intervals for the predictions of such models are then used to find the best region of the parameter space for their falsification, where the next set of experiments can be most profitably carried out. Extensive numerical tests and applications to the scaling laws in Tokamaks prove the viability of the proposed methodology

    Adaptive learning for disruption prediction in non-stationary conditions

    No full text
    For many years, machine learning tools have proved to be very powerful disruption predictors in tokamaks. On the other hand, the vast majority of the techniques deployed assume that the input data is independent and is sampled from exactly the same probability distribution for the training set, the test set and the final real time deployment. This hypothesis is certainly not verified in practice, since the experimental programmes evolve quite rapidly, resulting typically in ageing of the predictors and consequent suboptimal performance. This paper describes various adaptive training strategies that have been tested to maintain the performance of disruption predictors in non-stationary conditions. The proposed approaches have been implemented using new ensembles of classifiers, explicitly developed for the present application. The improvements in performance are unquestionable and, given the difficulties encountered so far in translating predictors from one device to another, the proposed adaptive methods from scratch can therefore be considered a useful option in the arsenal of alternatives envisaged for the next generation of devices, particularly at the very beginning of their operation

    A systemic approach to classification for knowledge discovery with applications to the identification of boundary equations in complex systems

    No full text
    Classification, which means discrimination between examples belonging to different classes, is a fundamental aspect of most scientific and engineering activities. Machine Learning (ML) tools have proved to be very performing in this task, in the sense that they can achieve very high success rates. However, both "realism" and interpretability of their models are low, leading to modest increases of knowledge and limited applicability, particularly in applications related to nonlinear and complex systems. In this paper, a methodology is described, which, by applying ML tools directly to the data, allows formulating new scientific models that describe the actual "physics" determining the boundary between the classes. The proposed technique consists of a stack of different ML tools, each one applied to a specific subtask of the scientific analysis; all together they form a system, which combines all the major strands of machine learning, from rule based classifiers and Bayesian statistics to genetic programming and symbolic manipulation. To take into account the error bars of the measurements generating the data, an essential aspect of scientific inference, the novel concept of the Geodesic Distance on Gaussian manifolds is adopted. The properties of the methodology have been investigated with a series of systematic numerical tests for different types of classification problems. The potential of the approach to handle real data has been tested with various experimental databases, built using measurements collected in the investigations of complex systems. The obtained results indicate that the proposed method permits to find physically meaningful mathematical equations, which reflect the actual phenomena under study. The developed techniques therefore constitute a very useful information processing system to bridge the gap between data, machine learning models and scientific theories

    Stacking of predictors for the automatic classification of disruption types to optimize the control logic

    No full text
    Nowadays, disruption predictors, based on machine learning techniques, can perform well but they typically do not provide any information about the type of disruption and cannot predict the time remaining before the current quench. On the other hand, the automatic identification of the disruption type is a crucial aspect required to optimize the remedial actions and a prerequisite to forecasting the time left for intervening. In this work, a stack of machine learning tools is applied to the task of automatic classification of the disruption types. The strategy is implemented from scratch and completely adaptive; the predictors start operating after the first disruption and update their own models, following the evolution of the experimental program, without any human intervention. Moreover, they are designed to implement a form of transfer learning, in the sense that they identify autonomously the most important disruption classes, generating new ones when necessary. The results obtained are very encouraging in terms of both prediction performance and classification accuracy. On the other hand, regarding the narrowing of the warning times, some progress has been achieved, but new techniques will have to be devised to obtain fully satisfactory properties

    Geodesic Distance on Gaussian Manifolds to Reduce the Statistical Errors in the Investigation of Complex Systems

    Full text link
    In the last years the reputation of medical, economic, and scientific expertise has been strongly damaged by a series of false predictions and contradictory studies. The lax application of statistical principles has certainly contributed to the uncertainty and loss of confidence in the sciences. Various assumptions, generally held as valid in statistical treatments, have proved their limits. In particular, since some time it has emerged quite clearly that even slightly departures from normality and homoscedasticity can affect significantly classic significance tests. Robust statisticalmethods have been developed, which can providemuch more reliable estimates. On the other hand, they do not address an additional problem typical of the natural sciences, whose data are often the output of delicate measurements. The data can therefore not only be sampled from a nonnormal pdf but also be affected by significant levels of Gaussian additive noise of various amplitude. To tackle this additional source of uncertainty, in this paper it is shown how already developed robust statistical tools can be usefully complemented with the Geodesic Distance on Gaussian Manifolds.This metric is conceptually more appropriate and practically more effective, in handling noise of Gaussian distribution, than the traditional Euclidean distance.The results of a series of systematic numerical tests show the advantages of the proposed approach in all the main aspects of statistical inference, from measures of location and scale to size effects and hypothesis testing. Particularly relevant is the reduction even of 35% in Type II errors, proving the important improvement in power obtained by applying the methods proposed in the paper. It is worth emphasizing that the proposed approach provides a general framework, in which also noise of different statistical distributions can be dealt with

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Quantifying total influence between variables with information theoretic and machine learning techniques

    Full text link
    The increasingly sophisticated investigations of complex systems require more robust estimates of the correlations between the measured quantities. The traditional Pearson correlation coefficient is easy to calculate but sensitive only to linear correlations. The total influence between quantities is, therefore, often expressed in terms of the mutual information, which also takes into account the nonlinear effects but is not normalized. To compare data from different experiments, the information quality ratio is, therefore, in many cases, of easier interpretation. On the other hand, both mutual information and information quality ratio are always positive and, therefore, cannot provide information about the sign of the influence between quantities. Moreover, they require an accurate determination of the probability distribution functions of the variables involved. As the quality and amount of data available are not always sufficient to grant an accurate estimation of the probability distribution functions, it has been investigated whether neural computational tools can help and complement the aforementioned indicators. Specific encoders and autoencoders have been developed for the task of determining the total correlation between quantities related by a functional dependence, including information about the sign of their mutual influence. Both their accuracy and computational efficiencies have been addressed in detail, with extensive numerical tests using synthetic data. A careful analysis of the robustness against noise has also been performed. The neural computational tools typically outperform the traditional indicators in practically every respect
    corecore