1,721,013 research outputs found
Topic modelling and text classification models for applications within EFSA
This report presents an overview of topic modelling and classification models in relation to four case studies in the EFSA project OC/EFSA/AMU/2020/02. As adequate document embeddings have a positive influence on the effectiveness of topic modelling as well as text classification, an extensive number of different possibilities for word and document embeddings are discussed. It was found that a multitude of increasingly more complex embeddings are readily available for off-the-shelf use. But as they are trained on large but mostly general text corpora, their utility for domain specific text varies. Fine tuning or creating document embeddings from scratch is only feasible in the presence of enough data and has an associated computational cost. For some domains (like scientific articles), pretrained embeddings are available. For topic modelling, we discuss standard techniques like non-negative matrix factorization and latent Dirichlet allocation as well as more recent methods based on clustering of document embeddings like Top2Vec and BERTopic. For text classification, we consider hierarchical text classification approaches combined with established techniques for text classification via document embeddings. We propose a selection of techniques for each of the case studies justifying their choice and present a plan for evaluation. Finally, we discuss our findings after having implemented and validated the selected techniques
Discovering structure in semi-structured data
Excerpt of introduction: Unfortunately, in spite of the above mentioned advantages, the presence of a schema is not mandatory and many XML documents are not accompanied by one. For instance, in a recent study, Barbosa et al. have shown that approximately half of the XML documents available on the web do not refer to a schema. In another study, we have noted that about two-thirds of XSDs gathered from schema repositories and from the web are not valid with respect to the W3C XML Schema specification, rendering them essentially useless for immediate application (see Chapter 6). A similar observation was made by Sahuguet concerning DTDs. Based on the lack of schemas in practice, it is essential to devise algorithms that can infer a schema for a given collection of XML documents when none, or no syntactically correct one, is present. This is also acknowledged by Florescu who emphasizes that in the context of data integration: “We need to extract good-quality schemas automatically from existing data and perform incremental maintenance of the generated schemas.” It should be noted that even when a schema is already available, there are situations where inference can be useful. One such situation is schema cleaning: sometimes a schema is too general with respect to the XML data that it is supposed to describe. In that case, it can be advantageous to infer a new schema based solely on the data at hand.... In general, schema inference can be used to restrict schemas to a relevant subset of data needed by the application at hand, thereby facilitating difficult tasks like schema matching and data integration. Indeed, as argued by Hinkelman [Hin05], industry-level standards are too loosely defined in general, which can result in XML schemas where many business structures are formally specified as being optional.... Based on the above observations, it is hence essential to devise algorithms that can automatically infer a DTD or XSD from a given corpus of XML documents...
Discovering structure in semi-structured data
Excerpt of introduction: Unfortunately, in spite of the above mentioned advantages, the presence of a schema is not mandatory and many XML documents are not accompanied by one. For instance, in a recent study, Barbosa et al. have shown that approximately half of the XML documents available on the web do not refer to a schema. In another study, we have noted that about two-thirds of XSDs gathered from schema repositories and from the web are not valid with respect to the W3C XML Schema specification, rendering them essentially useless for immediate application (see Chapter 6). A similar observation was made by Sahuguet concerning DTDs. Based on the lack of schemas in practice, it is essential to devise algorithms that can infer a schema for a given collection of XML documents when none, or no syntactically correct one, is present. This is also acknowledged by Florescu who emphasizes that in the context of data integration: “We need to extract good-quality schemas automatically from existing data and perform incremental maintenance of the generated schemas.” It should be noted that even when a schema is already available, there are situations where inference can be useful. One such situation is schema cleaning: sometimes a schema is too general with respect to the XML data that it is supposed to describe. In that case, it can be advantageous to infer a new schema based solely on the data at hand.... In general, schema inference can be used to restrict schemas to a relevant subset of data needed by the application at hand, thereby facilitating difficult tasks like schema matching and data integration. Indeed, as argued by Hinkelman [Hin05], industry-level standards are too loosely defined in general, which can result in XML schemas where many business structures are formally specified as being optional.... Based on the above observations, it is hence essential to devise algorithms that can automatically infer a DTD or XSD from a given corpus of XML documents...
Trustworthy Artificial Intelligence Methods for Image Analysis and Benchmarking of Neural Network Interpretability
The focus of this thesis is on Artificial Intelligence (AI) and on how AI can be presented
to people in a way that is more explainable, intuitive and trustworthy.
The field of Artificial intelligence is vast and encompasses many subdomains concerned
with how machines and software think and act and how to build intelligent
entities [3]. Broadly speaking, AI deals with a machine’s ability to learn and acquire
knowledge, reason, solve problems and apply and adapt what is learned to new situations.
Several definitions of AI have been used. Some try to define AI in terms of
how humanlike it is vs how rational it is on the one hand, and on the other in how
it acts vs how it thinks. For example, for a machine to pass the Turing test, it needs
to convincingly act humanly. The cognitive modelling approach tries to get machines
to think humanly and deals more with cognitive science and psychology. Another
approach is to make the machine think rationally, by enforcing logic and inference
rules. This is different still from a machine that acts rational, where the machine
needs to do the right or optimal thing. This latter approach is the most common one
as it is more easily captured in mathematical formulation, for example optimizing
a utility or loss function. Throughout the years, the field of AI has continued to
grow, and the number of subfields under its umbrella is numerous. Throughout the
history of modern AI many different methods have been used and proposed, from
artificial neurons [4], Hebbian learning [5], and reasoning as search and heuristics [6],
and later expanded to include logic programming such as Prolog [7], genetic programs
[8], and expert systems (of which Dendral [9] is often considered the first one). In
the 80’s, hidden Markov models [10] became more popular, and Bayesian networks
[11, 12] followed suit, leading to machine learning, Big Data, deep learning, and now
large language models [3]. Presently, AI permeates our daily lives in various forms,from video games and selfie filters to personalised video recommendations and medical
devices, even extending to autonomous vehicles. Large language models, such as
the ones used in chatbots and smart assistants, are used in content generation for
news articles, product descriptions, and academic theses. Such models are currently
garnering significant attention and have become the next major breakthrough in AI.
In this thesis, the focus will mainly be on Deep Neural Networks (DNN) and
their applications. The smallest building block of a neural network is the model of
a neuron, which was first introduced by McCulloch and Pitts [4] and was inspired
by the function of biological neurons. Consider a very high-level abstraction of a
neuron: through dendrites and receptors, the neuron receives stimuli, and when the
stimuli reach a threshold, the neuron fires an electrical signal through the axon.
The mathematical equivalent of such a neuron is a non-linear element with inputs
xi multiplied with weights wi. After adding a bias term b, this is passed through a
non-linear function f, also called the activation function. Originally, a neuron was a
binary classifier, and the non-linearity used was the Heaviside step function or sign
function. However, a variety of other functions have been used, such as the sigmoid
and tanh. In current neural networks, the rectified linear unit (ReLU) is the most
widely used. A single neuron is only able to learn linearly separable concepts. By
combining several neurons in a layer and stacking several layers so that the units in
one layer are fully connected to all the units in the previous network (so that the
output h of layer l is hl = f l (Wlhl−1)), we can build a Multilayer perceptron (MLP)
that can distinguish non-linearly separable data. This type of feed-forward network is
a basic neural network, and can already achieve remarkable results. In Hornik et al.
[13] showed that an MLP with as few as one hidden layer is a universal approximator,
meaning they can approximate any measurable funtion to any degree of accuracy,
given enough hidden units. The networks are trained to optimize a loss function that
quantifies the performance and serves as a proxy for the real objective. The gradients
of this loss function w.r.t the model’s weights can be efficiently calculated using the
back-propagation algorithm [14]. Gradients are propagated layer by layer using the
chain rule. The weights are then updated based on these gradients using the gradient
descent optimisation or variants thereof. Neural networks are discussed further in the
next section, section 1.1.
The term deep in Deep Learning refers to the utilisation of a larger number of
consecutive layers in these networks. By adding more layers on top of each other,
each layer is able to learn increasingly complex and meaningful features, enabling the
model to more easily grasp complex interactions in the data and simplify the modelling
of complex functions. In this way, Deep Learning does a form of automated featureengineering by learning relevant features directly from the raw data, potentially capturing
more intricate patterns and relationships that may not be easily identifiable
or feasible with handcrafted features, especially for computer vision. However, fully
connected layers have the disadvantage that they contain a huge amount of learnable
weights, making these models computationally expensive and prone to overfitting.
Therefore, in computer vision applications, convolutional layers are used. The neurons
in these layers are locally connected to a window of the input as opposed to a
fully connected layer. The window will then slide over the whole input to process it.
This reduces the number of weights in the layer and introduces a useful inductive bias:
pixels that are spatially close are processed together, leveraging the spatial structure
of images. Together with convolutional layers, convolutional neural networks (CNN)
use pooling layers to reduce the spatial resolution of features further, further reducing
the number of weights needed. Still, these models can contain millions of parameters,
making it an impossible task to comprehend the function of every single one. They
are considered black-box models, as we often do not know which features exactly have
been learned or how the decisions are being made.
CNNs had been successfully applied in real-world applications before, but it took
until 2011 for them to really take off, as more computation power became more
readily available due to efficient GPU implementations. In 2011, the model by Cire¸san
et al. [15] started winning image competitions, and in 2012, the AlexNet architecture
[16] won the ImageNet Large Scale Visual Recognition Challenge. CNNs have since
delivered state-of-the-art performance on many computer vision tasks. Additional
background is given in section 1.2.
While AI applications have become ubiquitous over the past few years and will
undoubtfully become an even more prevalent part of our lives in the years to come, the
black-box nature of AI models causes friction in AI uptake, especially where transparency,
accountability and interpretability is critical. For example, in healthcare,
where life-and-death decisions are being made. Early detection of a disease at an
early phase is critical to prevent disease progression and massively improve patient
outcomes. A wrong diagnosis can lead to harmful or fatal consequences. In autonomous
vehicles, a wrong detection or a missed traffic sign can cause a fatal crash.
In finance, explanations are needed to assess risks, facilitate decision making and is
needed for regulatory compliance. Moreover, a “right to explanation” is mandated
by the GDPR [17, Articles 13-15, 22]. No matter the field, no model is perfect, and
mistakes will happen. Without a reasonable explanation of the decisions made, it is
difficult for people to trust the AI and justify its use.
Explanations are not just necessary to justify the decisions and predictions beingmade. They are also essential to debug the model in several ways. There are various
ways in which biases can end up in the model, such as biased or skewed training
data or algorithmic bias, which needs to be snuffed out. By having the model explain
predictions, it becomes possible to debug the biases and take action. Explanations
help investigate the errors made by the model, allowing developers to understand the
underlying causes or to detect known failure modes. It also makes it easier to monitor
the performance over time, as a perfectly working system may start to misbehave over
time due to distribution shifts. Furthermore, it allows us to uncover new relations
in the data previously unknown to domain experts, allowing them to formulate new
hypotheses and create new knowledge. Because of these reasons, Explainable AI is a
rapidly evolving field, and new papers are published at a rapid pace. In section 1.3, we
will expand more on several commonly used XAI methods. Part II discusses feature
attribution methods, a group of XAI methods specifically used for image classification
models. These chapters cover a basic introduction of simple and common XAI methods,
and the specific feature attribution methods used in the papers presented. We
focus on methods that are local and model-centric i.e. they explain a specific sample
for a specific model. This is of course only a subset of possible XAI methods. A taxonomy
for the different can be found in [18]. For an overview of the state-of-the-art
methods, we refer to Minh et al. [19] and Linardatos et al. [20]
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
- …
