Search CORE

1,720,996 research outputs found

LaHAR: Latent Human Activity Recognition using LDA

Author: Staab Steffen
Boukhers Zeyd
Wete Danniene
Publication venue
Publication date: 09/02/2021
Field of study

Fractal approach for determining the optimal number of topics in the field of topic modeling

Author: Koltcov Sergej
Staab Steffen
Boukhers Zeyd
Ignatenko Vera
Publication venue
Publication date: 29/03/2019
Field of study

In this paper we apply multifractal formalism to the analysis of statistical behaviour of topic models under condition of varying number of topics. Our analysis reveals the existence of two self-similar regions and one transition region in the function of density-of-states depending on the number of topics. As earlier a function that can be expressed through density-of-states was successfully used to determine the optimal number of topics, we test the applicability of the density-of-states function for the same purpose. We provide numerical results for three topic models (PLSA, ARTM, and LDA Gibbs sampling) on two marked-up collections containing texts in two different languages. Our experiments show that the "true" number of topics, as determined by the human mark-up, occurs in the transition region

Southampton (e-Prints Soton)

An end-to-end approach for extracting and segmenting high-variance references from PDF documents

Author: Zeyd Boukhers
Staab Steffen
Boukhers Zeyd
Steffen Staab
Ambhore Shriharsh
Shriharsh Ambhore
Publication venue
Publication date: 01/09/2019
Field of study

This paper addresses the problem of extracting and segmenting references from PDF documents. The novelty of the presented approach lies in its capability to discover highly varying references mainly in terms of content, length and location in the document. Unlike existing works, the proposed method does not follow the classical pipeline that consists of sequential phases. It rather learns the different characteristics of references to be used in a coherent scheme that reduces the error accumulation by following a probabilistic approach. Contrary to conventional references, mentioning the sources of information in some publications, such as those of social science, is not subject to the same specifications such as being located in a unique reference section. Therefore, the proposed method aims to extract references of highly varying reference characteristics by relaxing the restrictions of existing methods. Additionally, we present in this paper a new challenging dataset of annotated references in German social science publications. The main purpose of this work is to serve the indexation of missing references by extracting them from challenging publications such as those of German social science. The effectiveness of the presented methods in terms of both extraction and segmentation is evaluated on different datasets, including the German social science set

Southampton (e-Prints Soton)

Crossref

Analyzing the influence of hyper-parameters and regularizers of topic modeling in terms of Renyi entropy

Author: Zeyd Boukhers
Staab Steffen
Koltsov Sergei
Sergei Koltcov
Vera Ignatenko
Boukhers Zeyd
Ignatenko Vera
Steffen Staab
Publication venue
Publication date: 30/03/2020
Field of study

Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research

Multidisciplinary Digital Publishing Institute

Southampton (e-Prints Soton)

Whois? Deep Author Name Disambiguation Using Bibliographic Data

Author: Asundi Nagaraj Bahubali
Boukhers Zeyd
Publication venue
Publication date: 2023
Field of study

201215As the number of authors is increasing exponentially over years, the number of authors sharing the same names is increasing proportionally. This makes it challenging to assign newly published papers to their adequate authors. Therefore, Author Name Ambiguity (ANA) is considered a critical open problem in digital libraries. This paper proposes an Author Name Disambiguation (AND) approach that links author names to their real-world entities by leveraging their co-authors and domain of research. To this end, we use a collection from the DBLP repository that contains more than 5 million bibliographic records authored by around 2.6 million co-authors. Our approach first groups authors who share the same last names and same first name initials. The author within each group is identified by capturing the relation with his/her co-authors and area of research, which is represented by the titles of the validated publications of the corresponding author. To this end, we train a neural network model that learns from the representations of the co-authors and titles. We validated the effectiveness of our approach by conducting extensive experiments on a large dataset

Fraunhofer-Publica

Whois? Deep Author Name Disambiguation using Bibliographic Data

Author: Bahubali Nagaraj Asundi
Boukhers Zeyd
Publication venue
Publication date: 24/07/2022
Field of study

As the number of authors is increasing exponentially over years, the number of authors sharing the same names is increasing proportionally. This makes it challenging to assign newly published papers to their adequate authors. Therefore, Author Name Ambiguity (ANA) is considered a critical open problem in digital libraries. This paper proposes an Author Name Disambiguation (AND) approach that links author names to their real-world entities by leveraging their co-authors and domain of research. To this end, we use a collection from the DBLP repository that contains more than 5 million bibliographic records authored by around 2.6 million co-authors. Our approach first groups authors who share the same last names and same first name initials. The author within each group is identified by capturing the relation with his/her co-authors and area of research, which is represented by the titles of the validated publications of the corresponding author. To this end, we train a neural network model that learns from the representations of the co-authors and titles. We validated the effectiveness of our approach by conducting extensive experiments on a large dataset.Comment: Accepted for publication @ TPDL202

arXiv.org e-Print Archive

Deep author name disambiguation using DBLP data

Author: Asundi Nagaraj Bahubali
Zeyd Boukhers
Boukhers Zeyd
Nagaraj Bahubali Asundi
Publication venue
Publication date: 04/05/2023
Field of study

431441In the academic world, the number of scientists grows every year and so does the number of authors sharing the same names. Consequently, it is challenging to assign newly published papers to their respective authors. Therefore, author name ambiguity is considered a critical open problem in digital libraries. This paper proposes an author name disambiguation approach that links author names to their real-world entities by leveraging their co-authors and domain of research. To this end, we use data collected from the DBLP repository that contains more than 5 million bibliographic records authored by around 2.6 million co-authors. Our approach first groups authors who share the same last names and same first name initials. The author within each group is identified by capturing the relation with his/her co-authors and area of research, represented by the titles of the validated publications of the corresponding author. To this end, we train a neural network model that learns from the representations of the co-authors and titles. We validated the effectiveness of our approach by conducting extensive experiments on a large dataset.25

Crossref

Fraunhofer-Publica

EconStor (ZBW Kiel)

MESD: Metadata Extraction from Scholarly Documents - A Shared Task Overview

Author: Boukhers Zeyd
Yang Cong
Publication venue
Publication date: 2025
Field of study

This paper presents an overview of the Metadata Extraction from Scholarly Documents (MESD) shared task, which was designed to address the challenge of extracting structured metadata (e.g. Title, Author, Abstract, etc.) from scientific publications. The task aimed to promote the development of techniques for making scholarly data more Findable, Accessible, Interoperable, Reusable (FAIR) by improving metadata extraction from PDF documents. We describe the task design and the creation of two complementary datasets: (1) the S2ORC_Exp500v1 dataset consisting of 500 training samples, 100 validation samples, and 100 test samples with text-based annotations, and (2) the SSOAR Multidisciplinary Vision Dataset (SSOARGMVD) containing more than 8000 documents with bounding box annotations suitable for computer vision approaches. We discuss potential directions for future research in metadata extraction from scholarly documents, highlighting the opportunities presented by these new resources

Fraunhofer-Publica

Evaluation Scheme of FAIRness in Scholarly Data

Author: Boukhers Zeyd
Publication venue
Publication date: 01/10/2022
Field of study

My talk @ Open Citation Workshop (OCW) 2022 Links: FAIRcookbook: https://faircookbook.elixir-europe.org/content/home.html FAIR Evaluator: https://gitlab.fit.fraunhofer.de/abu.ibne.bayazid/fairevaluator FDO Conference: https://fairdo.org

ZENODO

3D trajectory extraction from 2D videos for human activity analysis

Author: Boukhers Zeyd
Publication venue
Publication date: 01/01/2017
Field of study

CERN Document Server