Search CORE

1,721,188 research outputs found

Ego-network segmentation via (weighted) Jaccard median

Author: Zhong H. (Haodi)
Conte Alessio
Pissis Solon P.
Conte A. (Alessio)
Loukides Grigorios; id_orcid
Loukides G. (Grigorios)
Zhong Haodi
Loukides Grigorios
Pissis S. (Solon)
Pissis Solon P.; id_orcid
Publication venue
Publication date: 01/01/2024
Field of study

An ego-network is a graph representing the interactions of a node (ego) with its neighbors and the interactions among those neighbors. A sequence of ego-networks having the same ego can thus model the evolution of these interactions over time. We introduce the problem of segmenting a sequence of ego-networks into

k

segments, for any given integer

k

. Each segment is represented by a summary network, and the goal is to minimize the total loss of representing

k

segments by

k

summaries. The problem allows partitioning the sequence into homogeneous segments with respect to the activities or properties of the ego (e.g., to identify time periods when a user acquired different circles of friends in a social network) and to compactly represent each segment with a summary. The main challenge is to construct a summary that represents a collection of ego-networks with minimum loss. To address this challenge, we employ Jaccard Median (JM), a well-known NP-hard problem for summarizing sets, for which, however, no effective and efficient algorithms are known. We develop a series of algorithms for JM offering different effectiveness/efficiency trade-offs: (I) an exact exponential-time algorithm, based on Mixed Integer Linear Programming; (II) exact and approximation polynomial-time algorithms for minimizing an upper bound of the objective function of JM; and (III) efficient heuristics for JM, which are based on an effective scoring scheme and one of them also on sketching. We also study a generalization of the segmentation problem, in which there may be multiple edges between a pair of nodes in an ego-network. To tackle this problem, we develop a series of algorithms, based on a more general problem than JM, called Weighted Jaccard Median WJM: (I) an exact exponential-time algorithm, based on Mixed Integer Linear Programming; (II) exact algorithms for minimizing an upper bound of the objective function of WJM; and (III) efficient heuristics, based on the percentiles of edge multiplicities and one of them also on divide-and-conquer. By building upon the above results, we design algorithms for segmenting a sequence of ego-networks. Experiments with 10 real datasets and with synthetic datasets show that our algorithms produce optimal or near-optimal solutions to JM or to WJM, and that they substantially outperform state-of-the-art methods which can be employed for ego-network segmentation

VU Research Portal

CWI's Institutional Repository

Archivio della Ricerca - Università di Pisa

King's Research Portal

Erratum to: Circular sequence comparison: algorithms and applications

Author: Retha Ahmad
Roberto Grossi
Pissis Solon P.
Nadia Pisanti
Iliopoulos Costas S.
Vayani Fatima
GROSSI ROBERTO
Mercas Robert
PISANTI NADIA
Robert Mercas
Pissis Solon P
Iliopoulos Costas S
Roberto Grossi
Ahmad Retha
Fatima Vayani
Solon P. Pissis
Costas S. Iliopoulos
Publication venue
Publication date: 01/01/2016
Field of study

[This corrects the article DOI: 10.1186/s13015-016-0076-6.]

Crossref

Springer - Publisher Connector

Archivio della Ricerca - Università di Pisa

King's Research Portal

Alignment-free sequence comparison using absent words

Author: Pissis Solon P.
Solon P. Pissis (7168868)
Robert Mercas (2835212)
Charalampopoulos Panagiotis
Charalampopoulos Panagiotis; id_orcid
Pissis Solon P.; id_orcid
Crochemore Maxime
Fici Gabriele
Panagiotis Charalampopoulos (7169039)
Maxime Crochemore (3476969)
Gabriele Fici (7169042)
Mercaş Robert
Publication venue
Publication date: 01/01/2018
Field of study

Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realised by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free techniques, which are based on measures referring to the composition of sequences in terms of their constituent patterns. These measures, such as q-gram distance, are usually computed in time linear with respect to the length of the sequences. In this paper, we focus on the complementary idea: how two sequences can be efficiently compared based on information that does not occur in the sequences. A word is an absent word of some sequence if it does not occur in the sequence. An absent word is minimal if all its proper factors occur in the sequence. Here we present the first linear-time and linear-space algorithm to compare two sequences by considering all their minimal absent words. In the process, we present results of combinatorial interest, and also extend the proposed techniques to compare circular sequences. We also present an algorithm that, given a word x of length n, computes the largest integer for which all factors of x of that length occur in some minimal absent word of x in time and space O(n). Finally, we show that the known asymptotic upper bound on the number of minimal absent words of a word is tight

Loughborough University Institutional Repository

King's Research Portal

Archivio istituzionale della ricerca - Università di Palermo

On the separation of topology-free rank inequalities for the max stable set problem

Author: Coniglio Stefano
Publication venue
Publication date: 01/01/2017
Field of study

In the context of finding the largest stable set of a graph, rank inequalities prescribe that a stable set can contain, from any induced subgraph of the original graph, at most as many vertices as the stability number of the former. Although these inequalities subsume many of the valid inequalities known for the problem, their exact separation has only been investigated in few special cases obtained by restricting the induced subgraph to a specific topology. In this work, we propose a different approach in which, rather than imposing topological restrictions on the induced subgraph, we assume the right-hand side of the inequality to be fixed to a given (but arbitrary) constant. We then study the arising separation problem, which corresponds to the problem of finding a maximum weight subgraph with a bounded stability number. After proving its hardness and giving some insights on its polyhedral structure, we propose an exact branch-and-cut method for its solution. Computational results show that the separation of topology-free rank inequalities with a fixed right-hand side yields a substantial improvement over the bound provided by the fractional clique polytope (which is obtained with rank inequalities where the induced subgraph is restricted to a clique), often better than that obtained with Lovász’s Theta function via semidefinite programming

Aisberg (Università degli Studi di Bergamo)

On breaking truss-based and core-based communities

We introduce the general problem of identifying a smallest edge subset of a given graph whose deletion makes the graph community-free. We consider this problem under two community notions that have attracted significant attention: k-truss and k-core. We also introduce a problem variant where the identified subset contains edges incident to a given set of nodes and ensures that these nodes are not contained in any community: k-truss or k-core, in our case. These problems are directly applicable in social networks: The identified edges can be hidden by users or sanitized from the output graph; or in communication networks: the identified edges correspond to vital network connections. We present a series of theoretical and practical results. On the theoretical side, we show through non-trivial reductions that the problems we introduce are NP-hard and, in fact, hard to approximate. For the k-truss-based problems, we also show exact exponential-time algorithms, as well as a non-trivial lower bound on the size of an optimal solution. On the practical side, we develop a series of heuristics that are sped up by efficient data structures that we propose for updating the truss or core decomposition under edge deletions. In addition, we develop an algorithm to compute the lower bound. Extensive experiments on 11 real-world and synthetic graphs show that our heuristics are effective, outperforming natural baselines, and also efficient (up to two orders of magnitude faster than a natural baseline), thanks to our data structures. Furthermore, we present a case study on a co-authorship network and experiments showing that the removal of edges identified by our heuristics does not substantially affect the clustering structure of the input graph

VU Research Portal

University of Birmingham Research Portal

CWI's Institutional Repository

Archivio della Ricerca - Università di Pisa

King's Research Portal

Pangenome comparison via ED strings

Introduction: An elastic-degenerate (ED) string is a sequence of sets of strings. It can also be seen as a directed acyclic graph whose edges are labeled by strings. The notion of ED strings was introduced as a simple alternative to variation and sequence graphs for representing a pangenome, that is, a collection of genomic sequences to be analyzed jointly or to be used as a reference. Methods: In this study, we define notions of matching statistics of two ED strings as similarity measures between pangenomes and, consequently infer a corresponding distance measure. We then show that both measures can be computed efficiently, in both theory and practice, by employing the intersection graph of two ED strings. Results: We also implemented our methods as a software tool for pangenome comparison and evaluated their efficiency and effectiveness using both synthetic and real datasets. Discussion: As for efficiency, we compare the runtime of the intersection graph method against the classic product automaton construction showing that the intersection graph is faster by up to one order of magnitude. For showing effectiveness, we used real SARS-CoV-2 datasets and our matching statistics similarity measure to reproduce a well-established clade classification of SARS-CoV-2, thus demonstrating that the classification obtained by our method is in accordance with the existing one.</p

VU Research Portal

CWI's Institutional Repository

Directory of Open Access Journals

Archivio della Ricerca - Università di Pisa

Bilevel programming approaches to the computation of optimistic and pessimistic single-leader-multi-follower equilibria

Author: Gatti Nicola
Coniglio Stefano
Publication venue
Publication date: 01/01/2017
Field of study

We study the problem of computing an equilibrium in leader-follower games with a single leader and multiple followers where, after the leader’s commitment to a mixed strategy, the followers play simultaneously in a noncooperative way, reaching a Nash equilibrium. We tackle the problem from a bilevel programming perspective. Since, given the leader’s strategy, the followers’ subgame may admit multiple Nash equilibria, we consider the cases where the followers play either the best (optimistic) or the worst (pessimistic) Nash equilibrium in terms of the leader’s utility. For the optimistic case, we propose three formulations which cast the problem into a single level mixed- integer nonconvex program. For the pessimistic case, which, as we show, may admit a supremum but not a maximum, we develop an ad hoc branch-and-bound algorithm. Computational results are reported and illustrated

Aisberg (Università degli Studi di Bergamo)

Substring Complexity in Sublinear Space

Author: Pissis Solon P.
Gabriele Fici
Paweł Gawrychowski
Bernardini Giulia
Pissis S. (Solon)
Bernardini G. (Giulia)
Gawrychowski P. (Paweł)
Fici Gabriele
Giulia Bernardini
Solon P. Pissis
Gawrychowski Paweł
Fici G. (Gabriele)
Pissis Solon P.; id_orcid
Publication venue
Publication date: 01/01/2023
Field of study

Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size

z

of the Lempel-Ziv parse or the number

r

of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size

\gamma

of a smallest string attractor. Let

T

be a string of length

n

. A string attractor of

T

is a set of positions of

T

capturing the occurrences of all the substrings of

T

. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing

\gamma

is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function

S_T(k)

counting the number of distinct substrings of length

k

T

, also known as the substring complexity of

T

. This new measure is defined as

\delta= \sup\{S_T(k)/k, k\geq 1\}

and lower bounds all the relevant ad hoc measures previously considered. In particular,

\delta\leq \gamma

always holds and

\delta

can be computed in

\mathcal{O}(n)

time using

\Theta(n)

working space. Kociumaka et al. showed that one can construct an

\mathcal{O}(\delta \log \frac{n}{\delta})

-sized representation of

T

supporting efficient direct access and efficient pattern matching queries on

T

. Given that for highly compressible strings,

\delta

is significantly smaller than

n

, it is natural to pose the following question: Can we compute

\delta

efficiently using sublinear working space? We address this algorithmic challenge by showing the following bounds to compute

\delta

\mathcal{O}(\frac{n^3\log b}{b^2})

time using

\mathcal{O}(b)

space, for any

b\in[1,n]

, in the comparison model; or

\tilde{\mathcal{O}}(n^2/b)

time using

\tilde{\mathcal{O}}(b)

space, for any

b\in[\sqrt{n},n]

, in the word RAM model.Comment: Accepted to ISAAC 2023. Abstract abridged to satisfy arXiv requirement

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Trieste

VU Research Portal

AIR Universita degli studi di Milano

CWI's Institutional Repository

DROPS Dagstuhl Research Online Publication Server

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Author: Pissis Solon P.
Retha Ahmad
Pissis Solon P.; id_orcid
Publication venue
Publication date: 01/01/2018
Field of study

An elastic-degenerate string is a sequence of n sets of strings of total length N. It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm^2+N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N * ceil[m/w]) with pre-processing time and space O(m * ceil[m/w]), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M. A straightforward generalization of the existing bit-vector algorithm would require time O(N * ceil[M/w]) with pre-processing time and space O(M * ceil[M/w]), which is prohibitive in practice. We present a new on-line O(N * ceil[M/w])-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population's variants are considered

DROPS Dagstuhl Research Online Publication Server

King's Research Portal

A*PA2: Up to 19× Faster Exact Global Alignment

Author: Groot Koerkamp Ragnar
Koerkamp Ragnar Groot
Publication venue
Publication date: 01/01/2024
Field of study

ISSN:1868-896

ETHzürich Repository for Publications and Research Data

DROPS Dagstuhl Research Online Publication Server