1,721,234 research outputs found
Ego-network segmentation via (weighted) Jaccard median
An ego-network is a graph representing the interactions of a node (ego) with its neighbors and the interactions among those neighbors. A sequence of ego-networks having the same ego can thus model the evolution of these interactions over time. We introduce the problem of segmenting a sequence of ego-networks into segments, for any given integer . Each segment is represented by a summary network, and the goal is to minimize the total loss of representing segments by summaries. The problem allows partitioning the sequence into homogeneous segments with respect to the activities or properties of the ego (e.g., to identify time periods when a user acquired different circles of friends in a social network) and to compactly represent each segment with a summary. The main challenge is to construct a summary that represents a collection of ego-networks with minimum loss. To address this challenge, we employ Jaccard Median (JM), a well-known NP-hard problem for summarizing sets, for which, however, no effective and efficient algorithms are known. We develop a series of algorithms for JM offering different effectiveness/efficiency trade-offs: (I) an exact exponential-time algorithm, based on Mixed Integer Linear Programming; (II) exact and approximation polynomial-time algorithms for minimizing an upper bound of the objective function of JM; and (III) efficient heuristics for JM, which are based on an effective scoring scheme and one of them also on sketching. We also study a generalization of the segmentation problem, in which there may be multiple edges between a pair of nodes in an ego-network. To tackle this problem, we develop a series of algorithms, based on a more general problem than JM, called Weighted Jaccard Median WJM: (I) an exact exponential-time algorithm, based on Mixed Integer Linear Programming; (II) exact algorithms for minimizing an upper bound of the objective function of WJM; and (III) efficient heuristics, based on the percentiles of edge multiplicities and one of them also on divide-and-conquer. By building upon the above results, we design algorithms for segmenting a sequence of ego-networks. Experiments with 10 real datasets and with synthetic datasets show that our algorithms produce optimal or near-optimal solutions to JM or to WJM, and that they substantially outperform state-of-the-art methods which can be employed for ego-network segmentation
Alignment-free sequence comparison using absent words
Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realised by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free techniques, which are based on measures referring to the composition of sequences in terms of their constituent patterns. These measures, such as q-gram distance, are usually computed in time linear with respect to the length of the sequences. In this paper, we focus on the complementary idea: how two sequences can be efficiently compared based on information that does not occur in the sequences. A word is an absent word of some sequence if it does not occur in the sequence. An absent word is minimal if all its proper factors occur in the sequence. Here we present the first linear-time and linear-space algorithm to compare two sequences by considering all their minimal absent words. In the process, we present results of combinatorial interest, and also extend the proposed techniques to compare circular sequences. We also present an algorithm that, given a word x of length n, computes the largest integer for which all factors of x of that length occur in some minimal absent word of x in time and space O(n). Finally, we show that the known asymptotic upper bound on the number of minimal absent words of a word is tight
Erratum to: Circular sequence comparison: algorithms and applications
[This corrects the article DOI: 10.1186/s13015-016-0076-6.]
On the separation of topology-free rank inequalities for the max stable set problem
In the context of finding the largest stable set of a graph, rank inequalities prescribe that a stable set can contain, from any induced subgraph of the original graph, at most as many vertices as the stability number of the former. Although these inequalities subsume many of the valid inequalities known for the problem, their exact separation has only been investigated in few special cases obtained by restricting the induced subgraph to a specific topology. In this work, we propose a different approach in which, rather than imposing topological
restrictions on the induced subgraph, we assume the right-hand side of the inequality to be fixed to a given (but arbitrary) constant. We then study the arising separation problem, which corresponds to the problem of finding a maximum weight subgraph with a bounded stability number. After proving its hardness and giving some insights on its polyhedral structure, we propose an exact branch-and-cut method for its solution. Computational results show that the separation of topology-free rank inequalities with a fixed right-hand side yields a substantial improvement over the bound provided by the fractional clique polytope (which is obtained with rank inequalities where the induced subgraph is restricted to a clique), often better than that obtained with Lovász’s Theta function via semidefinite programming
On breaking truss-based and core-based communities
We introduce the general problem of identifying a smallest edge subset of a given graph whose deletion makes the graph community-free. We consider this problem under two community notions that have attracted significant attention: k-truss and k-core. We also introduce a problem variant where the identified subset contains edges incident to a given set of nodes and ensures that these nodes are not contained in any community: k-truss or k-core, in our case. These problems are directly applicable in social networks: The identified edges can be hidden by users or sanitized from the output graph; or in communication networks: the identified edges correspond to vital network connections. We present a series of theoretical and practical results. On the theoretical side, we show through non-trivial reductions that the problems we introduce are NP-hard and, in fact, hard to approximate. For the k-truss-based problems, we also show exact exponential-time algorithms, as well as a non-trivial lower bound on the size of an optimal solution. On the practical side, we develop a series of heuristics that are sped up by efficient data structures that we propose for updating the truss or core decomposition under edge deletions. In addition, we develop an algorithm to compute the lower bound. Extensive experiments on 11 real-world and synthetic graphs show that our heuristics are effective, outperforming natural baselines, and also efficient (up to two orders of magnitude faster than a natural baseline), thanks to our data structures. Furthermore, we present a case study on a co-authorship network and experiments showing that the removal of edges identified by our heuristics does not substantially affect the clustering structure of the input graph
Pangenome comparison via ED strings
Introduction: An elastic-degenerate (ED) string is a sequence of sets of strings. It can also be seen as a directed acyclic graph whose edges are labeled by strings. The notion of ED strings was introduced as a simple alternative to variation and sequence graphs for representing a pangenome, that is, a collection of genomic sequences to be analyzed jointly or to be used as a reference. Methods: In this study, we define notions of matching statistics of two ED strings as similarity measures between pangenomes and, consequently infer a corresponding distance measure. We then show that both measures can be computed efficiently, in both theory and practice, by employing the intersection graph of two ED strings. Results: We also implemented our methods as a software tool for pangenome comparison and evaluated their efficiency and effectiveness using both synthetic and real datasets. Discussion: As for efficiency, we compare the runtime of the intersection graph method against the classic product automaton construction showing that the intersection graph is faster by up to one order of magnitude. For showing effectiveness, we used real SARS-CoV-2 datasets and our matching statistics similarity measure to reproduce a well-established clade classification of SARS-CoV-2, thus demonstrating that the classification obtained by our method is in accordance with the existing one.</p
Bilevel programming approaches to the computation of optimistic and pessimistic single-leader-multi-follower equilibria
We study the problem of computing an equilibrium in leader-follower games with a single leader
and multiple followers where, after the leader’s commitment to a mixed strategy, the followers
play simultaneously in a noncooperative way, reaching a Nash equilibrium. We tackle the problem
from a bilevel programming perspective. Since, given the leader’s strategy, the followers’ subgame
may admit multiple Nash equilibria, we consider the cases where the followers play either the best
(optimistic) or the worst (pessimistic) Nash equilibrium in terms of the leader’s utility. For the
optimistic case, we propose three formulations which cast the problem into a single level mixed-
integer nonconvex program. For the pessimistic case, which, as we show, may admit a supremum
but not a maximum, we develop an ad hoc branch-and-bound algorithm. Computational results
are reported and illustrated
Approximate Pattern Matching on Elastic-Degenerate Text
A pan-genome is a group of closely-related genomes that are meant to be analyzed jointly or to be used as a reference. Expecting an increasing rate of genome production, due to the advent of rapid and cheap ‘next-generation’ sequencing technologies, pan-genomes should ideally take over the role of linear reference genomes in comparative genomics: a fundamental challenge in then to find efficient methods that can map reads of a newly sequenced genome to some representation of a pan-genome. A convenient way to represent a multiple alignment of several closely-related genomes is through an elastic-degenerate string, that is a sequence of n sets of strings of total length N: the issue is then to find all matches of a pattern of length m in a text of this kind. There exists an O(nm² + N)-time algorithm to solve the exact version of this problem (i.e. the pattern is required to be identical to some substring of the text) on-line. The first goal of this work is to generalize the previous problem, switching from exact to inexact matching, under both the Hamming and the edit distance models: the switch is essential to deal with real data, that are always affected by a certain error rate. Two efficient on-line algorithms are presented: the first one runs in O(kmG + kN)-time and O(m)-space, where G is the total number of strings in the elastic-degenerate text and k is the maximum Hamming distance allowed; the other deals with the edit distance and requires time O(k²mG + kN) and space O(m). A final section is devoted to study a method for comparing two pan-genomes. It is first demonstrated that it is NP-hard to compare two elastic-degenerate strings, thus degenerate (not elastic) strings are introduced in order to make the problem treatable. A linear-time algorithm for comparing two degenerate strings is then presented and used as a subroutine for detecting palindromes in such a structure, which is a highly relevant problem in molecular biology
Substring Complexity in Sublinear Space
Shannon's entropy is a definitive lower bound for statistical compression.
Unfortunately, no such clear measure exists for the compressibility of
repetitive strings. Thus, ad hoc measures are employed to estimate the
repetitiveness of strings, e.g., the size of the Lempel-Ziv parse or the
number of equal-letter runs of the Burrows-Wheeler transform. A more recent
one is the size of a smallest string attractor. Let be a string of
length . A string attractor of is a set of positions of capturing
the occurrences of all the substrings of . Unfortunately, Kempa and Prezza
[STOC 2018] showed that computing is NP-hard. Kociumaka et al. [LATIN
2020] considered a new measure of compressibility that is based on the function
counting the number of distinct substrings of length of , also
known as the substring complexity of . This new measure is defined as
and lower bounds all the relevant ad hoc
measures previously considered. In particular, always holds
and can be computed in time using working
space. Kociumaka et al. showed that one can construct an -sized representation of supporting efficient direct
access and efficient pattern matching queries on . Given that for highly
compressible strings, is significantly smaller than , it is natural
to pose the following question: Can we compute efficiently using
sublinear working space?
We address this algorithmic challenge by showing the following bounds to
compute : time using
space, for any , in the comparison model; or
time using space, for any
, in the word RAM model.Comment: Accepted to ISAAC 2023. Abstract abridged to satisfy arXiv
requirement
- …
