1,721,188 research outputs found

    Ego-network segmentation via (weighted) Jaccard median

    Full text link
    An ego-network is a graph representing the interactions of a node (ego) with its neighbors and the interactions among those neighbors. A sequence of ego-networks having the same ego can thus model the evolution of these interactions over time. We introduce the problem of segmenting a sequence of ego-networks into kk segments, for any given integer kk. Each segment is represented by a summary network, and the goal is to minimize the total loss of representing kk segments by kk summaries. The problem allows partitioning the sequence into homogeneous segments with respect to the activities or properties of the ego (e.g., to identify time periods when a user acquired different circles of friends in a social network) and to compactly represent each segment with a summary. The main challenge is to construct a summary that represents a collection of ego-networks with minimum loss. To address this challenge, we employ Jaccard Median (JM), a well-known NP-hard problem for summarizing sets, for which, however, no effective and efficient algorithms are known. We develop a series of algorithms for JM offering different effectiveness/efficiency trade-offs: (I) an exact exponential-time algorithm, based on Mixed Integer Linear Programming; (II) exact and approximation polynomial-time algorithms for minimizing an upper bound of the objective function of JM; and (III) efficient heuristics for JM, which are based on an effective scoring scheme and one of them also on sketching. We also study a generalization of the segmentation problem, in which there may be multiple edges between a pair of nodes in an ego-network. To tackle this problem, we develop a series of algorithms, based on a more general problem than JM, called Weighted Jaccard Median WJM: (I) an exact exponential-time algorithm, based on Mixed Integer Linear Programming; (II) exact algorithms for minimizing an upper bound of the objective function of WJM; and (III) efficient heuristics, based on the percentiles of edge multiplicities and one of them also on divide-and-conquer. By building upon the above results, we design algorithms for segmenting a sequence of ego-networks. Experiments with 10 real datasets and with synthetic datasets show that our algorithms produce optimal or near-optimal solutions to JM or to WJM, and that they substantially outperform state-of-the-art methods which can be employed for ego-network segmentation

    Alignment-free sequence comparison using absent words

    Full text link
    Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realised by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free techniques, which are based on measures referring to the composition of sequences in terms of their constituent patterns. These measures, such as q-gram distance, are usually computed in time linear with respect to the length of the sequences. In this paper, we focus on the complementary idea: how two sequences can be efficiently compared based on information that does not occur in the sequences. A word is an absent word of some sequence if it does not occur in the sequence. An absent word is minimal if all its proper factors occur in the sequence. Here we present the first linear-time and linear-space algorithm to compare two sequences by considering all their minimal absent words. In the process, we present results of combinatorial interest, and also extend the proposed techniques to compare circular sequences. We also present an algorithm that, given a word x of length n, computes the largest integer for which all factors of x of that length occur in some minimal absent word of x in time and space O(n). Finally, we show that the known asymptotic upper bound on the number of minimal absent words of a word is tight

    On the separation of topology-free rank inequalities for the max stable set problem

    Full text link
    In the context of finding the largest stable set of a graph, rank inequalities prescribe that a stable set can contain, from any induced subgraph of the original graph, at most as many vertices as the stability number of the former. Although these inequalities subsume many of the valid inequalities known for the problem, their exact separation has only been investigated in few special cases obtained by restricting the induced subgraph to a specific topology. In this work, we propose a different approach in which, rather than imposing topological restrictions on the induced subgraph, we assume the right-hand side of the inequality to be fixed to a given (but arbitrary) constant. We then study the arising separation problem, which corresponds to the problem of finding a maximum weight subgraph with a bounded stability number. After proving its hardness and giving some insights on its polyhedral structure, we propose an exact branch-and-cut method for its solution. Computational results show that the separation of topology-free rank inequalities with a fixed right-hand side yields a substantial improvement over the bound provided by the fractional clique polytope (which is obtained with rank inequalities where the induced subgraph is restricted to a clique), often better than that obtained with Lovász’s Theta function via semidefinite programming

    On breaking truss-based and core-based communities

    Full text link
    We introduce the general problem of identifying a smallest edge subset of a given graph whose deletion makes the graph community-free. We consider this problem under two community notions that have attracted significant attention: k-truss and k-core. We also introduce a problem variant where the identified subset contains edges incident to a given set of nodes and ensures that these nodes are not contained in any community: k-truss or k-core, in our case. These problems are directly applicable in social networks: The identified edges can be hidden by users or sanitized from the output graph; or in communication networks: the identified edges correspond to vital network connections. We present a series of theoretical and practical results. On the theoretical side, we show through non-trivial reductions that the problems we introduce are NP-hard and, in fact, hard to approximate. For the k-truss-based problems, we also show exact exponential-time algorithms, as well as a non-trivial lower bound on the size of an optimal solution. On the practical side, we develop a series of heuristics that are sped up by efficient data structures that we propose for updating the truss or core decomposition under edge deletions. In addition, we develop an algorithm to compute the lower bound. Extensive experiments on 11 real-world and synthetic graphs show that our heuristics are effective, outperforming natural baselines, and also efficient (up to two orders of magnitude faster than a natural baseline), thanks to our data structures. Furthermore, we present a case study on a co-authorship network and experiments showing that the removal of edges identified by our heuristics does not substantially affect the clustering structure of the input graph

    Pangenome comparison via ED strings

    Full text link
    Introduction: An elastic-degenerate (ED) string is a sequence of sets of strings. It can also be seen as a directed acyclic graph whose edges are labeled by strings. The notion of ED strings was introduced as a simple alternative to variation and sequence graphs for representing a pangenome, that is, a collection of genomic sequences to be analyzed jointly or to be used as a reference. Methods: In this study, we define notions of matching statistics of two ED strings as similarity measures between pangenomes and, consequently infer a corresponding distance measure. We then show that both measures can be computed efficiently, in both theory and practice, by employing the intersection graph of two ED strings. Results: We also implemented our methods as a software tool for pangenome comparison and evaluated their efficiency and effectiveness using both synthetic and real datasets. Discussion: As for efficiency, we compare the runtime of the intersection graph method against the classic product automaton construction showing that the intersection graph is faster by up to one order of magnitude. For showing effectiveness, we used real SARS-CoV-2 datasets and our matching statistics similarity measure to reproduce a well-established clade classification of SARS-CoV-2, thus demonstrating that the classification obtained by our method is in accordance with the existing one.</p

    Bilevel programming approaches to the computation of optimistic and pessimistic single-leader-multi-follower equilibria

    Full text link
    We study the problem of computing an equilibrium in leader-follower games with a single leader and multiple followers where, after the leader’s commitment to a mixed strategy, the followers play simultaneously in a noncooperative way, reaching a Nash equilibrium. We tackle the problem from a bilevel programming perspective. Since, given the leader’s strategy, the followers’ subgame may admit multiple Nash equilibria, we consider the cases where the followers play either the best (optimistic) or the worst (pessimistic) Nash equilibrium in terms of the leader’s utility. For the optimistic case, we propose three formulations which cast the problem into a single level mixed- integer nonconvex program. For the pessimistic case, which, as we show, may admit a supremum but not a maximum, we develop an ad hoc branch-and-bound algorithm. Computational results are reported and illustrated

    Substring Complexity in Sublinear Space

    Full text link
    Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size zz of the Lempel-Ziv parse or the number rr of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ\gamma of a smallest string attractor. Let TT be a string of length nn. A string attractor of TT is a set of positions of TT capturing the occurrences of all the substrings of TT. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ\gamma is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function ST(k)S_T(k) counting the number of distinct substrings of length kk of TT, also known as the substring complexity of TT. This new measure is defined as δ=sup{ST(k)/k,k1}\delta= \sup\{S_T(k)/k, k\geq 1\} and lower bounds all the relevant ad hoc measures previously considered. In particular, δγ\delta\leq \gamma always holds and δ\delta can be computed in O(n)\mathcal{O}(n) time using Θ(n)\Theta(n) working space. Kociumaka et al. showed that one can construct an O(δlognδ)\mathcal{O}(\delta \log \frac{n}{\delta})-sized representation of TT supporting efficient direct access and efficient pattern matching queries on TT. Given that for highly compressible strings, δ\delta is significantly smaller than nn, it is natural to pose the following question: Can we compute δ\delta efficiently using sublinear working space? We address this algorithmic challenge by showing the following bounds to compute δ\delta: O(n3logbb2)\mathcal{O}(\frac{n^3\log b}{b^2}) time using O(b)\mathcal{O}(b) space, for any b[1,n]b\in[1,n], in the comparison model; or O~(n2/b)\tilde{\mathcal{O}}(n^2/b) time using O~(b)\tilde{\mathcal{O}}(b) space, for any b[n,n]b\in[\sqrt{n},n], in the word RAM model.Comment: Accepted to ISAAC 2023. Abstract abridged to satisfy arXiv requirement

    Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

    Full text link
    An elastic-degenerate string is a sequence of n sets of strings of total length N. It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm^2+N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N * ceil[m/w]) with pre-processing time and space O(m * ceil[m/w]), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M. A straightforward generalization of the existing bit-vector algorithm would require time O(N * ceil[M/w]) with pre-processing time and space O(M * ceil[M/w]), which is prohibitive in practice. We present a new on-line O(N * ceil[M/w])-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population's variants are considered
    corecore