1,721,058 research outputs found
Standards opportunities around data-bearing Web pages
The evolving Web has seen ever-growing use of structured data, thanks to the way it enhances information authoring, querying, visualization and sharing. To date, however, most structured data authoring and management tools have been oriented towards programmers and Web developers. End users have been left behind, unable to leverage structured data for information management and communication as well as professionals. In this paper, I will argue that many of the benefits of structured data management can be provided to end users as well. I will describe an approach and tools that allow end users to define their own schemas (without knowing what a schema is), manage data and author (not program) interactive Web visualizations of that data using the Web tools with which they are already familiar, such as plain Web pages, blogs, wikis and WYSIWYG document editors. I will describe our experience deploying these tools and some lessons relevant to their future evolution
A near-linear time algorithm for constructing a cactus representation of minimum cuts
We present an Õ(m) (near-linear) time Monte Carlo algorithm for constructing the cactus data structure, a useful representation of all the global minimum edge cuts of an undirected graph. Our algorithm represents a fundamental improvement over the best previous (quadratic time) algorithms: because there can be quadratically many min-cuts, our algorithm must avoid looking at all min-cuts during the construction, but nonetheless builds a data structure representing them all. Our result closes the gap between the (near-linear) time required to find a single min-cut and that for (implicitly) finding all the min-cuts
A Nearly Optimal Oracle for Avoiding Failed Vertices and Edges
We present an improved oracle for the distance sensitivity problem. The goal is to preprocess a directed graph G = (V,E) with non-negative edge weights to answer queries of the form: what is the length of the shortest path from x to y that does not go through some failed vertex or edge f. The previous best algorithm produces an oracle of size ~O(n[superscript 2]) that has an O(1) query time, and an ~O(nn[superscript 2]√m) construction time. It was a randomized Monte Carlo algorithm that worked with high probability. Our oracle also has a constant query time and an ~O(n[superscript 2]) space requirement, but it has an improved construction time of ~O(mn), and it is deterministic. Note that O(1) query, O(n[superscript 2]) space, and O(mn) construction time is also the best known bound (up to logarithmic factors) for the simpler problem of finding all pairs shortest paths in a weighted, directed graph. Thus, barring improved solutions to the all pairs shortest path problem, our oracle is optimal up to logarithmic factors
Budget-Optimal Task Allocation for Reliable Crowdsourcing Systems
Crowdsourcing systems, in which numerous tasks are electronically distributed to numerous “information pieceworkers,” have emerged as an effective paradigm for human-powered solving of large-scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all such systems must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in an appropriate manner, e.g., majority voting.
In this paper, we consider a general model of such crowdsourcing tasks and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers' answers. We show that our algorithm, inspired by belief propagation and low-rank matrix approximation, significantly outperforms majority voting and, in fact, is optimal through comparison to an oracle that knows the reliability of every worker. Further, we compare our approach with a more general class of algorithms that can dynamically assign tasks. By adaptively deciding which questions to ask to the next set of arriving workers, one might hope to reduce uncertainty more efficiently. We show that, perhaps surprisingly, the minimum price necessary to achieve a target reliability scales in the same manner under both adaptive and nonadaptive scenarios. Hence, our nonadaptive approach is order optimal under both scenarios. This strongly relies on the fact that workers are fleeting and cannot be exploited. Therefore, architecturally, our results suggest that building a reliable worker-reputation system is essential to fully harnessing the potential of adaptive designs.National Science Foundation (U.S.) (Grant 1117381)National Science Foundation (U.S.) (EMT project)United States. Air Force Office of Scientific Research (Complex Networks project)United States. Army Research Office (Multidisciplinary University Research Initiative Award 58153-MA-MUR
Randomized Approximation Schemes for Cuts and Flows in Capacitated Graphs
We describe random sampling techniques for approximately solving problems that involve cuts and flows in graphs. We give a near-linear-time randomized combinatorial construction that transforms any graph on n vertices into an O(n log n)-edge graph on the same vertices whose cuts have approximately the same value as the original graph's. In this new graph, for example, we can run the [~ over O](m[superscript 3/2])-time maximum flow algorithm of Goldberg and Rao to find an s-t minimum cut in [~ over O](m[superscript 3/2]) time. This corresponds to a (1 + ε)-times minimum s-t cut in the original graph. A related approach leads to a randomized divide-and-conquer algorithm producing an approximately maximum flow in [~ over O](m√n) time. Our algorithm can also be used to improve the running time of sparsest cut approximation algorithms from [~ over O](mn) to [~ over O](n[superscript 2]) and to accelerate several other recent cut and flow algorithms. Our algorithms are based on a general theorem analyzing the concentration of random graphs' cut values near their expectations. Our work draws only on elementary probability and graph theory.National Science Foundation (U.S.
Fast Augmenting Paths by Random Sampling from Residual Graphs
Consider an n-vertex, m-edge, undirected graph with integral capacities and max-flow value v. We give a new [~ over O](m + nv)-time maximum flow algorithm. After assigning certain special sampling probabilities to edges in [~ over O](m)$ time, our algorithm is very simple: repeatedly find an augmenting path in a random sample of edges from the residual graph. Breaking from past work, we demonstrate that we can benefit by random sampling from directed (residual) graphs. We also slightly improve an algorithm for approximating flows of arbitrary value, finding a flow of value (1 - ε) times the maximum in [~ over O](m√n/ε) time.National Science Foundation (U.S.
Analysis of student engagement in an online annotation system in the context of a flipped introductory physics class
We discuss student participation in an online social annotation forum over two semesters of a flipped, introductory physics course at Harvard University. We find that students who engage in high-level discussion online, especially by providing answers to their peers' questions, make more gains in conceptual understanding than students who do not. This is true regardless of students' physics background. We find that we can steer online interaction towards more productive and engaging discussion by seeding the discussion and managing the size of the sections. Seeded sections produce higher quality annotations and a greater proportion of generative threads than unseeded sections. Larger sections produce longer threads; however, beyond a certain section size, the quality of the discussion decreases
Random contractions and sampling for hypergraph and hedge connectivity
We initiate the study of hedge connectivity of undirected graphs, motivated by dependent edge failures in real-world networks. In this model, edges are partitioned into groups called hedges that fail together. The hedge connectivity of a graph is the minimum number of hedges whose removal disconnects the graph. We give a polynomial-time approximation scheme and a quasi-polynomial exact algorithm for hedge connectivity. This provides strong evidence that the hedge connectivity problem is tractable, which contrasts with prior work that established the intractability of the corresponding s−t min-cut problem. Our techniques also yield new combinatorial and algorithmic results in hypergraph connectivity. Next, we study the behavior of hedge graphs under uniform random sampling of hedges. We show that unlike graphs, all cuts in the sample do not converge to their expected value in hedge graphs. Nevertheless, the min-cut of the sample does indeed concentrate around the expected value of the original min-cut. This leads to a sharp threshold on hedge survival probabilities for graph disconnection. To the best of our knowledge, this is the first network reliability analysis under dependent edge failures
Efficient crowdsourcing for multi-class labeling
Crowdsourcing systems like Amazon's Mechanical Turk have emerged as an effective large-scale human-powered platform for performing tasks in domains such as image classification, data entry, recommendation, and proofreading. Since workers are low-paid (a few cents per task) and tasks performed are monotonous, the answers obtained are noisy and hence unreliable. To obtain reliable estimates, it is essential to utilize appropriate inference algorithms (e.g. Majority voting) coupled with structured redundancy through task assignment. Our goal is to obtain the best possible trade-off between reliability and redundancy. In this paper, we consider a general probabilistic model for noisy observations for crowd-sourcing systems and pose the problem of minimizing the total price (i.e. redundancy) that must be paid to achieve a target overall reliability. Concretely, we show that it is possible to obtain an answer to each task correctly with probability 1-ε as long as the redundancy per task is O((K/q) log (K/ε)), where each task can have any of the distinct answers equally likely, q is the crowd-quality parameter that is defined through a probabilistic model. Further, effectively this is the best possible redundancy-accuracy trade-off any system design can achieve. Such a single-parameter crisp characterization of the (order-)optimal trade-off between redundancy and reliability has various useful operational consequences. Further, we analyze the robustness of our approach in the presence of adversarial workers and provide a bound on their influence on the redundancy-accuracy trade-off.
Unlike recent prior work [GKM11, KOS11, KOS11], our result applies to non-binary (i.e. K>2) tasks. In effect, we utilize algorithms for binary tasks (with inhomogeneous error model unlike that in [GKM11, KOS11, KOS11]) as key subroutine to obtain answers for K-ary tasks. Technically, the algorithm is based on low-rank approximation of weighted adjacency matrix for a random regular bipartite graph, weighted according to the answers provided by the workers.National Science Foundation (U.S.
Expressive Query Construction through Direct Manipulation of Nested Relational Results
Despite extensive research on visual query systems, the standard way to interact with relational databases remains to be through SQL queries and tailored form interfaces. We consider three requirements to be essential to a successful alternative: (1) query specification through direct manipulation of results, (2) the ability to view and modify any part of the current query without departing from the direct manipulation interface, and (3) SQL-like expressiveness. This paper presents the first visual query system to meet all three requirements in a single design. By directly manipulating nested relational results, and using spreadsheet idioms such as formulas and filters, the user can express a relationally complete set of query operators plus calculation, aggregation, outer joins, sorting, and nesting, while always remaining able to track and modify the state of the complete query. Our prototype gives the user an experience of responsive, incremental query building while pushing all actual query processing to the database layer. We evaluate our system with formative and controlled user studies on 28 spreadsheet users; the controlled study shows our system significantly outperforming Microsoft Access on the System Usability Scale
- …
