1,721,092 research outputs found

    Human Activity Analytics Based on Mobility and Social Media Data

    Full text link
    The development of social networks such as Twitter, Facebook and Google+ allow users to share their beliefs, feelings, or observations with their circles of friends. Based on these data, a range of applications and techniques has been developed, targeting to provide a better quality of life to the users. Nevertheless, the quality of results of the geolocationaware applications is signicantly restricted due to the tiny percentage of the social media data that is geotagged ( 2% for Twitter). Hence, increasing this percentage is an important and challenging problem. Moreover, information extracted from social media data can be complemented by the analysis of mobile phone usage data, in order to provide further insights on human activity patterns. In this thesis, we present a novel method for analyzing and geolocalizing non-geotagged Twitter posts. The proposed method is the rst to do so at the ne-grain of city neighborhoods, while being both eective and time ecient. Our method is based on the extraction of representative keywords for each candidate location,as well as the analysis of the tweet volume time series. We also describe a system built on top of our method, which geolocalizes tweets and allows users to visually examine the results and their evolution over time. Our system allows the user to get a better idea of how the activity of a particular location changes, which the most important keywords are, as well as to geolocalize individual tweets of interest. Moreover, we study the activity and mobility characteristics of the users that post geotagged tweets and compared the mobility of users who attended the event with a random set of users. Interestingly, the results of this analysis indicate that a very small number of users (i.e., less than 35 users in this study) is able to represent the mobility patterns present in the entire dataset. Finally, we study the call activity and mobility patterns, clustering the observed behaviors that exhibited similar characteristics, and characterizing the anomalous behaviors. We analyzed a Call Detail Record (CDR) dataset, containing (aggregated) information on the calls among mobile phones. Employing density-based algorithms and statistical analysis, we developed a framework that identies abnormal locations, as well abnormal time intervals. The results of this work can be used for early identication of exceptional situations, monitoring the eects of important events in urban and transportation planning, and others

    Large Scale Aggregated Sentiment Analytics

    No full text
    In the past years we have witnessed Sentiment Analytics becoming increasingly popular topic in Information Retrieval, which has established itself as a promising direction of research. With the rapid growth of the user-generated content represented in blogs, forums, social networks and micro-blogs, it became a useful tool for social studies, market analysis and reputation management, since it made possible capturing sentiments and opinions at a large scale and with the ever-growing precision. Sentiment Analytics came a long way from product review mining to full-fledged multi-dimensional analysis of social sentiment, exposing people attitude towards any topic aggregated along different dimensions, such as time and demographics. The novelty of our work is that it approaches Sentiment Analytics from the perspective of Data Mining, addressing some important problems which fall out of the scope of Opinion Mining. We develop a framework for Large Scale Aggregated Sentiment Analytics, which allows to capture and quantify important changes in aggregated sentiments or in their dynamics, evaluate demographical aspects of these changes, and explain the underlying events and mechanisms which drive them. The first component of our framework is Contradiction Analysis, which studies diverse opinions and their interaction, and allows tracking the quality of aggregated sentiment or detecting interesting sentiment differences. Targeting large scale applications, we develop a sentiment contradiction measure based on the statistical properties of sentiment and allowing efficient computation from aggregated sentiments. Another important component of our framework addresses the problem of monitoring and explaining temporal sentiment variations. Along this direction, we propose novel time series correlation methods tailored specifically for large scale analysis of sentiments aggregated over users demographics. Our methods help to identify interesting correlation patterns between demographic groups and thus better understand the demographical aspect of sentiment dynamics. We bring another interesting dimension to the problem of sentiment evolution by studying the joint dynamics of sentiments and news, uncovering the importance of news events and assessing their impact on sentiments. We propose a novel and universal way of modeling different media and their dynamics, which aims to describe the information propagation in news- and social media. Finally, we propose and evaluate an updateable method of sentiment aggregation and retrieval, which preserves important properties of aggregated sentiments and also supports scalability and performance requirements of our applications

    Mining and Learning in Sequential Data Streams: Interesting Correlations and Classification in Noisy Settings

    No full text
    Sequential data streams describe a variety of real life processes: from sensor readings of natural phenomena to robotics, moving trajectories and network monitoring scenarios. An item in a sequential data stream often depends on its previous values, subsequent items being strongly correlated. In this thesis we address the problem of extracting the most significant sequential patterns from a data stream, with applications to real-time data summarization and classification and estimating generative models of the data. The first contribution of this thesis is the notion of Conditional Heavy Hitters, which describes the items that are frequent conditionally – that is, within the context of their parent item. Conditional Heavy Hitters are useful in a variety of applications in sensor monitoring, analysis, Markov chain modeling, and more. We develop algorithms for efficient detection of Conditional Heavy Hitters depending on the characteristics of the data, and provide analytical quality guarantees for their performance. We also study the behavior of the proposed algorithms for different types of data and demonstrate the efficacy of our methods by experimental evaluation on several synthetic and real-world datasets. The second contribution of the thesis is the extension of Conditional Heavy Hitters to patterns of variable order, which we formalize in the notion of Variable Order Conditional Heavy Hitters. The significance of the variable order patterns is measured in terms of high conditional and joint probability and their difference from the independent case in terms of statistical significance. The approximate online solution in the variable order case exploits lossless compression approaches. Facing the tradeoff between memory usage and accuracy of the pattern extraction, we introduce several online space pruning strategies and study their quality guarantees. The strategies can be chosen depending on the estimation objectives, such as maximizing the precision or recall of extracted significant patterns. The efficiency of our approach is experimentally evaluated on three real datasets. The last contribution of the thesis is related to the prediction quality of the classical and sequential classification algorithms under varying levels of label noise. We present the "Sigmoid Rule" Framework, which allows choosing the most appropriate learning algorithm depending on the properties of the data. The framework uses an existing model of the expected performance of learning algorithms as a sigmoid function of the signal-to-noise ratio in the training instances. Based on the sigmoid parameters we define a set of intuitive criteria that are useful for comparing the behavior of learning algorithms in the presence of noise. Furthermore, we show that there is a connection between these parameters and the characteristics of the underlying dataset, hinting at how the inherent properties of a dataset affect learning. The framework is applicable to concept drift scenarios, including modeling user behavior over time, and mining of noisy time series of evolving nature

    Modeling and Querying Data Series and Data Streams with Uncertainty

    No full text
    Many real applications consume data that is intrinsically uncertain and error-prone. An uncertain data series is a series whose point values are uncertain. An uncertain data stream is a data stream whose tuples are existentially uncertain and/or have an uncertain value. Typical sources of uncertainty in data series and data streams include sensor data, data synopses, privacy-preserving transformations and forecasting models. In this thesis, we focus on the following three problems: (1) the formulation and the evaluation of similarity search queries in uncertain data series; (2) the evaluation of nearest neighbor search queries in uncertain data series; (3) the adaptation of sliding windows in uncertain data stream processing to accommodate existential and value uncertainty. We demonstrate experimentally that the correlation among neighboring time-stamps in data series can be leveraged to increase the accuracy of the results. We further show that the "possible world" semantics can be used as underlying uncertainty model to formulate nearest neighbor queries that can be evaluated efficiently. Finally, we discuss the relation between existential and value uncertainty in data stream applications, and verify experimentally our proposal of uncertain sliding windows

    Indexing for Very Large Data Series Collections

    Full text link
    Data series are a prevalent data type that has attracted lots of interest in recent years. Specifically, there has been an explosive interest towards the analysis of large volumes of data series in many different domains. This is both in businesses (e.g., in mobile applications) and in sciences e.g., in biology). In several time-critical scenarios, analysts need to be able to explore these data as soon as they become available, which is not currently possible for very large data series collections. In this thesis, we present the first adaptive indexing mechanism, specifically tailored to solve the problem of indexing and querying very large data series collections. The main idea is that instead of building the complete index over the complete data set up-front and querying only later, we interactively and adaptively build parts of the index, only for the parts of the data on which the users pose queries. The contents and the resolution of the index are purely driven by query patterns; the more queries that arrive, the more data series are indexed and at a higher resolution. Adaptive indexing significantly outperforms previous solutions, gracefully handling large data series collections, reducing the data to query delay: by the time state-of-the-art indexing techniques finish indexing 1 billion data series (and before answering even a single query), our method has already answered 3 * 10^5 queries. At the same time, we present novel algorithms for both full indexing of data series collections, as well as for efficient exact query answering. Our algorithms perform efficient skip-sequential scans of the data, avoiding the need of costly random accesses on the disk. Moreover, up to this point very little attention has been paid to properly evaluating data series index structures, with most previous work relying solely on randomly selected data series to use as queries (with/without adding noise). In this thesis, we show that random workloads are inherently not suitable for the task at hand and we argue that there is a need for carefully generating a query workload. We define measures that capture the characteristics of queries, and we propose a method for generating workloads with the desired properties, that is, effectively evaluating and comparing data series summarizations and indexes. In our experimental evaluation, with carefully controlled query workloads, we shed light on key factors affecting the performance of nearest neighbor search in large data series collections. Finally, apart from ad hoc data exploration, we also investigate methods for the systematic analysis of very large data series collections, supporting business intelligence applications. We present techniques, which borrow ideas from Strategic Management, for a goal-oriented analysis of large collections of performance indicator data series. Such algorithms can additionally be sped up through the use of the index structures presented in this work

    Online Prediction Threshold Optimization Under Semi-deferred Labelling

    Full text link
    In supermarket loyalty campaigns, shoppers collect stamps to redeem limited-time luxury products. Having an accurate prediction of which shoppers will eventually redeem is crucial to effective execution. With the ultimate goal of changing shopper behavior, it is important to ensure an adequate number of rewards and to be able to steer promising shoppers into joining the campaign and redeeming a reward. If information from previous campaigns is available, a prediction model can be built to predict the redemption probability, possibly also adapting the prediction threshold to determine predicted the label. During a running campaign, we only know a subset of the labels of the positive class (the so-far redeemers), and have no access to the labels of any example of the negative class (non-redeemers at the end of the campaign). The majority of the examples during the campaign do not have a label yet (shoppers that could still redeem but have not done so yet). This is a semi-deferred labelling setting and our goal is to improve the prediction quality using this limited information. Existing work on predicting (semi-deferred) labels either focuses on positive-unlabelled learning, which does not use existing models, or updates models after the prediction is made by assigning expected labels using unsupervised learning models. In this paper we present a framework for Online Prediction threshold optimization Under Semi-deferred labelling (OPUS). Our framework does not change the existing model, but instead adapts the prediction threshold that decides which probability is required for a positive label, based on the semi-deferred labels we already know. We apply OPUS to two real-world datasets: a supermarket with two campaigns and over 160 000 shoppers.</p

    Data reduction in data warehouses

    Full text link
    Much research has been devoted to the efficient computation of relational aggregations. In this paper we consider the inverse problem, that of deriving (approximately) the original data from the aggregates. We motivate this problem in the context of two specific application areas, approximate query answering and data analysis. We propose a framework based on the notion of information entropy that enables us to estimate the original values in a data set, given only aggregated information about it. We then show how approximate queries on the data from which the aggregates were derived can be performed using our framework. We also describe an alternate use of the proposed framework that enables us to identify values that deviate from the underlying data distribution, suitable for data ruining purposes. We present a detailed performance study of the algorithms using both real and synthetic data, highlighting the benefits of our approach as well as the efficiency of the proposed solutions. Subsequently, we consider the above problem in a space constrained environment, where only a subset of the required aggregates can be stored. More specifically, we wish to select a set of aggregates of interest, subject to a constraint on the total space occupied by these aggregates. The objective is to maximize the total benefit, which is a function of the number and importance of the queries that we can estimate given the selected aggregates. For this problem, which as we show is NP-hard, there are no known polynomial approximation schemes, and we propose several algorithms for solving it. We explore the use of greedy and randomized techniques as well as clustering based approaches. The solutions presented herein are generic and can be applied to other problem domains as well. We explore the properties and special characteristics of the above techniques with an experimental evaluation. Our results illustrate the behavior of the algorithms under different settings, and highlight the benefits of each approach. Based on our analysis, we present worst case scenarios for the algorithms. This offers insight into the operation of the algorithms, and provides a practical guide for selecting among the proposed techniques.Ph.D

    Time Series Management Systems: A 2022 Survey

    Full text link
    Enormous amounts of time series are being collected in many different domains. These include, but are not limited to, aviation, computing, energy, finance, logistics, and medicine. However, general-purpose Database Management Systems (DBMSs) are not optimized for times series management and thus significantly limit the amount of time series that can be efficiently stored and analyzed. As a remedy, specialized Time Series Management Systems (TSMSs) have been developed. This chapter, provides a thorough survey and classification of TSMSs that are developed through academic or industrial research and documented through peer-reviewed papers. To document their design and novel contributions, a summary of each system is provided. The systems are primarily classified based on their architecture. In addition, the systems are classified based on: when and why each system was developed, how it can be deployed, how mature its implementation is, how scalable it is, how it processes time series, what interfaces it provides, the type of approximation it supports, how low latency it can achieve, how it stores time series, and the types of queries it supports. The chapter concludes with a collection of open research problems based on the limitations of the surveyed systems

    Enabling access to and exploration of information graphs

    Full text link
    Exploratory search is the new frontier of information consumption as it goes well beyond simple \emph{lookups}. Information repositories are ubiquitous and grow larger every day, and automated search systems help users find information in such collections. To extract knowledge from these repositories, the common ``query lookup'' retrieval paradigm accepts a set of specifications (the query) that describes the objects of interest and then collects such objects. Yet, the query lookup retrieval paradigms commonly in use are no more sufficient to support complex information needs, as they can only provide candidate starting points, but do not help the user in expanding their knowledge. To ease access and consumption of rich information repositories, we address the crucial problem of data exploration. Exploratory tasks match the natural need for finding answers to open-ended information needs within an unfamiliar environment. In particular, in this dissertation, we focus on enabling access to and exploration of rich information graphs. Within businesses, organizations, and among researchers, data is produced in many forms, large volumes, and different contexts. As a consequence of this heterogeneity, many applications find more useful modelling their datasets with the graph model, where information is represented with entities (nodes) and relationships (edges). Those are the data graphs, the graph databases, the knowledge graphs, or more generally information graphs. The richness of their schema and of their content makes it challenging for users to express appropriate queries and retrieve the desired results. Hence, to allow an effective exploration of a graph, we require: (i) an expressive \emph{query paradigm}, (ii) an intuitive \emph{query mechanism}, and (iii) an appropriate \emph{storage and query processing system}. In this work, we address these three requirements. An exploratory query should be simple enough to avoid complicate declarative languages (such as SQL or SPARQL), and at the same time, it should retain the flexibility and expressiveness of such languages. For this reason, with respect to the query paradigm, we introduce the notion of \emph{exemplar queries} and propose extensions to handle multiple incomplete examples. An exemplar query is a query method in which the user, or the analyst, circumvents query languages by using examples as input. In particular, the solution we design allows flexible matching in the case of incomplete or partially specified examples. Moreover, to enable this query paradigm, there is the need for interactive systems that implement an incremental query-constructions mechanism and interactive explorations. To address this need, we study algorithms and implementations based on pseudo-relevance feedback for \emph{exemplar query suggestion}, along with an in-depth study of their effectiveness. Finally, as there exist many graph databases, high heterogeneity can be observed in the functionalities and performances of these systems. We provide an exhaustive evaluation methodology and a comprehensive study of the existing systems that allow to understand their capabilities and limitations. In particular, we design a novel micro-benchmarking framework for the assessment of the functionalities of some graph databases among the most prominent in the area and provide detailed insights on their performance

    Advanced Query Paradigms for the Novice User

    No full text
    Query answering is one of the most important processes in search systems, for it connects users to the information stored in data sources. A query is a set of specifications or constraints that the user provides to describe the objects of interest. As such, answering a query means retrieving those objects from the data source that match the user constraints. An answer from a search system may not fully satisfy the user. This happens if the answer does not contain the required object or it contains a number of irrelevant results. Commonly, the user does not know how to describe the query and ends up with one overly generic or specific or, even worse, she is not even aware of the correct conditions to describe the expected results. These problems are particularly evident when the database is interrogated by a novice user who, by definition, does not have sufficient technological skills to understand complicated query languages, or simply gives up if the system does not respond properly or timely. In this dissertation, we focus on three common problems in the broad query answering process to help novice users find the correct answers when the system does not provide sufficient support. First, we look at the empty answer problem, where the user provides a very specific query for which no answer exists in the database. In particular, we concentrate on interactive approaches for novice users. We end up with a rich probabilistic framework that includes user preferences and smoothly guides the user towards the most likely answers by means of simple yes/no questions on the query conditions to be discarded. Second, we analyze the information overload problem, that is complementary to the first. In this case the user provides a too generic query that returns a large set of potentially irrelevant results. We tackle the problem in structured databases, and more specifically, labeled graphs. The solution we propose returns a set of refinements (i.e., more specific queries) of the input query that, once executed, covers all the initial results. Third, we propose and study a completely novel query paradigm that assumes that the user is not able to describe the query conditions to retrieve the objects of interest. In this regard, we introduce exemplar queries, that allow the user to specify a single element in the result set and let the system infer the others. We provide clear semantics and a solution that works in large knowledge graphs. Finally, we validate the solutions for the three problems both in terms of theoretical results and experimental evaluation and we prove that the proposed methods efficiently scale up to very large real datasets
    corecore