RonPub -- Research Online Publishing
Not a member yet
199 research outputs found
Sort by
Compile-Time Query Optimization for Big Data Analytics
Many emerging programming environments for large-scale data analysis, such as Map-Reduce, Spark, and Flink, provide Scala-based APIs that consist of powerful higher-order operations that ease the development of complex data analysis applications. However, despite the simplicity of these APIs, many programmers prefer to use declarative languages, such as Hive and Spark SQL, to code their distributed applications. Unfortunately, most current data analysis query languages are based on the relational model and cannot effectively capture the rich data types and computations required for complex data analysis applications. Furthermore, these query languages are not well-integrated with the host programming language, as they are based on an incompatible data model. To address these shortcomings, we introduce a new query language for data-intensive scalable computing that is deeply embedded in Scala, called DIQL, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time. In contrast to other query languages, our query embedding eliminates impedance mismatch as any Scala code can be seamlessly mixed with SQL-like syntax, without having to add any special declaration. DIQL supports nested collections and hierarchical data and allows query nesting at any place in a query. With DIQL, programmers can express complex data analysis tasks, such as PageRank and matrix factorization, using SQL-like syntax exclusively. The DIQL query optimizer uses algebraic transformations to derive all possible joins in a query, including those hidden across deeply nested queries, thus unnesting nested queries of any form and any number of nesting levels. The optimizer also uses general transformations to push down predicates before joins and to prune unneeded data across operations. DIQL has been implemented on three Big Data platforms, Apache Spark, Apache Flink, and Twitter's Cascading/Scalding, and has been shown to have competitive performance relative to Spark DataFrames and Spark SQL for some complex queries. This paper extends our previous work on embedded data-intensive query languages by describing the complete details of the formal framework and the query translation and optimization processes, and by providing more experimental results that give further evidence of the performance of our system
Distributed Data-Gathering and -Processing in Smart Cities: An Information-Centric Approach
The technological advancements along with the proliferation of smart and connected devices (things) motivated the exploration of the creation of smart cities aimed at improving the quality of life, economic growth, and efficient resource utilization. Some recent initiatives defined a smart city network as the interconnection of the existing independent and heterogeneous networks and the infrastructure. However, considering the heterogeneity of the devices, communication technologies, network protocols, and platforms the interoperability of these networks is a challenge requiring more attention. In this paper, we propose the design of a novel Information-Centric Smart City architecture (iSmart), focusing on the demand of the future applications, such as efficient machineto-machine communication, low latency computation offloading, large data communication requirements, and advanced security. In designing iSmart, we use the Named-Data Networking (NDN) architecture as the underlying communication substrate to promote semantics-based communication and achieve seamless compute/data sharing
IoT Data Imputation with Incremental Multiple Linear Regression
In this paper, we address the problem related to missing data imputation in the IoT domain. More specifically, we propose an Incremental Space-Time-based model (ISTM) for repairing missing values in IoT real-time data streams. ISTM is based on Incremental Multiple Linear Regression, which processes data as follows: Upon data arrival, ISTM updates the model after reading again the intermediary data matrix instead of accessing all historical information. If a missing value is detected, ISTM will provide an estimation for the missing value based on nearly historical data and the observations of neighboring sensors of the default one. Experiments conducted with real traffic data show the performance of ISTM in comparison with known techniques
Word Embeddings for Wine Recommender Systems Using Vocabularies of Experts and Consumers
This vision paper proposes an approach to use the most advanced word embeddings techniques to bridge the gap between the discourses of experts and non-experts and more specifically the terminologies used by the twocommunities. Word embeddings makes it possible to find equivalent terms between experts and non-experts, byapproach the similarity between words or by revealing hidden semantic relations. Thus, these controlledvocabularies with these new semantic enrichments are exploited in a hybrid recommendation system incorporating content-based ontology and keyword-based ontology to obtain relevant wines recommendations regardless of the level of expertise of the end user. The major aim is to find a non-expert vocabulary from semantic rules to enrich the knowledge of the ontology and improve the indexing of the items (i.e. wine) and the recommendation process
FICLONE: Improving DBpedia Spotlight Using Named Entity Recognition and Collective Disambiguation
In this paper we present FICLONE, which aims to improve the performance of DBpedia Spotlight, not only for the task of semantic annotation (SA), but also for the sub-task of named entity disambiguation (NED). To achieve this aim, first we enhance the spotting phase by combining a named entity recognition system (Stanford NER ) with the results of DBpedia Spotlight. Second, we improve the disambiguation phase by using coreference resolution and exploiting a lexicon that associates a list of potential entities of Wikipedia to surface forms. Finally, to select the correct entity among the candidates found for one mention, FICLONE relies on collective disambiguation, an approach that has proved successful in many other annotators, and that takes into consideration the other mentions in the text. Our experiments show that FICLONE not only substantially improves the performance of DBpedia Spotlight for the NED sub-task but also generally outperforms other state-of-the-art systems. For the SA sub-task, FICLONE also outperforms DBpedia Spotlight against the dataset provided by the DBpedia Spotlight team
Query Rewriting by Contract under Privacy Constraints
In this paper we show how Query Rewriting rules and Containment checks of aggregate queries can be combined with Contract-based programming techniques. Based on the combination of both worlds, we are able to find new Query Rewriting rules for queries containing aggregate constraints. These rules can either be used to improve the overall system performance or, in our use case, to implement a privacy-aware way to process queries. By integrating them in our PArADISE framework, we can now process and rewrite all types of OLAP queries, including complex aggregate functions and group-by extensions. In our framework, we use the whole network structure, from data producing sensors up to cloud computers, to automatically deploy an edge computing subnetwork. On each edge node, so-called fragment queries of a genuine query are executed to filter and to aggregate data on resource restricted sensor nodes. As a result of integrating Contract-based programming approaches, we are now able to not only process less data but also to produce less data in the result. Thus, the privacy principle of data minimization is accomplished
Modelling Patterns in Continuous Streams of Data
The untapped source of information, extracted from the increasing number of sensors, can be explored to improve and optimize several systems. Yet, hand in hand with this growth goes the increasing difficulty to manage and organize all this new information. The lack of a standard context representation scheme is one of the main struggles in this research area. Conventional methods for extracting knowledge from data rely on a standard representation or a priori relation, which may not be feasible for IoT and M2M scenarios. With this in mind we propose a stream characterization model in order to provide the foundations for a novel stream similarity metric. Complementing previous work on context organization, we aim to provide an automatic stream organizational model without enforcing specific representations. In this paper we extend our work on stream characterization and devise a novel similarity method
The Design of a Gamification Algorithm in a Music Practice Application
Keeping track of pupils' progress across different instruments and lessons, and what they are meant to be practicing, can be challenging. The typical solution is to use a book in which teachers write notes and pupils record practice. This can, however, easily be lost or become illegible. Furthermore, music education and self-directed practice is one area of education which is not widely gamified, with gamification describing a technique that drives specific human behaviors, motivates users, and has proven success in influencing learning. An application could therefore be created to respond to these needs by recording and tracking music practice whilst also gamifying student learning. An algorithm which accommodates these requirements is presented in this paper
Webpage Ranking Analysis of Various Search Engines with Special Focus on Country-Specific Search
In order to attract many visitors to their own website, it is extremely important for website developers that their webpage is one of the best ranked webpages of search engines. As a rule, search engine operators do not disclose their exact ranking algorithm, so that website developers usually have only vague ideas about which measures have particularly positive influences on the webpage ranking. Conversely, we ask the question: "What are the properties of the best ranked webpages?" For this purpose, we perform a detailed analysis, in which we compare the properties of the best ranked webpages with the worse ranked webpages. Furthermore, we compare countryspecific differences
Halo Effect Contamination in Assessments of Web Interface Design
This paper relies on findings and theory from both the human-computer interaction and cognitive psychology literatures in order to inquire into the extent to which the halo effect contaminates web interface design assessments. As a human cognitive bias, the halo effect manifests itself when a judge's evaluations of an entity's individual characteristics are negatively or positively distorted by the judge's overall affect toward the entity being judged. These distortions and halo-induced delusions have substantial negative implications for rational decisionmaking and the ability to objectively evaluate businesses, technologies, or other humans, and should hence be a critical consideration for both managers and organizations alike. Here we inquire into the halo effect using a controlled, randomized experiment involving more than 1,200 research subjects. Subjects' preexisting affective states were activated using polarizing issues including abortion rights, immigration policy, and gun control laws. Subjects were then asked to evaluate specific interface characteristics of six different types of websites, the textual content of which either supported or contradicted their preexisting affective beliefs. Comparing subject responses to objective control evaluations revealed strong evidence of halo effect contamination in assessments of web interface design, particularly among men. In light of the results, a theoretical framework integrating elements from cognitive and evolutionary psychology is proposed to explain the origins and purpose of the halo effect