1,721,014 research outputs found
Predicting User Personality from Public Perceptions on Social Media
Personality distinctively characterises an individual and profoundly influences behaviours. Social media offer the virtual community an unprecedented opportunity to generate content and share aspects of their life which often reflect their personalities. The interest in using deep learning to infer traits from digital footprints has grown recently; however, very limited work has been presented which explores the sentiment information conveyed. The present study, therefore, used a computational approach to classify personality from social media by gauging public perceptions underlying factors encompassing traits.
In the research reported in this thesis, a Sentiment-based Personality Detection system was developed to infer trait from short texts based on the ’Big Five’ personality dimensions. We exploited the spirit of Neural Network Language Model (NNLM) by using a unified model that combines a Recurrent Neural Network named Long Short-Term Memory (LSTM) with a Convolutional Neural Network (CNN). The proposed system is threefold: It commences with sentiment classification by grouping short messages harvested online into three categories, namely positive, negative, and nonpartisan. This is followed by employing Global Vectors (GloVe) to build vectorial word representations. As such, this step aims to add external knowledge to short texts. We apply CNN and LSTM during the learning process. Finally, we trained each variant of the models to compute prediction scores across the five traits. Experimental study indicated the effectiveness of our system.
As part of our investigation, a case study was carried out which employed the proposed system. We opted for Uber, a renowned global hail-sharing company, as the subject of our examination. The selected study was set up to investigate the existing correlation of personality traits and opinion polarities. The results support the prior findings of the tendency of persons with the same traits to express sentiments in similar ways
Information Extraction from TV Series Scripts for Uptake Prediction
The script of a movie, or of an episode of a television series, describes the setting, the storyline, and the scene changes. It also details the movement, actions, non-oral expression, and dialogues of the characters.
The script is assessed by potential investors. If it is considered to be qualified, a decision is made to arrange funds and other resources to create the real product, i.e. a movie or a television series. This action of approving the project is known as green-lighting.
Many studies have been conducted on building models to predict the success of movies. However, the majority of these studies exploit factors which only become known after the decision of green-lighting, or after the release of the products. Only a few studies have focused on predictive models based on pre-greenlighting factors, which are available before the decision of green-lighting.
In comparison, there are even less models that forecast the performance of television series exploiting pre-greenlighting factors.
This study aims to extract features from scripts of pilot episodes, which are the first episodes of television series. These features will be exploited to construct predictive models for uptake of the television series.
Three data sources were employed, including the IMDB, the OpenSubtitles2016 corpus, and television series scripts retrieved from multiple websites. The scripts were then parsed, and the structures were analysed. Subsequently, features were extracted and data matrices were generated. These features and data matrices were used in classification algorithms for training and construction of predictive models. The output from the prediction models was then used for prediction of the uptake. However, the results were not as compelling as expected.
The present research was compared with previous studies on the same topic. The evaluation results are discussed, and suggestions for future work are given
A Novel E-mail Reply Approach for E-mail Management System
This project describes a novel intelligent E-mail reply system through information retrieval and information generation techniques. There are several difficulties to realise different kinds of functions using machine learning and deep learning algorithms. For example, the publicly available raw training datasets cannot meet the functional requirements of the model, and the information generation class models cannot satisfy the long text-based predictions due to limitations of the algorithm. It is well known that the Term Frequency-Inverse Document Frequency (TF-IDF) model is one of the most widely used feature extraction methods in information retrieval because of its simple algorithm and excellent performance. Meanwhile, The Document to Vector (Doc2Vec) model is an extension algorithm of Word to Vector (Word2Vec), which can train the index of documents together based on turning words into vectors. Good results have been achieved in determining the relationship between words within a document, as well as the correlation between different documents. Recently, the Gated Recurrent Unit (GRU) model is playing an increasingly important role in natural language processing (NLP) as an advanced method of applying a recurrent neural network (RNN). Also, the GRU model utilises deep neural networks to predict and generate information instead of extracting the original existing information. Specifically, we use these three algorithms to train and implement our models after heavily processing our training data. Experimental results show that a hybrid model combining the GRU information generation model as the base with the method of sentence to vector embedding (Sent2Vec) is a practicable method for long-text prediction. In the end, an intelligent E-mail reply system is implemented in our experiment. Three models are compared through subjective human evaluation
Predicting User Personality from Public Perceptions on Social Media
Personality distinctively characterises an individual and profoundly influences behaviours. Social media offer the virtual community an unprecedented opportunity to generate content and share aspects of their life which often reflect their personalities. The interest in using deep learning to infer traits from digital footprints has grown recently; however, very limited work has been presented which explores the sentiment information conveyed. The present study, therefore, used a computational approach to classify personality from social media by gauging public perceptions underlying factors encompassing traits.
In the research reported in this thesis, a Sentiment-based Personality Detection system was developed to infer trait from short texts based on the ’Big Five’ personality dimensions. We exploited the spirit of Neural Network Language Model (NNLM) by using a unified model that combines a Recurrent Neural Network named Long Short-Term Memory (LSTM) with a Convolutional Neural Network (CNN). The proposed system is threefold: It commences with sentiment classification by grouping short messages harvested online into three categories, namely positive, negative, and nonpartisan. This is followed by employing Global Vectors (GloVe) to build vectorial word representations. As such, this step aims to add external knowledge to short texts. We apply CNN and LSTM during the learning process. Finally, we trained each variant of the models to compute prediction scores across the five traits. Experimental study indicated the effectiveness of our system.
As part of our investigation, a case study was carried out which employed the proposed system. We opted for Uber, a renowned global hail-sharing company, as the subject of our examination. The selected study was set up to investigate the existing correlation of personality traits and opinion polarities. The results support the prior findings of the tendency of persons with the same traits to express sentiments in similar ways
Novel methods for distributed and privacy-preserving data stream mining
The growing number of “big” datasets present many opportunities for data mining, but also raise a variety of new challenges. Datasets may take the form of continuous streams with constantly changing patterns, they may be too widely distributed to be centralised for analysis at a single location, or they may contain sensitive values that data owners are not willing to share due to privacy concerns. Much past research has considered these issues individually, but few existing methods can address combinations of these properties. Therefore, this research develops methods for distributed and privacy-preserving data stream mining: a novel Hierarchical Distributed Stream Miner (HDSM) that learns relationships between the features of separate streams with minimal data transmission to central locations, and two data perturbation methods for privacy-preserving stream mining based on the combination of random projection, random translation, and additive noise. Experimental evaluation of HDSM demonstrates significant improvements in classification accuracy over existing distributed stream mining approaches while minimising data transmission and computational costs. HDSM’s ability to dynamically trade-off accuracy with these costs is also demonstrated. Variations of the known input-output Maximum A Posteriori (MAP) attack are developed to experimentally evaluate the data perturbation methods, and the proposed composite methods are shown to achieve a better trade-off between privacy and model accuracy than random projection alone. Finally, an approach is described for combining HDSM with data perturbation to achieve distributed privacy-preserving stream mining
Novel Methods for Distributed and Privacy-Preserving Data Stream Mining
The growing number of “big” datasets present many opportunities for data mining, but also raise a variety of new challenges. Datasets may take the form of continuous streams with constantly changing patterns, they may be too widely distributed to be centralised for analysis at a single location, or they may contain sensitive values that data owners are not willing to share due to privacy concerns. Much past research has considered these issues individually, but few existing methods can address combinations of these properties. Therefore, this research develops methods for distributed and privacy-preserving data stream mining: a novel Hierarchical Distributed Stream Miner (HDSM) that learns relationships between the features of separate streams with minimal data transmission to central locations, and two data perturbation methods for privacy-preserving stream mining based on the combination of random projection, random translation, and additive noise. Experimental evaluation of HDSM demonstrates significant improvements in classification accuracy over existing distributed stream mining approaches while minimising data transmission and computational costs. HDSM’s ability to dynamically trade-off accuracy with these costs is also demonstrated. Variations of the known input-output Maximum A Posteriori (MAP) attack are developed to experimentally evaluate the data perturbation methods, and the proposed composite methods are shown to achieve a better trade-off between privacy and model accuracy than random projection alone. Finally, an approach is described for combining HDSM with data perturbation to achieve distributed privacy-preserving stream mining
Information Extraction from TV Series Scripts for Uptake Prediction
The script of a movie, or of an episode of a television series, describes the setting, the storyline, and the scene changes. It also details the movement, actions, non-oral expression, and dialogues of the characters.
The script is assessed by potential investors. If it is considered to be qualified, a decision is made to arrange funds and other resources to create the real product, i.e. a movie or a television series. This action of approving the project is known as green-lighting.
Many studies have been conducted on building models to predict the success of movies. However, the majority of these studies exploit factors which only become known after the decision of green-lighting, or after the release of the products. Only a few studies have focused on predictive models based on pre-greenlighting factors, which are available before the decision of green-lighting.
In comparison, there are even less models that forecast the performance of television series exploiting pre-greenlighting factors.
This study aims to extract features from scripts of pilot episodes, which are the first episodes of television series. These features will be exploited to construct predictive models for uptake of the television series.
Three data sources were employed, including the IMDB, the OpenSubtitles2016 corpus, and television series scripts retrieved from multiple websites. The scripts were then parsed, and the structures were analysed. Subsequently, features were extracted and data matrices were generated. These features and data matrices were used in classification algorithms for training and construction of predictive models. The output from the prediction models was then used for prediction of the uptake. However, the results were not as compelling as expected.
The present research was compared with previous studies on the same topic. The evaluation results are discussed, and suggestions for future work are given
Performance Evaluation and Extension of Cachejoin in a Real-Life Environment
Active or real-time data warehousing is becoming very popular in business intelligence domain. In order to build a real-time or active data warehouse an online processing of stream of end users’ transaction with disk-based master data is required. This is also called processing of semi-stream data. Fundamentally, this semi-stream processing is a process of joining an incoming stream data (transactional data) with the disk-based slow retrieving master data by using an effective join operator. Typically this join operator works with a limited amount of main memory which cannot hold the entire disk-based master data. Recently a number of semi-stream join algorithms have been proposed in the literature. Most of these algorithms have been tested using synthetic dataset while only a few using real-life dataset. It is always interesting to see how these algorithms behave in real environment. As each semi-stream join performs differently under the different characteristics of the stream data, it is important to select appropriate semi-stream join based on the characteristics of the stream data. Also these join algorithms use different strategies to access the disk-based master data e.g. index (clustered index or non-clustered index) or no index.
Based on an intensive literature review, in this thesis we select a well-known semi-stream join CACHEJOIN (Cache Join) and implement it in MITRE 10 NZ, one of the leading home improvement and hardware retail store. We study the behavior of the algorithm under two different datasets (synthetic dataset and MITRE 10 NZ dataset). We study the performance of the algorithm under both datasets. Our performance study shows that under MITRE 10 NZ dataset CACHEJOIN performs very closer to that of synthetic dataset.
As an extension of our work we find that MITRE 10 NZ incoming stream data (transactional data) needs to join with two tables in disk-based master data. First join is performed with product table (sc) using stock_code as a join attribute. While second join is performed with customer table (cs_person) using account_code as a join attribute. This gives us an opportunity to extend our existing CACHEJOIN for two-stage join. The stream tuples move to the second stage as soon as they complete the first stage. The performance of two-stage join is studied against normal CACHEJOIN using MITRE 10 NZ dataset. After analyzing the performance we are confident that extended CACHEJOIN performs reasonably well for MITRE 10 NZ real environment.
As a future work, we have a plan to explore more in two-stage join by trying different semi-stream joins and find out the best join combinations, and also explore more on parallelization of running 2 parallel nodes to handle the future growth of MITRE 10 NZ transactional data
Personalised Taste Profiling in Short-Text Microblogs
The objective of this thesis is to develop diverse and user-representative methods for taste profiling in short-text microblog users. The proposed methods are entirely based on the disseminated content, social network structure and their variations over time. Inferring user interests and subsequent formulation of taste profiles is pertinent in personalizing content recommendations for micro-blogging services as well as in extraction of users with similarities in preferences. The methods are broadly divided into two categories: i) short-text analytics methods (Part I, Chapter 3) and ii) user interest identification and quantification (taste profiling) over time (Part II, Chapters 4,5 and 6).
With the proposed method in Part I, it is possible to accurately extract knowledge from short texts, a usually difficult process due to the unconventional language on such platforms. As a case study, a semi-supervised modelling framework is proposed based on tweets metadata in extraction of better topical representations of short texts. In the findings, topical vectors from semantically relevant long texts made shorter and otherwise noisy texts more interpretable. The built models generated better results in terms of topical classifications compared to similar approaches.
The methods in Part II largely support the detection of user interests and subsequent modelling of taste profiles. As case studies, several approaches were proposed in identifying and quantifying short-text microblog users’ interests. A neural network-based approach was proposed in the computation of user interests in a specific topic as part of the process to identify relevant users for follow-back feature in certain domains. In addition, a soft clustering method was proposed to identify user interests in several topics and to certain levels. Lastly, the time dependency factor in interest decay and gain in such microblogs was modelled. This mirrored a conventional short-text microblogging platform where content is volatile based on for example the prevailing news at the time. Twitter was used as the testing platform for the proposed approaches mainly because of its popularity, API access ability as well as the temporal-dynamism of its overall network structure. This research is fundamental to services, content recommendations and audience measurement
Performance evaluation and extension of Cachejoin in a real-life environment
Active or real-time data warehousing is becoming very popular in business intelligence domain. In order to build a real-time or active data warehouse an online processing of stream of end users’ transaction with disk-based master data is required. This is also called processing of semi-stream data. Fundamentally, this semi-stream processing is a process of joining an incoming stream data (transactional data) with the disk-based slow retrieving master data by using an effective join operator. Typically this join operator works with a limited amount of main memory which cannot hold the entire disk-based master data. Recently a number of semi-stream join algorithms have been proposed in the literature. Most of these algorithms have been tested using synthetic dataset while only a few using real-life dataset. It is always interesting to see how these algorithms behave in real environment. As each semi-stream join performs differently under the different characteristics of the stream data, it is important to select appropriate semi-stream join based on the characteristics of the stream data. Also these join algorithms use different strategies to access the disk-based master data e.g. index (clustered index or non-clustered index) or no index.
Based on an intensive literature review, in this thesis we select a well-known semi-stream join CACHEJOIN (Cache Join) and implement it in MITRE 10 NZ, one of the leading home improvement and hardware retail store. We study the behavior of the algorithm under two different datasets (synthetic dataset and MITRE 10 NZ dataset). We study the performance of the algorithm under both datasets. Our performance study shows that under MITRE 10 NZ dataset CACHEJOIN performs very closer to that of synthetic dataset.
As an extension of our work we find that MITRE 10 NZ incoming stream data (transactional data) needs to join with two tables in disk-based master data. First join is performed with product table (sc) using stock_code as a join attribute. While second join is performed with customer table (cs_person) using account_code as a join attribute. This gives us an opportunity to extend our existing CACHEJOIN for two-stage join. The stream tuples move to the second stage as soon as they complete the first stage. The performance of two-stage join is studied against normal CACHEJOIN using MITRE 10 NZ dataset. After analyzing the performance we are confident that extended CACHEJOIN performs reasonably well for MITRE 10 NZ real environment.
As a future work, we have a plan to explore more in two-stage join by trying different semi-stream joins and find out the best join combinations, and also explore more on parallelization of running 2 parallel nodes to handle the future growth of MITRE 10 NZ transactional data
- …
