1,720,966 research outputs found
Discriminative feature learning for multimodal classification
The purpose of this thesis is to tackle two related topics: multimodal classification and objective functions to improve the discriminative power of features.
First, I worked on image and text classification tasks and performed many experiments to show the effectiveness of different approaches available in literature.
Then, I introduced a novel methodology which can classify multimodal documents using singlemodal classifiers merging textual and visual information into images and a novel loss function to improve separability between samples of a dataset.
Results show that exploiting multimodal data increases performances on classification tasks rather than using traditional single-modality methods.
Moreover the introduced GIT loss function is able to enhance the discriminative power of features, lowering intra-class distance and raising inter-class distance between samples of a multiclass dataset
Using convolutional neural networks for content extraction from online flyers
The rise of online shopping has hurt physical retailers, which struggle to persuade customers to buy products in physical stores rather than online. Marketing flyers are a great mean to increase the visibility of physical retailers, but the unstructured offers appearing in those documents cannot be easily compared with similar online deals, making it hard for a customer to understand whether it is more convenient to order a product online or to buy it from the physical shop. In this work we tackle this problem, introducing a content extraction algorithm that automatically extracts structured data from flyers. Unlike competing approaches that mainly focus on textual content or simply analyze font type, color and text positioning, we propose a new approach that uses Convolutional Neural Networks to classify words extracted from flyers typically used in marketing materials to attract the attention of readers towards specific deals. We obtained good results and a high language and genre independence
Embedded Textual Content for Document Image Classification with Convolutional Neural Networks
Hand written characters recognition via deep metric learning
Deep metric learning plays an important role in measuring similarity through distance metrics among arbitrary group of data. MNIST dataset is typically used to measure similarity however this dataset has few seemingly similar classes, making it less effective for deep metric learning methods. In this paper, we created a new handwritten dataset named Urdu-Characters with set of classes suitable for deep metric learning. With this work, we compare the performance of two state-of-The-Art deep metric learning methods i.e. Siamese and Triplet network. We show that a Triplet network is more powerful than a Siamese network. In addition, we show that the performance of a Triplet or Siamese network can be improved using most powerful underlying Convolutional Neural Network architectures
Aiding intra-text representations with visual context for multimodal named entity recognition
With the massive explosion of social media platforms such as Twitter and Instagram, people everyday share billions of multimedia posts, containing images and text. Typically, text in these posts is short, informal and noisy, leading to ambiguities which can be resolved using images. In this paper we will explore text-centric Named Entity Recognition task on these multimedia posts. We propose an end to end model which learns a joint representation of a text and an image. Our model extends multi-dimensional self-attention technique, where now image helps to enhance relationship between words. Experiments show that our model is capable of capturing both textual and visual contexts with greater accuracy, achieving state-of-the-art results on Twitter multimodal Named Entity Recognition dataset
A query and product suggestion method for price comparison search engines
In this paper we propose a query suggestion method for price comparison search engines. Query suggestion techniques are used for generating alternative queries to facilitate web users in information seeking; in this specific domain, suggestions provided to web users need to be properly generated taking into account that the suggested products must be still available for sale. We propose a novel approach based on a slightly variant of classical query-URL graphs: the query-product click-through bipartite graph. Information extracted both from search engine logs and specific domain features are exploited to build the graph, and one of the advantages of this model is that such a graph can be used to suggest not only related queries but also related products. Concepts used in the proposed method are not restricted to our context but are used in many other major e-commerce and search engine websites, we tested the model on several challenging datasets, and also compared with a recent query suggestion approach specifically designed for price comparison engines. Our solution outperforms the competing approach, achieving higher results in terms of relevance of the provided suggestions and coverage rates on top-8 suggestions
Deep Latent Space Learning for Cross-Modal Mapping of Audio and Visual Signals
We propose a novel deep training algorithm for joint representation of audio and visual information which consists of a single stream network (SSNet) coupled with a novel loss function to learn a shared deep latent space representation of multimodal information. The proposed framework characterizes the shared latent space by leveraging the class centers which helps to eliminate the need of pairwise or triplet supervision. We quantitatively and qualitatively evaluate the proposed approach on VoxCeleb, a benchmarks audio-visual dataset on multitude of tasks including cross-modal verification, cross-modal matching and cross-modal retrieval. State-of-the-art performance is achieved on cross-modal verification and matching while comparable results are observed on the remaining applications. Our experiments demonstrate the effectiveness of the technique for cross-modal biometric applications
Do cross modal systems leverage semantic relationships?
Current cross modal retrieval systems are evaluated using R@K measure which does not leverage semantic relationships rather strictly follows the manually marked image text query pairs. Therefore, current systems do not generalize well for the unseen data in the wild. To handle this, we propose a new measure SemanticMap to evaluate the performance of cross modal systems. Our proposed measure evaluates the semantic similarity between the image and text representations in the latent embedding space. We also propose a novel cross modal retrieval system using a single stream network for bidirectional retrieval. The proposed system is based on a deep neural network trained using extended center loss, minimizing the distance of image and text descriptions in the latent space from the class centers. In our system, the text descriptions are also encoded as images which enabled us to use single stream network for both text and images. To the best of our knowledge, our work is the first of its kind in terms of employing a single stream network for cross modal retrieval systems. The proposed system is evaluated on two publicly available datasets including MSCOCO and Flickr30K and has shown comparable results to the current state-of-the-art methods
- …
