1,721,228 research outputs found
AdOCTeRA: Adaptive Optimization Constraints for improved Text-guided Retrieval of Apartments
Nowadays, it is common for workers to relocate to new countries while seeking better job opportunities, or to live as digital nomads. While doing so, they face the problem of finding a new place to call home, requiring them to trust online advertisements or to physically visit the apartment. Recently, the research community investigated the possibility of performing the search on the Metaverse, hence reducing time and costs related to traveling and limiting carbon emissions. The methods available are based on state-of-the-art cross-modal retrieval techniques, which learn a joint embedding space by mapping apartment-descriptions pairs close. However, these methodologies push all the other pairs far away in the embedding space. In this paper, we identify this decision as a limitation, since different apartments are likely to share many aspects. To overcome it, we propose AdOCTeRA, which automatically separates the apartments into three classes – very similar, slightly similar, and dissimilar – and proposes adaptive optimization constraints for each of them. We validate our methodology on a large dataset of more than 6000 apartments, obtaining considerable relative improvements over the previous state-of-the-art (+3.8% R@5 and +7.3% R@10), and consistent improvements over the baseline across all the experiments. The source code is available at https://github.com/aliabdari/AdOCTeRA
Data Augmentation Techniques for the Video Question Answering Task
Video Question Answering (VideoQA) is a task that requires a model to analyze and understand both the visual content given by the input video and the textual part given by the question, and the interaction between them in order to produce a meaningful answer. In our work we focus on the Egocentric VideoQA task, which exploits first-person videos, because of the importance of such task which can have impact on many different fields, such as those pertaining the social assistance and the industrial training. Recently, an Egocentric VideoQA dataset, called EgoVQA, has been released. Given its small size, models tend to overfit quickly. To alleviate this problem, we propose several augmentation techniques which give us a +5.5% improvement on the final accuracy over the considered baseline
Learning Video Retrieval Models with Relevance-Aware Online Mining
Due to the amount of videos and related captions uploaded every hour, deep learning-based solutions for cross-modal video retrieval are attracting more and more attention. A typical approach consists in learning a joint text-video embedding space, where the similarity of a video and its associated caption is maximized, whereas a lower similarity is enforced with all the other captions, called negatives. This approach assumes that only the video and caption pairs in the dataset are valid, but different captions - positives - may also describe its visual contents, hence some of them may be wrongly penalized. To address this shortcoming, we propose the Relevance-Aware Negatives and Positives mining (RANP) which, based on the semantics of the negatives, improves their selection while also increasing the similarity of other valid positives. We explore the influence of these techniques on two video-text datasets: EPIC-Kitchens-100 and MSR-VTT. By using the proposed techniques, we achieve considerable improvements in terms of nDCG and mAP, leading to state-of-the-art results, e.g. +5.3% nDCG and +3.0% mAP on EPIC-Kitchens-100. We share code and pretrained models at https://github.com/aranciokov/ranp
Improving semantic video retrieval models by training with a relevance-aware online mining strategy
To retrieve a video via a multimedia search engine, a textual query is usually created by the user and then used to perform the search. Recent state-of-the-art cross-modal retrieval methods learn a joint text–video embedding space by using contrastive loss functions, which maximize the similarity of positive pairs while decreasing that of the negative pairs. Although the choice of these pairs is fundamental for the construction of the joint embedding space, the selection procedure is usually driven by the relationships found within the dataset: a positive pair is commonly formed by a video and its own caption, whereas unrelated video-caption pairs represent the negative ones. We hypothesize that this choice results in a retrieval system with limited semantics understanding, as the standard training procedure requires the system to discriminate between groundtruth and negative even though there is no difference in their semantics. Therefore, differently from the previous approaches, in this paper we propose a novel strategy for the selection of both positive and negative pairs which takes into account both the annotations and the semantic contents of the captions. By doing so, the selected negatives do not share semantic concepts with the positive pair anymore, and it is also possible to discover new positives within the dataset. Based on our hypothesis, we provide a novel design of two popular contrastive loss functions, and explore their effectiveness on four heterogeneous state-of-the-art approaches. The extensive experimental analysis conducted on four datasets, EPIC-Kitchens-100, MSR-VTT, MSVD, and Charades, validates the effectiveness of the proposed strategy, observing, e.g., more than +20% nDCG on EPIC-Kitchens-100. Furthermore, these results are corroborated with qualitative evidence both supporting our hypothesis and explaining why the proposed strategy effectively overcomes it
Metaverse Retrieval: Finding the Best Metaverse Environment via Language
In recent years, the metaverse has sparked an increasing interest across the globe and is projected to reach a market size of more than \1000B by 2030. This is due to its many potential applications in highly heterogeneous fields, such as entertainment and multimedia consumption, training, and industry. This new technology raises many research challenges since, as opposed to the more traditional scene understanding, metaverse scenarios contain additional multimedia content, such as movies in virtual cinemas and operas in digital theaters, which greatly influence the relevance of the metaverse to a user query. For instance, if a user is looking for Impressionist exhibitions in a virtual museum, only the museums that showcase exhibitions featuring various Impressionist painters should be considered relevant. In this paper, we introduce the novel problem of text-to-metaverse retrieval, which proposes the challenging objective of ranking a list of metaverse scenarios based on a given textual query. To the best of our knowledge, this represents the first step towards understanding and automating cross-modal tasks dealing with metaverses. Since no public datasets contain these important multimedia contents inside the scenes, we also collect and annotate a dataset which serves as a proof-of-concept for the problem. To establish the foundation for it, we implement and analyze several solutions based on deep learning, whereas to promote transparency and reproducibility, we will publicly release their source code and the collected data
Video question answering supported by a multi-task learning objective
Video Question Answering (VideoQA) concerns the realization of models able to analyze a video, and produce a meaningful answer to visual content-related questions. To encode the given question, word embedding techniques are used to compute a representation of the tokens suitable for neural networks. Yet almost all the works in the literature use the same technique, although recent advancements in NLP brought better solutions. This lack of analysis is a major shortcoming. To address it, in this paper we present a twofold contribution about this inquiry and its relation with question encoding. First of all, we integrate four of the most popular word embedding techniques in three recent VideoQA architectures, and investigate how they influence the performance on two public datasets: EgoVQA and PororoQA. Thanks to the learning process, we show that embeddings carry question type-dependent characteristics. Secondly, to leverage this result, we propose a simple yet effective multi-task learning protocol which uses an auxiliary task defined on the question types. By using the proposed learning strategy, significant improvements are observed in most of the combinations of network architecture and embedding under analysis
A Language-Based Solution to Enable Metaverse Retrieval
Recently, the Metaverse is becoming increasingly attractive, with millions of users accessing the many available virtual worlds. However, how do users find the one Metaverse which best fits their current interests? So far, the search process is mostly done by word of mouth, or by advertisement on technology-oriented websites. However, the lack of search engines similar to those available for other multimedia formats (e.g., YouTube for videos) is showing its limitations, since it is often cumbersome to find a Metaverse based on some specific interests using the available methods, while also making it difficult to discover user-created ones which lack strong advertisement. To address this limitation, we propose to use language to naturally describe the desired contents of the Metaverse a user wishes to find. Second, we highlight that, differently from more conventional 3D scenes, Metaverse scenarios represent a more complex data format since they often contain one or more types of multimedia which influence the relevance of the scenario itself to a user query. Therefore, in this work, we create a novel task, called Text-to-Metaverse retrieval, which aims at modeling these aspects while also taking the cross-modal relations with the textual data into account. Since we are the first ones to tackle this problem, we also collect a dataset of 33000 Metaverses, each of which consists of a 3D scene enriched with multimedia content. Finally, we design and implement a deep learning framework based on contrastive learning, resulting in a thorough experimental setup
FArMARe: a Furniture-Aware Multi-task methodology for Recommending Apartments based on the user interests
Nowadays, many people frequently have to search for new accommodation options. Searching for a suitable apartment is a time-consuming process, especially because visiting them is often mandatory to assess the truthfulness of the advertisements found on the Web. While this process could be alleviated by visiting the apartments in the metaverse, the Web-based recommendation platforms are not suitable for the task. To address this shortcoming, in this paper, we define a new problem called text-to-apartment recommendation, which requires ranking the apartments based on their relevance to a textual query expressing the user's interests. To tackle this problem, we introduce FArMARe, a multi-task approach that supports cross-modal contrastive training with a furniture-aware objective. Since public datasets related to indoor scenes do not contain detailed descriptions of the furniture, we collect and annotate a dataset comprising more than 6000 apartments. A thorough experimentation with three different methods and two raw feature extraction procedures reveals the effectiveness of FArMARe in dealing with the problem at hand
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
- …
