1,721,144 research outputs found
TAMformer: Multi-Modal Transformer with Learned Attention Mask for Early Intent Prediction
A system for automatic detection and recognition of advertising trademarks in sports videos
In this technical demonstration we show the current version of our trademark detection and recognition system that has been developed in collaboration with a sport marketing firm 1 with the aim of evaluating the visibility of advertising trademarks in broadcast sporting events. We propose a semi-automatic system for detecting and retrieving trade-mark appearances in sports videos. A human annotator supervises the results of the automatic annotation through an interface that shows the time and the position of the detected trademarks; due to this fact the aim of the system is to provide a good recall figure, so that the supervisor can safely skip the parts of the video that have been marked as not containing a trademark, thus speeding up his work
Towards Polyp Counting in Full-Procedure Colonoscopy Videos
Automated colonoscopy reporting holds great potential for enhancing quality control and improving cost-effectiveness of colonoscopy procedures. A major challenge lies in the automated identification, tracking, and re-association (ReID) of polyps tracklets across full-procedure colonoscopy videos. This is essential for precise polyp counting and enables automated computation of key quality metrics, such as Adenoma Detection Rate (ADR) and Polyps Per Colonoscopy (PPC). However, polyp ReID is challenging due to variations in polyp appearance, frequent disappearance from the field of view, and occlusions. In this work, we leverage the REAL-Colon dataset, the first open-access dataset providing full-procedure videos, to define tasks, data splits and metrics for the problem of automatically count polyps in full-procedure videos, establishing an open-access framework. We re-implement previously proposed SimCLR-based methods for learning representations of polyp tracklets, both single-frame and multi-view, and adapt them to the polyp counting task. We then propose an Affinity Propagation-based clustering method to further improve ReID based on these learned representations, ultimately enhancing polyp counting. Our approach achieves state-of-the-art performance, with a polyp fragmentation rate of 6.30 and a false positive rate (FPR) below 5% on the REAL-Colon dataset. We release code at https://github.com/lparolari/towards-polyp-counting
DSP-ST: Dynamic Structural Prior Spatiotemporal Graph Attention Networks for Traffic Speed Prediction
Accurate traffic forecasting is essential for intelligent transportation system management and control. Due to the highly complex spatiotemporal (ST) correlation of real-world road networks, dynamic and long-term traffic prediction presents many challenges. We propose a traffic speed prediction model based on dynamic structural prior (DSP) ST graph attention networks. We provide a structural prior graph, namely, dual graph convolution, which combines spatial and contextual subgraphs to enable the discovery of the non-Euclidean spatial correlation and potential contextual similarity of road networks. Moreover, to dynamically extract the ST correlation, this article employs a multihead self-attention temporal convolution module to capture the temporal correlation and a graph attention convolution module to extract the spatial correlation. The prediction output is generated by stacking multiple ST blocks. Experimental results on two real-world traffic datasets demonstrate that DSP-ST outperforms existing mainstream baselines, which can provide references for traffic management departments
Aligning and linking entity mentions in image, text, and knowledge base
A picture is worth a thousand words, the adage reads. However, pictures cannot replace words in terms of their ability to efficiently convey clear (mostly) unambiguous and concise knowledge. Images and text, indeed reveal different and complementary information that, if combined will result in more information than the sum of that contained in a single media. The combination of visual and textual information can be obtained by linking the entities mentioned in the text with those shown in the pictures. To further integrate this with the agent’s background knowledge, an additional step is necessary. That is, either finding the entities in the agent knowledge base that correspond to those mentioned in the text or shown in the picture or, extending the knowledge base with the newly discovered entities. We call this complex task Visual-Textual-Knowledge Entity Linking (VTKEL). In this article, after providing a precise definition of the VTKEL task, we present two datasets called VTKEL1k* and VTKEL30k. These datasets consisting of images and corresponding captions, in which the image and textual mentions are both annotated with the corresponding entities typed according to the YAGO ontology. The datasets can be used for training and evaluating algorithms of the VTKEL task. Successively, we introduce a baseline algorithm called VT-LinKEr (Visual-Textual-Knowledge Entity Linker) for the solution of the VTKEL task. We evaluate the performances of VT-LinKEr on both datasets. We then contribute a supervised algorithm called ViTKan (Visual-Textual- Knowledge Alignment Network). We trained the ViTKan algorithm using features data of the VTKEL1k* dataset. The experimental results on VTKEL1k* and VTKEL30k datasets show that ViTKan substantially outperforms the baseline algorithm
VT-LINKER: Visual-Textual-Knowledge Entity Linker
“A picture is worth a thousand words”, the adage reads. However, pictures cannot replace words in terms of their ability to efficiently convey clear (mostly) unambiguous and concise knowledge. Images and text, indeed, reveal different and complementary information that, if combined, result in more information than the sum of that contained in the single media. The combination of visual and textual information can be obtained by linking the entities mentioned in the text with those shown in the pictures. To further integrate this with agent background knowledge, an additional step is necessary. That is, either finding the entities in the agent knowledge base that correspond to those mentioned in the text or shown in the picture or, extending the knowledge base with the newly discovered entities. We call this complex task Visual-Textual-Knowledge Entity Linking (VTKEL). In this paper, we precisely define the VTKEL task and present two datasets composed of 1k and 30k pictures, annotated with visual and textual entities and linked to the YAGO ontology. Successively, we develop the first unsupervised algorithm for the solution of VTKEL task. The evaluation of the algorithm shows promising results on both 1k and 30k VTKEL datasets
Social and Scene-Aware Trajectory Prediction in Crowded Spaces
Mimicking human ability to forecast future positions or interpret complex interactions in urban scenarios, such as streets, shopping malls or squares, is essential to develop socially compliant robots or self-driving cars. Autonomous systems may gain advantage on anticipating human motion to avoid collisions or to naturally behave alongside people. To foresee plausible trajectories, we construct an LSTM (long short-term memory)-based model considering three fundamental factors: people interactions, past observations in terms of previously crossed areas and semantics of surrounding space. Our model encompasses several pooling mechanisms to join the above elements defining multiple tensors, namely social, navigation and semantic tensors. The network is tested in unstructured environments where complex paths emerge according to both internal (intentions) and external (other people, not accessible areas) motivations. As demonstrated, modeling paths unaware of social interactions or context information, is insufficient to correctly predict future positions. Experimental results corroborate the effectiveness of the proposed framework in comparison to LSTM-based models for human path prediction
Automatic trademark detection and recognition in sport videos
In this paper we describe a system for automatic detection and recognition of trademarks in sports videos. We propose a compact representation of trademarks based on SIFT feature points and a matching algorithm to robustly detect and retrieve trademarks in a variety of different sports video types. Trademark localization is performed through robust clustering of matched feature points in the video frame. A supervised machine learning approach is used to automatically adapt the similarity threshold used to assess the trademark matches. Experimental results are provided, along with an analysis of the precision and recall. Results show that our proposed technique is efficient and effectively detects and classifies trademarks
- …
