Association for the Advancement of Artificial Intelligence: AAAI Publications
Not a member yet
26155 research outputs found
Sort by
OpenVIS: Open-vocabulary Video Instance Segmentation
Open-vocabulary Video Instance Segmentation (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video, without being constrained to categories seen during training. In this work, we propose InstFormer, a carefully designed framework for the OpenVIS task that achieves powerful open-vocabulary capabilities through lightweight fine-tuning with limited-category data. InstFormer begins with the open-world mask proposal network, encouraged to propose all potential instance class-agnostic masks by the contrastive instance margin loss. Next, we introduce InstCLIP, adapted from pre-trained CLIP with Instance Guidance Attention, which encodes open-vocabulary instance tokens efficiently. These instance tokens not only enable open-vocabulary classification but also offer strong universal tracking capabilities. Furthermore, to prevent the tracking module from being constrained by the training data with limited categories, we propose the universal rollout association, which transforms the tracking problem into predicting the next frame’s instance tracking token. The experimental results demonstrate the proposed InstFormer achieve state-of-the-art capabilities on a comprehensive OpenVIS evaluation benchmark, while also achieves competitive performance in fully supervised VIS task
ProPose: Probabilistic 3D Human Pose Estimation with Instance-Level Distribution and Normalizing Flow
3D Human Pose Estimation (HPE) is a one-to-many problem by nature, making it challenging to estimate an accurate 3D pose from a single 2D pose. Some prior works have attempted to tackle this problem by using a conditional generative network. They generate 3D poses from a given 2D pose with noises from a standard Gaussian distribution, while the depth distribution is dependent on each posture and more complex than the standard Gaussian distribution. This may lead to inaccurate distribution learning. In this paper, we propose a probabilistic framework called ProPose to address this issue. ProPose employs Pose Instance-Level Gaussian Distribution (PILGD) derived from 3D pose-based self-representation learning to obtain reliable distribution which is able to address pose-dependent depth distribution. To access this PILGD, we utilize normalizing flow, which learns a mapping function between the PILGD and a 2D Pose-Adaptive Gaussian Distribution (PAGD). This converts the problem of directly estimating 3D poses from 2D poses to a mapping problem between PILGD and PAGD using a normalizing flow. Extensive experiments show the advantages of utilizing the PILGD and PAGD. ProPose achieves comparable performances to previous state-of-the-art probabilistic methods in a multi-hypothesis setting. Notably, ProPose in a single-hypothesis setting demonstrates comparable generalization ability to existing state-of-the-art deterministic methods
Boosting Segment Anything Model Towards Open-Vocabulary Learning
The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics.
In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we boost it to detect arbitrary objects from human inputs like category names or reference expressions.
Building upon the SAM image encoder, we introduce a novel SideFormer module designed to acquire SAM features adept at perceiving objects and inject comprehensive semantic information for recognition.
In addition, we devise an Open-set RPN that leverages SAM proposals to assist in finding potential objects.
Consequently, Sambor enables the open-vocabulary detector to equally focus on generalizing both localization and classification sub-tasks.
Our approach demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous state-of-the-art methods.
We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models
Prompt Tuning In a Compact Attribute Space
Prompt tuning (PT) has emerged as a key to unlocking the power of visual-language models like CLIP for various downstream tasks. Predominant approaches learn a small set of task-relevant soft prompts by solving an image-class matching problem. Nevertheless, by optimizing merely with respect to class names, they face challenges in learning high performant prompts capable of capturing fine-grained, diverse characteristics of each class, and tends to overfit potentially biased distribution of base classes. In this work, we propose PTinCAS to tackle prompt tuning in a compact attribute space, driven by the premise that attributes offer detailed class interpretations and can facilitate transfer across related categories. Particularly, PTinCAS is grounded in two innovative designs. First, we create a compact attribute space by properly prompting large language models to generate factual descriptions about categories, which are subsequently clustered to form a concise attribute vocabulary. Second, we leverage attributes as a source of supervision in PT to transfer the inherent common sense knowledge in attributes to soft prompts. An object-aware visual prompting mechanism is developed to effortlessly highlight intended regions in the original image, which guides the model towards learning visual attributes associated with object regions rather than the background. We show that PTinCAS not only improves few-shot generalizability compared to existing PT methods, but also provides some level of inherent explainability that helps us understand why a class name is determined based on the attributes activated in an image
Identity-Text Video Corpus Grounding
Video corpus grounding (VCG), which aims to retrieve relevant video moments from a video corpus, has attracted significant attention in the multimedia research community. However, the existing VCG setting primarily focuses on matching textual descriptions with videos and ignores the distinct visual identities in the videos, thus resulting in inaccurate understanding of video content and deteriorated retrieval performances.
To address this limitation, we introduce a novel task, Identity-Text Video Corpus Grounding (ITVCG), which simultaneously utilize textual descriptions and visual identities as queries. As such, ITVCG benefits in enabling more accurate video corpus grounding with visual identities, as well as providing users with more flexible options to locate relevant frames based on either textual descriptions or textual descriptions and visual identities.
To conduct evaluations regarding the novel ITVCG task, we propose the TVR-IT dataset, comprising 463 identity images from 6 TV shows, with 68,840 out of 72,840 queries containing at least one identity image. Furthermore, we propose Video-Locator, the first model designed for the ITVCG task. Our proposed Video-Locator integrates video-identity-text alignment and multi-modal fine-grained fusion components, facilitating a video large language model (Video LLM) to jointly understand textual descriptions, visual identities, as well as videos. Experimental results demonstrate the effectiveness of the proposed Video-Locator model and highlight the importance of identity-generalization capability for ITVCG
CLIP-RestoreX: Restore Image Structure and Perception in Exposure Correction
Exposure correction aims to adjust the exposure of an under- and over-exposed image to enhance its overall visual quality. The core challenge of this task lies in that it requires to faithfully restore both the structure and perception information. In this work, we present a novel exposure correction method, referred to as CLIP-RestoreX, that leverages structural and perceptual priors from CLIP to tackle exposure correction. Specifically, we in CLIP-RestoreX propose to perform exposure correction by aligning CLIP-based structural and perceptual feature of the impaired image with its ground-truth image. To better restore the damaged structural information and perceptual information, we further design a frequency-domain based feature enhancement diffusion model, where we utilize the globality of Fourier transform to help reveal potential the relationship within the features. We conduct extensive experiments on several benchmark datasets. The results demonstrate that the proposed CLIP-RestoreX outperforms state-of-the-art exposure correction methods
High-Resolution Frame Interpolation with Patch-based Cascaded Diffusion
Despite the recent progress, existing frame interpolation methods still struggle with processing extremely high resolution input and handling challenging cases such as repetitive textures, thin objects, and large motion. To address these issues, we introduce a patch-based cascaded pixel diffusion model for high resolution frame interpolation, HiFI, that excels in these scenarios while achieving competitive performance on standard benchmarks. Cascades, which generate a series of images from low to high resolution, can help significantly with large or complex motion that require both global context for a coarse solution and detailed context for high resolution output. However, contrary to prior work on cascaded diffusion models which perform diffusion on increasingly large resolutions, we use a single model that always performs diffusion at the same resolution and upsamples by processing patches of the inputs and the prior solution. At inference time, this drastically reduces memory usage and allows a single model, solving both frame interpolation (base model’s task) and spatial up-sampling, saving training cost as well. HiFI excels at high-resolution images and complex repeated textures that require global context, achieving comparable or state-of-the-art performance on various benchmarks (Vimeo, Xiph, X-Test, and SEPE-8K). We further introduce a new dataset, LaMoR, that focuses on particularly challenging cases, and HiFI significantly outperforms other baselines
FlexiTex: Enhancing Texture Generation via Visual Guidance
Recent texture generation methods achieve impressive results due to the powerful generative prior they leverage from large-scale text-to-image diffusion models.
However, abstract textual prompts are limited in providing global textural or shape information, which results in the texture generation methods producing blurry or inconsistent patterns.
To tackle this, we present FlexiTex, embedding rich information via visual guidance to generate a high-quality texture.
The core of FlexiTex is the Visual Guidance Enhancement module, which incorporates more specific information from visual guidance to reduce ambiguity in the text prompt and preserve high-frequency details.
To further enhance the visual guidance, we introduce a Direction-Aware Adaptation module that automatically designs direction prompts based on different camera poses, avoiding the Janus problem and maintaining semantically global consistency.
Benefiting from the visual guidance, FlexiTex produces quantitatively and qualitatively sound results, demonstrating its potential to advance texture generation for real-world applications
Weighted Poisson-disk Resampling on Large-Scale Point Clouds
For large-scale point cloud processing, resampling takes the important role of controlling the point number and density while keeping the geometric consistency. However, current methods cannot balance such different requirements. Particularly with large-scale point clouds, classical methods often struggle with decreased efficiency and accuracy. To address such issues, we propose a weighted Poisson-disk (WPD) resampling method to improve the usability and efficiency for the processing. We first design an initial Poisson resampling with a voxel-based estimation strategy. It is able to estimate a more accurate radius of the Poisson-disk while maintaining high efficiency. Then, we design a weighted tangent smoothing step to further optimize the Voronoi diagram for each point. At the same time, sharp features are detected and kept in the optimized results with isotropic property. Finally, we achieve a resampling copy from the original point cloud with the specified point number, uniform density, and high-quality geometric consistency. Experiments show that our method significantly improves the performance of large-scale point cloud resampling for different applications, and provides a highly practical solution
Optimizing Human Pose Estimation Through Focused Human and Joint Regions
Human pose estimation has given rise to a broad spectrum of novel and compelling applications, including action recognition, sports analysis, as well as surveillance. However, accurate video pose estimation remains an open challenge. One aspect that has been overlooked so far is that existing methods learn motion clues from all pixels rather than focusing on the target human body, making them easily misled and disrupted by unimportant information such as background changes or movements of other people. Additionally, while the current Transformer-based pose estimation methods has demonstrated impressive performance with global modeling, they struggle with local context perception and precise positional identification. In this paper, we try to tackle these challenges from three aspects: (1) We propose a bilayer Human-Keypoint Mask module that performs coarse-to-fine visual token refinement, which gradually zooms in on the target human body and keypoints while masking out unimportant figure regions. (2) We further introduce a novel deformable cross attention mechanism and a bidirectional separation strategy to adaptively aggregate spatial and temporal motion clues from constrained surrounding contexts. (3) We mathematically formulate the deformable cross attention, constraining that the model focuses solely on the regions centered at the target person body. Empirically, our method achieves state-of-the-art performance on three large-scale benchmark datasets. A remarkable highlight is that our method achieves an 84.8 mean Average Precision (mAP) on the challenging wrist joint, which significantly outperforms the 81.5 mAP achieved by the current state-of-the-art method on the PoseTrack2017 dataset