Association for the Advancement of Artificial Intelligence: AAAI Publications

Not a member yet

26155 research outputs found

Sort by

OpenVIS: Open-vocabulary Video Instance Segmentation

Author: Guo Pinxue
Huang Hao
He Peiyang
Liu Xuefeng
Xiao Tianjun
Zhang Wenqiang
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 11/04/2025
Field of study

Open-vocabulary Video Instance Segmentation (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video, without being constrained to categories seen during training. In this work, we propose InstFormer, a carefully designed framework for the OpenVIS task that achieves powerful open-vocabulary capabilities through lightweight fine-tuning with limited-category data. InstFormer begins with the open-world mask proposal network, encouraged to propose all potential instance class-agnostic masks by the contrastive instance margin loss. Next, we introduce InstCLIP, adapted from pre-trained CLIP with Instance Guidance Attention, which encodes open-vocabulary instance tokens efficiently. These instance tokens not only enable open-vocabulary classification but also offer strong universal tracking capabilities. Furthermore, to prevent the tracking module from being constrained by the training data with limited categories, we propose the universal rollout association, which transforms the tracking problem into predicting the next frame’s instance tracking token. The experimental results demonstrate the proposed InstFormer achieve state-of-the-art capabilities on a comprehensive OpenVIS evaluation benchmark, while also achieves competitive performance in fully supervised VIS task

ProPose: Probabilistic 3D Human Pose Estimation with Instance-Level Distribution and Normalizing Flow

Author: Han Jumin
Kim Jun-Hee
Lee Seong-Whan
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 11/04/2025
Field of study

3D Human Pose Estimation (HPE) is a one-to-many problem by nature, making it challenging to estimate an accurate 3D pose from a single 2D pose. Some prior works have attempted to tackle this problem by using a conditional generative network. They generate 3D poses from a given 2D pose with noises from a standard Gaussian distribution, while the depth distribution is dependent on each posture and more complex than the standard Gaussian distribution. This may lead to inaccurate distribution learning. In this paper, we propose a probabilistic framework called ProPose to address this issue. ProPose employs Pose Instance-Level Gaussian Distribution (PILGD) derived from 3D pose-based self-representation learning to obtain reliable distribution which is able to address pose-dependent depth distribution. To access this PILGD, we utilize normalizing flow, which learns a mapping function between the PILGD and a 2D Pose-Adaptive Gaussian Distribution (PAGD). This converts the problem of directly estimating 3D poses from 2D poses to a mapping problem between PILGD and PAGD using a normalizing flow. Extensive experiments show the advantages of utilizing the PILGD and PAGD. ProPose achieves comparable performances to previous state-of-the-art probabilistic methods in a multi-hypothesis setting. Notably, ProPose in a single-hypothesis setting demonstrates comparable generalization ability to existing state-of-the-art deterministic methods

Boosting Segment Anything Model Towards Open-Vocabulary Learning

Author: Han Xumeng
Wei Longhui
Yu Xuehui
Dou Zhiyang
He Xin
Wang Kuiran
Sun Yingfei
Han Zhenjun
Tian Qi
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 11/04/2025
Field of study

The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we boost it to detect arbitrary objects from human inputs like category names or reference expressions. Building upon the SAM image encoder, we introduce a novel SideFormer module designed to acquire SAM features adept at perceiving objects and inject comprehensive semantic information for recognition. In addition, we devise an Open-set RPN that leverages SAM proposals to assist in finding potential objects. Consequently, Sambor enables the open-vocabulary detector to equally focus on generalizing both localization and classification sub-tasks. Our approach demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous state-of-the-art methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models

Prompt Tuning In a Compact Attribute Space

Author: Hou Shiyu
Zhou Tianfei
Zhang Shuai
Yuan Ye
Wang Guoren
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 11/04/2025
Field of study

Prompt tuning (PT) has emerged as a key to unlocking the power of visual-language models like CLIP for various downstream tasks. Predominant approaches learn a small set of task-relevant soft prompts by solving an image-class matching problem. Nevertheless, by optimizing merely with respect to class names, they face challenges in learning high performant prompts capable of capturing fine-grained, diverse characteristics of each class, and tends to overfit potentially biased distribution of base classes. In this work, we propose PTinCAS to tackle prompt tuning in a compact attribute space, driven by the premise that attributes offer detailed class interpretations and can facilitate transfer across related categories. Particularly, PTinCAS is grounded in two innovative designs. First, we create a compact attribute space by properly prompting large language models to generate factual descriptions about categories, which are subsequently clustered to form a concise attribute vocabulary. Second, we leverage attributes as a source of supervision in PT to transfer the inherent common sense knowledge in attributes to soft prompts. An object-aware visual prompting mechanism is developed to effortlessly highlight intended regions in the original image, which guides the model towards learning visual attributes associated with object regions rather than the background. We show that PTinCAS not only improves few-shot generalizability compared to existing PT methods, but also provides some level of inherent explainability that helps us understand why a class name is determined based on the attributes activated in an image

Identity-Text Video Corpus Grounding

Author: Huang Bin
Wang Xin
Chen Hong
Chen Houlun
Wu Yaofei
Zhu Wenwu
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 11/04/2025
Field of study

Video corpus grounding (VCG), which aims to retrieve relevant video moments from a video corpus, has attracted significant attention in the multimedia research community. However, the existing VCG setting primarily focuses on matching textual descriptions with videos and ignores the distinct visual identities in the videos, thus resulting in inaccurate understanding of video content and deteriorated retrieval performances. To address this limitation, we introduce a novel task, Identity-Text Video Corpus Grounding (ITVCG), which simultaneously utilize textual descriptions and visual identities as queries. As such, ITVCG benefits in enabling more accurate video corpus grounding with visual identities, as well as providing users with more flexible options to locate relevant frames based on either textual descriptions or textual descriptions and visual identities. To conduct evaluations regarding the novel ITVCG task, we propose the TVR-IT dataset, comprising 463 identity images from 6 TV shows, with 68,840 out of 72,840 queries containing at least one identity image. Furthermore, we propose Video-Locator, the first model designed for the ITVCG task. Our proposed Video-Locator integrates video-identity-text alignment and multi-modal fine-grained fusion components, facilitating a video large language model (Video LLM) to jointly understand textual descriptions, visual identities, as well as videos. Experimental results demonstrate the effectiveness of the proposed Video-Locator model and highlight the importance of identity-generalization capability for ITVCG

CLIP-RestoreX: Restore Image Structure and Perception in Exposure Correction

Author: Huang Xiang
Zhang Qing
Hu Jian-Fang
Zheng Wei-Shi
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 11/04/2025
Field of study

Exposure correction aims to adjust the exposure of an under- and over-exposed image to enhance its overall visual quality. The core challenge of this task lies in that it requires to faithfully restore both the structure and perception information. In this work, we present a novel exposure correction method, referred to as CLIP-RestoreX, that leverages structural and perceptual priors from CLIP to tackle exposure correction. Specifically, we in CLIP-RestoreX propose to perform exposure correction by aligning CLIP-based structural and perceptual feature of the impaired image with its ground-truth image. To better restore the damaged structural information and perceptual information, we further design a frequency-domain based feature enhancement diffusion model, where we utilize the globality of Fourier transform to help reveal potential the relationship within the features. We conduct extensive experiments on several benchmark datasets. The results demonstrate that the proposed CLIP-RestoreX outperforms state-of-the-art exposure correction methods

High-Resolution Frame Interpolation with Patch-based Cascaded Diffusion

Author: Hur Junhwa
Herrmann Charles
Saxena Saurabh
Kontkanen Janne
Lai Wei-Sheng
Shih Yichang
Rubinstein Michael
Fleet David J.
Sun Deqing
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 11/04/2025
Field of study

Despite the recent progress, existing frame interpolation methods still struggle with processing extremely high resolution input and handling challenging cases such as repetitive textures, thin objects, and large motion. To address these issues, we introduce a patch-based cascaded pixel diffusion model for high resolution frame interpolation, HiFI, that excels in these scenarios while achieving competitive performance on standard benchmarks. Cascades, which generate a series of images from low to high resolution, can help significantly with large or complex motion that require both global context for a coarse solution and detailed context for high resolution output. However, contrary to prior work on cascaded diffusion models which perform diffusion on increasingly large resolutions, we use a single model that always performs diffusion at the same resolution and upsamples by processing patches of the inputs and the prior solution. At inference time, this drastically reduces memory usage and allows a single model, solving both frame interpolation (base model’s task) and spatial up-sampling, saving training cost as well. HiFI excels at high-resolution images and complex repeated textures that require global context, achieving comparable or state-of-the-art performance on various benchmarks (Vimeo, Xiph, X-Test, and SEPE-8K). We further introduce a new dataset, LaMoR, that focuses on particularly challenging cases, and HiFI significantly outperforms other baselines

FlexiTex: Enhancing Texture Generation via Visual Guidance

Author: Jiang Dadong
Yang Xianghui
Zhao Zibo
Zhang Sheng
Yu Jiaao
Lai Zeqiang
Yang Shaoxiong
Guo Chunchao
Zhou Xiaobo
Ke Zhihui
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 11/04/2025
Field of study

Recent texture generation methods achieve impressive results due to the powerful generative prior they leverage from large-scale text-to-image diffusion models. However, abstract textual prompts are limited in providing global textural or shape information, which results in the texture generation methods producing blurry or inconsistent patterns. To tackle this, we present FlexiTex, embedding rich information via visual guidance to generate a high-quality texture. The core of FlexiTex is the Visual Guidance Enhancement module, which incorporates more specific information from visual guidance to reduce ambiguity in the text prompt and preserve high-frequency details. To further enhance the visual guidance, we introduce a Direction-Aware Adaptation module that automatically designs direction prompts based on different camera poses, avoiding the Janus problem and maintaining semantically global consistency. Benefiting from the visual guidance, FlexiTex produces quantitatively and qualitatively sound results, demonstrating its potential to advance texture generation for real-world applications

Weighted Poisson-disk Resampling on Large-Scale Point Clouds

Author: Jiao Xianhe
Lv Chenlei
Zhao Junli
Yi Ran
Wen Yu-Hui
Pan Zhenkuan
Wu Zhongke
Liu Yong-Jin
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 11/04/2025
Field of study

For large-scale point cloud processing, resampling takes the important role of controlling the point number and density while keeping the geometric consistency. However, current methods cannot balance such different requirements. Particularly with large-scale point clouds, classical methods often struggle with decreased efficiency and accuracy. To address such issues, we propose a weighted Poisson-disk (WPD) resampling method to improve the usability and efficiency for the processing. We first design an initial Poisson resampling with a voxel-based estimation strategy. It is able to estimate a more accurate radius of the Poisson-disk while maintaining high efficiency. Then, we design a weighted tangent smoothing step to further optimize the Voronoi diagram for each point. At the same time, sharp features are detected and kept in the optimized results with isotropic property. Finally, we achieve a resampling copy from the original point cloud with the specified point number, uniform density, and high-quality geometric consistency. Experiments show that our method significantly improves the performance of large-scale point cloud resampling for different applications, and provides a highly practical solution

Optimizing Human Pose Estimation Through Focused Human and Joint Regions

Author: Jiao Yingying
Wang Zhigang
Liu Zhenguang
Fan Shaojing
Wu Sifan
Wu Zheqi
Xu Zhuoyue
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 11/04/2025
Field of study

Human pose estimation has given rise to a broad spectrum of novel and compelling applications, including action recognition, sports analysis, as well as surveillance. However, accurate video pose estimation remains an open challenge. One aspect that has been overlooked so far is that existing methods learn motion clues from all pixels rather than focusing on the target human body, making them easily misled and disrupted by unimportant information such as background changes or movements of other people. Additionally, while the current Transformer-based pose estimation methods has demonstrated impressive performance with global modeling, they struggle with local context perception and precise positional identification. In this paper, we try to tackle these challenges from three aspects: (1) We propose a bilayer Human-Keypoint Mask module that performs coarse-to-fine visual token refinement, which gradually zooms in on the target human body and keypoints while masking out unimportant figure regions. (2) We further introduce a novel deformable cross attention mechanism and a bidirectional separation strategy to adaptively aggregate spatial and temporal motion clues from constrained surrounding contexts. (3) We mathematically formulate the deformable cross attention, constraining that the model focuses solely on the regions centered at the target person body. Empirically, our method achieves state-of-the-art performance on three large-scale benchmark datasets. A remarkable highlight is that our method achieves an 84.8 mean Average Precision (mAP) on the challenging wrist joint, which significantly outperforms the 81.5 mAP achieved by the current state-of-the-art method on the PoseTrack2017 dataset

0

full texts

26,155

metadata records

Updated in last 30 days.

Association for the Advancement of Artificial Intelligence: AAAI Publications

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇