Association for the Advancement of Artificial Intelligence: AAAI Publications
Not a member yet
26155 research outputs found
Sort by
GaitCycFormer: Leveraging Gait Cycles and Transformers for Gait Emotion Recognition.
Gait Emotion Recognition (GER) is an emerging task within Human Emotion Recognition. Skeleton-based GER requires discriminative spatial and temporal features. However, current methods primarily focus on capturing spatial topology information but fail to effectively learn temporal features from long-distance frames. Moreover, these methods are mostly sensitive to the order of sampled sequences, resulting in significant accuracy drops when sequences are randomly sampled. In order to obtain a more robust and comprehensive spatial-temporal representation of gait, we introduce the Graph-Transformer architecture into GER for the first time, proposing a novel framework named GaitCycFormer. Specifically, we designed a Cycle Position Encoding (CPE) based on the gait cycle, which explicitly segments any gait sequence into more manageable periodic units, to enhance temporal feature modeling. Additionally, we incorporate a bi-level Transformer, consisting of an Intra-cycle Transformer and an Inter-cycle Transformer to capture local and global temporal information within each gait cycle and between gait cycles respectively. Experiments demonstrate that our GaitCycFormer achieves state-of-the-art performance on popular datasets, and proves to be more reliable and robust
DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection
Multi-modal salient object detection (SOD) through the integration of additional data such as depth or thermal information has become a significant task in computer vision during recent years. Traditionally, the challenges of identifying salient objects in RGB, RGB-D (Depth), and RGB-T (Thermal) images are tackled separately. However, without intricate cross-modal fusion strategies, such approaches struggle to effectively integrate multi-modal information, often resulting in poorly defined object edges or overconfident inaccurate predictions.
Recent studies have shown that designing a unified end-to-end framework to handle all three types of SOD tasks simultaneously is both necessary and difficult. To address this need, we propose a novel approach that treats multi-modal SOD as a conditional mask generation task utilizing diffusion models.
We introduce DiMSOD, which enables the concurrent use of local (depth maps, thermal maps) and global controls (original images) within a unified model for progressive denoising and refined prediction. DiMSOD is efficient, only requiring fine-tuning of our newly introduced modules on the existing stable diffusion, which not only reduces the fine-tuning cost, making it more viable for practical use, but also enhances the integration of multi-modal conditional controls. Specifically, we have developed modules including SOD-ControlNet, Feature Adaptive Network (FAN), and Feature Injection Attention Network (FIAN) to enhance the model's performance. Extensive experiments demonstrate that DiMSOD efficiently detects salient objects across RGB, RGB-D, and RGB-T datasets, achieving superior performance compared to previous well-established methods
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models
Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Please watch all videos in supplementary materials for better view
Partial Point Cloud Registration with Multi-view 2D Image Learning
Learning representations from numerous 2D image data has shown promising performance, yet very few works apply this representations to point cloud registration. In this paper, we explore how to leverage the 2D information to assist the point cloud registration, and propose IAPReg, an Image-Assisted Partial 3D point cloud Registration framework with the multi-view images generated by the input point cloud. It is expected to enrich 3D information with 2D knowledge, and leverage 2D knowledge to assist with point cloud registration. Specifically, we create multi-view depth maps by projecting the input point cloud from several specific views, and then extract 2D and 3D features using some well-established models. To fuse the information learned from 2D and 3D modalities, inter-modality multi-view learning module is proposed to enhance geometric information and complement semantic information. Weighted SVD is a common method to reduce the impact of inaccurate correspondences on registration. However, determining the correspondence weights is not trivial. Therefore, we design a 2D-weighted SVD method, where the 2D knowledge is employed to provide weight information of correspondences. Extensive experiments perform that our method outperform the state-of-the-art method without additional 2D training data
NightReID: A Large-Scale Nighttime Person Re-Identification Benchmark
Person re-identification (Re-ID) is crucial for intelligent surveillance systems, facilitating the identification of individuals across multiple camera views. While significant advancements have been made for daytime scenarios, ensuring reliable Re-ID performance during nighttime remains a significant challenge. Given the cost and limited accessibility of infrared cameras, we investigate a critical question: Can RGB cameras be effectively utilized for accurate Re-ID during nighttime? To address this, we introduce NightReID, a large-scale RGB Re-ID dataset collected from a real-world nighttime surveillance system. NightReID includes 1,500 identities and over 53,000 images, capturing diverse scenes with complex lighting and adverse weather conditions. This rich dataset provides a valuable benchmark for advancing nighttime Re-ID research. Moreover, we propose the Enhancement, Denoising, and Alignment (EDA) framework with two novel modules to enhance nighttime Re-ID performance. First, an unsupervised Image Enhancement and Denoising (IED) method is designed to improve the quality of nighttime images, preserving critical details while removing noise without requiring paired ground truth. Second, we introduce Data Distribution Alignment (DDA) through statistical priors, aligning the distributions between pre-training data and nighttime data to mitigate domain shift. Extensive experiments on multiple nighttime Re-ID datasets demonstrate the significance of NightReID and validate the efficacy, flexibility, and applicability of the EDA framework
PointCFormer: A Relation-Based Progressive Feature Extraction Network for Point Cloud Completion
Point cloud completion aims to reconstruct the complete 3D shape from incomplete point clouds, and it is crucial for tasks such as 3D object detection and segmentation. Despite the continuous advances in point cloud analysis techniques, feature extraction methods are still confronted with apparent limitations. The sparse sampling of point clouds, used as inputs in most methods, often results in a certain loss of global structure information. Meanwhile, traditional local feature extraction methods usually struggle to capture the intricate geometric details. To overcome these drawbacks, we introduce PointCFormer, a transformer framework optimized for robust global retention and precise local detail capture in point cloud completion. This framework embraces several key advantages. First, we propose a relation-based local feature extraction method to perceive local delicate geometry characteristics. This approach establishes a fine-grained relationship metric between the target point and its k-nearest neighbors, quantifying each neighboring point's contribution to the target point's local features. Secondly, we introduce a progressive feature extractor that integrates our local feature perception method with self-attention. Starting with a denser sampling of points as input, it iteratively queries long-distance global dependencies and local neighborhood relationships. This extractor maintains enhanced global structure and refined local details, without generating substantial computational overhead. Additionally, we develop a correction module after generating point proxies in the latent space to reintroduce denser information from the input points, enhancing the representation capability of the point proxies. PointCFormer demonstrates state-of-the-art performance on several widely used benchmarks
GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expressions
Audio-driven talking head generation necessitates seamless integration of audio and visual data amidst the challenges posed by diverse input portraits and intricate correlations between audio and facial motions. In response, we propose a robust framework GoHD designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion. GoHD innovates with three key modules: Firstly, an animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles. This module achieves high disentanglement of motion and identity, and it also incorporates gaze orientation to rectify unnatural eye movements that were previously overlooked. Secondly, a conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody. Thirdly, to estimate lip-synchronized and realistic expressions from the input audio within limited training data, a two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions, e.g., blinks and frowns. Extensive experiments validate GoHD's advanced generalization capabilities, demonstrating its effectiveness in generating realistic talking face results on arbitrary subjects
A Lottery Ticket Hypothesis Approach with Sparse Fine-tuning and MAE for Image Forgery Detection and Localization
The rise in sophisticated image forgery techniques, driven by advancements in image editing and generation, has posed new security challenges. Traditional methods, designed for specific tampering artifacts, struggle with out-of-distribution image forgery detection. In this paper, we propose a shift in paradigm, placing greater emphasis on the universal characteristics of authentic images, as opposed to solely focusing on specific forgery signals. We introduce an enhancement to the Masked Autoencoder (MAE), aptly termed the Forgery MAE (FMAE). This modification retains the inherent characteristics of natural images while integrating multi-source forgery information. Our implementation involves applying the lottery ticket hypothesis during pre-training to identify forgery-sensitive parameters, followed by their sparse fine-tuning to target the forgery detection and localization task. Concurrently, we develop a ``mixture of experts'' noise extractor to compile multi-source forgery data. Our FMAE effectively extracts forgery features and shows strong resilience against unseen forgeries. Extensive experiments across multiple datasets confirm our method's superior accuracy and generalization capability over existing techniques
CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility
Video inpainting is a crucial task with diverse applications, including fine-grained video editing, video recovery, and video dewatermarking. However, most existing video inpainting methods primarily focus on visual content completion while neglecting text information. There are only a limited number of text-guided video inpainting techniques, and these techniques struggle with maintaining visual quality and exhibit poor semantic representation capabilities. In this paper, we introduce CoCoCo, a text-guided video inpainting diffusion framework. To address the aforementioned challenges, we enhance both the training data and model structure. Specifically, we devise an instance-aware region selection strategy for masked area sampling and develop a novel motion block that incorporates efficient 3D full attention and textual cross attention. Additionally, our CoCoCo framework can be seamlessly integrated with various personalized text-to-image diffusion models through a delicate training-free transfer mechanism. Comprehensive experiments demonstrate that CoCoCo can create high-quality visual content with enhanced temporal consistency, improved text controllability, and better compatibility with personalized image models
Optimal Classification Trees for Continuous Feature Data Using Dynamic Programming with Branch-and-Bound
Computing an optimal classification tree that provably maximizes training performance within a given size limit, is NP-hard, and in practice, most state-of-the-art methods do not scale beyond computing optimal trees of depth three. Therefore, most methods rely on a coarse binarization of continuous features to maintain scalability. We propose a novel algorithm that optimizes trees directly on the continuous feature data using dynamic programming with branch-and-bound. We develop new pruning techniques that eliminate many sub-optimal splits in the search when similar to previously computed splits and we provide an efficient subroutine for computing optimal depth-two trees. Our experiments demonstrate that these techniques improve runtime by one or more orders of magnitude over state-of-the-art optimal methods and improve test accuracy by 5% over greedy heuristics