Paper title Authors Description
PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes
Ahmed Abdelreheem
KAUST
Filippo Aleotti
Niantic Spatial
Jamie Watson
Niantic Spatial
Zawar Qureshi
Niantic Spatial
Abdelrahman Eldesokey
KAUST
Peter Wonka
KAUST
Gabriel Brostow
UCL
Sara Vicente
Niantic Spatial
Guillermo Garcia-Hernando
Niantic Spatial
Abstract
We introduce the task of Language-Guided Object Placement in Real 3D Scenes. Given a 3D reconstructed point-cloud scene, a 3D asset, and a natural-language instruction, the goal is to place the asset so that the instruction is satisfied. The task demands tackling four intertwined challenges: (a) one-to-many ambiguity in valid placements; (b) precise geometric and physical reasoning; (c) joint understanding across the scene, the asset, and language; and (d) robustness to noisy point clouds with no privileged metadata at test time. The first three challenges mirror the complexities of synthetic scene generation, while the metadata-free, noisy-scan scenario is inherited from language-guided 3D visual grounding. We inaugurate this task by introducing a benchmark and evaluation protocol, releasing a dataset for training multi-modal large language models (MLLMs), and establishing a first nontrivial baseline. We believe this challenging setup and benchmark will provide a foundation for evaluating and advancing MLLMs in 3D understanding.
NormalLoc: Visual Localization on Textureless 3D Models using Surface Normals
Jiro Abe
Visual Intelligence Research Laboratories, NEC Corporation
Gaku Nakano
Visual Intelligence Research Laboratories, NEC Corporation
Kazumine Ogura
Visual Intelligence Research Laboratories, NEC Corporation
Abstract
We propose NormalLoc, a novel visual localization method for estimating the 6-DoF pose of a camera using textureless 3D models. Existing methods often rely on color or texture information, limiting their applicability in scenarios where such information is unavailable. NormalLoc addresses this limitation by using rendered normal images generated from surface normals of 3D models to establish a training scheme for both global descriptor computation and matching. This approach enables robust visual localization even when geometric details are limited. Experimental results demonstrate that NormalLoc achieves state-of-the-art performance for visual localization on textureless 3D models, especially in scenarios with limited geometric detail.
UINavBench: A Framework for Comprehensive Evaluation of Interactive Digital Agents
Harsh Agrawal
Apple
Eldon Schoop
Apple
Xinlei Pan
Apple
Anuj Mahajan
Apple
Ari Seff
Apple
Di Feng
Apple
Ruijia Cheng
Apple
Andres Romero Mier Y Teran
Apple
Esteban Gomez
Apple
Abhishek Sundararajan
Apple
Forrest Huang
Apple
Amanda Swearngin
Apple
Mohana Prasad Sathya Moorthy
Apple
Jeff Nichols
Apple
Alexander Toshev
Apple
Abstract
We build a comprehensive online evaluation benchmark for language-conditioned multi-step task execution on mobile interfaces. Our benchmark strives to evaluate the multistep planning, reasoning, and visual grounding capabilities of agents, using mobile user interfaces as a concrete testbed. To build diverse, challenging tasks that reflect real-world use cases, we propose an exhaustive taxonomy that allows us to measure progress along multiple decisionmaking abilities including multi-step planning, visual perception, action grounding, and using memory or external knowledge. We also highlight important factors such as statefulness, safety, and evaluation complexity that are key to design tasks that can be reliably evaluated. Using this taxonomy, we design 116 tasks across 36 unique apps. Through an automatic framework, we stage and evaluate several natural baselines with different input representations and planning strategies. We show that the bestperforming agent achieves 40% success on our benchmark. We further measure agents' abilities to plan, ground, and utilize world knowledge highlighting areas of improvement. 1. Intro Building autonomous agents has been a long-standing goal in Artificial Intelligence (AI). With recent advances in Large Language Models (LLMs), and Vision-Language Models (VLMs), there has been a surge in the development of interactive digital agents that can automate tasks on mobile phones [5-13]. These agents are designed to automate everyday activities such as shopping for groceries, planning trips, and organizing calendars. Several benchmarks have been introduced to evaluate these agents' ability to understand and navigate Web [14- 18], Android [19-25], and Desktop environments [26-28]. Benchmarks for mobile agents have typically been offline, consisting of static sets of images or ground truth trajectories against which an agent is evaluated. While offline benchmarks are performant, they do not reflect the real-world stocha
UPP: Unified Point-Level Prompting for Robust Point Cloud Analysis
Zixiang Ai
Wangxuan Institute of Computer Technology, Peking University
Zhenyu Cui
Wangxuan Institute of Computer Technology, Peking University
Yuxin Peng
Wangxuan Institute of Computer Technology, Peking University
Jiahuan Zhou
Wangxuan Institute of Computer Technology, Peking University
Abstract
Pre-trained point cloud analysis models have shown promising advancements in various downstream tasks, yet their effectiveness is typically suffering from low-quality point cloud (i.e., noise and incompleteness), which is a common issue in real scenarios due to casual object occlusions and unsatisfactory data collected by 3D sensors. To this end, existing methods focus on enhancing point cloud quality by developing dedicated denoising and completion models. However, due to the isolation between the point cloud enhancement and downstream tasks, these methods fail to work in various real-world domains. In addition, the conflicting objectives between denoising and completing tasks further limit the ensemble paradigm to preserve critical geometric features. To tackle the above challenges, we propose a unified point-level prompting method that reformulates point cloud denoising and completion as a prompting mechanism, enabling robust analysis in a parameterefficient manner. We start by introducing a Rectification Prompter to adapt to noisy points through the predicted rectification vector prompts, effectively filtering noise while preserving intricate geometric features essential for accurate analysis. Sequentially, we further incorporate a Completion Prompter to generate auxiliary point prompts based on the rectified point clouds, facilitating their robustness and adaptability. Finally, a Shape-Aware Unit module is exploited to efficiently unify and capture the filtered geometric features for the downstream point cloud analysis. Extensive experiments on four datasets demonstrate the superiority and robustness of our method when handling noisy and incomplete point cloud data against existing state-of-the-art methods. Our code is released at https://github.com/zhoujiahuan1991/ICCV2025-UPP.
Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation
Hiroyasu Akada
Max Planck Institute for Informatics, SIC
Jian Wang
Max Planck Institute for Informatics, SIC
Vladislav Golyanik
Max Planck Institute for Informatics, SIC
Christian Theobalt
Max Planck Institute for Informatics, SIC
Abstract
Egocentric 3D human pose estimation has been actively studied using cameras installed in front of a head-mounted device (HMD). While frontal placement is the optimal and the only option for some tasks, such as hand tracking, it remains unclear if the same holds for full-body tracking due to self-occlusion and limited field-of-view coverage. Notably, even the state-of-the-art methods often fail to estimate accurate 3D poses in many scenarios, such as when HMD users tilt their heads upward-a common motion in human activities. A key limitation of existing HMD designs is their neglect of the back of the body, despite its potential to provide crucial 3D reconstruction cues. Hence, this paper investigates the usefulness of rear cameras for full-body tracking. We also show that simply adding rear views to the frontal inputs is not optimal for existing methods due to their dependence on individual 2D joint detectors without effective multi-view integration. To address this issue, we propose a new transformer-based method that refines 2D joint heatmap estimation with multi-view information and heatmap uncertainty, thereby improving 3D pose tracking. Also, we introduce two new large-scale datasets, Ego4ViewSyn and Ego4View-RW, for a rear-view evaluation. Our experiments show that the new camera configurations with back views provide superior support for 3D pose tracking compared to only frontal placements. The proposed method achieves significant improvement over the current state of the art (>10% on MPJPE).
Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection
Giacomo D'Amicantonio
Eindhoven University of Technology
Snehashis Majhi
INRIA
Quan Kong
Woven by Toyota
Lorenzo Garattoni
Woven by Toyota
Gianpiero Francesca
Woven by Toyota
François Brémond
INRIA
Egor Bondarev
Eindhoven University of Technology
Abstract
Video Anomaly Detection (VAD) is a challenging task due to the variability of anomalous events and the limited availability of labeled data. Under the Weakly-Supervised VAD (WSVAD) paradigm, only video-level labels are provided during training, while predictions are made at the frame level. Although state-of-the-art models perform well on simple anomalies (e.g., explosions), they struggle with complex real-world events (e.g., shoplifting). This difficulty stems from two key issues: (1) the inability of current models to address the diversity of anomaly types, as they process all categories with a shared model, overlooking categoryspecific features; and (2) the weak supervision signal, which lacks precise temporal information, limiting the ability to capture nuanced anomalous patterns blended with normal events. To address these challenges, we propose Gaussian Splatting-guided Mixture of Experts (GS-MoE), a novel framework that employs a set of expert models, each specialized in capturing specific anomaly types. These experts are guided by a temporal Gaussian splatting loss, enabling the model to leverage temporal consistency and enhance weak supervision. The Gaussian splatting approach encourages a more precise and comprehensive representation of anomalies by focusing on temporal segments most likely to contain abnormal events. The predictions from these specialized experts are integrated through a mixture-of-experts mechanism to model complex relationships across diverse anomaly patterns. Our approach achieves state-of-theart performance, with a 91.58% AUC on the UCF-Crime dataset, and demonstrates superior results on XD-Violence and MSAD datasets. By leveraging category-specific expertise and temporal guidance, GS-MoE sets a new benchmark for VAD under weak supervision.
MinCD-PnP: Learning 2D-3D Correspondences with Approximate Blind PnP
Pei An
Huazhong University of Science and Technology
Jiaqi Yang
Northwestern Polytechnical University
Muyao Peng
Huazhong University of Science and Technology
You Yang
Huazhong University of Science and Technology
Qiong Liu
Huazhong University of Science and Technology
Xiaolin Wu
Southwest Jiaotong University
Liangliang Nan
Delft University of Technology
Abstract
Image-to-point-cloud (I2P) registration is a fundamental problem in computer vision, focusing on establishing 2D3D correspondences between an image and a point cloud. Recently, the differentiable perspective-n-point (PnP) has been widely used to supervise I2P registration networks by enforcing projective constraints on 2D-3D correspondences. However, differentiable PnP is highly sensitive to noise and outliers in the predicted correspondences, which hinders the effectiveness of correspondence learning. Inspired by the robustness of blind PnP to noise and outliers in correspondences, we propose an approximate blind PnP-based correspondence learning approach. To mitigate the high computational cost of blind PnP, we reformulate it as a more tractable problem: minimizing the Chamfer distance between learned 2D and 3D keypoints, referred to as MinCD-PnP. To effectively solve MinCD-PnP, we introduce a lightweight multi-task learning module, MinCD-Net, which can be easily integrated into the existing I2P registration architectures. Extensive experiments on 7-Scenes, RGBD-V2, ScanNet, and self-collected datasets demonstrate that MinCD-Net outperforms state-of-the-art methods and achieves higher inlier ratio and registration recall in both cross-scene and cross-dataset settings. The source code: https://github.com/anpei96/mincdpnp-demo.
SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image
Dimitrije Antić
University of Amsterdam
Georgios Paschalidis
University of Amsterdam
Shashank Tripathi
Max Planck Institute for Intelligent Systems
Theo Gevers
University of Amsterdam
Sai Kumar Dwivedi
Max Planck Institute for Intelligent Systems
Dimitrios Tzionas
University of Amsterdam
Abstract
Recovering 3D object pose and shape from a single image is a challenging and ill-posed problem. This is due to strong (self-)occlusions, depth ambiguities, the vast intraand inter-class shape variance, and the lack of 3D ground truth for natural images. Existing deep-network methods are trained on synthetic datasets to predict 3D shapes, so they often struggle generalizing to real-world images. Moreover, they lack an explicit feedback loop for refining noisy estimates, and primarily focus on geometry without directly considering pixel alignment. To tackle these limitations, we develop a novel render-and-compare optimization framework, called SDFit. This has three key innovations: First, it uses a learned category-specific and morphable signed-distance-function (mSDF) model, and fits this to an image by iteratively refining both 3D pose and shape. The mSDF robustifies inference by constraining the search on the manifold of valid shapes, while allowing for arbitrary shape topologies. Second, SDFit retrieves an initial 3D shape that likely matches the image, by exploiting foundational models for efficient look-up into 3D shape databases. Third, SDFit initializes pose by establishing rich 2D-3D correspondences between the image and the mSDF through foundational features. We evaluate SDFit on three image datasets, i.e., Pix3D, Pascal3D+, and COMIC. SDFit performs on par with SotA feed-forward networks for unoccluded images and common poses, but is uniquely robust to occlusions and uncommon poses. Moreover, it requires no retraining for unseen images. Thus, SDFit contributes new insights for generalizing in the wild. Code is available at https://anticdimi.github.io/sdfit.
Zero-shot Inexact CAD Model Alignment from a Single Image
Pattaramanee Arsomngern
VISTEC
Sasikarn Khwanmuang
VISTEC
Matthias Nießner
Technical University of Munich
Supasorn Suwajanakorn
VISTEC
Abstract
One practical approach to infer 3D scene structure from a single image is to retrieve a closely matching 3D model from a database and align it with the object in the image. Existing methods rely on supervised training with images and pose annotations, which limits them to a narrow set of object categories. To address this, we propose a weakly supervised 9-DoF alignment method for inexact 3D models that requires no scene-level pose annotations and generalizes to unseen categories. Our approach derives a novel feature space based on foundation features that ensure multi-view consistency and overcome symmetry ambiguities inherent in foundation features using a self-supervised triplet loss. Additionally, we introduce a texture-invariant pose refinement technique that performs dense alignment in normalized object coordinates, estimated through the enhanced feature space. We conduct extensive evaluations on the real-world ScanNet25k dataset, where our method outperforms SOTA weakly supervised baselines by +4.3% mean alignment accuracy and is the only weakly supervised approach to surpass the supervised ROCA by +2.7%. To assess generalization, we introduce SUN2CAD, a real-world test set with 20 novel object categories, where our method achieves SOTA results without prior training on them.
Less is More: Improving Motion Diffusion Models with Sparse Keyframes
Jinseok Bae
Dept. of Electrical and Computer Engineering, Seoul National University
Inwoo Hwang
Dept. of Electrical and Computer Engineering, Seoul National University
Young-Yoon Lee
Roblox
Ziyu Guo
CSE, The Chinese University of Hong Kong
Joseph Liu
Roblox
Yizhak Ben-Shabat
Roblox
Young Min Kim
Dept. of Electrical and Computer Engineering, Seoul National University
Mubbasir Kapadia
Roblox
Abstract
Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training complexity, especially when learning intricate distributions of large motion datasets even with modern neural architectures. This severely limits the performance of generative motion models for downstream tasks. Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and geometrically meaningful keyframes. Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps. Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps. We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks.
EVOLVE: Event-Guided Deformable Feature Transfer and Dual-Memory Refinement for Low-Light Video Object Segmentation
Jong-Hyeon Baek
Chungnam National University
Jiwon Oh
Chungnam National University
Yeong Jun Koh
Chungnam National University
Abstract
Video Object Segmentation (VOS) in low-light scenarios remains highly challenging due to significant texture loss and severe noise, which often lead to unreliable image feature generation and degraded segmentation performance. To address this issue, we propose EVOLVE, a novel multi-modal framework that integrates event-guided deformable feature transfer and dual-memory refinement for low-light VOS. EVOLVE addresses spatial misalignment between frames and improves object representation by utilizing event-driven cues. The event-guided deformable feature transfer (EDFT) module enhances feature alignment through event-driven deformable convolutions, where offsets derived from event features enable motion-aware spatial adjustments, leading to more precise propagation of object features in reference frames. Furthermore, the dual-memory object transformer (DMOT) iteratively refines object representations by maintaining and updating both image-based and event-based memory representations. Through its memory refinement module (MRM), DMOT selectively enhances relevant object features while suppressing background noise, resulting in stable and temporally coherent segmentation results. Extensive experiments on low-light VOS benchmarks demonstrate that EVOLVE achieves state-of-the-art segmentation performance, surpassing both event-based and image-based VOS methods in accuracy and computational efficiency. Code is available at https://github.com/whdgusdl48/EVOLVE.
FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation
Yunpeng Bai
The University of Texas at Austin
Qixing Huang
The University of Texas at Austin
Abstract
Monocular Depth Estimation (MDE) is a fundamental 3D vision problem with numerous applications such as 3D scene reconstruction, autonomous navigation, and AI content creation. However, robust and generalizable MDE remains challenging due to limited real-world labeled data and distribution gaps between synthetic datasets and real data. Existing methods often struggle with real-world test data with low efficiency, reduced accuracy, and lack of detail. To address these issues, we propose an efficient MDE approach named FiffDepth. The key feature of FiffDepth is its use of diffusion priors. It transforms diffusion-based image generators into a feed-forward architecture for detailed depth estimation. FiffDepth preserves key generative features and integrates the strong generalization capabilities of models like DINOv2. Through benchmark evaluations, we demonstrate that FiffDepth achieves exceptional accuracy, stability, and fine-grained detail, offering significant improvements in MDE performance against state-ofthe-art MDE approaches. The paper's source code is available here: https://yunpeng1998.github.io/FiffDepth/
RCTDistill: Cross-Modal Knowledge Distillation Framework for Radar-Camera 3D Object Detection with Temporal Fusion
Geonho Bang
Seoul National University
Minjae Seong
Hanyang University
Jisong Kim
Hanyang University
Geunju Baek
Seoul National University
Daye Oh
Hyundai Motor Company
Junhyung Kim
Hyundai Motor Company
Junho Koh
Hyundai Motor Company
Jun Won Choi
Seoul National University
Abstract
Radar-camera fusion methods have emerged as a costeffective approach for 3D object detection but still lag behind LiDAR-based methods in performance. Recent works have focused on employing temporal fusion and Knowledge Distillation (KD) strategies to overcome these limitations. However, existing approaches have not sufficiently accounted for uncertainties arising from object motion or sensor-specific errors inherent in radar and camera modalities. In this work, we propose RCTDistill, a novel cross-modal KD method based on temporal fusion, comprising three key modules: Range-Azimuth Knowledge Distillation (RAKD), Temporal Knowledge Distillation (TKD), and Region-Decoupled Knowledge Distillation (RDKD). RAKD is designed to consider the inherent errors in the range and azimuth directions, enabling effective knowledge transfer from LiDAR features to refine inaccurate BEV representations. TKD mitigates temporal misalignment caused by dynamic objects by aligning historical radar-camera BEV features with current LiDAR representations. RDKD enhances feature discrimination by distilling relational knowledge from the teacher model, allowing the student to differentiate foreground and background features. RCTDistill achieves state-of-the-art radar-camera fusion performance on both the nuScenes and View-of-Delft (VoD) datasets, with the fastest inference speed of 26.2 FPS.
Vid-Group: Temporal Video Grounding Pretraining from Unlabeled Videos in the Wild
Peijun Bao
Nanyang Technological University
Chenqi Kong
Nanyang Technological University
Siyuan Yang
Nanyang Technological University
Zihao Shao
Peking University
Xinghao Jiang
Shanghai Jiaotong University
Boon Poh Ng
Nanyang Technological University
Meng Hwa Er
Nanyang Technological University
Alex Kot
Nanyang Technological University
Abstract
Given a natural language query, temporal video grounding aims to localize the described temporal moment in an untrimmed video. A major challenge of this task is its heavy dependence on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. To support this, we introduce Temporal Video Grounding Pretraining (Vid-Group), a large-scale dataset collected in a scalable manner with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. The semantics-guided refinement enhances the pseudo labels by leveraging semantic similarity with video frames to clean out unpaired data and make initial adjustments to temporal boundaries. In the following memory-consensus correction phase, a memory bank tracks the model predictions, progressively correcting the temporal boundaries based on consensus within the memory. Comprehensive experiments demonstrate ReCorrect's strong generalization abilities across multiple downstream settings. The code, dataset, and pretrained models are available at https://github.com/baopj/Vid-Group.
MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation
Vladislav Bargatin
Lomonosov Moscow State University
Egor Chistov
Lomonosov Moscow State University
Alexander Yakovenko
MSU Institute for Artificial Intelligence
Dmitriy Vatolin
MSU Institute for Artificial Intelligence
Abstract
Recent advances in optical flow estimation have prioritized accuracy at the cost of growing GPU memory consumption, particularly for high-resolution (FullHD) inputs. We introduce MEMFOF, a memory-efficient multi-frame optical flow method that identifies a favorable trade-off between multi-frame estimation and GPU memory usage. Notably, MEMFOF requires only 2.09 GB of GPU memory at runtime for 1080p inputs, and 28.5 GB during training, which uniquely positions our method to be trained at native 1080p without the need for cropping or downsampling. We systematically revisit design choices from RAFT-like architectures, integrating reduced correlation volumes and high-resolution training protocols alongside multi-frame estimation, to achieve state-of-the-art performance across multiple benchmarks while substantially reducing memory overhead. Our method outperforms more resourceintensive alternatives in both accuracy and runtime efficiency, validating its robustness for flow estimation at high resolutions. At the time of submission, our method ranks first on the Spring benchmark with a 1-pixel (1px) outlier rate of 3.289, leads Sintel (clean) with an endpoint error (EPE) of 0.963, and achieves the best Fl-all error on KITTI-2015 at 2.94%. The code is available at: https: //github.com/msu-video-group/memfof.
Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation
Luca Bartolomei
University of Bologna
Enrico Mannocci
University of Bologna
Fabio Tosi
University of Bologna
Matteo Poggi
University of Bologna
Stefano Mattoccia
University of Bologna
Abstract
Event cameras capture sparse, high-temporal-resolution visual information, making them particularly suitable for challenging environments with high-speed motion and strongly varying lighting conditions. However, the lack of large datasets with dense ground-truth depth annotations hinders learning-based monocular depth estimation from event data. To address this limitation, we propose a crossmodal distillation paradigm to generate dense proxy labels leveraging a Vision Foundation Model (VFM). Our strategy requires an event stream spatially aligned with RGB frames, a simple setup even available off-the-shelf, and exploits the robustness of large-scale VFMs. Additionally, we propose to adapt VFMs, either a vanilla one like Depth Anything v2 (DAv2), or deriving from it a novel recurrent architecture to infer depth from monocular event cameras. We evaluate our approach with synthetic and real-world datasets, demonstrating that i) our cross-modal paradigm achieves competitive performance compared to fully supervised methods without requiring expensive depth annotations, and ii) our VFM-based models achieve state-of-the-art performance.
What If: Understanding Motion Through Sparse Interactions
Stefan Andreas Baumann
CompVis @ LMU Munich
Nick Stracke
CompVis @ LMU Munich
Timy Phan
CompVis @ LMU Munich
Björn Ommer
CompVis @ LMU Munich
Abstract
Understanding the dynamics of a physical scene involves reasoning about the diverse ways it can potentially change, especially as a result of local interactions. We present the Flow Poke Transformer (FPT), a novel framework for directly predicting the distribution of local motion, conditioned on sparse interactions termed 'pokes'. Unlike traditional methods that typically only enable dense sampling of a single realization of scene dynamics, FPT provides an interpretable directly accessible representation of multi-modal scene motion, its dependency on physical interactions and the inherent uncertainties of scene dynamics. We also evaluate our model on several downstream tasks to enable comparisons with prior methods and highlight the flexibility of our approach. On dense face motion generation, our generic pre-trained model surpasses specialized baselines. FPT can be fine-tuned in strongly out-of-distribution tasks such as synthetic datasets to enable significant improvements over in-domain methods in articulated object motion estimation. Additionally, predicting explicit motion distributions directly enables our method to achieve competitive performance on tasks like moving part segmentation from pokes which further demonstrates the versatility of our FPT. Code and models are publicly available at compvis.github.io/flow-poke-transformer.
ClaraVid: A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling
Radu Beche
Technical University of Cluj-Napoca
Sergiu Nedevschi
Technical University of Cluj-Napoca
Abstract
The development of aerial holistic scene understanding algorithms is hindered by the scarcity of comprehensive datasets that enable both semantic and geometric reconstruction. While synthetic datasets offer an alternative, existing options exhibit task-specific limitations, unrealistic scene compositions, and rendering artifacts that compromise real-world applicability. We introduce ClaraVid, a synthetic aerial dataset specifically designed to overcome these limitations. Comprising 16,917 high-resolution images captured at 4032x3024 from multiple viewpoints across diverse landscapes, ClaraVid provides dense depth maps, panoptic segmentation, sparse point clouds, and dynamic object masks, while mitigating common rendering artifacts. To further advance neural reconstruction, we introduce the Delentropic Scene Profile (DSP), a novel complexity metric derived from differential entropy analysis, designed to quantitatively assess scene difficulty and inform reconstruction tasks. Utilizing DSP, we systematically benchmark neural reconstruction methods, uncovering a consistent, measurable correlation between scene complexity and reconstruction accuracy. Empirical results indicate that higher delentropy strongly correlates with increased reconstruction errors, validating DSP as a reliable complexity prior. The data and code are available on the project page: rdbch.github.com/claravid.
FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation
Yasser Benigmim
Inria
Mohammad Fahes
Inria
Tuan-Hung Vu
Inria
Andrei Bursuc
Inria
Raoul de Charette
Inria
Abstract
In this paper, we challenge the conventional practice in Open-Vocabulary Semantic Segmentation (OVSS) of using averaged class-wise text embeddings, which are typically obtained by encoding each class name with multiple templates (e.g., a photo of <class>, a sketch of a <class>). We investigate the impact of templates for OVSS, and find that for each class, there exist singletemplate classifiers-which we refer to as class-experts- that significantly outperform the conventional averaged classifier. First, to identify these class-experts, we introduce a novel approach that estimates them without any labeled data or training. By leveraging the class-wise prediction entropy of single-template classifiers, we select those yielding the lowest entropy as the most reliable class-experts. Second, we combine the outputs of class-experts in a new fusion process. Our plug-and-play method, coined FLOSS, is orthogonal and complementary to existing OVSS methods, offering an improvement without the need for additional labels or training. Extensive experiments show that FLOSS consistently enhances state-of-the-art OVSS models, generalizes well across datasets with different distribution shifts, and delivers substantial improvements in lowdata scenarios where only a few unlabeled images are available. Our code is available at https://github.com/yasserben/FLOSS.
AstroLoc: Robust Space to Ground Image Localizer
Gabriele Berton
Politecnico di Torino
Alex Stoken
Amazon
Carlo Masone
Politecnico di Torino
Abstract
Thousands of photos of Earth are taken every day by astronauts from the International Space Station. Localizing these photos, which has been performed manually for decades, has recently been approached through image retrieval solutions: given an astronaut photo, the goal is to find its most similar match among a large database of geotagged satellite images, in a task called Astronaut Photography Localization (APL). Yet, existing APL approaches are trained only using satellite images, without taking advantage of the millions of open-source astronaut photos. In this work we present the first APL pipeline capable of leveraging astronaut photos for training. We first produce full localization information for 300,000 manually weakly-labeled astronaut photos through an automated pipeline, and then use these images to train a model, called AstroLoc. AstroLoc learns a robust representation of Earth's surface features through two objective functions: pairing astronaut photos with their matching satellite counterparts in a pairwise loss, and a second loss on clusters of satellite imagery weighted by their relevance to astronaut photography through unsupervised mining. AstroLoc achieves a staggering 35% average improvement in recall@1 over previous SOTA, reaching a recall@100 consistently over 99% for existing datasets. Moreover, without fine-tuning, AstroLoc provides excellent results for related tasks like the lost-in-space satellite problem and historical space imagery localization.
Scene Coordinate Reconstruction Priors
Wenjing Bian
University of Oxford
Axel Barroso-Laguna
Niantic Spatial
Tommaso Cavallari
Niantic Spatial
Victor Adrian Prisacariu
University of Oxford
Eric Brachmann
Niantic Spatial
Abstract
Scene coordinate regression (SCR) models have proven to be powerful implicit scene representations for 3D vision, enabling visual relocalization and structure-from-motion. SCR models are trained specifically for one scene. If training images imply insufficient multi-view constraints SCR models degenerate. We present a probabilistic reinterpretation of training SCR models, which allows us to infuse highlevel reconstruction priors. We investigate multiple such priors, ranging from simple priors over the distribution of reconstructed depth values to learned priors over plausible scene coordinate configurations. For the latter, we train a 3D point cloud diffusion model on a large corpus of indoor scans. Our priors push predicted 3D scene points towards plausible geometry at each training step to increase their likelihood. On three indoor datasets our priors help learning better scene representations, resulting in more coherent scene point clouds, higher registration rates and better camera poses, with a positive effect on down-stream tasks such as novel view synthesis and camera relocalization.
Hyper-Depth: Hypergraph-based Multi-Scale Representation Fusion for Monocular Depth Estimation
Lin Bie
Tsinghua University
Siqi Li
Tsinghua University
Yifan Feng
Tsinghua University
Yue Gao
Tsinghua University
Abstract
Monocular depth estimation (MDE) is a fundamental problem in computer vision with wide-ranging applications in various downstream tasks. While multi-scale features are perceptually critical for MDE, existing transformer-based approaches have yet to leverage them explicitly. To address this limitation, we propose a hypergraph-based multiscale representation fusion framework, Hyper-Depth. The proposed Hyper-Depth incorporates two key components: a semantic consistency enhancement (SCE) module and a geometric consistency constraint (GCC) module. The SCE module, designed based on hypergraph convolution, aggregates global information and enhances the representation of multi-scale patch features. Meanwhile, the GCC module provides geometric guidance to reduce over-fitting errors caused by excessive reliance on local features. In addition, we introduce a correlation-based conditional random fields (C-CRFs) module as the decoder to filter correlated patches and compute attention weights more effectively. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches across all evaluation metrics on the KITTI and NYU-Depth-v2 datasets, achieving improvements of 6.21% and 3.32% on the main metric RMSE, respectively. Furthermore, zero-shot evaluations on the nuScenes and SUN-RGBD datasets validate the generalizability of our method.
RayGaussX: Accelerating Gaussian-Based Ray Marching for Real-Time and High-Quality Novel View Synthesis
Hugo Blanc
Mines Paris, PSL University
Jean-Emmanuel Deschaud
Mines Paris, PSL University
Alexis Paljic
Mines Paris, PSL University
Abstract
RayGauss has achieved state-of-the-art rendering quality for novel-view synthesis on synthetic and indoor scenes by representing radiance and density fields with irregularly distributed elliptical basis functions, rendered via volume ray casting using a Bounding Volume Hierarchy (BVH). However, its computational cost prevents real-time rendering on real-world scenes. Our approach, RayGaussX, builds on RayGauss by introducing key contributions that accelerate both training and inference. Specifically, we incorporate volumetric rendering acceleration strategies such as empty-space skipping and adaptive sampling, enhance ray coherence, and introduce scale regularization to reduce false-positive intersections. Additionally, we propose a new densification criterion that improves density distribution in distant regions, leading to enhanced graphical quality on larger scenes. As a result, RayGaussX achieves 5x to 12x faster training and 50x to 80x higher rendering speeds (FPS) on real-world datasets while improving visual quality by up to +0.56 dB in PSNR. The associated code is available at: github.com/hugobl1/raygaussx.
GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow
Simon Boeder
Robert Bosch GmbH
Fabian Gigengack
Robert Bosch GmbH
Benjamin Risse
University of Münster
Abstract
Occupancy estimation has become a prominent task in 3D computer vision, particularly within the autonomous driving community. In this paper, we present a novel approach to occupancy estimation, termed GaussianFlowOcc, which is inspired by Gaussian Splatting and replaces traditional dense voxel grids with a sparse 3D Gaussian representation. Our efficient model architecture based on a Gaussian Transformer significantly reduces computational and memory requirements by eliminating the need for expensive 3D convolutions used with inefficient voxel-based representations that predominantly represent empty 3D spaces. GaussianFlowOcc effectively captures scene dynamics by estimating temporal flow for each Gaussian during the overall network training process, offering a straightforward solution to a complex problem that is often neglected by existing methods. Moreover, GaussianFlowOcc is designed for scalability, as it employs weak supervision and does not require costly dense 3D voxel annotations based on additional data (e.g., LiDAR). Through extensive experimentation, we demonstrate that GaussianFlowOcc significantly outperforms all previous methods for weakly supervised occupancy estimation on the nuScenes dataset while featuring an inference speed that is 50 times faster than current SOTA.
Uncertainty-Aware Diffusion-Guided Refinement of 3D Scenes
Sarosij Bose
University of California, Riverside
Arindam Dutta
University of California, Riverside
Sayak Nag
University of California, Riverside
Junge Zhang
University of California, Riverside
Jiachen Li
University of California, Riverside
Konstantinos Karydis
University of California, Riverside
Amit K. Roy-Chowdhury
University of California, Riverside
Abstract
Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, existing single image to 3D reconstruction methods render incoherent and blurry views. This problem is exacerbated when the unseen regions are far away from the input camera. In this work, we address these inherent limitations in existing single image-to-3D scene feedforward networks. To alleviate the poor performance due to insufficient information beyond the input image's view, we leverage a strong generative prior, in the form of a pretrained latent video diffusion model, for iterative refinement of a coarse scene represented by optimizable Gaussian parameters. To ensure that the style and texture of the generated images align with that of the input image, we incorporate on-the-fly Fourier-style transfer between the generated images and the input image. Additionally, we design a semantic uncertainty quantification module that calculates the perpixel entropy and yields uncertainty maps used to guide the refinement process from the most confident pixels while discarding the remaining highly uncertain ones. We conduct extensive experiments on real-world scene datasets, including in-domain RealEstate-10K and out-of-domain KITTIv2, showing that our approach can provide more realistic and high-fidelity novel view synthesis results compared to existing state-of-the-art methods. The project page is available at: https://github.com/UCR-Vision-andLearning-Group/UAR-Scenes
ScanEdit: Hierarchically-Guided Functional 3D Scan Editing
Mohamed El Amine Boudjoghra
Technical University of Munich
Ivan Laptev
MBZUAI
Angela Dai
Technical University of Munich
Abstract
With the fast pace of 3D capture technology and resulting abundance of 3D data, effective 3D scene editing becomes essential for a variety of graphics applications. In this work we present ScanEdit, an instruction-driven method for functional editing of complex, real-world 3D scans. To model large and interdependent sets of objects, we propose a hierarchically-guided approach. Given a 3D scan decomposed into its object instances, we first construct a hierarchical scene graph representation to enable effective, tractable editing. We then leverage reasoning capabilities of Large Language Models (LLMs) and translate highlevel language instructions into actionable commands applied hierarchically to the scene graph. Finally, ScanEdit integrates LLM-based guidance with explicit physical constraints and generates realistic scenes where object arrangements obey both physics and common sense. In our extensive experimental evaluation ScanEdit outperforms state of the art and demonstrates excellent results for a variety of real-world scenes and input instructions. Our code is available at aminebdj.github.io/scanedit
Spherical Epipolar Rectification for Deep Two-View Absolute Depth Estimation
Pierre-André Brousseau
Université de Montréal
Sébastien Roy
Université de Montréal
Abstract
Absolute depth estimation from single camera sequence of images is a relevant task given that mobile machines increasingly rely on vision to navigate. Deep learning for stereo matching has been demonstrated to improve performance for stereo rectified depth estimation but these methods require straightforward left-right camera setups. This work proposes to introduce deep stereo matching to two views of a monocular image sequence obtained from a camera in motion in a static scene. This paper introduces a novel and principled spherical epipolar rectification model, which handles all camera motions. This rectification model is differentiable and allows self-supervised deep stereo matching algorithms to compute disparity and recover depth, given known camera pose. This paper also introduces a spherical crop operation which limits rectified image size and allows for competitive absolute depth estimation performance. This results in a spherical rectification model that is demonstrated to provide metric depth and compete favorably with current state-of-the-art monocular depth estimators. The code is available at https://gitlab.com/labv3d/spherical-stereo.git.
ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training
Leonard Bruns
KTH Royal Institute of Technology
Axel Barroso-Laguna
Niantic Spatial
Tommaso Cavallari
Niantic Spatial
Áron Monszpart
Third Dimension AI
Sowmya Munukutla
Niantic Spatial
Victor Adrian Prisacariu
University of Oxford
Eric Brachmann
Niantic Spatial
Abstract
Scene coordinate regression (SCR) has established itself as a promising learning-based approach to visual relocalization. After mere minutes of scene-specific training, SCR models estimate camera poses of query images with high accuracy. Still, SCR methods fall short of the generalization capabilities of more classical feature-matching approaches. When imaging conditions of query images, such as lighting or viewpoint, are too different from the training views, SCR models fail. Failing to generalize is an inherent limitation of previous SCR frameworks, since their training objective is to encode the training views in the weights of the coordinate regressor itself. The regressor essentially overfits to the training views, by design. We propose to separate the coordinate regressor and the map representation into a generic transformer and a scene-specific map code. This separation allows us to pre-train the transformer on tens of thousands of scenes. More importantly, it allows us to train the transformer to generalize from mapping images to unseen query images during pre-training. We demonstrate on multiple challenging relocalization datasets that our method, ACE-G, leads to significantly increased robustness while keeping the computational footprint attractive.
CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation
Elena Bueno-Benito
Institut de Robòtica i Informàtica Industrial, CSIC-UPC
Mariella Dimiccoli
Institut de Robòtica i Informàtica Industrial, CSIC-UPC
Abstract
Unsupervised action segmentation has recently pushed its limits with ASOT, an optimal transport (OT)-based method that simultaneously learns action representations and performs clustering using pseudo-labels. Unlike other OT-based approaches, ASOT makes no assumptions about action ordering and can decode a temporally consistent segmentation from a noisy cost matrix between video frames and action labels. However, the resulting segmentation lacks segment-level supervision, limiting the effectiveness of feedback between frames and action representations. To address this limitation, we propose Closed Loop Optimal Transport (CLOT), a novel OT-based framework with a multi-level cyclic feature learning mechanism. Leveraging its encoder-decoder architecture, CLOT learns pseudolabels alongside frame and segment embeddings by solving two separate OT problems. It then refines both frame embeddings and pseudo-labels through cross-attention between the learned frame and segment embeddings, by integrating a third OT problem. Experimental results on four benchmark datasets demonstrate the benefits of cyclical learning for unsupervised action segmentation. 1
Active Learning Meets Foundation Models: Fast Remote Sensing Data Annotation for Object Detection
Marvin Burges
TU Wien
Philipe Ambrozio Dias
Oak Ridge National Laboratory
Carson Woody
Oak Ridge National Laboratory
Sarah Walters
Oak Ridge National Laboratory
Dalton Lunga
Oak Ridge National Laboratory
Abstract
Object detection in remote sensing demands extensive, high-quality annotations-a process that is both laborintensive and time-consuming. In this work, we introduce a real-time active learning and semi-automated labeling framework that leverages foundation models to streamline dataset annotation for object detection in remote sensing imagery. For example, by integrating a Segment Anything Model (SAM), our approach generates mask-based bounding boxes that serve as the basis for dual sampling: (a) uncertainty estimation to pinpoint challenging samples, and (b) diversity assessment to ensure broad data coverage. Furthermore, our Dynamic Box Switching Module (DBS) addresses the well-known cold start problem for object detection models by replacing its suboptimal initial predictions with SAM-derived masks, thereby enhancing earlystage localization accuracy. Extensive evaluations on multiple remote sensing datasets, along with a real-world user study, demonstrate that our framework not only reduces annotation effort but also significantly boosts detection performance compared to traditional active learning sampling methods. The code for training and the user interface is available under https://github.com/mburgescvl/ICCV_AL4FM.
SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection for SLAM
Yannick Burkhardt
Technical University of Munich
Simon Schaefer
Technical University of Munich
Stefan Leutenegger
Technical University of Munich
Abstract
Event-based keypoint detection and matching holds significant potential, enabling the integration of event sensors into highly optimized Visual SLAM systems developed for frame cameras over decades of research. Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a datadriven approach to predict stable keypoints with expressive descriptors. Due to the absence of event datasets with ground truth keypoint labels, we leverage existing framebased keypoint detectors on readily available event-aligned and synchronized gray-scale frames for self-supervision: we generate temporally sparse keypoint pseudo-labels considering that events are a product of both scene appearance and camera motion. Combined with our novel, informationrich event representation, we enable SuperEvent to effectively learn robust keypoint detection and description in event streams. Finally, we demonstrate the usefulness of SuperEvent by its integration into a modern sparse keypoint and descriptor-based SLAM framework originally developed for traditional cameras, surpassing the state-of-theart in event-based SLAM by a wide margin. Source code is available at ethz-mrl.github.io/SuperEvent.
JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
Kwon Byung-Ki
POSTECH
Qi Dai
Microsoft Research Asia
Lee Hyoseok
POSTECH
Chong Luo
Microsoft Research Asia
Tae-Hyun Oh
KAIST
Abstract
We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the stateof-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accurate depth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, namely, adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy. With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image gener- †Work done during an internship at Microsoft Research Asia. ation by simply controlling the timesteps of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a viable alternative to conditional generation.
Cycle-Consistent Learning for Joint Layout-to-Image Generation and Object Detection
Xinhao Cai
Nanjing University of Science and Technology
Qiuxia Lai
Communication University of China
Gensheng Pei
Nanjing University of Science and Technology
Xiangbo Shu
Nanjing University of Science and Technology
Yazhou Yao
State Key Laboratory of Intelligent Manufacturing of Advanced Construction Machinery
Wenguan Wang
Zhejiang University
Zhinan Yu
Nanjing University of Science and Technology
Bo Du
School of Computer Science, Wuhan University
Abstract
In this paper, we propose a generation-detection cycle consistent (GDCC) learning framework that jointly optimizes both layout-to-image (L2I) generation and object detection (OD) tasks in an end-to-end manner. The key of GDCC lies in the inherent duality between the two tasks, where L2I takes all object boxes and labels as input conditions to generate images, and OD maps images back to these layout conditions. Specifically, in GDCC, L2I generation is guided by a layout translation cycle loss, ensuring that the layouts used to generate images align with those predicted from the synthesized images. Similarly, OD benefits from an image translation cycle loss, which enforces consistency between the synthesized images fed into the detector and those generated from predicted layouts. While current L2I and OD tasks benefit from large-scale annotated layout-image pairs, our GDCC enables more efficient use of auto-synthesized data, thereby further enhancing data efficiency. It is worth noting that our GDCC framework is computationally efficient thanks to the perturbative single-step sampling strategy and a priority timestep re-sampling strategy during training. Besides, GDCC preserves the architectures of L2I, OD models, and the generation pipeline within the framework, thus maintaining the original inference speed. Extensive experiments demonstrate that GDCC significantly improves the controllability of diffusion models and the accuracy of object detectors.
CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs
Yihan Cao
National University of Defense Technology
Zheng Qin
Defense Innovation Institute, Academy of Military Sciences
Jiazhao Zhang
Peking University
Qin Zou
Wuhan University
Zhinan Yu
National University of Defense Technology
Bo Du
Wuhan University
Shuzhen Liu
National University of Defense Technology
Kai Xu
National University of Defense Technology
Abstract
Object goal navigation (ObjectNav) is a fundamental task in embodied AI, requiring an agent to locate a target object in previously unseen environments. This task is particularly challenging because it requires both perceptual and cognitive processes, including object recognition and decision-making. While substantial advancements in perception have been driven by the rapid development of visual foundation models, progress on the cognitive aspect remains constrained, primarily limited to either implicit learning through simulator rollouts or explicit reliance on predefined heuristic rules. Inspired by neuroscientific findings demonstrating that humans maintain and dynamically update fine-grained cognitive states during object search tasks in novel environments, we propose CogNav, a framework designed to mimic this cognitive process using large language models. Specifically, we model the cognitive process using a finite state machine comprising fine-grained cognitive states, ranging from exploration to identification. Transitions between states are determined by a large language model based on a dynamically constructed heterogeneous cognitive map, which contains spatial and semantic information about the scene being explored. Extensive evaluations on the HM3D, MP3D, and RoboTHOR benchmarks demonstrate that our cognitive process modeling significantly improves the success rate of ObjectNav at least by relative 14% over the state-of-the-arts.
IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark
Zhe Cao
Beijing Institute of Technology
Jin Zhang
Beijing Institute of Technology
Ruiheng Zhang
Beijing Institute of Technology
Abstract
Real-world infrared imagery presents unique challenges for vision-language models due to the scarcity of aligned text data and domain-specific characteristics. Although existing methods have advanced the field, their reliance on synthetic infrared images generated through style transfer from visible images, which limits their ability to capture the unique characteristics of the infrared modality. To address this, we propose IRGPT, the first multi-modal large language model for real-world infrared images, built upon a large-scale InfraRed-Text Dataset (IR-TD) comprising over 260K authentic image-text pairs. The proposed IR-TD dataset contains real infrared images paired with meticulously handcrafted texts, where the initial drafts originated from two complementary processes: (1) LLM-generated descriptions of visible images, and (2) rule-based descriptions of annotations. Furthermore, we introduce a bi-cross-modal curriculum transfer learning strategy that systematically transfers knowledge from visible to infrared domains by considering the difficulty scores of both infrared-visible and infraredtext. Evaluated on a benchmark of 9 tasks (e.g., recognition, grounding), IRGPT achieves state-of-the-art performance even compared with larger-scale models.
MotionCtrl: A Real-time Controllable Vision-Language-Motion Model
Bin Cao
Institute of Automation, Chinese Academy of Sciences
Sipeng Zheng
BeingBeyond
Ye Wang
Renmin University of China
Lujie Xia
Peking University
Qianshan Wei
Southeast University
Qin Jin
Renmin University of China
Jing Liu
Institute of Automation, Chinese Academy of Sciences
Zongqing Lu
Peking University
Abstract
Human motion generation involves synthesizing coherent human motion sequences conditioned on diverse multimodal inputs and holds significant potential for realworld applications. Despite recent advancements, existing vision-language-motion models (VLMMs) remain limited in achieving this goal. In this paper, we identify the lack of controllability as a critical bottleneck, where VLMMs struggle with diverse human commands, pose initialization, generation of long-term or unseen cases, and fine-grained control over individual body parts. To address these challenges, we introduce MotionCtrl, the first real-time, controllable VLMM with state-of-the-art performance. MotionCtrl achieves its controllability through training on HuMo100M, the largest human motion dataset to date, featuring over 5 million self-collected motions, 100 million multi-task instructional instances, and detailed part-level descriptions that address a long-standing gap in the field. Additionally, we propose a novel part-aware residual quantization technique for motion tokenization, enabling precise control over individual body parts during motion generation. Extensive experiments demonstrate MotionCtrl's superior performance across a wide range of motion benchmarks. Furthermore, we provide strategic design insights and a detailed time efficiency analysis to guide the development of practical motion generators.
UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction
Jin Cao
Zhejiang University
Hongrui Wu
Tongji University
Ziyong Feng
DeepGlint
Hujun Bao
Zhejiang University
Xiaowei Zhou
Zhejiang University
Sida Peng
Zhejiang University
Abstract
This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations. However, these methods rely heavily on dense observations for robustly optimizing model parameters. To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process. To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images. Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies. Extensive experiments on both synthetic and realworld datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. The code will be released for the reproducibility.
Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation
Yihong Cao
Hunan University
Jiaming Zhang
Karlsruhe Institute of Technology
Xu Zheng
HKUST(GZ)
Hao Shi
Zhejiang University
Kunyu Peng
Karlsruhe Institute of Technology
Abstract
Panoramic image processing is essential for omnicontext perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360° viewpoint coverage and occlusionaware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and syntheticto-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available at https://github.com/yihong-97/UNLOCK.
Visual Relation Diffusion for Human-Object Interaction Detection
Ping Cao
Beijing Jiaotong University
Yepeng Tang
Beijing Jiaotong University
Chunjie Zhang
Beijing Jiaotong University
Xiaolong Zheng
Chinese Academy of Sciences
Chao Liang
Wuhan University
Yunchao Wei
Beijing Jiaotong University
Yao Zhao
Beijing Jiaotong University
Abstract
Human-object interaction (HOI) detection relies on finegrained visual understanding to distinguish complex relationships between humans and objects. While recent generative diffusion models have demonstrated remarkable capability in learning detailed visual concepts through pixellevel generation, their potential for interaction-level relationship modeling remains largely unexplored. To bridge this gap, we propose a Visual Relation Diffusion model (VRDiff), which introduces dense visual relation conditions to guide interaction understanding. Specifically, we encode interaction-aware condition representations that capture both spatial responsiveness and contextual semantics of human-object pairs, conditioning the diffusion process purely on visual features rather than text-based inputs. Furthermore, we refine these relation representations through generative feedback from the diffusion model, enhancing HOI detection without requiring image synthesis. Extensive experiments on the HICO-DET benchmark demonstrate that VRDiff achieves competitive results under both standard and zero-shot HOI detection settings.
Doppler-Aware LiDAR-RADAR Fusion for Weather-Robust 3D Detection
Yujeong Chae
Korea Advanced Institute of Science and Technology
Heejun Park
Korea Advanced Institute of Science and Technology
Hyeonseong Kim
Korea Advanced Institute of Science and Technology
Kuk-Jin Yoon
Korea Advanced Institute of Science and Technology
Abstract
Robust 3D object detection across diverse weather conditions is crucial for safe autonomous driving, and RADAR is increasingly leveraged for its resilience in adverse weather. Recent advancements have explored 4D RADAR and LiDAR-RADAR fusion to enhance 3D perception capabilities, specifically targeting weather robustness. However, existing methods often handle Doppler in ways that are not well-suited for multi-modal settings or lack tailored encoding strategies, hindering effective feature fusion and performance. To address these shortcomings, we propose a novel Doppler-aware LiDAR-4D RADAR fusion (DLRFusion) framework for robust 3D object detection. We introduce a multi-path iterative interaction module that integrates LiDAR, RADAR power, and Doppler, enabling a structured feature fusion process. Doppler highlights dynamic regions, refining RADAR power and enhancing LiDAR features across multiple stages, improving detection confidence. Extensive experiments on the K-RADAR dataset demonstrate that our approach effectively exploits Doppler information, achieving state-of-the-art performance in both normal and adverse weather conditions.
GaussRender: Learning 3D Occupancy with Gaussian Rendering
Lo¨ıck Chambon
ValeoAI, Paris, France
Eloi Zablocki
Sorbonne University, Paris, France
Alexandre Boulch
Sorbonne University, Paris, France
Micka¨el Chen
Hcompany.ai, Paris, France
Matthieu Cord
ValeoAI, Paris, France
Abstract
Understanding the 3D geometry and semantics of driving scenes is critical for safe autonomous driving. Recent advances in 3D occupancy prediction have improved scene representation but often suffer from visual inconsistencies, leading to floating artifacts and poor surface localization. Existing voxel-wise losses (e.g., cross-entropy) fail to enforce visible geometric coherence. In this paper, we propose GaussRender, a module that improves 3D occupancy learning by enforcing projective consistency. Our key idea is to project both predicted and groundtruth 3D occupancy into 2D camera views, where we apply supervision. Our method penalizes 3D configurations that produce inconsistent 2D projections, thereby enforcing a more coherent 3D structure. To achieve this efficiently, we leverage differentiable rendering with Gaussian splatting. GaussRender seamlessly integrates with existing architectures while maintaining efficiency and requiring no inference-time modifications. Extensive evaluations on multiple benchmarks (SurroundOcc-nuScenes, Occ3DnuScenes, SSCBench-KITTI360) demonstrate that GaussRender significantly improves geometric fidelity across various 3D occupancy models (TPVFormer, SurroundOcc, Symphonies), achieving state-of-the-art results, particularly on surface-sensitive metrics such as RayIoU. The code is open-sourced at https://github.com/valeoai/GaussRender.
SMSTracker: Tri-path Score Mask Sigma Fusion for Multi-Modal Tracking
Sixian Chan
Zhejiang University of Technology, China
Zedong Li
Zhejiang University of Technology, China
Wenhao Li
Nanyang Technological University, Singapore
Shijian Lu
Nanyang Technological University, Singapore
Chunhua Shen
Zhejiang University, China
Xiaoqin Zhang
Zhejiang University of Technology, China
Abstract
Multi-modal object tracking has emerged as a significant research focus in computer vision due to its robustness in complex environments, such as exposure variations, blur, and occlusions. Despite existing studies integrating supplementary modal information into pre-trained RGB trackers through visual prompt mechanisms, this approach exhibits a critical limitation: it inherently prioritizes RGB information as the dominant modality, thereby underutilizing the complementary information of alternative modalities. To address this fundamental limitation, we present SMSTracker, an innovative tri-path score mask sigma fusion framework for multi-modal tracking, including three key modules. Firstly, we design a tri-path Score Mask Fusion (SMF) module to evaluate and quantify the reliability of each modality, allowing optimal exploitation of complementary features between modalities. Secondly, we introduce a pioneering Sigma Interaction (SGI) module to facilitate a sophisticated fusion of modal features across tri-branches. Furthermore, we advance a Drop Key Finetuning (DKF) strategy to address the inherent challenge of unequal data contribution in multi-modal learning scenarios, thereby enhancing the model's capacity for comprehensive multi-modal information processing. Finally, extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event datasets demonstrate the significant performance improvements achieved by SMSTracker over existing state-of-the-art methods. Code and model are available at https://github.com/Leezed525/SMSTracker.
Hierarchical-aware Orthogonal Disentanglement Framework for Fine-grained Skeleton-based Action Recognition
Haochen Chang
School of Systems Science and Engineering, Sun Yat-sen University
Pengfei Ren
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Haoyang Zhang
Defense Innovation Institute, Academy of Military Sciences
Liang Xie
Defense Innovation Institute, Academy of Military Sciences
Hongbo Chen
School of Systems Science and Engineering, Sun Yat-sen University
Erwei Yin
Defense Innovation Institute, Academy of Military Sciences
Abstract
In recent years, skeleton-based action recognition has gained significant attention due to its robustness in varying environmental conditions. However, most existing methods struggle to distinguish fine-grained actions due to subtle motion features, minimal inter-class variation, and they often fail to consider the underlying similarity relationships between action classes. To address these limitations, we propose a Hierarchical-aware Orthogonal Disentanglement framework (HiOD). We disentangle coarsegrained and fine-grained features by employing independent spatial-temporal granularity-aware bases, which encode semantic representations at varying levels of granularity. Additionally, we design a cross-granularity feature interaction mechanism that leverages complementary information between coarse-grained and fine-grained features. We further enhance the learning process through hierarchical prototype contrastive learning, which utilizes the parent class hierarchy to guide the learning of coarse-grained features while ensuring the distinguishability of fine-grained features within child classes. Extensive experiments on FineGYM, FSD-10, NTU RGB+D, and NTU RGB+D 120 datasets demonstrate the superiority of our method in finegrained action recognition tasks.
LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation
Wei-Jer Chang
UC Berkeley
Wei Zhan
UC Berkeley
Masayoshi Tomizuka
UC Berkeley
Manmohan Chandraker
NEC Labs America
Francesco Pittaluga
NEC Labs America
Abstract
Evaluating autonomous vehicles with controllability allows for scalable testing in counterfactual or structured settings, improving both efficiency and safety. We introduce LANGTRAJ, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning on natural language inputs, LANGTRAJ enables flexible and intuitive control over interactive behaviors, generating nuanced and realistic scenarios. Unlike prior approaches that rely on domain-specific guidance functions, LANGTRAJ incorporates language conditioning during training for more intuitive traffic simulation control. In addition, we propose a novel closed-loop training strategy for diffusion models to enhance realism in closed-loop simulation. To support language-conditioned simulation, we develop a scalable pipeline for annotating agent-agent interactions and single-agent behaviors, which we use to develop INTERDRIVE, a large-scale dataset offering diverse and interactive labels for training languageconditioned diffusion models. Validated on the Waymo Motion Dataset, LANGTRAJ demonstrates strong performance in both realism, language controllability, and languageconditioned safety-critical simulation, establishing a new paradigm for flexible and scalable autonomous vehicle testing. Project website: https://langtraj.github.io/.
Learning Neural Scene Representation from iToF Imaging
Wenjie Chang
University of Science and Technology of China
Hanzhi Chang
University of Science and Technology of China
Yueyi Zhang
Miromind
Wenfei Yang
University of Science and Technology of China
Tianzhu Zhang
University of Science and Technology of China
Abstract
Indirect Time-of-Flight (iToF) cameras are popular for 3D perception because they are cost-effective and easy to deploy. They emit modulated infrared signals to illuminate the scene and process the received signals to generate amplitude and phase images. The depth is calculated from the phase using the modulation frequency. However, the obtained depth often suffers from noise caused by multi-path interference, low signal-to-noise ratio (SNR), and depth wrapping. Building on recent advancements in neural scene representations, which have shown great potential in 3D modeling from multi-view RGB images, we propose leveraging this approach to reconstruct 3D representations from noisy iToF data. Our method utilizes the multi-view consistency of amplitude and phase maps, fusing information from all input views to generate an accurate scene representation. Considering the impact of infrared illumination, we propose a new rendering scheme for amplitude maps based on signed distance function (SDF) and introduce a neural lighting function to model the appearance variations caused by active illumination. We also incorporate a phaseguided sampling strategy and a wrapping-aware phase-todepth loss to utilize raw phase information and mitigate depth wrapping. Additionally, we add a noise-weight loss to prevent excessive smoothing information across noisy multi-view measurements. Experiments conducted on synthetic and real-world datasets demonstrate that the proposed method outperforms state-of-the-art techniques.
ALOcc: Adaptive Lifting-Based 3D Semantic Occupancy and Cost Volume-Based Flow Predictions
Dubing Chen
SKL-IOTSC, CIS, University of Macau
Jin Fang
SKL-IOTSC, CIS, University of Macau
Wencheng Han
SKL-IOTSC, CIS, University of Macau
Xinjing Cheng
Junbo Yin
CEMSE Division, King Abdullah University of Science and Technology
Chenzhong Xu
SKL-IOTSC, CIS, University of Macau
Fahad Shahbaz Khan
Mohamed bin Zayed University of Artificial Intelligence
Jianbing Shen
SKL-IOTSC, CIS, University of Macau
Abstract
3D semantic occupancy and flow prediction are fundamental to spatiotemporal scene understanding. This paper proposes a vision-based framework with three targeted improvements. First, we introduce an occlusion-aware adaptive lifting mechanism incorporating depth denoising. This enhances the robustness of 2D-to-3D feature transformation while mitigating reliance on depth priors. Second, we enforce 3D-2D semantic consistency via jointly optimized prototypes, using confidence- and category-aware sampling to address the long-tail classes problem. Third, to streamline joint prediction, we devise a BEV-centric cost volume to explicitly correlate semantic and flow features, supervised by a hybrid classification-regression scheme that handles diverse motion scales. Our purely convolutional architecture establishes new SOTA performance on multiple benchmarks for both semantic occupancy and joint occupancy semantic-flow prediction. We also present a family of models offering a spectrum of efficiency-performance trade-offs. Our real-time version exceeds all existing real-time methods in speed and accuracy, ensuring its practical viability.
AutoScape: Geometry-Consistent Long-Horizon Scene Generation
Jiacheng Chen
Simon Fraser University
Ziyu Jiang
NEC Labs America
Mingfu Liang
Northwestern University
Bingbing Zhuang
NEC Labs America
Jong-Chyi Su
UC San Diego
Sparsh Garg
UC San Diego
Ying Wu
Northwestern University
Manmohan Chandraker
UC San Diego
Abstract
This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6% and 43.0%, respectively. Project page: https://auto-scape.github.io.
Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction
Weirong Chen
TU Munich
Ganlin Zhang
TU Munich
Felix Wimbauer
TU Munich
Rui Wang
Microsoft
Nikita Araslanov
TU Munich
Andrea Vedaldi
University of Oxford
Daniel Cremers
TU Munich
Abstract
Traditional SLAM systems, which rely on bundle adjustment, struggle with the highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, while the latter can lead to inconsistent motion estimates. Taking a novel approach, this work leverages a 3D point tracker to separate camera-induced motion from the observed motion of dynamic objects. By considering only the camera-induced component, bundle adjustment can operate reliably on all scene elements. We further ensure depth consistency across video frames with lightweight postprocessing based on scale maps. Our framework combines the core of traditional SLAM-bundle adjustment-with a robust learning-based 3D tracker. Integrating motion decomposition, bundle adjustment, and depth refinement, our unified framework, BA-Track, accurately tracks camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.
CasP: Improving Semi-Dense Feature Matching Pipeline Leveraging Cascaded Correspondence Priors for Guidance
Peiqi Chen
Wuhan University
Lei Yu
Ant Group
Yi Wan
Wuhan University
Yingying Pei
Wuhan University
Xinyi Liu
Wuhan University
Yongxiang Yao
Wuhan University
Yingying Zhang
Ant Group
Lixiang Ru
Ant Group
Liheng Zhong
Ant Group
Jingdong Chen
Ant Group
Ming Yang
Ant Group
Yongjun Zhang
Wuhan University
Abstract
Semi-dense feature matching methods have shown strong performance in challenging scenarios. However, the existing pipeline relies on a global search across the entire feature map to establish coarse matches, limiting further improvements in accuracy and efficiency. Motivated by this limitation, we propose a novel pipeline, CasP, which leverages cascaded correspondence priors for guidance. Specifically, the matching stage is decomposed into two progressive phases, bridged by a region-based selective cross-attention mechanism designed to enhance feature discriminability. In the second phase, one-to-one matches are determined by restricting the search range to the one-to-many prior areas identified in the first phase. Additionally, this pipeline benefits from incorporating high-level features, which helps reduce the computational costs of low-level feature extraction. The acceleration gains of CasP increase with higher resolution, and our lite model achieves a speedup of ∼2.2x at a resolution of 1152 compared to the most efficient method, ELoFTR. Furthermore, extensive experiments demonstrate its superiority in geometric estimation, particularly with impressive cross-domain generalization. These advantages highlight its potential for latency-sensitive and high-robustness applications, such as SLAM and UAV systems. Code is available at https://github.com/pq-chen/CasP.
DASH: 4D Hash Encoding with Self-Supervised Decomposition for Real-Time Dynamic Scene Rendering
Jie Chen
University of Science and Technology of China
Zhangchi Hu
University of Science and Technology of China
Peixi Wu
University of Science and Technology of China
Huyue Zhu
University of Science and Technology of China
Hebei Li
University of Science and Technology of China
Xiaoyan Sun
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Abstract
Dynamic scene reconstruction is a long-term challenge in 3D vision. Existing plane-based methods in dynamic Gaussian splatting suffer from an unsuitable low-rank assumption, causing feature overlap and poor rendering quality. Although 4D hash encoding provides an explicit representation without low-rank constraints, directly applying it to the entire dynamic scene leads to substantial hash collisions and redundancy. To address these challenges, we present DASH, a real-time dynamic scene rendering framework that employs 4D hash encoding coupled with self-supervised decomposition. Our approach begins with a self-supervised decomposition mechanism that separates dynamic and static components without manual annotations or precomputed masks. Next, we introduce a multiresolution 4D hash encoder for dynamic elements, providing an explicit representation that avoids the low-rank assumption. Finally, we present a spatio-temporal smoothness regularization strategy to mitigate unstable deformation artifacts. Experiments on real-world datasets demonstrate that DASH achieves state-of-the-art dynamic rendering performance, exhibiting enhanced visual quality at realtime speeds of 264 FPS on a single 4090 GPU. Code: https://github.com/chenj02/DASH.
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers
Yuntao Chen
HKISI, CAS
Yuqi Wang
New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Zhaoxiang Zhang
New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Abstract
World model-based searching and planning are widely recognized as a promising path toward human-level physical intelligence. However, current driving world models primarily rely on video diffusion models, which specialize in visual generation but lack the flexibility to incorporate other modalities like action. In contrast, autoregressive transformers have demonstrated exceptional capability in modeling multimodal data. Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning through standard next-token prediction. Our DrivingGPT demonstrates strong performance in both actionconditioned video generation and end-to-end planning in the VQ token space for the first time, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow
Yixiang Chen
New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Peiyan Li
New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Yan Huang
New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Jiabing Yang
New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Kehan Chen
New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Liang Wang
New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Abstract
Current language-guided robotic manipulation systems often require low-level action-labeled datasets for imitation learning. While object-centric flow prediction methods mitigate this issue, they remain limited to scenarios involving rigid objects with clear displacement and minimal occlusion. In this work, we present Embodiment-Centric Flow (EC-Flow), a framework that directly learns manipulation from action-unlabeled videos by predicting embodimentcentric flow. Our key insight is that incorporating the embodiment's inherent kinematics significantly enhances generalization to versatile manipulation scenarios, including deformable object handling, occlusions, and non-objectdisplacement tasks. To connect the EC-Flow with language instructions and object interactions, we further introduce a goal-alignment module by jointly optimizing movement consistency and goal-image prediction. Moreover, translating EC-Flow to executable robot actions only requires a standard robot URDF (Unified Robot Description Format) file to specify kinematic constraints across joints, which makes it easy to use in practice. We validate EC-Flow on both simulation (Meta-World) and real-world tasks, demonstrating its state-of-the-art performance in occluded object handling (62% improvement), deformable object manipulation (45% improvement), and non-object-displacement tasks (80% improvement) than prior state-of-the-art objectcentric flow methods. More results can be found on our project website: https://ec-flow1.github.io.
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
Xingyu Chen
Zhejiang University
Yue Chen
Zhejiang University
Yuliang Xiu
Westlake University
Andreas Geiger
University of Tübingen, Tübingen AI Center
Anpei Chen
University of Tübingen, Tübingen AI Center
Abstract
Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets.
EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds
Lu Chen
State Key Lab of CAD&CG, Zhejiang University
Yizhou Wang
The Chinese University of Hong Kong
Shixiang Tang
The Chinese University of Hong Kong
Qianhong Ma
Shanghai Jiao Tong University
Tong He
Shanghai Artificial Intelligence Laboratory
Wanli Ouyang
The Chinese University of Hong Kong
Xiaowei Zhou
State Key Lab of CAD&CG, Zhejiang University
Hujun Bao
State Key Lab of CAD&CG, Zhejiang University
Sida Peng
State Key Lab of CAD&CG, Zhejiang University
Abstract
Learning an agent model that behaves like humans- capable of jointly perceiving the environment, predicting the future, and taking actions from a first- person perspective- is a fundamental challenge in computer vision. Existing methods typically train separate models for these abilities, which fail to capture their intrinsic relationships and prevent them from learning from each other. Inspired by how humans learn through the perception- action loop, we propose EgoAgent, a unified agent model that simultaneously learns to represent, predict, and act within a single transformer. EgoAgent explicitly models the causal and temporal dependencies among these abilities by formulating the task as an interleaved sequence of states and actions. It further introduces a joint embedding-action-prediction architecture with temporally asymmetric predictor and observer branches, enabling synergistic optimization across all three capabilities. Comprehensive evaluations of EgoAgent on representative tasks such as image classification, egocentric future state prediction, and 3D human motion prediction demonstrate the superiority of our method. The code and trained models will be publicly available at https: //github.com/zju3dv/EgoAgent.
Enhancing Prompt Generation with Adaptive Refinement for Camouflaged Object Detection
Xuehan Chen
Xi'an Jiaotong-Liverpool University, China
Guangyu Ren
Xi'an Jiaotong-Liverpool University, China
Tianhong Dai
Imperial College London, United Kingdom
Tania Stathaki
Imperial College London, United Kingdom
Hengyan Liu
Xi'an Jiaotong-Liverpool University, China
Abstract
Foundation models, such as Segment Anything Model (SAM), have exhibited remarkable performance in conventional segmentation tasks, primarily due to their training on large-scale datasets. Nonetheless, challenges remain in specific downstream tasks, such as Camouflaged Object Detection (COD). Existing research primarily aims to enhance performance by integrating additional multimodal information derived from other foundation models. However, directly leveraging the information generated by these models may introduce additional biases due to domain shifts. To address this issue, we propose an Adaptive Refinement Module (ARM), which efficiently processes multimodal information and simultaneously refining the mask prompt. Furthermore, we construct an auxiliary embedding that effectively exploits the intermediate information generated during ARM, providing SAM with richer feature representations. Experimental results indicate that our proposed architecture surpasses most state-of-the-art (SOTA) models in the COD task, particularly excelling in structured target segmentation.
Event-based Tiny Object Detection: A Benchmark Dataset and Baseline
Nuo Chen
National University of Defense Technology, China
Chao Xiao
National University of Defense Technology, China
Yimian Dai
Nankai University
Shiman He
National University of Defense Technology, China
Miao Li
National University of Defense Technology, China
Wei An
National University of Defense Technology, China
Abstract
Small object detection (SOD) in anti-UAV task is a challenging problem due to the small size of UAVs and complex backgrounds. Traditional frame-based cameras struggle to detect small objects in complex environments due to their low frame rates, limited dynamic range, and data redundancy. Event cameras, with microsecond temporal resolution and high dynamic range, provide a more effective solution for SOD. However, existing event-based object detection datasets are limited in scale, feature large targets size, and lack diverse backgrounds, making them unsuitable for SOD benchmarks. In this paper, we introduce a Eventbased Small object detection (EVSOD) dataset (namely EVUAV), the first large-scale, highly diverse benchmark for anti-UAV tasks. It includes 147 sequences with over 2.3 million event-level annotations, featuring extremely small targets (averaging 6.8 x 5.4 pixels) and diverse scenarios such as urban clutter and extreme lighting conditions. Furthermore, based on the observation that small moving targets form continuous curves in spatiotemporal event point clouds, we propose Event based Sparse Segmentation Network (EV-SpSegNet), a novel baseline for event segmentation in point cloud space, along with a Spatiotemporal Correlation (STC) loss that leverages motion continuity to guide the network in retaining target events. Extensive experiments on the EV-UAV dataset demonstrate the superiority of our method and provide a benchmark for future research in EVSOD. The dataset and code are at https: //github.com/ChenYichen9527/Ev-UAV.
Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation
Tiankai Chen
Southwest Jiaotong University
Yushu Li
South China University of Technology
Adam Goodge
Institute for infocomm research(I2R), A*STAR
Fei Teng
Southwest Jiaotong University
Xulei Yang
Institute for infocomm research(I2R), A*STAR
Tianrui Li
Southwest Jiaotong University
Xun Xu
Institute for infocomm research(I2R), A*STAR
Abstract
Out-of-distribution (OOD) detection in 3D point cloud data remains a challenge, particularly in applications where safe and robust perception is critical. While existing OOD detection methods have shown progress for 2D image data, extending these to 3D environments involves unique obstacles. This paper introduces a training-free framework that leverages Vision-Language Models (VLMs) for effective OOD detection in 3D point clouds. By constructing a graph based on class prototypes and testing data, we exploit the data manifold structure to enhancing the effectiveness of VLMs for 3D OOD detection. We propose a novel Graph Score Propagation (GSP) method that incorporates prompt clustering and self-training negative prompting to improve OOD scoring with VLM. Our method is also adaptable to few-shot scenarios, providing options for practical applications. We demonstrate that GSP consistently outperforms state-of-the-art methods across synthetic and realworld datasets for 3D point cloud OOD detection.
Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues
Chen Chen
National University of Defense Technology, China
Kangcheng Bin
National University of Defense Technology, China
Ting Hu
National University of Defense Technology, China
Jiahao Qi
National University of Defense Technology, China
Xingyue Liu
National University of Defense Technology, China
Tianpeng Liu
National University of Defense Technology, China
Zhen Liu
National University of Defense Technology, China
Yongxiang Liu
National University of Defense Technology, China
Ping Zhong
National University of Defense Technology, China
Abstract
Unmanned aerial vehicles (UAV)-based object detection with visible (RGB) and infrared (IR) images facilitates robust around-the-clock detection, driven by advancements in deep learning techniques and the availability of highquality dataset. However, the existing dataset struggles to fully capture real-world complexity for limited imaging conditions. To this end, we introduce a high-diversity dataset ATR-UMOD covering varying scenarios, spanning altitudes from 80m to 300m, angles from 0° to 75°, and allday, all-year time variations in rich weather and illumination conditions. Moreover, each RGB-IR image pair is annotated with 6 condition attributes, offering valuable highlevel contextual information. To meet the challenge raised by such diverse conditions, we propose a novel promptguided condition-aware dynamic fusion (PCDF) to adaptively reassign multimodal contributions by leveraging annotated condition cues. By encoding imaging conditions as text prompts, PCDF effectively models the relationship between conditions and multimodal contributions through a task-specific soft-gating transformation. A prompt-guided condition-decoupling module further ensures the availability in practice without condition annotations. Experiments on ATR-UMOD dataset reveal the effectiveness of PCDF.
GCRayDiffusion: Pose-Free Surface Reconstruction via Geometric Consistent Ray Diffusion
Li-Heng Chen
Beijing Normal University
Zi-Xin Zou
VAST
Chang Liu
Beijing Normal University
Tianjiao Jing
Beijing Normal University
Yan-Pei Cao
VAST
Shi-Sheng Huang
Beijing Normal University
Hongbo Fu
Hong Kong University of Science and Technology
Hua Huang
Beijing Normal University
Abstract
Accurate surface reconstruction from unposed images is crucial for efficient 3D object or scene creation. However, it remains challenging particularly for the joint camera pose estimation. Previous approaches have achieved impressive pose-free surface reconstruction results in denseview settings, but could easily fail for sparse-view scenarios without sufficient visual overlap. In this paper, we propose a new technique for pose-free surface reconstruction, which follows triplane-based signed distance field (SDF) learning but regularizes the learning by explicit points sampled from ray-based diffusion of camera pose estimation. Our key contribution is a novel Geometric Consistent Ray Diffusion model (GCRayDiffusion), where we represent camera poses as neural bundle rays and regress the distribution of noisy rays via a diffusion model. More importantly, we further condition the denoising process of RGRayDiffusion using the triplane-based SDF of the entire scene, which provides effective 3D consistent regularization to get multi-view consistent camera pose estimation. Finally, we incorporate RGRayDiffusion to the triplane-based SDF learning by introducing on-surface geometric regularization from the sampling points of the neural bundle rays, which leads to highly accurate pose-free surface reconstruction results even for sparse view inputs. Extensive evaluations on public datasets show that our GCRayDiffusion achieves more accurate camera pose estimation than previous approaches, with geometrically more consistent surface reconstruction results, especially given sparse view inputs. Our source code is available at https://github.com/CountNemoChan/GCRayDiffusion
GLEAM: Learning Generalizable Exploration Policy for Active Mapping in Complex 3D Indoor Scene
Xiao Chen
The Chinese University of Hong Kong
Tai Wang
Shanghai AI Laboratory
Quanyi Li
Shanghai AI Laboratory
Tao Huang
Shanghai AI Laboratory
Jiangmiao Pang
Shanghai AI Laboratory
Tianfan Xue
The Chinese University of Hong Kong
Abstract
Generalizable active mapping in complex unknown environments remains a critical challenge for mobile robots. Existing methods, constrained by insufficient training data and conservative exploration strategies, exhibit limited generalizability across scenes with diverse layouts and complex connectivity. To enable scalable training and reliable evaluation, we introduce GLEAM-Bench, the first largescale benchmark designed for generalizable active mapping with 1,152 diverse 3D scenes from synthetic and realscan datasets. Building upon this foundation, we propose GLEAM, a unified generalizable exploration policy for active mapping. Its superior generalizability comes mainly from our semantic representations, long-term navigable goals, and randomized strategies. It significantly outperforms state-of-the-art methods, achieving 66.50% coverage (+9.49%) with efficient trajectories and improved mapping accuracy on 128 unseen complex scenes.
GenHaze: Pioneering Controllable One-Step Realistic Haze Generation for Real-World Dehazing
Sixiang Chen
The Hong Kong University of Science and Technology (Guangzhou)
Tian Ye
The Hong Kong University of Science and Technology (Guangzhou)
Yunlong Lin
Xiamen University
Yeying Jin
Tencent
Yijun Yang
The Hong Kong University of Science and Technology (Guangzhou)
Haoyu Chen
The Hong Kong University of Science and Technology (Guangzhou)
Jianyu Lai
The Hong Kong University of Science and Technology (Guangzhou)
Song Fei
The Hong Kong University of Science and Technology (Guangzhou)
Zhaohu Xing
The Hong Kong University of Science and Technology (Guangzhou)
Fugee Tsung
The Hong Kong University of Science and Technology
Lei Zhu
The Hong Kong University of Science and Technology
Abstract
Real-world image dehazing is crucial for enhancing visual quality in computer vision applications. However, existing physics-based haze generation paradigms struggle to model the complexities of real-world haze and lack controllability, limiting the performance of existing baselines on real-world images. In this paper, we introduce GenHaze, a pioneering haze generation framework that enables the one-step generation of high-quality, reference-controllable hazy images. GenHaze leverages the pre-trained latent diffusion model (LDM) with a carefully designed clean-to-haze generation protocol to produce realistic hazy images. Additionally, by leveraging its fast, controllable generation of paired highquality hazy images, we illustrate that existing dehazing baselines can be unleashed in a simple and efficient manner. Extensive experiments indicate that GenHaze achieves visually convincing and quantitatively superior hazy images. It also significantly improves multiple existing dehazing models across 7 non-reference metrics with minimal fine-tuning epochs. Our work demonstrates that LDM possesses the potential to generate realistic degradations, providing an effective alternative to prior generation pipelines.
HORT: Monocular Hand-held Objects Reconstruction with Transformers
Zerui Chen
Inria, École normale supérieure, CNRS, PSL Research University
Rolandos Alexandros Potamias
Imperial College London
Shizhe Chen
Inria, École normale supérieure, CNRS, PSL Research University
Cordelia Schmid
Inria, École normale supérieure, CNRS, PSL Research University
Abstract
Reconstructing hand-held objects in 3D from monocular images remains a significant challenge in computer vision. Most existing approaches rely on implicit 3D representations, which produce overly smooth reconstructions and are time-consuming to generate explicit 3D shapes. While more recent methods directly reconstruct point clouds with diffusion models, the multi-step denoising makes high-resolution reconstruction inefficient. To address these limitations, we propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. Our method follows a coarse-to-fine strategy, first generating a sparse point cloud from the image and progressively refining it into a dense representation using pixel-aligned image features. To enhance reconstruction accuracy, we integrate image features with 3D hand geometry to jointly predict the object point cloud and its pose relative to the hand. Our model is trained end-to-end for optimal performance. Experimental results on both synthetic and real datasets demonstrate that our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.
Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning
Yiyang Chen
South China University of Technology
Shanshan Zhao
Alibaba International Digital Commerce Group
Lunhao Duan
Alibaba International Digital Commerce Group
Changxing Ding
South China University of Technology
Dacheng Tao
Nanyang Technological University
Abstract
Diffusion-based models, widely used in text-to-image generation, have proven effective in 2D representation learning. Recently, this framework has been extended to 3D self-supervised learning by constructing a conditional point generator for enhancing 3D representations. However, its performance remains constrained by the 3D diffusion model, which is trained on the available 3D datasets with limited size. We hypothesize that the robust capabilities of text-to-image diffusion models, particularly Stable Diffusion (SD), which is trained on large-scale datasets, can help overcome these limitations. To investigate this hypothesis, we propose PointSD, a framework that leverages the SD model for 3D self-supervised learning. By replacing the SD model's text encoder with a 3D encoder, we train a point-to-image diffusion model that allows point clouds to guide the denoising of rendered noisy images. With the trained point-to-image diffusion model, we use noisefree images as the input and point clouds as the condition to extract SD features. Next, we train a 3D backbone by aligning its features with these SD features, thereby facilitating direct semantic learning. Comprehensive experiments on downstream point cloud tasks and ablation studies demonstrate that the SD model can enhance point cloud self-supervised learning. Code is publicly available at https://github.com/wdttt/PointSD.
High-Precision 3D Measurement of Complex Textured Surfaces Using Multiple Filtering Approach
Yuchong Chen
Southeast University
Jian Yu
Southeast University
Shaoyan Gai
Southeast University
Zeyu Cai
Southeast University
Feipeng Da
Southeast University
Abstract
In structured light systems, measurement accuracy tends to decline significantly when evaluating complex textured surfaces, particularly at boundaries between different colors. To address this issue, this paper conducts a detailed analysis to develop an error model that illustrates the relationship between phase error and image characteristics, specifically the blur level, grayscale value, and grayscale gradient. Based on this model, a high-precision approach for measuring complex textured targets is introduced, employing a multiple filtering approach. This approach first applies a sequence of filters to vary the blur level of the captured patterns, allowing calculation of phase differences under different blur conditions. Then, these phase differences are used in the constructed error model to identify the critical parameter causing phase errors. Finally, phase recovery is performed using the calibrated parameter, effectively reducing errors caused by complex textures. Experimental comparisons exhibit that this method reduces the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) by 40.31% and 40.78%, respectively. In multiple experiments, its performance generally surpassed that of existing methods, demonstrating improved accuracy and robustness.
Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image
Jerred Chen
University of Oxford
Ronald Clark
University of Oxford
Abstract
In many robotics and VR/AR applications, fast camera motions lead to a high level of motion blur, causing existing camera pose estimation methods to fail. In this work, we propose a novel framework that leverages motion blur as a rich cue for motion estimation rather than treating it as an unwanted artifact. Our approach works by predicting a dense motion flow field and a monocular depth map directly from a single motion-blurred image. We then recover the instantaneous camera velocity by solving a linear least squares problem under the small motion assumption. In essence, our method produces an IMU-like measurement that robustly captures fast and aggressive camera movements. To train our model, we construct a largescale dataset with realistic synthetic motion blur derived from ScanNet++v2 and further refine our model by training end-to-end on real data using our fully differentiable pipeline. Extensive evaluations on real-world benchmarks demonstrate that our method achieves state-of-the-art angular and translational velocity estimates, outperforming current methods like MASt3R and COLMAP.
InvRGB+L: Inverse Rendering of Complex Scenes with Unified Color and LiDAR Reflectance Modeling
Xiaoxue Chen
AIR, Tsinghua University
Bhargav Chandaka
University of Illinois Urbana-Champaign
Chih-Hao Lin
University of Illinois Urbana-Champaign
Ya-Qin Zhang
AIR, Tsinghua University
David Forsyth
University of Illinois Urbana-Champaign
Hao Zhao
AIR, Tsinghua University
Shenlong Wang
University of Illinois Urbana-Champaign
Abstract
We present InvRGB+L, a novel inverse rendering model that reconstructs large, relightable, and dynamic scenes from a single RGB+LiDAR sequence. Conventional inverse graphics methods rely primarily on RGB observations and use LiDAR mainly for geometric information, often resulting in suboptimal material estimates due to visible light interference. We find that LiDAR's intensity values-captured with active illumination in a different spectral range-offer complementary cues for robust material estimation under variable lighting. Inspired by this, InvRGB+L leverages LiDAR intensity cues to overcome challenges inherent in RGB-centric inverse graphics through two key innovations: (1) a novel physics-based LiDAR shading model and (2) RGB-LiDAR material consistency losses. The model produces novel-view RGB and LiDAR renderings of urban and indoor scenes and supports relighting, night simulations, and dynamic object insertions-achieving results that surpass current state-of-the-art methods in both scene-level urban inverse rendering and LiDAR simulation.
LONG3R: Long Sequence Streaming 3D Reconstruction
Zhuoguang Chen
Shanghai Artificial Intelligence Laboratory
Minghui Qin
IIIS, Tsinghua University
Tianyuan Yuan
IIIS, Tsinghua University
Zhe Liu
IIIS, Tsinghua University
Hang Zhao
IIIS, Tsinghua University
Abstract
Recent advancements in multi-view scene reconstruction have been significant, yet existing methods face limitations when processing streams of input images. These methods either rely on time-consuming offline optimization or are restricted to shorter sequences, hindering their applicability in real-time scenarios. In this work, we propose LONG3R (LOng sequence streamiNG 3D Reconstruction), a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences. Our model achieves realtime processing by operating recurrently, maintaining and updating memory with each new observation. We first employ a memory gating mechanism to filter relevant memory, which, together with a new observation, is fed into a dual-source refined decoder for coarse-to-fine interaction. To effectively capture long-sequence memory, we propose a 3D spatio-temporal memory that dynamically prunes redundant spatial information while adaptively adjusting resolution along the scene. To enhance our model's performance on long sequences while maintaining training efficiency, we employ a two-stage curriculum training strategy, each stage targeting specific capabilities. Experiments demonstrate that LONG3R outperforms state-of-theart streaming methods, particularly for longer sequences, while maintaining real-time inference speed. Project page: https://zgchen33.github.io/LONG3R/.
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
Yi Chen
The University of Hong Kong
Yuying Ge
ARC Lab, Tencent PCG
Weiliang Tang
The Chinese University of Hong Kong
Yizhuo Li
The University of Hong Kong
Yixiao Ge
ARC Lab, Tencent PCG
Mingyu Ding
University of California, Berkeley
Ying Shan
ARC Lab, Tencent PCG
Xihui Liu
The University of Hong Kong
Abstract
Recent developments in Large Language Models (LLMs) pre-trained on extensive corpora have shown significant success in various natural language processing (NLP) tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich 'corpus', can a similar generative pretraining approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer †Corresponding Authors. of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging 'language' of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood. To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.
Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation
Yingjie Chen
Tongyi Lab, Alibaba Group
Yifang Men
Tongyi Lab, Alibaba Group
Yuan Yao
Tongyi Lab, Alibaba Group
Miaomiao Cui
Tongyi Lab, Alibaba Group
Liefeng Bo
Tongyi Lab, Alibaba Group
Abstract
Motion-controllable image animation is a fundamental task with a wide range of potential applications. Recent works have made progress in controlling camera or object motion via various motion representations, while they still struggle to support collaborative camera and object motion control with adaptive control granularity. To this end, we introduce 3D-aware motion representation and propose an image animation framework, called Perception-as-Control, to achieve fine-grained collaborative motion control. Specifically, we construct 3D-aware motion representation from a reference image, manipulate it based on interpreted user instructions, and perceive it from different viewpoints. In this way, camera and object motions are transformed into intuitive and consistent visual changes. Then, our framework leverages the perception results as motion control signals, enabling it to support various motion-related video synthesis tasks in a unified and flexible way. Experiments demonstrate the superiority of the proposed approach. For more details and qualitative results, please refer to our anonymous project webpage: Perception-as-Control.
Point Cloud Self-supervised Learning via 3D to Multi-view Masked Learner
Zhimin Chen
Clemson University
Xuewei Chen
Clemson University
Xiao Guo
Michigan State University
Yingwei Li
Johns Hopkins University
Longlong Jing
The City University of New York
Liang Yang
The City University of New York
Bing Li
Clemson University
Abstract
Recently, multi-modal masked autoencoders (MAE) has been introduced in 3D self-supervised learning, offering enhanced feature learning by leveraging both 2D and 3D data to capture richer cross-modal representations. However, these approaches have two limitations: (1) they inefficiently require both 2D and 3D modalities as inputs, even though the inherent multi-view properties of 3D point clouds already contain 2D modality. (2) input 2D modality causes the reconstruction learning to unnecessarily rely on visible 2D information, hindering 3D geometric representation learning. To address these challenges, we propose a 3D to Multi-View Learner (Multi-View ML) that only utilizes 3D modalities as inputs and effectively capture rich spatial information in 3D point clouds. Specifically, we first project 3D point clouds to multi-view 2D images at the feature level based on 3D-based pose. Then, we introduce two components: (1) a 3D to multi-view autoencoder that reconstructs point clouds and multi-view images from 3D and projected 2D features; (2) a multi-scale multi-head (MSMH) attention mechanism that facilitates local-global information interactions in each decoder transformer block through attention heads at various scales. Additionally, a novel twostage self-training strategy is proposed to align 2D and 3D representations. Our method outperforms state-of-the-art counterparts across various downstream tasks, including 3D classification, part segmentation, and object detection.
SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World
Chen Chen
Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences
Zhirui Wang
Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences
Taowei Sheng
Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences
Yi Jiang
Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences
Yundu Li
Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences
Peirui Cheng
Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences
Luning Zhang
Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences
Kaiqiang Chen
Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences
Yanfeng Hu
Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences
Xue Yang
Shanghai Jiao Tong University
Xian Sun
Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences
Abstract
Existing vision-based 3D occupancy prediction methods are inherently limited in accuracy due to their exclusive reliance on street-view imagery, neglecting the potential benefits of incorporating satellite views. We propose SA-Occ, the first Satellite-Assisted 3D occupancy prediction model, which leverages GPS & IMU to integrate historical yet readily available satellite imagery into real-time applications, effectively mitigating limitations of ego-vehicle perceptions, involving occlusions and degraded performance in distant regions. To address the core challenges of crossview perception, we propose: 1) Dynamic-Decoupling Fusion, which resolves inconsistencies in dynamic regions caused by the temporal asynchrony between satellite and street views; 2) 3D-Proj Guidance, a module that enhances 3D feature extraction from inherently 2D satellite imagery; and 3) Uniform Sampling Alignment, which aligns the sampling density between street and satellite views. Evaluated on Occ3D-nuScenes, SA-Occ achieves state-of-the-art performance, especially among single-frame methods, with a 39.05% mIoU (a 6.97% improvement), while incurring only 6.93 ms of additional latency per frame.
Semantic Causality-Aware Vision-Based 3D Occupancy Prediction
Dubing Chen
SKL-IOTSC, CIS, University of Macau
Huan Zheng
SKL-IOTSC, CIS, University of Macau
Yucheng Zhou
SKL-IOTSC, CIS, University of Macau
Xianfei Li
COWAROBOT Co. Ltd.
Wenlong Liao
COWAROBOT Co. Ltd.
Tao He
COWAROBOT Co. Ltd.
Pai Peng
COWAROBOT Co. Ltd.
Jianbing Shen
SKL-IOTSC, CIS, University of Macau
Abstract
Vision-based 3D semantic occupancy prediction is a critical task in 3D vision that integrates volumetric 3D reconstruction with semantic understanding. Existing methods, however, often rely on modular pipelines. These modules are typically optimized independently or use pre-configured inputs, leading to cascading errors. In this paper, we address this limitation by designing a novel causal loss that enables holistic, end-to-end supervision of the modular 2Dto-3D transformation pipeline. Grounded in the principle of 2D-to-3D semantic causality, this loss regulates the gradient flow from 3D voxel representations back to the 2D features. Consequently, it renders the entire pipeline differentiable, unifying the learning process and making previously non-trainable components fully learnable. Building on this principle, we propose the Semantic Causality-Aware 2D-to-3D Transformation, which comprises three components guided by our causal loss: Channel-Grouped Lifting for adaptive semantic mapping, Learnable Camera Offsets for enhanced robustness against camera perturbations, and Normalized Convolution for effective feature propagation. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the Occ3D benchmark, demonstrating significant robustness to camera perturbations and improved 2D-to-3D semantic consistency.
Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation
Siyu Chen
Jimei University
Ting Han
Sun Yat-sen University
Changshe Zhang
Xidian University
Xin Luo
Jimei University
Meiliu Wu
University of Glasgow
Guorong Cai
Jimei University
Jinhe Su
Jimei University
Abstract
Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depthaware learnable tokens to continuously decouple domaininvariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datasets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at https://github.com/SY-Ch/DepthForge.
UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving
Abstract
The creation of diverse and realistic driving scenarios has become essential to enhance perception and planning capabilities of the autonomous driving system. However, generating long-duration, surround-view consistent driving videos remains a significant challenge. To address this, we present UniMLVG, a unified framework designed to generate extended street multi-perspective videos under precise control. By integrating single- and multi-view driving videos into the training data, our approach updates a DiT-based diffusion model equipped with cross-frame and cross-view modules across three stages with multi training objectives, substantially boosting the diversity and quality of generated visual content. Importantly, we propose an innovative explicit viewpoint modeling approach for multi-view video generation to effectively improve motion transition consistency. Capable of handling various input reference formats (e.g., text, images, or video), our UniMLVG generates high-quality multi-view videos according to the corresponding condition constraints such as 3D bounding boxes or frame-level text descriptions. Compared to the best models with similar capabilities, our framework achieves improvements of 48.2% in FID and 35.2% in FVD.
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Zhangquan Chen
Tsinghua University
Xufang Luo
Microsoft Research Asia
Dongsheng Li
Microsoft Research Asia
Abstract
Visual understanding is inherently intention-driven-humans selectively focus on different regions of a scene based on their goals. Recent advances in large multimodal models (LMMs) enable flexible expression of such intentions through natural language, allowing queries to guide visual reasoning processes. Frameworks like Visual Chain-of-Thought have demonstrated the benefit of incorporating explicit reasoning steps, where the model predicts a focus region before answering a query. However, existing approaches rely heavily on supervised training with annotated intermediate bounding boxes, which severely limits scalability due to the combinatorial explosion of intention-region pairs. To overcome this limitation, we propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception. VisRL optimizes the entire visual reasoning process using only reward signals. By treating intermediate focus selection as an internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations while aligning more closely with how humans learn to perceive the world. Extensive experiments across multiple benchmarks show that VisRL consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs. Our code is available at https://github.com/zhangquanchen/VisRL.
Constraint-Aware Feature Learning for Parametric Point Cloud
Xi Cheng
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen
Ruiqi Lei
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen
Di Huang
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen
Zhichao Liao
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen
Fengyuan Piao
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen
Yan Chen
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen
Pingfa Feng
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen
Long Zeng
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen
Abstract
Parametric point clouds are sampled from CAD shapes and are becoming increasingly common in industrial manufacturing. Most CAD-specific deep learning methods focus on geometric features, while overlooking constraints inherent in CAD shapes. This limits their ability to discern CAD shapes with similar appearances but different constraints. To tackle this challenge, we first analyze the constraint importance via simple validation experiments. Then, we introduce a deep learning-friendly constraint representation with three components, and design a constraintaware feature learning network (CstNet), which includes two stages. Stage 1 extracts constraint representation from BRep data or point cloud based on local features. It enables better generalization ability to unseen dataset after pre-training. Stage 2 employs attention layers to adaptively adjust the weights of three constraints' components. It facilitates the effective utilization of constraints. In addition, we built the first multi-modal parametric-purpose dataset, i.e. Param20K, comprising about 20K CAD instances of 75 classes. On this dataset, CstNet achieved 3.49% (classification) and 26.17% (rotation robustness) accuracy improvements over the state-of-the-art. To the best of our knowledge, CstNet is the first constraint-aware deep learning method tailored for parametric point cloud analysis. Our project page with source code is available at: https://cstnetwork.github.io/.
MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding
Tongtong Cheng
Department of Computer Science, Chongqing University
Rongzhen Li
National Elite Institute of Engineering, Chongqing University
Yixin Xiong
Department of Computer Science, Chongqing University
Tao Zhang
Department of Computer Science, Chongqing University
Jing Wang
College of Computer Science and Technology, National University of Defense Technology
Kai Liu
Department of Computer Science, Chongqing University
Abstract
Accurate driving behavior recognition and reasoning are critical for autonomous driving video understanding. However, existing methods often tend to dig out the shallow causal, fail to address spurious correlations across modalities, and ignore the ego-vehicle level causality modeling. To overcome these limitations, we propose a novel Multimodal Causal Analysis Model (MCAM) that constructs latent causal structures between visual and language modalities. Firstly, we design a multi-level feature extractor to capture long-range dependencies. Secondly, we design a causal analysis module that dynamically models driving scenarios using a directed acyclic graph (DAG) of driving states. Thirdly, we utilize a vision-language transformer to align critical visual features with their corresponding linguistic expressions. Extensive experiments on the BDDX, and CoVLA datasets demonstrate that MCAM achieves SOTA performance in visual-language causal relationship learning. Furthermore, the model exhibits superior capability in capturing causal characteristics within video sequences, showcasing its effectiveness for autonomous driving applications. The code is available at https://github.com/SixCorePeach/MCAM
Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps
Chong Cheng
The Hong Kong University of Science and Technology (Guangzhou)
Sicheng Yu
The Hong Kong University of Science and Technology (Guangzhou)
Zijian Wang
The Hong Kong University of Science and Technology (Guangzhou)
Yifan Zhou
The Hong Kong University of Science and Technology (Guangzhou)
Hao Wang
The Hong Kong University of Science and Technology (Guangzhou)
Abstract
3D Gaussian Splatting (3DGS) has become a popular solution in SLAM due to its high-fidelity and real-time novel view synthesis performance. However, some previous 3DGS SLAM methods employ a differentiable rendering pipeline for tracking, lack geometric priors in outdoor scenes. Other approaches introduce separate tracking modules, but they accumulate errors with significant camera movement, leading to scale drift. To address these challenges, we propose a robust RGB-only outdoor 3DGS SLAM method: S3PO-GS. Technically, we establish a self-consistent tracking module anchored in the 3DGS pointmap, which avoids cumulative scale drift and achieves more precise and robust tracking with fewer iterations. Additionally, we design a patch-based pointmap dynamic mapping module, which introduces geometric priors while avoiding scale ambiguity. This significantly enhances tracking accuracy and the quality of scene reconstruction, making it particularly suitable for complex outdoor environments. Our experiments on the Waymo, KITTI, and DL3DV datasets demonstrate that S3PO-GS achieves state-of-the-art results in novel view synthesis and outperforms other 3DGS SLAM methods in tracking accuracy. Project page: https://3dagentworld. github.io/S3PO-GS/.
RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration
Chong Cheng
The Hong Kong University of Science and Technology (Guangzhou)
Yu Hu
The Hong Kong University of Science and Technology (Guangzhou)
Sicheng Yu
The Hong Kong University of Science and Technology (Guangzhou)
Beizhen Zhao
The Hong Kong University of Science and Technology (Guangzhou)
Zijian Wang
The Hong Kong University of Science and Technology (Guangzhou)
Hao Wang
The Hong Kong University of Science and Technology (Guangzhou)
Abstract
3D Gaussian Splatting (3DGS) has demonstrated its potential in reconstructing scenes from unposed images. However, optimization-based 3DGS methods struggle with sparse views due to limited prior knowledge. Meanwhile, feed-forward Gaussian approaches are constrained by input formats, making it challenging to incorporate more input views. To address these challenges, we propose RegGS, a 3D Gaussian registration-based framework for reconstructing unposed sparse views. RegGS aligns local 3D Gaussians generated by a feed-forward network into a globally consistent 3D Gaussian representation. Technically, we implement an entropy-regularized Sinkhorn algorithm to efficiently solve the optimal transport Mixture 2-Wasserstein (MW2) distance, which serves as an alignment metric for Gaussian mixture models (GMMs) in Sim(3) space. Furthermore, we design a joint 3DGS registration module that integrates the MW2 distance, photometric consistency, and depth geometry. This enables a coarse-to-fine registration process while accurately estimating camera poses and aligning the scene. Experiments on the RE10K and ACID datasets demonstrate that RegGS effectively registers local Gaussians with high fidelity, achieving precise pose estimation and high-quality novel-view synthesis. Project page: https://3dagentworld.github.io/reggs/.
Temporal-aware Query Routing for Real-time Video Instance Segmentation
Zesen Cheng
School of Electronic and Computer Engineering, Peking University, Shenzhen
Kehan Li
Alibaba Group
Yian Zhao
School of Electronic and Computer Engineering, Peking University, Shenzhen
Hang Zhang
Alibaba Group
Chang Liu
Department of Automation and BNRist, Tsinghua University, Beijing
Jie Chen
School of Electronic and Computer Engineering, Peking University, Shenzhen
Abstract
With the rise of applications such as embodied intelligence, developing high real-time online video instance segmentation (VIS) has become increasingly important. However, through time profiling of the components in advanced online VIS architecture (i.e., transformer-based architecture), we find that the transformer decoder significantly hampers the inference speed. Further analysis of the similarities between the outputs from adjacent frames at each transformer decoder layer reveals significant redundant computations within the transformer decoder. To address this issue, we introduce Temporal-Aware query Routing (TAR) mechanism. We embed it before each transformer decoder layer. By fusing the optimal queries from the previous frame, the queries output by the preceding decoder layer, and their differential information, TAR predicts a binary classification score and then uses an argmax operation to determine whether the current layer should be skipped. Experimental results demonstrate that integrating TAR into the baselines achieves significant efficiency gains (24.7 →34.6 FPS for MinVIS, 22.4 →32.8 FPS for DVIS++) while also improving performance (e.g., on YoutubeVIS 2019, 47.4 →48.4 AP for MinVIS, 55.5 →55.7 AP for DVIS++). Furthermore, our analysis of the TAR mechanism shows that the number of skipped layers increases as the differences between adjacent video frames decrease, which suggests that our method effectively utilizes inter-frame differences to reduce redundant computations in the transformer decoder.
EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device
Gunjan Chhablani
Georgia Tech
Xiaomeng Ye
Georgia Tech
Muhammad Zubair Irshad
Toyota Research Institute
Zsolt Kira
Georgia Tech
Abstract
The field of Embodied AI predominantly relies on simulation for training and evaluation, often using either fully synthetic environments that lack photorealism or high-fidelity real-world reconstructions captured with expensive hardware. As a result, sim-to-real transfer remains a major challenge. In this paper, we introduce EmbodiedSplat, a novel approach that personalizes policy training by efficiently capturing the deployment environment and finetuning policies within the reconstructed scenes. Our method leverages 3D Gaussian Splatting (GS) and the Habitat-Sim simulator to bridge the gap between realistic scene capture and effective training environments. Using iPhonecaptured deployment scenes, we reconstruct meshes via GS, enabling training in settings that closely approximate realworld conditions. We conduct a comprehensive analysis of training strategies, pre-training datasets, and mesh reconstruction techniques, evaluating their impact on simto-real predictivity in real-world scenarios. Experimental results demonstrate that agents fine-tuned with EmbodiedSplat outperform both zero-shot baselines pre-trained on large-scale real-world datasets (HM3D) and synthetically generated datasets (HSSD), achieving absolute success rate improvements of 20% and 40% on real-world Image Navigation task. Moreover, our approach yields a high sim-vs-real correlation (0.87-0.97) for the reconstructed meshes, underscoring its effectiveness in adapting policies to diverse environments with minimal effort. Project page: https://gchhablani.github.io/embodied-splat.
Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting
Seunggeun Chi
Purdue University
Enna Sachdeva
Honda Research Institute USA
Pin-Hao Huang
Honda Research Institute USA
Kwonjoon Lee
Honda Research Institute USA
Abstract
Amodal completion, the task of inferring the complete appearance of objects despite partial occlusions, is crucial for understanding complex human-object interactions (HOI) in computer vision and robotics. Existing methods, including pre-trained diffusion models, often struggle to generate plausible completions in dynamic scenarios due to their limited understanding of HOI. To address this challenge, we propose a novel approach that leverages physical prior knowledge alongside a specialized multi-regional inpainting technique tailored for HOI. By incorporating physical constraints derived from human topology and contact information, we define two distinct regions: the primary region, where occluded object parts are most likely to reside, and the secondary region, where occlusions are less probable. Our multi-regional inpainting method employs customized denoising strategies across these regions within a diffusion model, thereby enhancing the accuracy and realism of generated completions in both shape and visual detail. Experimental results demonstrate that our approach substantially outperforms existing methods in HOI scenarios, advancing 1Work done at Honda Research Institute machine perception toward a more human-like understanding of dynamic environments. Furthermore, we show that our pipeline remains robust even without ground-truth contact annotations, broadening its applicability to tasks such as 3D reconstruction and novel view/pose synthesis.
Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation
Zhixiang Chi
University of Toronto
Yanan Wu
China Agricultural University
Li Gu
Concordia University
Huan Liu
McMaster University
Ziqiang Wang
Concordia University
Yang Zhang
Beijing Jiaotong University
Yang Wang
Concordia University
Konstantinos N Plataniotis
University of Toronto
Abstract
CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patchlevel correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViTB, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.
AJAHR: Amputated Joint Aware 3D Human Mesh Recovery
Hyunjin Cho
Chung-Ang University
Giyun Choi
Chung-Ang University
Jongwon Choi
Chung-Ang University
Abstract
Existing human mesh recovery methods assume a standard human body structure, overlooking diverse anatomical conditions such as limb loss. This assumption introduces bias when applied to individuals with amputations-a limitation further exacerbated by the scarcity of suitable datasets. To address this gap, we propose Amputated Joint Aware 3D Human Mesh Recovery (AJAHR), which is an adaptive pose estimation framework that improves mesh reconstruction for individuals with limb loss. Our model integrates a body-part amputation classifier, jointly trained with the mesh recovery network, to detect potential amputations. We also introduce Amputee 3D (A3D), which is a synthetic dataset offering a wide range of amputee poses for robust training. While maintaining competitive performance on non-amputees, our approach achieves state-ofthe-art results for amputated individuals. Additional materials can be found at: https://chojinie.github.io/project_AJAHR/
DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding
Jungbin Cho
Yonsei University
Junwan Kim
Yonsei University
Jisoo Kim
Yonsei University
Minseo Kim
Yonsei University
Mingu Kang
Sungkyunkwan University
Sungeun Hong
Sungkyunkwan University
Tae-Hyun Oh
Yonsei University
Youngjae Yu
Yonsei University
Abstract
Human motion is inherently continuous and dynamic, posing significant challenges for generative models. While discrete generation methods are widely used, they suffer from limited expressiveness and frame-wise noise artifacts. In contrast, continuous approaches produce smoother, more natural motion but often struggle to adhere to conditioning signals due to high-dimensional complexity and limited training data. To resolve this 'discord' between discrete and continuous representations we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that leverages rectified flow to decode discrete motion tokens in the continuous, raw motion space. Our core idea is to frame token decoding as a conditional generation task, ensuring that DisCoRD captures fine-grained dynamics and achieves smoother, more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals on diverse settings. Extensive evaluations demonstrate that DisCoRD achieves state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on KIT-ML. These results establish DisCoRD as a robust solution for bridging the divide between discrete efficiency and continuous realism. Project website: https://whwjdqls.github.io/discord-motion/
Learning Large Motion Estimation from Intermediate Representations with a High-Resolution Optical Flow Dataset Featuring Long-Range Dynamic Motion
Hoonhee Cho
KAIST
Yuhwan Jeong
KAIST
Kuk-Jin Yoon
KAIST
Abstract
With advancements in sensor and display technologies, high-resolution imagery is becoming increasingly prevalent in diverse applications. As a result, optical flow estimation needs to adapt to larger image resolutions, where even moderate movements lead to substantial pixel displacements, making long-range motion estimation more critical than ever. However, existing datasets primarily focus on short-range flow in low-resolution settings, limiting the generalization of models to high-resolution scenarios with large displacements. Additionally, there is a lack of suitable datasets for evaluating model capacity in longrange motion estimation, further hindering progress in this area. To address this, we introduce RelayFlow-4K, highresolution 4K optical flow dataset designed to capture diverse motion patterns, including long-range intermediate frame flows. While such datasets provide valuable training resources, long-range estimation remains challenging due to increased matching ambiguity. Simply incorporating these datasets does not inherently improve performance. To this end, we propose a novel training framework that integrates matching cost distillation and incremental time-step learning to refine cost volume estimation and stabilize training. Additionally, we leverage the distance map, which measures the distance from unmatched regions to their nearest matched pixels, improving occlusion handling. Our approach significantly enhances long-range optical flow estimation in high-resolution settings. Our datasets and code are available at https://github. com/Chohoonhee/RelayFlow-4K.
Humans as a Calibration Pattern: Dynamic 3D Scene Reconstruction from Unsynchronized and Uncalibrated Videos
Changwoon Choi
Seoul National University
Jeongjun Kim
Seoul National University
Geonho Cha
NAVER Cloud
Minkwan Kim
Seoul National University
Dongyoon Wee
NAVER Cloud
Young Min Kim
Seoul National University
Abstract
Recent works on dynamic 3D neural field reconstruction assume the input from synchronized multi-view videos whose poses are known. The input constraints are often not satisfied in real-world setups, making the approach impractical. We show that unsynchronized videos from unknown poses can generate dynamic neural fields as long as the videos capture human motion. Humans are one of the most common dynamic subjects captured in videos, and their shapes and poses can be estimated using state-ofthe-art libraries. While noisy, the estimated human shape and pose parameters provide a decent initialization point to start the highly non-convex and under-constrained problem of training a consistent dynamic neural representation. Given the shape and pose parameters of humans in individual frames, we formulate methods to calculate the time offsets between videos, followed by camera pose estimations that analyze the 3D joint positions. Then, we train the dynamic neural fields employing multiresolution grids while we concurrently refine both time offsets and camera poses. The setup still involves optimizing many parameters; therefore, we introduce a robust progressive learning strategy to stabilize the process. Experiments show that our approach achieves accurate spatio-temporal calibration and high-quality scene reconstruction in challenging conditions.
FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution
Gene Chou
Netflix Eyeline Studios
Wenqi Xian
Netflix Eyeline Studios
Guandao Yang
Stanford University
Mohamed Abdelfattah
Cornell University
Bharath Hariharan
Cornell University
Noah Snavely
Cornell University
Ning Yu
Netflix Eyeline Studios
Paul Debevec
Netflix Eyeline Studios
Abstract
A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce highresolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044→1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data. We evaluate our approach across multiple datasets against state-ofthe-art depth models, and find that ours outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require highresolution depth, such as video editing, and online decisionmaking, such as robotics. We release all code and model weights at https://github.com/Eyeline-Research/FlashDepth.
OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection
Adrian Chow
University of Waterloo
Evelien Riddell
University of Waterloo
Yimu Wang
University of Waterloo
Sean Sedwards
University of Waterloo
Krzysztof Czarnecki
University of Waterloo
Abstract
Open-vocabulary 3D object detection for autonomous driving aims to detect novel objects beyond the predefined training label sets in point cloud scenes. Existing approaches achieve this by connecting traditional 3D object detectors with vision-language models (VLMs) to regress 3D bounding boxes for novel objects and perform open-vocabulary classification through cross-modal alignment between 3D and 2D features. However, achieving robust cross-modal alignment remains a challenge due to semantic inconsistencies when generating corresponding 3D and 2D feature pairs. To overcome this challenge, we present OV-SCAN, an Open-Vocabulary 3D framework that enforces Semantically Consistent Alignment for Novel object discovery. OVSCAN employs two core strategies: discovering precise 3D annotations and filtering out low-quality or corrupted alignment pairs (arising from 3D annotation, occlusioninduced, or resolution-induced noise). Extensive experiments on the nuScenes dataset demonstrate that OV-SCAN achieves state-of-the-art performance. Our code is available at https://github.com/ahtchow/OV-SCAN.
EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception
Sanjoy Chowdhury
University of Maryland, College Park
Subrata Biswas
Meta Reality Labs
Sayan Nag
University of Toronto
Tushar Nagarajan
Meta Reality Labs
Calvin Murdock
Meta Reality Labs
Ishwarya Ananthabhotla
Meta Reality Labs
Yijun Qian
Meta Reality Labs
Vamsi Krishna Ithapu
Meta Reality Labs
Dinesh Manocha
University of Maryland, College Park
Ruohan Gao
University of Maryland, College Park
Abstract
Modern perception models, particularly those designed for multisensory egocentric tasks, have achieved remarkable performance but often come with substantial computational costs. These high demands pose challenges for real-world deployment, especially in resource-constrained environments. In this paper, we introduce EGOADAPT, a framework that adaptively performs cross-modal distillation and policy learning to enable efficient inference across different egocentric perception tasks, including egocentric action recognition, active speaker localization, and behavior anticipation. Our proposed policy module is adaptable to task-specific action spaces, making it broadly applicable. Experimental results on three challenging egocentric datasets EPIC-Kitchens, EasyCom, and Aria Everyday Activities demonstrate that our method significantly enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%, and energy up to 9.6x, while still on-par and in many cases outperforming, the performance of corresponding state-of-the-art models.
GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions
Xiaomeng Chu
University of Science and Technology of China
Jiajun Deng
The University of Adelaide
Guoliang You
University of Science and Technology of China
Wei Liu
University of Science and Technology of China
Xingchen Li
University of Science and Technology of China
Jianmin Ji
University of Science and Technology of China
Yanyong Zhang
University of Science and Technology of China
Abstract
Flexible instruction-guided 6-DoF grasping is a significant yet challenging task for real-world robotic systems. Existing methods utilize the contextual understanding capabilities of the large language models (LLMs) to establish mappings between expressions and targets, allowing robots to comprehend users' intentions in the instructions. However, the LLM's knowledge about objects' physical properties remains underexplored despite its tight relevance to grasping. In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary questionanswering (QA) tasks. Particularly, we design a set of QA templates to enable hierarchical reasoning that includes three stages: target parsing, physical property analysis, and grasp action selection. Moreover, GraspCoT presents a unified multimodal LLM architecture, which encodes multi-view observations of 3D scenes into 3D-aware visual tokens, and then jointly embeds these visual tokens with CoT-derived textual tokens within LLMs to generate grasp pose predictions. Furthermore, we present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands. Extensive experiments on IntentGrasp demonstrate the superiority of our method, with additional validation in real-world robotic applications confirming its practicality. The code is available at https://github.com/cxmomo/GraspCoT.
ETA: Energy-based Test-time Adaptation for Depth Completion
Younjoon Chung
Yale University
Hyoungseob Park
Yale University
Patrick Rim
Yale University
Xiaoran Zhang
Yale University
Jihe He
Yale University
Ziyao Zeng
Yale University
Safa Cicek
UCLA
Byung-Woo Hong
Chung-Ang University
James S. Duncan
Yale University
Alex Wong
Yale University
Abstract
We propose a method for test-time adaptation of pretrained depth completion models. Depth completion models, trained on some 'source' data, often predict erroneous outputs when transferred to 'target' data captured in novel environmental conditions due to a covariate shift. The crux of our method lies in quantifying the likelihood of depth predictions belonging to the source data distribution. The challenge is in the lack of access to out-of-distribution (target) data prior to deployment. Hence, rather than making assumptions regarding the target distribution, we utilize adversarial perturbations as a mechanism to explore the data space. This enables us to train an energy model that scores local regions of depth predictions as in- or out-ofdistribution. We update the parameters of pretrained depth completion models at test time to minimize energy, effectively aligning test-time predictions to those of the source distribution. We call our method 'Energy-based Test-time Adaptation', or ETA for short. We evaluate our method across three indoor and three outdoor datasets, where ETA improve over the previous state-of-the-art method by an average of 6.94% for outdoors and 10.23% for indoors. Project Page: https://fuzzythecat.github.io/eta.
ToF-Splatting: Dense SLAM using Sparse Time-of-Flight Depth and Multi-Frame Integration
Andrea Conti
University of Bologna
Matteo Poggi
University of Bologna
Valerio Cambareri
Sony DepthSensing Solutions
Martin R. Oswald
University of Amsterdam
Stefano Mattoccia
University of Bologna
Abstract
Time-of-Flight (ToF) sensors provide efficient active depth sensing at relatively low power budgets; among such designs, only very sparse measurements from lowresolution sensors are considered to meet the increasingly limited power constraints of mobile and AR/VR devices. However, such extreme sparsity levels limit the seamless usage of ToF depth in SLAM. In this work, we propose ToF-Splatting, the first 3D Gaussian Splatting-based SLAM pipeline tailored for using effectively very sparse ToF input data. Our approach improves upon the state of the art by introducing a multi-frame integration module, which produces dense depth maps by merging cues from extremely sparse ToF depth, monocular color, and multi-view geometry. Extensive experiments on both real and synthetic sparse ToF datasets demonstrate the advantages of our approach, as it achieves state-of-the-art tracking and mapping performances on reference datasets.
SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark
Alex Costanzino
University of Bologna
Pierluigi Zama Ramirez
University of Bologna
Luigi Lella
University of Bologna
Matteo Ragaglia
SACMI Imola
Alessandro Oliva
SACMI Imola
Giuseppe Lisanti
University of Bologna
Luigi Di Stefano
University of Bologna
Abstract
We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalising from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 Mpx) and point clouds (∼7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.
Debiased Teacher for Day-to-Night Domain Adaptive Object Detection
Yiming Cui
Hangzhou Dianzi University
Liang Li
Institute of Computing Technology, Chinese Academy of Sciences
Haibing Yin
Hangzhou Dianzi University
Yuhan Gao
Lishui Institute of Hangzhou Dianzi University
Yaoqi Sun
Lishui University
Chenggang Yan
Hangzhou Dianzi University
Abstract
Day-to-Night Domain Adaptive Object Detection (DNDAOD) is a significant challenge due to the low visibility and signal-to-noise ratio at night. Although recent selftraining approaches achieve promising results, they fail to address three critical biases: distribution bias, training bias, and confirmation bias. Therefore, we propose a Debiased Teacher to address the above biases from three aspects: domain transforming, representation compensating, and pseudo label calibrating. Concretely, the day-to-night domain transforming module (DNDT) leverages physical priors to model some key day-night domain differences, thus transforming daytime images into night-like images. Then, the cross-domain representation compensating module (CDRC) selectively mixes objects from nighttime and night-like images to compensate for the model's general representation of nighttime objects. Further, to correct confirmation bias caused by learning from inaccurate pseudo labels, the pseudo label confirmation calibrating module (ConCal) is designed to obtain accurate pseudo labels for better nighttime knowledge learning. Experimental results on three benchmarks demonstrate that our method outperforms current SOTA methods by a large margin.
SeaS: Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning
Zhewei Dai
Huazhong University of Science and Technology
Shilei Zeng
Huazhong University of Science and Technology
Haotian Liu
Huazhong University of Science and Technology
Xurui Li
Huazhong University of Science and Technology
Feng Xue
University of Trento
Yu Zhou
Huazhong University of Science and Technology
Abstract
We introduce SeaS, a unified industrial generative model for automatically creating diverse anomalies, authentic normal products, and precise anomaly masks. While extensive research exists, most efforts either focus on specific tasks, i.e., anomalies or normal products only, or require separate models for each anomaly type. Consequently, prior methods either offer limited generative capability or depend on a vast array of anomaly-specific models. We demonstrate that U-Net's differentiated learning ability captures the distinct visual traits of slightly-varied normal products and diverse anomalies, enabling us to construct a unified model for all tasks. Specifically, we first introduce an Unbalanced Abnormal (UA) Text Prompt, comprising one normal token and multiple anomaly tokens. More importantly, our Decoupled Anomaly Alignment (DA) loss decouples anomaly attributes and binds them to distinct anomaly tokens of UA, enabling SeaS to create unseen anomalies by recombining these attributes. Furthermore, our Normal-image Alignment (NA) loss aligns the normal token to normal patterns, making generated normal products globally consistent and locally varied. Finally, SeaS produces accurate anomaly masks by fusing discriminative U-Net features with high-resolution VAE features. SeaS sets a new benchmark for industrial generation, significantly enhancing downstream applications, with average improvements of +8.66% pixel-level AP for synthesis-based AD approaches, +1.10% imagelevel AP for unsupervised AD methods, and +12.79% IoU for supervised segmentation models. Code is available at https://github.com/HUST-SLOW/SeaS.
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Erik Daxberger
Apple
Nina Wenzel
Apple
David Griffiths
Apple
Haiming Gang
Apple
Justin Lazarow
Apple
Gefen Kohavi
Apple
Kai Kang
Apple
Marcin Eichner
Apple
Yinfei Yang
Apple
Afshin Dehghan
Apple
Peter Grasch
Apple
Abstract
Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space. In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes. Our Cubify Anything VQA (CA-VQA) data covers diverse spatial tasks including spatial relationship prediction, metric size and distance estimation, and 3D grounding. We show that CA-VQA enables us to train MM-Spatial, a strong generalist MLLM that also achieves state-of-the-art performance on 3D spatial understanding benchmarks, including our own. We show how incorporating metric depth and multi-view inputs (provided in CA-VQA) can further improve 3D understanding, and demonstrate that data alone allows our model to achieve depth perception capabilities comparable to dedicated monocular depth estimation models. https://github.com/apple/ml-cubifyanything
Interpretable point cloud classification using multiple instance learning
Matt De Vries
Sentinal4D
Reed Naidoo
Institute of Cancer Research
Olga Fourkioti
Institute of Cancer Research
Lucas G Dent
University College London
Nathan Curry
Imperial College London
Chris Dunsby
University College London
Chris Bakal
Institute of Cancer Research
Abstract
Understanding 3D cell shape is crucial in biomedical research, where morphology serves as a key indicator of disease, cellular state, and drug response. However, many existing 3D point cloud classification models lack interpretability, limiting their utility for extracting biologically meaningful insights. In this work, we unify standard point cloud backbones and feature aggregation strategies within a Multiple Instance Learning (MIL) framework to enable inherently interpretable classification. Our approach, POINTMIL, improves classification performance while providing fine-grained point-level explanations without relying on post hoc analysis. We demonstrate state-of-the-art mACC (97.3%) and F1 (97.5%) in the IntrA biomedical dataset and evaluate the interpretability using quantitative and qualitative metrics. Additionally, we introduce ATLAS-1, a novel dataset of drug-treated 3D cancer cells, and use it to show how POINTMIL captures fine-grained morphological effects of chemical treatments. Beyond biomedical applications, POINTMIL generalises to standard benchmarks such as ModelNet40 and ScanObjectNN, offering interpretable 3D object recognition across domains1.
Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration
Junyuan Deng
The Hong Kong University of Science and Technology
Wei Yin
Horizon Robotics
Xiaoyang Guo
Horizon Robotics
Qian Zhang
Horizon Robotics
Xiaotao Hu
The Hong Kong University of Science and Technology
Weiqiang Ren
Horizon Robotics
Xiao-Xiao Long
Nanjing University
Ping Tan
The Hong Kong University of Science and Technology
Abstract
In this paper, we present DM-Calib, a diffusion-based approach for estimating pinhole camera intrinsic parameters from a single input image. Monocular camera calibration is essential for many 3D vision tasks. However, most existing methods depend on handcrafted assumptions or are constrained by limited training data, resulting in poor generalization across diverse real-world images. Recent advancements in stable diffusion models, trained on massive data, have shown the ability to generate high-quality images with varied characteristics. Emerging evidence indicates that these models implicitly capture the relationship between camera focal length and image content. Building on this insight, we explore how to leverage the powerful priors of diffusion models for monocular pinhole camera calibration. Specifically, we introduce a new image-based representation, termed Camera Image, which losslessly encodes the numerical camera intrinsics and integrates seamlessly with the diffusion framework. Using this representation, we reformulate the problem of estimating camera intrinsics as the generation of a dense Camera Image conditioned on an input image. By fine-tuning a stable diffusion model to generate a Camera Image from a single RGB input, we can extract camera intrinsics via a RANSAC operation. We further demonstrate that our monocular calibration method enhances performance across various 3D tasks, including zero-shot metric depth estimation, 3D metrology, pose estimation and sparse-view reconstruction. Extensive experiments on multiple public datasets show that our approach significantly outperforms baselines and provides broad benefits to 3D vision tasks.
Open-World Skill Discovery from Unsegmented Demonstration Videos
Jingwen Deng
Peking University
Zihao Wang
Peking University
Shaofei Cai
Peking University
Anji Liu
University of California, Los Angeles
Yitao Liang
Peking University
Abstract
Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on random splitting or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semanticaware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in Minecraft, a rich open-world simulator with extensive gameplay videos available online. The SBD-generated segments yielded relative performance improvements of 63.7% and 52.1% for conditioned policies on short-term atomic tasks, and 11.3% and 20.8% for their corresponding hierarchical agents on long-horizon tasks, compared to unsegmented baselines. Our method can leverage the diverse YouTube videos to train instruction-following agents. The project page is at https://craftjarvis.github.io/SkillDiscovery/.
Self-Calibrating Gaussian Splatting for Large Field-of-View Reconstruction
Youming Deng
Cornell University
Wenqi Xian
Netflix Eyeline Studios
Guandao Yang
Stanford University
Leonidas Guibas
Stanford University
Gordon Wetzstein
Stanford University
Steve Marschner
Cornell University
Paul Debevec
Netflix Eyeline Studios
Abstract
Large field-of-view (FOV) cameras can simplify and accelerate scene capture because they provide complete coverage with fewer views. However, existing reconstruction pipelines fail to take full advantage of large-FOV input data because they convert input views to perspective images, resulting in stretching that prevents the use of the full image. Additionally, they calibrate lenses using models that do not accurately fit real fisheye lenses in the periphery. We present a new reconstruction pipeline based on Gaussian Splatting that uses a flexible lens model and supports fields of view approaching 180 degrees. We represent lens distortion with a hybrid neural field based on an Invertible ResNet and use a cubemap to render wideFOV images while retaining the efficiency of the Gaussian Splatting pipeline. Our system jointly optimizes lens distortion, camera intrinsics, camera poses, and scene representations using a loss measured directly against the original input pixels. We present extensive experiments on both synthetic and real-world scenes, demonstrating that our model accurately fits real-world fisheye lenses and that our end-to-end self-calibration approach provides higherquality reconstructions than existing methods. More details and videos can be found at the project page: https: //denghilbert.github.io/self-cali/.
Kaleidoscopic Background Attack: Disrupting Pose Estimation with Multi-Fold Radial Symmetry Textures
Xinlong Ding
University of Science and Technology Beijing
Hongwei Yu
University of Science and Technology Beijing
Jiawei Li
University of Science and Technology Beijing
Feifan Li
University of Science and Technology Beijing
Yu Shang
Tsinghua University
Bochao Zou
University of Science and Technology Beijing
Huimin Ma
University of Science and Technology Beijing
Jiansheng Chen
University of Science and Technology Beijing
Abstract
Camera pose estimation is a fundamental computer vision task that is essential for applications like visual localization and multi-view stereo reconstruction. In the objectcentric scenarios with sparse inputs, the accuracy of pose estimation can be significantly influenced by background textures that occupy major portions of the images across different viewpoints. In light of this, we introduce the Kaleidoscopic Background Attack (KBA), which uses identical segments to form discs with multi-fold radial symmetry. These discs maintain high similarity across different viewpoints, enabling effective attacks on pose estimation models even with natural texture segments. Additionally, a projected orientation consistency loss is proposed to optimize the kaleidoscopic segments, leading to a significant enhancement in the attack effectiveness. Experimental results show that optimized adversarial kaleidoscopic backgrounds can effectively attack various camera pose estimation models.
RePoseD: Efficient Relative Pose Estimation With Known Depth Information
Yaqing Ding
Czech Technical University in Prague
Viktor Kocur
Comenius University in Bratislava
Václav Vávra
Czech Technical University in Prague
Zuzana Berger Haladová
Comenius University in Bratislava
Jian Yang
Nankai University
Torsten Sattler
Czech Technical University in Prague
Zuzana Kukelova
Czech Technical University in Prague
Abstract
Recent advances in monocular depth estimation methods (MDEs) and their improved accuracy open new possibilities for their applications. In this paper, we investigate how monocular depth estimates can be used for relative pose estimation. In particular, we are interested in answering the question whether using MDEs improves results over traditional point-based methods. We propose a novel framework for estimating the relative pose of two cameras from point correspondences with associated monocular depths. Since depth predictions are typically defined up to an unknown scale or even both unknown scale and shift parameters, our solvers jointly estimate the scale or both the scale and shift parameters along with the relative pose. We derive efficient solvers considering different types of depths for three camera configurations: (1) two calibrated cameras, (2) two cameras with an unknown shared focal length, and (3) two cameras with unknown different focal lengths. Our new solvers outperform stateof-the-art depth-aware solvers in terms of speed and accuracy. In extensive real experiments on multiple datasets and with various MDEs, we discuss which depth-aware solvers are preferable in which situation. The code is available at https://github.com/kocurvik/mdrp.
Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition
Jeonghyeok Do
Korea Advanced Institute of Science and Technology
Munchurl Kim
Korea Advanced Institute of Science and Technology
Abstract
In zero-shot skeleton-based action recognition (ZSAR), aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. ZSAR faces a fundamental challenge in bridging the modality gap between the two-kind features, which severely limits generalization to unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated by the success of diffusion models in multi-modal alignment (e.g., text-to-image, text-to-video), we firstly present a diffusion-based skeletontext alignment framework for ZSAR. Our approach, Triplet Diffusion for Skeleton-Text Matching (TDSM), focuses on cross-alignment power of diffusion models rather than their generative capability. Specifically, TDSM aligns skeleton features with text prompts by incorporating text features into the reverse diffusion process, where skeleton features are denoised under text guidance, forming a unified skeleton-text latent space for robust matching. To enhance discriminative power, we introduce a triplet diffusion (TD) loss that encourages our TDSM to correct skeletontext matches while pushing them apart for different action classes. Our TDSM significantly outperforms very recent state-of-the-art methods with significantly large margins of 2.36%-point to 13.05%-point, demonstrating superior accuracy and scalability in zero-shot settings through effective skeleton-text matching.
DAMap: Distance-aware MapNet for High Quality HD Map Construction
Jinpeng Dong
Xi'an Jiaotong University
Chen Li
Xi'an Jiaotong University
Yutong Lin
Xi'an Jiaotong University
Jingwen Fu
Xi'an Jiaotong University
Sanping Zhou
Xi'an Jiaotong University
Nanning Zheng
Xi'an Jiaotong University
Abstract
High-definition (HD) map is an important component to support navigation and planning for autonomous driving vehicles. Predicting map elements with high quality (high classification and localization scores) is crucial to the safety of autonomous driving vehicles. However, current methods perform poorly in high quality predictions due to inherent task misalignment. Two main factors are responsible for misalignment: 1) inappropriate task labels due to one-to-many matching queries sharing the same labels, and 2) sub-optimal task features due to task-shared sampling mechanism. In this paper, we reveal two inherent defects in current methods and develop a novel HD map construction method named DAMap to address these problems. Specifically, DAMap consists of three components: Distance-aware Focal Loss (DAFL), Hybrid Loss Scheme (HLS), and Task Modulated Deformable Attention (TMDA). The DAFL is introduced to assign appropriate classification labels for one-to-many matching samples. The TMDA is proposed to obtain discriminative task-specific features. Furthermore, the HLS is proposed to better utilize the advantages of the DAFL. We perform extensive experiments and consistently achieve performance improvement on the NuScenes and Argoverse2 benchmarks under different metrics, baselines, splits, backbones, and schedules.
DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation
Yue-Jiang Dong
Tsinghua University
Wang Zhao
ARC Lab, Tencent PCG
Jiale Xu
ARC Lab, Tencent PCG
Ying Shan
ARC Lab, Tencent PCG
Song-Hai Zhang
Tsinghua University
Abstract
Diffusion-based video depth estimation methods have achieved remarkable success. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions. In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.
From One to More: Contextual Part Latents for 3D Generation
Shaocong Dong
HKUST
Lihe Ding
CUHK
Xiao Chen
CUHK
Yaokun Li
CUHK
Yuxin Wang
HKUST
Yucheng Wang
HKUST
Qi Wang
HKUST
Jaehyeok Kim
HKUST
Chenjian Gao
CUHK
Zhanpeng Huang
SenseTime Research
Zibin Wang
SenseTime Research
Tianfan Xue
CUHK
Dan Xu
HKUST
Abstract
To generate 3D objects, early research focused on multiview-driven approaches relying solely on 2D renderings. Recently, the 3D native latent diffusion paradigm has demonstrated superior performance in 3D generation, because it fully leverages the geometric information provided in ground truth 3D data. Despite its fast development, 3D diffusion still faces three challenges. First, the majority of these methods represent a 3D object by one single latent, regardless of its complexity. This may lead to detail loss when generating 3D objects with multiple complicated parts. Second, most 3D assets are designed parts by parts, yet the current holistic latent representation overlooks the independence of these parts and their interrelationships, limiting the model's generative ability. Third, current methods rely on global conditions (e.g., text, image, point cloud) to control the generation process, lacking detailed controllability. Therefore, motivated by how 3D designers create a 3D object, we present a new part-based 3D generation framework, CoPart, which represents a 3D object with multiple contextual part latents and simultaneously generates coherent 3D parts. This part-based framework has several advantages, including: i) reduces the encoding burden of intricate objects by decomposing them into simpler parts, ii) facilitates part learning and part relationship modeling, and iii) naturally supports part-level control. Furthermore, to ensure the coherence of part latents and to harness the powerful priors from foundation models, we propose a novel mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising. Benefiting from the part-based representation, we demonstrate that CoPart can support various applications including part-editing, articulated object generation, and mini-scene generation. Moreover, we collect a new large-scale 3D part dataset named Partverse from Objaverse through automatic mesh segmentation and subsequent human post-annotations. By training on the proposed dataset, CoPart achieves promising part-based 3D generation with high controllability. Project page: https://copart3d.github.io.
Online Dense Point Tracking with Streaming Memory
Qiaole Dong
Fudan University
Yanwei Fu
Fudan University
Abstract
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video, even in the presence of occlusions. Traditional methods use optical flow models to directly estimate long-range motion, but they often suffer from appearance drifting without considering temporal consistency. Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one, which is slow and less effective for long-range tracking. To account for temporal consistency and enable efficient information propagation, we present a lightweight and fast model with Streaming memory for dense POint Tracking and online video processing. The SPOT framework features three core components: a customized memory reading module for feature enhancement, a sensory memory for short-term motion dynamics modeling, and a visibilityguided splatting module for accurate information propagation. This combination enables SPOT to perform dense point tracking with state-of-the-art accuracy on the CVO benchmark, as well as comparable or superior performance to offline models on sparse tracking benchmarks such as TAP-Vid and RoboTAP. Notably, SPOT with 10x smaller parameter numbers operates at least 2x faster than previous state-of-the-art models while maintaining the best performance on CVO. We will release the models and codes at: https://dqiaole.github.io/SPOT/.
Teaching VLMs to Localize Specific Objects from In-context Examples
Sivan Doveh
Weizmann Institute of Science
Nimrod Shabtay
IBM Research
Eli Schwartz
IBM Research
Hilde Kuehne
IBM Research
Raja Giryes
Tel Aviv University
Rogerio Feris
MIT-IBM
Leonid Karlinsky
MIT-IBM
James Glass
MIT CSAIL
Assaf Arbelle
IBM Research
Shimon Ullman
Weizmann Institute of Science
M. Jehanzeb Mirza
MIT CSAIL
Abstract
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that present-day VLMs (including the proprietary GPT-4o) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. In this work, we focus on the task of few-shot personalized localization, where a model is given a small set of annotated images (in-context examples) - each with a category label and bounding box - and is tasked with localizing the same object type in a query image. Personalized localization can be particularly important in cases of ambiguity of several related objects that can respond to a text or an object that is hard to describe with words. To provoke personalized localization abilities in models, we present a data-centric solution that fine-tunes them using carefully curated data from video object tracking datasets. By leveraging sequences of frames tracking the same object across multiple shots, we simulate instruction-tuning dialogues that promote context awareness. To reinforce this, we introduce a novel regularization technique that replaces object labels with pseudo-names, ensuring the model relies on visual context rather than prior knowledge. Our method significantly enhances the few-shot localization performance of recent VLMs ranging from 7B to 72B in size, without sacrificing generalization, as demonstrated on several benchmarks tailored towards evaluating personalized localization abilities. This work is the first to explore and benchmark personalized few-shot localization for VLMs - exposing critical weaknesses in presentday VLMs, and laying a foundation for future research in context-driven vision-language applications.
3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views
Xiaobiao Du
University of Technology Sydney
Yida Wang
Li Auto Inc.
Haiyang Sun
Li Auto Inc.
Zhuojie Wu
The University of Queensland
Hongwei Sheng
The University of Queensland
Shuyun Wang
The University of Queensland
Jiaying Ying
The University of Queensland
Ming Lu
City University of Macau
Tianqing Zhu
City University of Macau
Kun Zhan
Li Auto Inc.
Xin Yu
The University of Queensland
Abstract
3D cars are widely used in self-driving systems, virtual and augmented reality, and gaming applications. However, existing 3D car datasets are either synthetic or low-quality, limiting their practical utility and leaving a significant gap with the high-quality real-world 3D car dataset. In this paper, we present the first large-scale 3D real car dataset, termed 3DRealCar, which offers three key features: (1) High-Volume: 2,500 cars meticulously scanned using smartphones to capture RGB images and point clouds with realworld dimensions; (2) High-Quality: Each car is represented by an average of 200 dense, high-resolution 360degree RGB-D views, enabling high-fidelity 3D reconstruction; (3) High-Diversity: The dataset encompasses a diverse collection of cars from over 100 brands, captured under three distinct lighting conditions (reflective, standard, and dark). We further provide detailed car parsing maps for each instance to facilitate research in automotive segmentation tasks. To focus on vehicles, background point clouds are removed, and all cars are aligned to a unified coordinate system, enabling controlled reconstruction and rendering. We benchmark state-of-the-art 3D reconstruction methods across different lighting conditions using 3DRealCar. Extensive experiments demonstrate that the standard lighting subset can be used to reconstruct high-quality 3D car models that significantly enhance performance on various carrelated 2D and 3D tasks. Notably, our dataset reveals critical challenges faced by current 3D reconstruction methods under reflective and dark lighting conditions, providing valuable insights for future research. Our project is hosted at https://xiaobiaodu.github.io/3drealcar/.
Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection
Ji Du
Nankai University
Xin Wang
The Hong Kong Polytechnic University
Fangwei Hao
Nankai University
Mingyang Yu
Nankai University
Chunyuan Chen
Nankai University
Jiesheng Wu
Anhui Normal University
Bin Wang
Nankai University
Jing Xu
The Hong Kong Polytechnic University
Ping Li
The Hong Kong Polytechnic University
Abstract
At the core of Camouflaged Object Detection (COD) lies segmenting objects from their highly similar surroundings. Previous efforts navigate this challenge primarily through image-level modeling or annotation-based optimization. Despite advancing considerably, this commonplace practice hardly taps valuable dataset-level contextual information or relies on laborious annotations. In this paper, we propose RISE, a RetrIeval SElf-augmented paradigm that exploits the entire training dataset to generate pseudolabels for single images, which could be used to train COD models. RISE begins by constructing prototype libraries for environments and camouflaged objects using training images (without ground truth), followed by K-Nearest Neighbor (KNN) retrieval to generate pseudo-masks for each image based on these libraries. It is important to recognize that using only training images without annotations exerts a pronounced challenge in crafting high-quality prototype libraries. In this light, we introduce a Clustering-thenRetrieval (CR) strategy, where coarse masks are first generated through clustering, facilitating subsequent histogrambased image filtering and cross-category retrieval to produce high-confidence prototypes. In the KNN retrieval stage, to alleviate the effect of artifacts in feature maps, we propose Multi-View KNN Retrieval (MVKR), which integrates retrieval results from diverse views to produce more robust and precise pseudo-masks. Extensive experiments demonstrate that RISE outperforms state-of-the-art unsupervised and prompt-based methods. Code is available at https://github.com/xiaohainku/RISE
RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors
Sicong Du
CaiNiao Inc., Alibaba Group
Jiarun Liu
CaiNiao Inc., Alibaba Group
Qifeng Chen
CaiNiao Inc., Alibaba Group
Hao-Xiang Chen
BNRist, Tsinghua University
Tai-Jiang Mu
BNRist, Tsinghua University
Sheng Yang
CaiNiao Inc., Alibaba Group
Abstract
A single-pass driving clip frequently results in incomplete scanning of the road structure, making reconstructed scene expanding a critical requirement for sensor simulators to effectively regress driving actions. Although contemporary 3D Gaussian Splatting (3DGS) techniques achieve remarkable reconstruction quality, their direct extension through the integration of diffusion priors often introduces cumulative physical inconsistencies and compromises training efficiency. To address these limitations, we present RGE-GS, a novel expansive reconstruction framework that synergizes diffusion-based generation with reward-guided Gaussian integration. The RGE-GS framework incorporates two key innovations: First, we propose a reward network that learns to identify and prioritize consistently generated patterns prior to reconstruction phases, thereby enabling selective retention of diffusion outputs for spatial stability. Second, during the reconstruction process, we devise a differentiated training strategy that automatically adjust Gaussian optimization progress according to scene converge metrics, which achieving better convergence than baseline methods. Extensive evaluations of publicly available datasets demonstrate that RGEGS achieves state-of-the-art performance in reconstruction quality. Our source-code will be made publicly available at https://github.com/CN-ADLab/RGE-GS.
RTMap: Real-Time Recursive Mapping with Change Detection and Localization
Yuheng Du
CaiNiao Inc., Alibaba Group
Sheng Yang
CaiNiao Inc., Alibaba Group
Lingxuan Wang
CaiNiao Inc., Alibaba Group
Zhenghua Hou
CaiNiao Inc., Alibaba Group
Chengying Cai
CaiNiao Inc., Alibaba Group
Zhitao Tan
CaiNiao Inc., Alibaba Group
Mingxia Chen
CaiNiao Inc., Alibaba Group
Shi-Sheng Huang
Beijing Normal University
Qiang Li
CaiNiao Inc., Alibaba Group
Abstract
While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi-agent observations. We propose RTMap to enhance these single-traversal methods by persistently crowdsourcing a multi-traversal HD map as a self-evolutional memory. On onboard agents, RTMap simultaneously addresses three core challenges in an endto-end fashion: (1) Uncertainty-aware positional modeling for HD map elements, (2) probabilistic-aware localization w.r.t. the crowdsourced prior-map, and (3) realtime detection for possible road structural changes. Experiments on several public autonomous driving datasets demonstrate our solid performance on both the prior-aided map quality and the localization accuracy, demonstrating our effectiveness of robustly serving downstream prediction and planning modules while gradually improving the accuracy and freshness of the crowdsourced prior-map asynchronously. Our source-code will be made publicly available at https://github.com/CN-ADLab/RTMap.
RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation
Yuwen Du
Shanghai Jiao Tong University
Anning Hu
Shanghai Jiao Tong University
Zichen Chao
Nanjing University of Science and Technology
Yifan Lu
Shanghai Jiao Tong University
Junhao Ge
Shanghai Jiao Tong University
Genjia Liu
Shanghai Jiao Tong University
Weitao Wu
Nanjing University of Science and Technology
Lanjun Wang
Tianjin University
Siheng Chen
Shanghai Jiao Tong University
Abstract
Roadside Collaborative Perception refers to a system where multiple roadside units collaborate to pool their perceptual data, assisting vehicles in enhancing their environmental awareness. Existing roadside perception methods concentrate on model design but overlook data issues like calibration errors, sparse information, and multi-view consistency, leading to poor performance on recent published datasets. To significantly enhance roadside collaborative perception and address critical data issues, we present the first simulation framework RoCo-Sim for roadside collaborative perception. RoCo-Sim is capable of generating diverse, multi-view consistent simulated roadside data through dynamic foreground editing and fullscene style transfer of a single image. RoCo-Sim consists of four components: (1) Camera Extrinsic Optimizer ensures accurate 3D to 2D projection for roadside cameras; (2) A novel Multi-View Occlusion-Aware Sampler (MOAS) determines the placement of diverse digital assets within 3D space; (3) DepthSAM innovatively models foreground-background relationships from single-frame fixed-view images, ensuring multi-view consistency of foreground; and (4) Scalable Post-Processing Toolkit generates more realistic and enriched scenes through style transfer and other enhancements. RoCo-Sim significantly improves roadside 3D object detection, outperforming SOTA methods by 83.74% on Rcooper-Intersection and 83.12% on TUMTraf-V2X for AP70. RoCo-Sim fills a critical gap in roadside perception simulation. Code can be accessed at: https://github.com/duyuwen-duen/RoCo-Sim
Counting Stacked Objects
Corentin Dumery
EPFL
Noa Etté
EPFL
Aoxiang Fan
EPFL
Ren Li
EPFL
Jingyi Xu
Stony Brook University
Hieu Le
EPFL
Pascal Fua
EPFL
Abstract
Visual object counting is a fundamental computer vision task underpinning numerous real-world applications, from cell counting in biomedicine to traffic and wildlife monitoring. However, existing methods struggle to handle the challenge of stacked 3D objects in which most objects are hidden by those above them. To address this important yet underexplored problem, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems - estimating the 3D geometry of the object stack and the occupancy ratio from multi-view images. By combining geometric reconstruction and deep learningbased depth analysis, our method can accurately count identical objects within containers, even when they are irregularly stacked. We validate our 3D Counting pipeline on large-scale synthetic and diverse real-world datasets with manually verified total counts. Our datasets and code and can be found at https://corentindumery.github. io/projects/stacks.html
Is Tracking Really More Challenging in First Person Egocentric Vision?
Matteo Dunnhofer
University of Udine
Zaira Manigrasso
University of Udine
Christian Micheloni
University of Udine
Abstract
Visual object tracking and segmentation are becoming fundamental tasks for understanding human activities in egocentric vision. Recent research has benchmarked state-ofthe-art methods and concluded that first person egocentric vision presents challenges compared to previously studied domains. However, these claims are based on evaluations conducted across significantly different scenarios. Many of the challenging characteristics attributed to egocentric vision are also present in third person videos of human-object activities. This raises a critical question: how much of the observed performance drop stems from the unique first person viewpoint inherent to egocentric vision versus the domain of human-object activities? To address this question, we introduce a new benchmark study designed to disentangle such factors. Our evaluation strategy enables a more precise separation of challenges related to the first person perspective from those linked to the broader domain of human-object activity understanding. By doing so, we provide deeper insights into the true sources of difficulty in egocentric tracking and segmentation, facilitating more targeted advancements on this task.
SynCity: Training-Free Generation of 3D Worlds
Paul Engstler
University of Oxford
Aleksandar Shtedritski
University of Oxford
Iro Laina
University of Oxford
Christian Rupprecht
University of Oxford
Andrea Vedaldi
University of Oxford
Abstract
We propose SynCity, a method for generating explorable 3D worlds from textual descriptions. Our approach leverages pre-trained textual, image, and 3D generators without requiring fine-tuning or inference-time optimization. While most 3D generators are object-centric and unable to create large-scale worlds, we demonstrate how 2D and 3D generators can be combined to produce ever-expanding scenes. The world is generated tile by tile, with each new tile created within its context and seamlessly integrated into the scene. SynCity enables fine-grained control over the appearance and layout of the generated worlds, which are both detailed and diverse. Project page: https://research.paulengstler.com/syncity/
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Yue Fan
State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China
Xiaojian Ma
State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China
Rongpeng Su
State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China
Jun Guo
State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China
Rujie Wu
State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China
Xi Chen
State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China
Qing Li
State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China
Abstract
This paper investigates the problem of understanding dynamic 3D scenes from egocentric observations, a key challenge in robotics and embodied AI. Unlike prior studies that explored this as long-form video understanding and utilized egocentric video only, we instead propose an LLMbased agent, Embodied VideoAgent, which constructs scene memory from both egocentric video and embodied sensory inputs (e.g. depth and pose sensing). We further introduce a VLM-based approach to automatically update the memory when actions or activities over objects are perceived. Embodied VideoAgent attains significant advantages over counterparts in challenging reasoning and planning tasks in 3D scenes, achieving gains of 6.5% on Ego4D-VQ3D, 2.6% on OpenEQA, and 15.3% on EnvQA. We have also demonstrated its potential in various embodied AI tasks including generating embodied interactions and perception for robot manipulation. The code and demo will be made public.
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
Ke Fan
Shanghai Jiao Tong University
Shunlin Lu
CUHK, Shenzhen
Minyue Dai
Fudan University
Runyi Yu
HKUST
Lixing Xiao
Zhejiang University
Zhiyang Dou
HKU
Junting Dong
Shanghai AI Laboratory
Lizhuang Ma
Shanghai Jiao Tong University, East China Normal University
Jingbo Wang
Shanghai AI Laboratory
Abstract
Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zeroshot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.
PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization
Bing Fan
University of North Texas
Yunhe Feng
University of North Texas
Yapeng Tian
University of Texas at Dallas
James Chenhao Liang
U.S. Naval Research Laboratory
Yuewei Lin
Brookhaven National Laboratory
Yan Huang
University of North Texas
Heng Fan
University of North Texas
Abstract
Egocentric visual query localization (EgoVQL) focuses on localizing the target of interest in space and time from firstperson videos, given a visual query. Despite recent progressive, existing methods often struggle to handle severe object appearance changes and cluttering background in the video due to lacking sufficient target cues, leading to degradation. Addressing this, we introduce PRVQL, a novel Progressive knowledge-guided Refinement framework for EgoVQL. The core is to continuously exploit target-relevant knowledge directly from videos and utilize it as guidance to refine both query and video features for improving target localization. Our PRVQL contains multiple processing stages. The target knowledge from one stage, comprising appearance and spatial knowledge extracted via two specially designed knowledge learning modules, are utilized as guidance to refine the query and videos features for the next stage, which are used to generate more accurate knowledge for further feature refinement. With such a progressive process, target knowledge in PRVQL can be gradually improved, which, in turn, leads to better refined query and video features for localization in the final stage. Compared to previous methods, our PRVQL, besides the given object cues, enjoys additional crucial target information from a video as guidance to refine features, and hence enhances EgoVQL in complicated scenes. In our experiments on challenging Ego4D, PRVQL achieves stateof-the-art result and largely surpasses other methods, showing its efficacy. Our code, model and results will be released at https://github.com/fb-reps/PRVQL.
RIOcc: Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction
Baojie Fan
Nanjing University of Posts and Telecommunications
Xiaotian Li
Nanjing University of Posts and Telecommunications
Yuhan Zhou
Nanjing University of Posts and Telecommunications
Yuyu Jiang
Nanjing University of Posts and Telecommunications
Jiandong Tian
Shenyang Institute of Automation, Chinese Academy of Sciences
Huijie Fan
Shenyang Institute of Automation, Chinese Academy of Sciences
Abstract
The multi-modal 3D semantic occupancy task provides a comprehensive understanding of the scene and has received considerable attention in the field of autonomous driving. However, existing methods mainly focus on processing large-scale voxels, which bring high computational costs and degrade details. Additionally, they struggle to accurately capture occluded targets and distant information. In this paper, we propose a novel LiDAR-Camera 3D semantic occupancy prediction framework called RIOcc, with collaborative feature refinement and multi-scale cross-modal fusion transformer. Specifically, RIOcc encodes multi-modal data into a unified Bird's Eye View (BEV) space, which reduces computational complexity and enhances the efficiency of feature alignment. Then, multi-scale feature processing substantially expands the receptive fields. Meanwhile, in the LiDAR branch, we design the Dual-branch Pooling (DBP) to adaptively enhance geometric features across both the Channel and Grid dimensions. In the camera branch, the Wavelet and Semantic Encoders are developed to extract high-level semantic features with abundant edge and structural information. Finally, to facilitate effective cross-modal complementarity, we develop the Deformable Dual-Attention (DDA) module. Extensive experiments demonstrate that RIOcc achieves state-of-the-art performance, with 54.2 mIoU and 25.9 mIoU on the Occ3DnuScenes and nuScenes-Occupancy datasets, respectively.
Video Individual Counting for Moving Drones
Yaowu Fan
Sun Yat-sen University
Jia Wan
Harbin Institute of Technology (Shenzhen)
Tao Han
Hong Kong University of Science and Technology
Antoni B. Chan
City University of Hong Kong
Andy J. Ma
Sun Yat-sen University
Abstract
Video Individual Counting (VIC) has received increasing attention for its importance in intelligent video surveillance. Existing works are limited in two aspects, i.e., dataset and method. Previous datasets are captured with fixed or rarely moving cameras with relatively sparse individuals, restricting evaluation for a highly varying view and time in crowded scenes. Existing methods rely on localization followed by association or classification, which struggle under dense and dynamic conditions due to inaccurate localization of small targets. To address these issues, we introduce the MovingDroneCrowd Dataset, featuring videos captured by fast-moving drones in crowded scenes under diverse illuminations, shooting heights and angles. We further propose a Shared Density map-guided Network (SDNet) using a Depth-wise Cross-Frame Attention (DCFA) module to directly estimate shared density maps between consecutive frames, from which the inflow and outflow density maps are derived by subtracting the shared density maps from the global density maps. The inflow density maps across frames are summed up to obtain the number of unique pedestrians in a video. Experiments on our datasets and publicly available ones show the superiority of our method over the state of the arts in highly dynamic and complex crowded scenes. Our dataset and codes have been released publicly1.
Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision
Xiao Fang
Carnegie Mellon University
Minhyek Jeon
Carnegie Mellon University
Zheyang Qin
Carnegie Mellon University
Stanislav Panev
Carnegie Mellon University
Celso de Melo
DEVCOM Army Research Laboratory
Shuowen Hu
DEVCOM Army Research Laboratory
Shayok Chakraborty
Florida State University
Fernando De la Torre
Carnegie Mellon University
Abstract
Detecting vehicles in aerial imagery is a critical task with applications in traffic monitoring, urban planning, and defense intelligence. Deep learning methods have provided state-of-the-art (SOTA) results for this application. However, a significant challenge arises when models trained on data from one geographic region fail to generalize effectively to other areas. Variability in factors such as environmental conditions, urban layouts, road networks, vehicle types, and image acquisition parameters (e.g., resolution, lighting, and angle) leads to domain shifts that degrade model performance. This paper proposes a novel method that uses generative AI to synthesize high-quality aerial images and their labels, improving detector training through data augmentation. Our key contribution is the development of a multi-stage, multi-modal knowledge transfer framework utilizing fine-tuned latent diffusion models (LDMs) to mitigate the distribution gap between the source and target environments. Extensive experiments across diverse aerial imagery domains show consistent performance improvements in AP50 over supervised learning on source domain data, weakly supervised adaptation methods, unsupervised domain adaptation methods, and open-set object detectors by 4-23%, 6-10%, 7-40%, and more than 50%, respectively. Furthermore, we introduce two newly annotated aerial datasets from New Zealand and Utah to support further research in this field. Project page is available at: https://humansensinglab.github.io/AGenDA
MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh
Shuangkang Fang
Beihang University
I-Chao Shen
The University of Tokyo
Yufeng Wang
Beihang University
Yi-Hsuan Tsai
Google
Yi Yang
StepFun
Shuchang Zhou
StepFun
Wenrui Ding
Beihang University
Takeo Igarashi
The University of Tokyo
Ming-Hsuan Yang
UC Merced
Abstract
We present MeshLLM, a novel framework that leverages large language models (LLMs) to understand and generate text-serialized 3D meshes. Our approach addresses key limitations in existing methods, including the limited dataset scale when catering to LLMs' token length and the loss of 3D structural information during mesh serialization. We introduce a Primitive-Mesh decomposition strategy, which divides 3D meshes into structurally meaningful subunits. This enables the creation of a large-scale dataset with 1500k+ samples, almost 50x larger than previous methods, which aligns better with the LLM scaling law principles. Furthermore, we propose inferring face connectivity from vertices and local mesh assembly training strategies, significantly enhancing the LLMs' ability to capture mesh topology and spatial structures. Experiments show that MeshLLM outperforms the state-of-the-art LLaMA-Mesh in both mesh generation quality and shape understanding, highlighting its great potential in processing text-serialized 3D meshes.
NeRF Is a Valuable Assistant for 3D Gaussian Splatting
Shuangkang Fang
Beihang University
I-Chao Shen
The University of Tokyo
Takeo Igarashi
The University of Tokyo
Yufeng Wang
Beihang University
ZeSheng Wang
Beihang University
Yi Yang
StepFun
Wenrui Ding
Beihang University
Shuchang Zhou
StepFun
Abstract
We introduce NeRF-GS, a novel framework that jointly optimizes Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). This framework leverages the inherent continuous spatial representation of NeRF to mitigate several limitations of 3DGS, including sensitivity to Gaussian initialization, limited spatial awareness, and weak inter-Gaussian correlations, thereby enhancing its performance. In NeRF-GS, we revisit the design of 3DGS and progressively align its spatial features with NeRF, enabling both representations to be optimized within the same scene through shared 3D spatial information. We further address the formal distinctions between the two approaches by optimizing residual vectors for both implicit features and Gaussian positions to enhance the personalized capabilities of 3DGS. Experimental results on benchmark datasets show that NeRF-GS surpasses existing methods and achieves state-of-the-art performance. This outcome confirms that NeRF and 3DGS are complementary rather than competing, offering new insights into hybrid approaches that combine 3DGS and NeRF for efficient 3D scene representation.
Proxy-Bridged Game Transformer for Interactive Extreme Motion Prediction
Yanwen Fang
The University of Hong Kong
Wenqi Jia
University of Illinois Urbana-Champaign
Xu Cao
University of Illinois Urbana-Champaign
Peng-Tao Jiang
vivo Mobile Communication Co., Ltd
Guodong Li
The University of Hong Kong
Jintai Chen
HKUST(GZ)
Abstract
Multi-person motion prediction becomes particularly challenging when handling highly interactive scenarios involving extreme motions. Previous works focused more on the case of ‘moderate' motions (e.g., walking together), where predicting each pose in isolation often yields reasonable results. However, these approaches fall short in modeling extreme motions like lindy-hop dances, as they require a more comprehensive understanding of cross-person dependencies. To bridge this gap, we introduce Proxybridged Game Transformer (PGformer), a Transformerbased foundation model that captures the interactions driving extreme multi-person motions. PGformer incorporates a novel cross-query attention module to learn bidirectional dependencies between pose sequences and a proxy unit that subtly controls bidirectional spatial information flow. We evaluated PGformer on the challenging ExPI dataset, which involves large collaborative movements. Both quantitative and qualitative results demonstrate the superiority of PGformer in both short- and long-term predictions. We also test the proposed method on moderate movement datasets CMU-Mocap and MuPoTS-3D, generalizing PGformer to scenarios with more than two individuals with promising results. Code of PGformer is available at https://github.com/joyfang1106/pgformer.
SuperDec: 3D Scene Decomposition with Superquadrics Primitives
Elisabetta Fedele
ETH Zurich
Boyang Sun
ETH Zurich
Leonidas Guibas
Stanford University
Marc Pollefeys
ETH Zurich
Francis Engelmann
Stanford University
Abstract
We present SUPERDEC, an approach for creating compact 3D scene representations via decomposition into superquadric primitives. While most recent methods use geometric primitives to obtain photorealistic 3D reconstructions, we instead leverage them to obtain a compact yet expressive representation. To this end, we design a novel architecture that efficiently decomposes point clouds of arbitrary objects into a compact set of superquadrics. We train our model on ShapeNet and demonstrate its generalization capabilities on object instances from ScanNet++ as well as on full Replica scenes. Finally, we show that our compact superquadric-based representation supports a wide range of downstream applications, including robotic manipulation and controllable visual content generation. Project page: https://super-dec.github.io.
ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking
Xiaokun Feng
School of Artificial Intelligence, UCAS
Shiyu Hu
School of Physical and Mathematical Sciences, NTU
Xuchen Li
School of Artificial Intelligence, UCAS
Dailing Zhang
School of Artificial Intelligence, UCAS
Meiqi Wu
School of Artificial Intelligence, UCAS
Jing Zhang
School of Artificial Intelligence, UCAS
Xiaotang Chen
School of Artificial Intelligence, UCAS
Kaiqi Huang
School of Artificial Intelligence, UCAS
Abstract
Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame. To achieve robust tracking, especially in complex long-term scenarios that reflect real-world conditions as recently highlighted by MGIT, it is essential not only to characterize the target features but also to utilize the context features related to the target. However, the visual and textual target-context cues derived from the initial prompts generally align only with the initial target state. Due to their dynamic nature, target states are constantly changing, particularly in complex long-term sequences. It is intractable for these cues to continuously guide Vision-Language Trackers (VLTs). Furthermore, for the text prompts with diverse expressions, our experiments reveal that existing VLTs struggle to discern which words pertain to the target or the context, complicating the utilization of textual cues. In this work, we present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states through comprehensive Target-Context feature modeling, thereby achieving robust tracking. Specifically, (1) for the visual modality, we propose an effective temporal visual target-context modeling approach that provides the tracker with timely visual cues. (2) For the textual modality, we achieve precise target words identification solely based on textual content, and design an innovative context words calibration method to adaptively utilize auxiliary context words. (3) We conduct extensive experiments on mainstream benchmarks and ATCTrack achieves a new SOTA performance. The code and models will be released at: https://github.com/XiaokunFeng/ATCTrack
Gaussian-based World Model: Gaussian Priors for Voxel-Based Occupancy Prediction and Future Motion Prediction
Tuo Feng
ReLER, CCAI, Zhejiang University
Wenguan Wang
ReLER, CCAI, Zhejiang University
Yi Yang
ReLER, CCAI, Zhejiang University
Abstract
In autonomous driving, accurately predicting occupancy and motion is crucial for safe navigation within dynamic environments. However, existing methods often suffer from difficulties in handling complex scenes and uncertainty arising from sensor data. To address these issues, we propose a new Gaussian-based World Model (GWM), seamlessly integrating raw multi-modal sensor inputs. In 1st stage, Gaussian representation learner utilizes self-supervised pretraining to learn robust Gaussian representation. Gaussian representation integrates semantic and geometric information and establishes a robust probabilistic understanding of the environment. In 2nd stage, GWM seamlessly integrates learning, simulation, and planning into a unified framework, empowering the uncertainty-aware simulator & planner to jointly forecast future scene evolutions and vehicle trajectories. Simulator generates future scene predictions by modeling both static and dynamic elements, while planner calculates optimal paths to minimize collision risks, thus enhancing navigation safety. Overall, GWM employs a sensor-to-planning world model that directly processes raw sensor data, setting it apart from previous methods. Experiments show that GWM outperforms state-of-the-art approaches by 1.46% in semantic comprehension and 0.07m in motion prediction. Moreover, we provide an in-depth analysis of Gaussian representations under complex scenarios.
I2VControl: Disentangled and Unified Video Motion Synthesis Control
Wanquan Feng
Intelligent Creation Team, ByteDance
Tianhao Qi
University of Science and Technology of China (USTC)
Jiawei Liu
Intelligent Creation Team, ByteDance
Mingzhen Sun
Institute of Automation, Chinese Academy of Sciences (CASIA)
Pengqi Tu
Intelligent Creation Team, ByteDance
Tianxiang Ma
Intelligent Creation Team, ByteDance
Fei Dai
Intelligent Creation Team, ByteDance
Songtao Zhao
Intelligent Creation Team, ByteDance
Siyu Zhou
Intelligent Creation Team, ByteDance
Qian He
Intelligent Creation Team, ByteDance
Abstract
Motion controllability is crucial in video synthesis. However, most previous methods are limited to single control types, and combining them often results in logical conflicts. In this paper, we propose a disentangled and unified framework, namely I2VControl, to overcome the logical conflicts. We rethink camera control, object dragging, and motion brush, reformulating all tasks into a consistent representation based on point trajectories, each managed by a dedicated formulation. Accordingly, we propose a spatial partitioning strategy, where each unit is assigned to a concomitant control category, enabling diverse control types to be dynamically orchestrated within a single synthesis pipeline without conflicts. Furthermore, we design an adapter structure that functions as a plug-in for pre-trained models and is agnostic to specific model architectures. We conduct extensive experiments, achieving excellent performance on various control tasks, and our method further facilitates user-driven creative combinations, enhancing innovation and creativity. Project page: https://wanquanf.github.io/I2VControl.
Partially Matching Submap Helps: Uncertainty Modeling and Propagation for Text to Point Cloud Localization
Mingtao Feng
Xidian University
Longlong Mei
Xidian University
Zijie Wu
Xidian University
Jianqiao Luo
Hunan University
Fenghao Tian
Xidian University
Jie Feng
Xidian University
Weisheng Dong
Xidian University
Yaonan Wang
Hunan University
Abstract
Text to point cloud cross-modal localization is a crucial vision-language task for future human-robot collaboration. Existing coarse-to-fine frameworks assume that each query text precisely corresponds to the center area of a submap, limiting their applicability in real-world scenarios. This work redefines the task under a more realistic assumption, relaxing the one-to-one retrieval constraint by allowing partially matching query text and submap pairs. To address this challenge, we augment datasets with partially matching submaps and introduce an uncertainty-aware framework. Specifically, we model cross-modal ambiguity in fine-grained location regression by integrating uncertainty scores, represented as 2D Gaussian distributions, to mitigate the impact of challenging samples. Additionally, we propose an uncertaintyaware similarity metric that enhances similarity assessment between query text and submaps by propagating uncertainty into coarse place recognition, enabling the model to learn discriminative features, effectively handle partially matching samples and improve task synergy. Extensive experiments on KITTI360Pose and CityRefer demonstrate that our method achieves state-of-the-art performance across both stages. Our code is available at https://github.com/Afoolbird/PMSH
St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World
Haiwen Feng
UC Berkeley
Junyi Zhang
UC Berkeley
Qianqian Wang
UC Berkeley
Yufei Ye
Stanford University
Pengcheng Yu
Max Planck Institute for Intelligent Systems
Michael J. Black
Max Planck Institute for Intelligent Systems
Trevor Darrell
UC Berkeley
Angjoo Kanazawa
UC Berkeley
Abstract
Dynamic 3D reconstruction and point tracking in videos are typically treated as separate tasks, despite their deep connection. We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs. This is achieved by predicting two appropriately defined pointmaps for a pair of frames captured at different moments. Specifically, we predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry while maintaining 3D correspondences. Chaining these predictions through the video sequence with respect to a reference frame naturally computes long-range correspondences, effectively combining 3D reconstruction with 3D tracking. Unlike prior methods that rely heavily on 4D ground truth supervision, we employ a novel adaptation scheme based on a reprojection loss. We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework. Our code, model, and benchmark will be released.
VideoOrion: Tokenizing Object Dynamics in Videos
Yicheng Feng
School of Computer Science, Peking University
Yijiang Li
University of California, San Diego
Wanpeng Zhang
School of Computer Science, Peking University
Sipeng Zheng
unknown
Hao Luo
School of Computer Science, Peking University
Zihao Yue
Renmin University of China
Zongqing Lu
BeingBeyond
Abstract
We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos-the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detectsegment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our method addresses the persistent challenge in Video-LLMs of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs. Compared to prior methods which resort to downsampling the original video or aggregating visual tokens using resamplers, leading to information loss and entangled semantics, VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations but also enables explicit object modeling of video content with minimal computational cost. Moreover, the introduced object tokens naturally allow VideoOrion to accomplish video-based referring tasks. Experimental results show that VideoOrion can learn to make good use of the object tokens, and achieves competitive results on both general video question answering and video-based referring benchmarks.
FlowR: Flowing from Sparse to Dense 3D Reconstructions
Tobias Fischer
ETH Zurich
Samuel Rota Bul`o
Meta Reality Labs Zurich
Yung-Hsu Yang
ETH Zurich
Nikhil Keetha
Meta Reality Labs Zurich
Lorenzo Porzi
Meta Reality Labs Zurich
Norman Müller
Meta Reality Labs Zurich
Katja Schwarz
Meta Reality Labs Zurich
Jonathon Luiten
Meta Reality Labs Zurich
Marc Pollefeys
ETH Zurich
Peter Kontschieder
Meta Reality Labs Zurich
Abstract
3D Gaussian splatting enables high-quality novel view synthesis (NVS) at real-time frame rates. However, its quality drops sharply as we depart from the training views. Thus, dense captures are needed to match the high-quality expectations of applications like Virtual Reality (VR). However, such dense captures are very laborious and expensive to obtain. Existing works have explored using 2D generative models to alleviate this requirement by distillation or generating additional training views. These models typically rely on a noise-to-data generative process conditioned only on a handful of reference input views, leading to hallucinations, inconsistent generation results, and subsequent reconstruction artifacts. Instead, we propose a multi-view, flow matching model that learns a flow to directly connect novel view renderings from possibly sparse reconstructions to renderings that we expect from dense reconstructions. This enables augmenting scene captures with consistent, generated views to improve reconstruction quality. Our model is trained on a novel dataset of 3.6M image pairs and can process up to 45 views at 540→960 resolution (91K tokens) on one H100 GPU in a single forward pass. Our pipeline consistently improves NVS in sparse- and dense-view scenarios, leading to higher-quality reconstructions than prior works across multiple, widely-used NVS benchmarks.
Unified Category-Level Object Detection and Pose Estimation from RGB Images using 3D Prototypes
Tom Fischer
Saarland University
Xiaojie Zhang
University of Technology Nuremberg
Eddy Ilg
University of Technology Nuremberg
Abstract
Recognizing objects in images is a fundamental problem in computer vision. Although detecting objects in 2D images is common, many applications require determining their pose in 3D space. Traditional category-level methods rely on RGB-D inputs, which may not always be available, or employ two-stage approaches that use separate models and representations for detection and pose estimation. For the first time, we introduce a unified model that integrates detection and pose estimation into a single framework for RGB images by leveraging neural mesh models with learned features and multi-model RANSAC. Our approach achieves state-of-the-art results for RGB category-level pose estimation on REAL275, improving on the current state-of-the-art by 22.9% averaged across all scale-agnostic metrics. Finally, we demonstrate that our unified method exhibits greater robustness compared to single-stage baselines. Our code and models are available at github.com/Fischer-Tom/unified-detectionand-pose-estimation.
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
Haoyu Fu
Huazhong University of Science and Technology
Diankun Zhang
Xiaomi EV
Zongchuang Zhao
Huazhong University of Science and Technology
Jianfeng Cui
Xiaomi EV
Dingkang Liang
Huazhong University of Science and Technology
Chong Zhang
Xiaomi EV
Dingyuan Zhang
Huazhong University of Science and Technology
Hongwei Xie
Xiaomi EV
Bing Wang
Xiaomi EV
Xiang Bai
Huazhong University of Science and Technology
Abstract
End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a hOlistic E2E autonomous dRiving framework by vIsion-language instructed actiON generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.
ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives
Yuqian Fu
INSAIT, Sofia University 'St. Kliment Ohridski'
Runze Wang
Fudan University
Bin Ren
University of Trento
Guolei Sun
ETH Zurich
Biao Gong
unknown
Yanwei Fu
Fudan University
Danda Pani Paudel
unknown
Xuanjing Huang
Fudan University
Luc Van Gool
unknown
Abstract
Bridging the gap between ego-centric and exo-centric views has been a long-standing question in computer vision. In this paper, we focus on the emerging Ego-Exo object correspondence task, which aims to understand object relations across ego-exo perspectives through segmentation. While numerous segmentation models have been proposed, most operate on a single image (view), making them impractical for cross-view scenarios. PSALM [75], a recently proposed segmentation method, stands out as a notable exception with its demonstrated zero-shot ability on this task. However, due to the drastic viewpoint change between ego and exo, PSALM fails to accurately locate and segment objects, especially in complex backgrounds or when object appearances change significantly. To address these issues, we propose ObjectRelator, a novel approach featuring two key modules: Multimodal Condition Fusion (MCFuse) and SSL-based Cross-View Object Alignment (XObjAlign). MCFuse introduces language as an additional cue, integrating both visual masks and textual descriptions to improve object localization and prevent incorrect associations. XObjAlign enforces cross-view consistency through self-supervised alignment, enhancing robustness to object appearance variations. Extensive experiments demonstrate ObjectRelator's effectiveness on the large-scale Ego-Exo4D benchmark and HANDAL-X (an adapted dataset for cross-view segmentation) with state-of-the-art performance. Code is available at: http://yuqianfu.com/ObjectRelator.
Beyond RGB: Adaptive Parallel Processing for RAW Object Detection
Shani Gamrian
Sony Research
Hila Barel
Sony Research
Feiran Li
Sony Research
Masakazu Yoshimura
Sony Group Corporation
Daisuke Iso
Sony Research
Abstract
Object detection models are typically applied to standard RGB images processed through Image Signal Processing (ISP) pipelines, which are designed to enhance sensorcaptured RAW images for human vision. However, these ISP functions can lead to a loss of critical information that may be essential in optimizing for computer vision tasks, such as object detection. In this work, we introduce Raw Adaptation Module (RAM), a module designed to replace the traditional ISP, with parameters optimized specifically for RAW object detection. Inspired by the parallel processing mechanisms of the human visual system, RAM departs from existing learned ISP methods by applying multiple ISP functions in parallel rather than sequentially, allowing for a more comprehensive capture of image features. These processed representations are then fused in a specialized module, which dynamically integrates and optimizes the information for the target task. This novel approach not only leverages the full potential of RAW sensor data but also enables task-specific pre-processing, resulting in superior object detection performance. Our approach outperforms RGB-based methods and achieves state-of-the-art results across diverse RAW image datasets under varying lighting conditions and dynamic ranges. Our code is available at https://github.com/SonyResearch/RawAdaptationModule.
GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting
Wanshui Gan
The University of Tokyo
Fang Liu
The University of Tokyo
Hongbin Xu
South China University of Technology
Ningkai Mo
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Naoto Yokoya
The University of Tokyo
Abstract
We introduce GaussianOcc, a systematic method that investigates Gaussian splatting for fully self-supervised and efficient 3D occupancy estimation in surround views. First, traditional methods for self-supervised 3D occupancy estimation still require ground truth 6D ego pose from sensors during training. To address this limitation, we propose Gaussian Splatting for Projection (GSP) module to provide accurate scale information for fully self-supervised training from adjacent view projection. Additionally, existing methods rely on volume rendering for final 3D voxel representation learning using 2D signals (depth maps and semantic maps), which is time-consuming and less effective. We propose Gaussian Splatting from Voxel space (GSV) to leverage the fast rendering properties of Gaussian splatting. As a result, the proposed GaussianOcc method enables fully self-supervised (no ground truth ego pose) 3D occupancy estimation in competitive performance with low computational cost (2.7 times faster in training and 5 times faster in rendering). The relevant code is available in https://github.com/GANWANSHUI/GaussianOcc.git.
Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens
Suchisrit Gangopadhyay
Yale University
Jung-Hee Kim
Michigan State University
Xien Chen
Yale University
Patrick Rim
Yale University
Hyoungseob Park
Yale University
Alex Wong
Yale University
Abstract
We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: github.com/JungHeeKim29/calibration-token.
3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation
Jianzhe Gao
Zhejiang University
Rui Liu
Zhejiang University
Wenguan Wang
Zhejiang University
Abstract
Vision-language navigation (VLN) requires an agent to traverse complex 3D environments based on natural language instructions, necessitating a thorough scene understanding. While existing works equip agents with various scene representations to enhance spatial awareness, they often neglect the complex 3D geometry and rich semantics in VLN scenarios, limiting the ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. Specifically, Egocentric Scene Map is constructed online by initializing 3D Gaussians from sparse pseudo-lidar point clouds, providing informative geometric priors for scene understanding. Each Gaussian primitive is further enriched through Open-Set Semantic Grouping operation, which groups 3D Gaussians based on their membership in object instances or stuff categories within the open world, resulting in a unified 3D Gaussian Map. Building on this map, Multi-Level Action Prediction strategy, which combines spatial-semantic cues at multiple granularities, is designed to assist agents in decision-making. Extensive experiments conducted on three public benchmarks (i.e., R2R, R4R, and REVERIE) validate the effectiveness of our method.
3D Mesh Editing using Masked LRMs
Will Gao
University of Chicago
Dilin Wang
Meta Reality Labs
Yuchen Fan
Meta Reality Labs
Aljaz Bozic
Meta Reality Labs
Tuur Stuyck
Meta Reality Labs
Zhengqin Li
Meta Reality Labs
Zhao Dong
Meta Reality Labs
Rakesh Ranjan
Meta Reality Labs
Nikolaos Sarafianos
Meta Reality Labs
Abstract
We present a novel approach to shape editing, building on recent progress in 3D reconstruction from multi-view images. We formulate shape editing as a conditional reconstruction problem, where the model must reconstruct the input shape with the exception of a specified 3D region, in which the geometry should be generated from the conditional signal. To this end, we train a conditional Large Reconstruction Model (LRM) for masked reconstruction, using multi-view consistent masks rendered from a randomly generated 3D occlusion, and using one clean viewpoint as the conditional signal. During inference, we manually define a 3D region to edit and provide an edited image from a canonical viewpoint to fill that region. We demonstrate that, in just a single forward pass, our method not only preserves the input geometry in the unmasked region through reconstruction capabilities on par with SoTA, but is also expressive enough to perform a variety of mesh edits from a single image guidance that past works struggle with, while being 2 −10x faster than the top-performing prior work.
Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians
Quankai Gao
University of Southern California
Iliyan Georgiev
Adobe Research
Tuanfeng Y. Wang
Adobe Research
Krishna Kumar Singh
Adobe Research
Ulrich Neumann
University of Southern California
Jae Shin Yoon
Adobe Research
Abstract
3D generation has made significant progress, however, it still largely remains at the object-level. Feedforward 3D scene-level generation has been rarely explored due to the lack of models capable of scaling-up latent representation learning on 3D scene-level data. Unlike object-level generative models, which are trained on well-labeled 3D data in a bounded canonical space, scene-level generations with 3D scenes represented by 3D Gaussian Splatting (3DGS) are unbounded and exhibit scale inconsistency across different scenes, making unified latent representation learning for generative purposes extremely challenging. In this paper, we introduce Can3Tok, the first 3D scene-level variational autoencoder (VAE) capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding, which effectively captures both semantic and spatial information of the inputs. Beyond model design, we propose a general pipeline for 3D scene data processing to address scale inconsistency issue. We validate our method on the recent scene-level 3D dataset DL3DV-10K, where we found that only Can3Tok successfully generalizes to novel 3D scenes, while compared methods fail to converge on even a few hundred scene inputs during training and exhibit zero generalization ability during inference. Finally, we demonstrate image-to-3DGS and text-to-3DGS generation as our applications to demonstrate it's ability to faciliate downstream generation tasks. Project page: https://github.com/Zerg-Overmind/Can3Tok
CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction
Yuanyuan Gao
Northwestern Polytechnical University
Hao Li
Northwestern Polytechnical University
Jiaqi Chen
Northwestern Polytechnical University
Zhengyu Zou
Northwestern Polytechnical University
Zhihang Zhong
Shanghai Artificial Intelligence Laboratory
Dingwen Zhang
Northwestern Polytechnical University
Xiao Sun
Shanghai Artificial Intelligence Laboratory
Junwei Han
Northwestern Polytechnical University
Abstract
Despite its significant achievements in large-scale scene reconstruction, 3D Gaussian Splatting still faces substantial challenges, including slow processing, high computational costs, and limited geometric accuracy. These core issues arise from its inherently unstructured design and the absence of efficient parallelization. To overcome these challenges simultaneously, we introduce CityGS-X, a scalable architecture built on a novel parallelized hybrid hierarchical 3D representation (PH2-3D). As an early attempt, CityGS-X abandons the cumbersome merge-and-partition process and instead adopts a newly-designed batch-level multi-task rendering process. This architecture enables efficient multi-GPU rendering through dynamic Level-ofDetail voxel allocations, significantly improving scalability and performance. To further enhance both overall quality and geometric accuracy, CityGS-X presents a progressive RGB-Depth-Normal training strategy. This approach enhances 3D consistency by jointly optimizing appearance and geometry representation through multi-view constraints and off-the-shelf depth priors within batch-level training. Through extensive experiments, CityGS-X consistently outperforms existing methods in terms of faster training times, larger rendering capacities, and more accurate geometric details in large-scale scenes. Notably, CityGS-X can train and render a scene with 5,000+ images in just 5 hours using only 4x4090 GPUs, a task that would make other alternative methods encounter Out-Of-Memory (OOM) issues and fail completely. This implies that CityGS-X is far beyond the capacity of other existing methods. Project Page: https://lifuguan.github.io/CityGS-X/
Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction
Zhirui Gao
National University of Defense Technology
Renjiao Yi
National University of Defense Technology
Yaqiao Dai
National University of Defense Technology
Xuening Zhu
National University of Defense Technology
Wei Chen
National University of Defense Technology
Chenyang Zhu
National University of Defense Technology
Kai Xu
National University of Defense Technology
Abstract
This paper presents an end-to-end framework for reconstructing 3D parametric curves directly from multi-view edge maps. Contrasting with existing two-stage methods that follow a sequential 'edge point cloud reconstruction and parametric curve fitting' pipeline, our one-stage approach optimizes 3D parametric curves directly from 2D edge maps, eliminating error accumulation caused by the inherent optimization gap between disconnected stages. However, parametric curves inherently lack suitability for rendering-based multi-view optimization, necessitating a complementary representation that preserves their geometric properties while enabling differentiable rendering. We propose a novel bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components. This tight correspondence formulates a curveaware Gaussian representation, CurveGaussian, that enables differentiable rendering of 3D curves, allowing direct optimization guided by multi-view evidence. Furthermore, we introduce a dynamically adaptive topology optimization framework during training to refine curve structures through linearization, merging, splitting, and pruning operations. Comprehensive evaluations on the ABC dataset and real-world benchmarks demonstrate our onestage method's superiority over two-stage alternatives, particularly in producing cleaner and more robust reconstructions. Additionally, by directly optimizing parametric curves, our method significantly reduces the parameter count during training, achieving both higher efficiency and superior performance compared to existing approaches.
DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning
Ziqi Gao
Shenzhen University
Qiufu Li
Shenzhen University
Linlin Shen
Shenzhen University
Abstract
Compared to 2D data, the scale of point cloud data in different domains available for training, is quite limited. Researchers have been trying to combine these data of different domains for masked autoencoder (MAE) pre-training to leverage such a data scarcity issue. However, the prior knowledge learned from mixed domains may not align well with the downstream 3D point cloud analysis tasks, leading to degraded performance. To address such an issue, we propose the Domain-Adaptive Point Cloud Masked Autoencoder (DAP-MAE), an MAE pre-training method, to adaptively integrate the knowledge of cross-domain datasets for general point cloud analysis. In DAP-MAE, we design a heterogeneous domain adapter that utilizes an adaptation mode during pre-training, enabling the model to comprehensively learn information from point clouds across different domains, while employing a fusion mode in the finetuning to enhance point cloud features. Meanwhile, DAPMAE incorporates a domain feature generator to guide the adaptation of point cloud features to various downstream tasks. With only one pre-training, DAP-MAE achieves excellent performance across four different point cloud analysis tasks, reaching 95.18% in object classification on ScanObjectNN and 88.45% in facial expression recognition on Bosphorus. The code will be released at https: //github.com/CVI-SZU/DAP-MAE
Epipolar Consistent Attention Aggregation Network for Unsupervised Light Field Disparity Estimation
Chen Gao
Beijing Jiaotong University
Shuo Zhang
Beijing Jiaotong University
Youfang Lin
Beijing Jiaotong University
Abstract
Disparity estimation is an essential step in processing and analyzing Light Field (LF) images. Recent methods construct the cost volume to exploit the correspondence of the LFs over the preset maximum disparity, limiting them to process the large parallax scenes. Different from constructing cost volume, the self-attention mechanism calculates the parallax attention between epipolar lines to find the matching points. However, for LFs that have different views, the related disparity scales are different in parallax attention since the baselines with the central view are different. Moreover, if the matching information is occluded in one view, the disparity information can be explored through other views. Therefore, mapping these attentions to the same scale and selecting effective matching information are key points for disparity estimation from parallax attention. In this paper, we explore parallax attention for LF and design an unsupervised method, named Epipolar Consistent Attention Aggregation Network (ECAAN). We first introduce an epipolar consistent scale unification block by considering the consistency relationships to standardize disparity scales of the parallax attention maps. Based on the intra-properties and inter-relationships of parallax attention, we further propose a consistent occlusionfree aggregation block to integrate the information from the occlusion-free areas. In addition, we design an improved photometric loss to constrain the model. ECAAN achieves state-of-the-art performance in LF depth estimation. Notably, ECAAN attains a mean square error of 0.2 on large-disparity LF datasets, achieving a 68% error reduction compared to the second-best method.
MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
Ruiyuan Gao
CUHK
Kai Chen
HKUST
Bo Xiao
Huawei Cloud
Lanqing Hong
Huawei Noah's Ark Lab
Zhenguo Li
Huawei Noah's Ark Lab
Qiang Xu
CUHK
Abstract
The rapid advancement of diffusion models has greatly improved video synthesis, especially in controllable video generation, which is vital for applications like autonomous driving. Although DiT with 3D VAE has become a standard framework for video generation, it introduces challenges in controllable driving video generation, especially for framewise geometric control, rendering existing methods ineffective. To address these issues, we propose MagicDriveV2, a novel approach that integrates the MVDiT block and spatial-temporal conditional encoding to enable multiview video generation and precise geometric control. Additionally, we introduce an efficient method for obtaining contextual descriptions for videos to support diverse textual control, along with a progressive training strategy using mixed video data to enhance training efficiency and generalizability. Consequently, MagicDrive-V2 enables multi-view driving video synthesis with 3.3x resolution and 4x frame count (compared to current SOTA), rich contextual control, and geometric controls. Extensive experiments demonstrate MagicDrive-V2's ability, unlocking broader applications in autonomous driving. Project page: flymin.github.io/magicdrive-v2/
Self-supervised Learning of Hybrid Part-aware 3D Representations of 2D Gaussians and Superquadrics
Zhirui Gao
National University of Defense Technology
Renjiao Yi
National University of Defense Technology
Yuhang Huang
National University of Defense Technology
Wei Chen
National University of Defense Technology
Chenyang Zhu
National University of Defense Technology
Kai Xu
National University of Defense Technology
Abstract
Low-level 3D representations, such as point clouds, meshes, NeRFs and 3D Gaussians, are commonly used for modeling 3D objects and scenes. However, cognitive studies indicate that human perception operates at higher levels and interprets 3D environments by decomposing them into meaningful structural parts, rather than low-level elements like points or voxels. Structured geometric decomposition enhances scene interpretability and facilitates downstream tasks requiring component-level manipulation. In this work, we introduce PartGS, a self-supervised part-aware reconstruction framework that integrates 2D Gaussians and superquadrics to parse objects and scenes into an interpretable decomposition, leveraging multi-view image inputs to uncover 3D structural information. Our method jointly optimizes superquadric meshes and Gaussians by coupling their parameters within a hybrid representation. On one hand, superquadrics enable the representation of a wide range of shape primitives, facilitating flexible and meaningful decompositions. On the other hand, 2D Gaussians capture detailed texture and geometric details, ensuring high-fidelity appearance and geometry reconstruction. Operating in a self-supervised manner, our approach demonstrates superior performance compared to state-of-the-art methods across extensive experiments on the DTU, ShapeNet, and real-world datasets.
SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting
Zihui Gao
Zhejiang University
Jia-Wang Bian
ByteDance Seed
Guosheng Lin
Nanyang Technological University
Hao Chen
Zhejiang University
Chunhua Shen
Zhejiang University
Abstract
Surface reconstruction and novel view rendering from sparse-view images are challenging. Signed Distance Function (SDF)-based methods struggle with fine details, while 3D Gaussian Splatting (3DGS)-based approaches lack global geometry coherence. We propose a novel hybrid method that combines the strengths of both approaches: SDF captures coarse geometry to enhance 3DGS-based rendering, while newly rendered images from 3DGS refine the details of SDF for accurate surface reconstruction. As a result, our method surpasses state-of-the-art approaches in surface reconstruction and novel view synthesis on the DTU and MobileBrick datasets. Code will be released at: https://github.com/aim-uofa/SurfaceSplat.
VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions
Yash Garg
University of California, Riverside
Saketh Bachu
University of California, Riverside
Arindam Dutta
University of California, Riverside
Rohit Lal
University of California, Riverside
Sarosij Bose
University of California, Riverside
Calvin-Khang Ta
University of California, Riverside
M. Salman Asif
University of California, Riverside
Amit Roy-Chowdhury
University of California, Riverside
Abstract
Human pose and shape (HPS) estimation methods have been extensively studied, with many demonstrating high zero-shot performance on in-the-wild images and videos. However, these methods often struggle in challenging scenarios involving complex human poses or significant occlusions. Although some studies address 3D human pose estimation under occlusion, they typically evaluate performance on datasets that lack realistic or substantial occlusions, e.g., most existing datasets introduce occlusions with random patches over the human or clipart-style overlays, which may not reflect real-world challenges. To bridge this gap in realistic occlusion datasets, we introduce a novel benchmark dataset, VOccl3D, a Video-based human Occlusion dataset with 3D body pose and shape annotations. Inspired by works such as AGORA and BEDLAM, we constructed this dataset using advanced computer graphics rendering techniques, incorporating diverse real-world occlusion scenarios, clothing textures, and human motions. Additionally, we fine-tuned recent HPS methods, CLIFF and BEDLAM-CLIFF, on our dataset, demonstrating significant qualitative and quantitative improvements across multiple public datasets, as well as on the test split of our dataset, while comparing its performance with other stateof-the-art methods. Furthermore, we leveraged our dataset to enhance human detection performance under occlusion by fine-tuning an existing object detector, YOLO11, thus leading to a robust end-to-end HPS estimation system under occlusions. Overall, this dataset serves as a valuable resource for future research aimed at benchmarking methods designed to handle occlusions, offering a more realistic alternative to existing occlusion datasets. See the Project page for code and dataset:https://yashgarg98. github.io/VOccl3D-dataset/† Currently at NASA MSFC IMPACT. ‡ Currently at Dolby Laboratories. Work done while the authors were at UCR.
Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving
Junhao Ge
Shanghai Jiao Tong University
Zuhong Liu
Shanghai Jiao Tong University
Longteng Fan
Shanghai Jiao Tong University
Yifan Jiang
Shanghai Jiao Tong University
Jiaqi Su
Shanghai Jiao Tong University
Yiming Li
New York University
Zhejun Zhang
ETH Zurich
Siheng Chen
Shanghai Jiao Tong University
Abstract
End-to-end (E2E) autonomous driving (AD) models require diverse, high-quality data to perform well across various driving scenarios. However, collecting large-scale realworld data is expensive and time-consuming, making highfidelity synthetic data essential for enhancing data diversity and model robustness. Existing driving simulators have significant limitations for synthetic data generation: gameengine-based simulators struggle to produce realistic sensor data, while NeRF-based and diffusion-based methods face efficiency challenges. Additionally, recent simulators designed for closed-loop evaluation provide limited interaction with other vehicles, failing to simulate complex realworld traffic dynamics. To address these issues, we introduce SceneCrafter, a realistic, interactive, and efficient AD simulator based on 3D Gaussian Splatting (3DGS). SceneCrafter not only efficiently generates realistic driving logs across diverse traffic scenarios but also enables robust closed-loop evaluation of end-to-end models. Experimental results demonstrate that SceneCrafter serves as both a reliable evaluation platform and a efficient data generator that significantly improves end-to-end model generalization. Our code will be released at https://github. com/cancaries/SceneCrafter.
ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones
Anurag Ghosh
Carnegie Mellon University
Shen Zheng
Carnegie Mellon University
Robert Tamburo
Carnegie Mellon University
Khiem Vuong
Carnegie Mellon University
Juan Alvarez-Padilla
Carnegie Mellon University
Hailiang Zhu
Carnegie Mellon University
Michael Cardei
Carnegie Mellon University
Nicholas Dunn
Carnegie Mellon University
Christoph Mertz
Carnegie Mellon University
Srinivasa G. Narasimhan
Carnegie Mellon University
Abstract
Perceiving and autonomously navigating through work zones is a challenging and underexplored problem. Open datasets for this long-tailed scenario are scarce. We propose the ROADWork dataset to learn to recognize, observe, analyze, and drive through work zones. State-of-the-art foundation models fail when applied to work zones. Finetuning models on our dataset significantly improves perception and navigation in work zones. With ROADWork , we discover new work zone images with higher precision (+32.5%) at a much higher rate (12.8x) around the world. Open-vocabulary methods fail too, whereas fine-tuned detectors improve performance (+32.2 AP). Vision-Language Models (VLMs) struggle to describe work zones, but finetuning substantially improves performance (+36.7 SPICE). Beyond fine-tuning, we show the value of simple techniques. Video label propagation provides additional gains (+2.6 AP) for instance segmentation. While reading work zone signs, composing a detector and text spotter via cropscaling improves performance (+14.2% 1-NED). Composing work zone detections to provide context further reduces hallucinations (+3.9 SPICE) in VLMs. We predict navigational goals and compute drivable paths from work zone videos. Incorporating road work semantics ensures 53.6% goals have angular error (AE) < 0.5◦(+9.9 %) and 75.3% pathways have AE < 0.5◦(+8.1 %).
Splat-LOAM: Gaussian Splatting LiDAR Odometry and Mapping
Emanuele Giacomini
Sapienza University of Rome
Luca Di Giammarino
Sapienza University of Rome
Lorenzo De Rebotti
Sapienza University of Rome
Giorgio Grisetti
Sapienza University of Rome
Martin R. Oswald
University of Amsterdam
Abstract
LiDARs provide accurate geometric measurements, making them valuable for ego-motion estimation and reconstruction tasks. Although its success, managing an accurate and lightweight representation of the environment still poses challenges. Both classic and NeRF-based solutions have to trade off accuracy over memory and processing times. In this work, we build on recent advancements in Gaussian Splatting methods to develop a novel LiDAR odometry and mapping pipeline that exclusively relies on Gaussian primitives for its scene representation. Leveraging spherical projection, we drive the refinement of the primitives uniquely from LiDAR measurements. Experiments show that our approach matches the current registration performance, while achieving SOTA results for mapping tasks with minimal GPU requirements. This efficiency makes it a strong candidate for further exploration and potential adoption in realtime robotics estimation tasks.
Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation
Uzay Gökay
University of Bonn
Federico Spurio
University of Bonn
Dominik R. Bach
University of Bonn
Juergen Gall
University of Bonn
Abstract
Current state-of-the-art methods for skeleton-based temporal action segmentation are predominantly supervised and require annotated data, which is expensive to collect. In contrast, existing unsupervised temporal action segmentation methods have focused primarily on video data, while skeleton sequences remain underexplored, despite their relevance to real-world applications, robustness, and privacypreserving nature. In this paper, we propose a novel approach for unsupervised skeleton-based temporal action segmentation. Our method utilizes a sequence-to-sequence temporal autoencoder that keeps the information of the different joints disentangled in the embedding space. Latent skeleton sequences are then divided into non-overlapping patches and quantized to obtain distinctive skeleton motion words, driving the discovery of semantically meaningful action clusters. We thoroughly evaluate the proposed approach on three widely used skeleton-based datasets, namely HuGaDB, LARa, and BABEL. The results demonstrate that our model outperforms the current state-of-theart unsupervised temporal action segmentation methods. Code is available at github.com/bachlab/SMQ.
RoMo: Robust Motion Segmentation Improves Structure from Motion
Lily Goli
Google DeepMind
Sara Sabour
Google DeepMind
Mark Matthews
Google DeepMind
Marcus A. Brubaker
Google DeepMind
Dmitry Lagun
Google DeepMind
Alec Jacobson
Adobe Research
David J. Fleet
Google DeepMind
Saurabh Saxena
Google DeepMind
Andrea Tagliasacchi
Google DeepMind
Abstract
There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casuallycaptured video. Estimating accurate camera poses from videos through structure-from-motion (SfM) relies on robustly separating static and dynamic parts of a video. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-theart on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.
CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction
Zhefei Gong
Westlake University
Pengxiang Ding
Zhejiang University
Shangke Lyu
Westlake University
Siteng Huang
Zhejiang University
Mingyang Sun
Zhejiang University
Wei Zhao
Westlake University
Zhaoxin Fan
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing
Donglin Wang
Westlake University
Abstract
In robotic visuomotor policy learning, diffusion-based models have achieved significant success in improving the accuracy of action trajectory generation compared to traditional autoregressive models. However, they suffer from inefficiency due to multiple denoising steps and limited flexibility from complex constraints. In this paper, we introduce Coarseto-Fine AutoRegressive Policy (CARP), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-to-fine, nextscale approach. CARP decouples action generation into two stages: first, an action autoencoder learns multi-scale representations of the entire action sequence; then, a GPTstyle transformer refines the sequence prediction through a coarse-to-fine autoregressive process. This straightforward and intuitive approach produces highly accurate and smooth actions, matching or even surpassing the performance of diffusion-based policies while maintaining efficiency on par with autoregressive policies. We conduct extensive evaluations across diverse settings, including single-task and multitask scenarios on state-based and image-based simulation benchmarks, as well as real-world tasks. CARP achieves competitive success rates, with up to a 10% improvement, and delivers 10x faster inference compared to state-of-theart policies, establishing a high-performance, efficient, and flexible paradigm for action generation in robotic tasks.
ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models
Bingchen Gong
École Polytechnique
Diego Gomez
École Polytechnique
Abdullah Hamdi
Visual Geometry Group, University of Oxford
Abdelrahman Eldesokey
King Abdullah University of Science and Technology (KAUST)
Ahmed Abdelreheem
King Abdullah University of Science and Technology (KAUST)
Peter Wonka
King Abdullah University of Science and Technology (KAUST)
Abstract
We propose a novel zero-shot approach for keypoint detection on 3D shapes. Point-level reasoning on visual data is challenging as it requires precise localization capability, posing problems even for powerful models like DINO or CLIP. Traditional methods for 3D keypoint detection rely heavily on annotated 3D datasets and extensive supervised training, limiting their scalability and applicability to new categories or domains. In contrast, our method utilizes the rich knowledge embedded within Multi-Modal Large Language Models (MLLMs). Specifically, we demonstrate, for the first time, that pixel-level annotations used to train recent MLLMs can be exploited for both extracting and naming salient keypoints on 3D models without any ground truth labels or supervision. Experimental evaluations demonstrate that our approach achieves competitive performance on standard benchmarks compared to supervised methods, despite not requiring any 3D keypoint annotations during training. Our results highlight the potential of integrating language models for localized 3D shape understanding. This work opens new avenues for cross-modal learning and underscores the effectiveness of MLLMs in contributing to 3D computer vision challenges.
Referring Expression Comprehension for Small Objects
Kanoko Goto
Institute of Science Tokyo
Takumi Hirose
Institute of Science Tokyo
Mahiro Ukai
Institute of Science Tokyo
Shuhei Kurita
National Institute of Informatics
Nakamasa Inoue
Institute of Science Tokyo
Abstract
Referring expression comprehension (REC) aims to localize the target object described by a natural language expression. Recent advances in vision-language learning have led to significant performance improvements in REC tasks. However, localizing extremely small objects remains a considerable challenge despite its importance in real-world applications such as autonomous driving. To address this issue, we introduce a novel dataset and method for REC targeting small objects. First, we present the small object REC (SOREC) dataset, which consists of 100,000 pairs of referring expressions and corresponding bounding boxes for small objects in driving scenarios. Second, we propose the progressive-iterative zooming adapter (PIZA), an adapter module for parameter-efficient fine-tuning that enables models to progressively zoom in and localize small objects. In a series of experiments, we apply PIZA to GroundingDINO and demonstrate a significant improvement in accuracy on the SOREC dataset. Our dataset, codes and pre-trained models are publicly available on the project page.
Knowledge-Guided Part Segmentation
Xuejian Gou
Xidian University
Fang Liu
Xidian University
Licheng Jiao
Xidian University
Shuo Li
Xidian University
Lingling Li
Xidian University
Hao Wang
Xidian University
Xu Liu
Xidian University
Puhua Chen
Xidian University
Wenping Ma
Xidian University
Abstract
In real-world scenarios, objects and their parts inherently possess both coarse-grained differences and intricate fine-grained structural relationships. These characteristics can be formalized as knowledge, leveraged for finegrained part comprehension. However, existing part segmentation models consistently fail to capture these complex inter-part relationships, treating parts as independent entities and disregarding object-level distinctions. To address these limitations, we propose a novel Knowledge-Guided Part Segmentation (KPS) framework. Our approach automatically extracts structural relationships between parts using a large language model (LLM) and integrates them into a knowledge graph. Subsequently, a structural knowledge guidance module employs a graph convolutional network (GCN) to model these relationships. Furthermore, a coarse-grained object guidance module captures objectspecific distinctions and integrates them as visual guidance. The integrated insights from the part structure and object differentiation guide the fine-grained part segmentation. Our KPS achieves notable improvements in segmentation performance, with a 4.96% mIoU gain on PartImageNet and a 3.73% gain on Pascal-Part. Moreover, in the open-vocabulary setting on Pascal-Part-116, it improves hIoU by 3.25%, highlighting the effectiveness of knowledge guidance in enhancing fine-grained part segmentation.
GEOPARD: Geometric Pretraining for Articulation Prediction in 3D Shapes
Pradyumn Goyal
UMass Amherst
Dmitry Petrov
UMass Amherst
Sheldon Andrews
Ecole de technologie superieure
Yizhak Ben-Shabat
Roblox
Hsueh-Ti Derek Liu
Roblox
Evangelos Kalogerakis
UMass Amherst, TU Crete
Abstract
We present GEOPARD1, a transformer-based architecture for predicting articulation from a single static snapshot of a 3D shape. The key idea of our method is a pretraining strategy that allows our transformer to learn plausible candidate articulations for 3D shapes based on a geometric-driven search without manual articulation annotation. The search automatically discovers physically valid part motions that do not cause detachments or collisions with other shape parts. Our experiments indicate that this geometric pretraining strategy, along with carefully designed choices in our transformer architecture, yields state-of-the-art results in articulation inference in the PartNet-Mobility dataset.
Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs
Bhavya Goyal
University of Wisconsin-Madison
Felipe Gutierrez-Barragan
Ubicept
Wei Lin
University of Wisconsin-Madison
Andreas Velten
University of Wisconsin-Madison, Ubicept
Yin Li
University of Wisconsin-Madison
Mohit Gupta
University of Wisconsin-Madison, Ubicept
Abstract
LiDAR-based 3D sensors provide point clouds, a canonical 3D representation used in various scene understanding tasks. Modern LiDARs face key challenges in several real-world scenarios, such as long-distance or low-albedo objects, producing sparse or erroneous point clouds. These errors, which are rooted in the noisy raw LiDAR measurements, get propagated to downstream perception models, resulting in potentially severe loss of accuracy. This is because conventional 3D processing pipelines do not retain any uncertainty information from the raw measurements when constructing point clouds. We propose Probabilistic Point Clouds (PPC), a novel 3D scene representation where each point is augmented with a probability attribute that encapsulates the measurement uncertainty (or confidence) in the raw data. We further introduce inference approaches that leverage PPC for robust 3D object detection; these methods are versatile and can be used as computationally lightweight drop-in modules in 3D inference pipelines. We demonstrate, via both simulations and real captures, that PPC-based 3D inference methods outperform several baselines using LiDAR as well as camera-LiDAR fusion models, across challenging indoor and outdoor scenarios involving small, distant, and low-albedo objects, as well as strong ambient light. Our project webpage is at https://bhavyagoyal.github.io/ppc.
Dark-ISP: Enhancing RAW Image Processing for Low-Light Object Detection
Jiasheng Guo
Institute of Science and Technology for Brain-inspired Intelligence, Fudan University
Xin Gao
Institute of Science and Technology for Brain-inspired Intelligence, Fudan University
Yuxiang Yan
Institute of Science and Technology for Brain-inspired Intelligence, Fudan University
Guanghao Li
Institute of Science and Technology for Brain-inspired Intelligence, Fudan University
Jian Pu
Institute of Science and Technology for Brain-inspired Intelligence, Fudan University
Abstract
Low-light Object detection is crucial for many realworld applications but remains challenging due to degraded image quality. While recent studies have shown that RAW images offer superior potential over RGB images, existing approaches either use RAW-RGB images1 with information loss or employ complex frameworks. To address these, we propose a lightweight and self-adaptive Image Signal Processing (ISP) plugin, Dark-ISP, which directly processes Bayer RAW images in dark environments, enabling seamless end-to-end training for object detection. Our key innovations are: (1) We deconstruct conventional ISP pipelines into sequential linear (sensor calibration) and nonlinear (tone mapping) sub-modules, recasting them as differentiable components optimized through task-driven losses. Each module is equpped with content-aware adaptability and physics-informed priors, enabling automatic RAW-toRGB conversion aligned with detection objectives. (2) By exploiting the ISP pipeline's intrinsic cascade structure, we devise a Self-Boost mechanism that facilitates cooperation between sub-modules. Through extensive experiments on three RAW image datasets, we demonstrate that our method outperforms state-of-the-art RGB- and RAW-based detection approaches, achieving superior results with minimal parameters in challenging low-light environments.
DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation
Jiazhe Guo
Tsinghua University
Yikang Ding
MEGVII
Xiwu Chen
Mach Drive
Shuo Chen
Tsinghua University
Bohan Li
Shanghai Jiao Tong University
Yingshuang Zou
Tsinghua University
Xiaoyang Lyu
University of Hong Kong
Feiyang Tan
Mach Drive
Xiaojuan Qi
University of Hong Kong
Zhiheng Li
Tsinghua University
Hao Zhao
Tsinghua University
Abstract
Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations. The Project is available at https://royalmelon0505.github.io/DiST-4D
IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation
Wenxuan Guo
Tsinghua University
Xiuwei Xu
Tsinghua University
Hang Yin
Tsinghua University
Ziwei Wang
Nanyang Technological University
Jianjiang Feng
Tsinghua University
Jie Zhou
Tsinghua University
Jiwen Lu
Tsinghua University
Abstract
Visual navigation with an image as goal is a fundamental and challenging problem. Conventional methods either rely on end-to-end RL learning or modular-based policy with topological graph or BEV map as memory, which cannot fully model the geometric relationship between the explored 3D environment and the goal image. In order to efficiently and accurately localize the goal image in 3D space, we build our navigation system upon the renderable 3D gaussian (3DGS) representation. However, due to the computational intensity of 3DGS optimization and the large search space of 6-DoF camera pose, directly leveraging 3DGS for image localization during agent exploration process is prohibitively inefficient. To this end, we propose IGL-Nav, an Incremental 3D Gaussian Localization framework for efficient and 3D-aware image-goal navigation. Specifically, we incrementally update the scene representation as new images arrive with feed-forward monocular prediction. Then we coarsely localize the goal by leveraging the geometric information for discrete space matching, which can be equivalent to efficient 3D convolution. When the agent is close to the goal, we finally solve the fine target pose with optimization via differentiable rendering. The proposed IGL-Nav outperforms existing state-of-the-art methods by a large margin across diverse experimental configurations. It can also handle the more challenging free-view imagegoal setting and be deployed on real-world robotic platform using a cellphone to capture goal image at arbitrary pose. Project page: https://gwxuan.github.io/IGL-Nav/.
Motion-2-to-3: Leveraging 2D Motion Data for 3D Motion Generations
Ruoxi Guo
Zhejiang University
Huaijin Pi
The University of Hong Kong
Zehong Shen
Zhejiang University
Qing Shuai
Zhejiang University
Zechen Hu
Deep Glint
Zhumei Wang
Deep Glint
Yajiao Dong
Deep Glint
Ruizhen Hu
Shenzhen University
Taku Komura
The University of Hong Kong
Sida Peng
Zhejiang University
Xiaowei Zhou
Zhejiang University
Abstract
Text-driven human motion synthesis has showcased its potential for revolutionizing motion design in the movie and game industry. Existing methods often rely on 3D motion capture data, which requires special setups, resulting in high costs for data acquisition, ultimately limiting the diversity and scope of human motion. In contrast, 2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities. In this paper, we explore the use of 2D human motion extracted from videos as an alternative data source to improve text-driven 3D motion generation. Our approach introduces a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data. We first train a single-view 2D local motion generator on a large dataset of text-2D motion pairs. Then we fine-tune the generator with 3D data, transforming it into a multi-view generator that predicts view-consistent local joint motion and root dynamics. Evaluations on the well-acknowledged dataset and novel text prompts demonstrate that our method can efficiently utilize 2D data, supporting a wider range of realistic 3D human motion generation.
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm
Ziyan Guo
Singapore University of Technology and Design
Zeyu Hu
LIGHTSPEED
De Wen Soh
Singapore University of Technology and Design
Na Zhao
Singapore University of Technology and Design
Abstract
Human motion generation and editing are key components of computer vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for realworld applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion generation. Consequently, they lack editing capabilities, finegrained control, and fail to facilitate knowledge sharing across tasks. To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: Motion-Condition-Motion, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target motion. Based on this paradigm, we propose a unified framework, MotionLab, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified conditions. In MotionLab, we introduce the 1) MotionFlow Transformer to enhance conditional generation and editing without task-specific modules; 2) Aligned Rotational Position Encoding to guarantee the time synchronization between source motion and target motion; 3) Task Specified Instruction Modulation; and 4) Motion Curriculum Learning for effective multi-task learning and knowledge sharing across tasks. Notably, our MotionLab demonstrates promising generalization capabilities and inference efficiency across multiple benchmarks for human motion. Our code and additional video results are available at: https://diouo.github.io/motionlab.github.io/.
Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras
Shuang Guo
TU Berlin and Robotics Institute
Friedhelm Hamann
TU Berlin and Robotics Institute, Science of Intelligence Excellence Cluster, Einstein Center for Digital Future
Guillermo Gallego
TU Berlin and Robotics Institute, Science of Intelligence Excellence Cluster, Einstein Center for Digital Future
Abstract
Event cameras rely on motion to obtain information about scene appearance. This means that appearance and motion are inherently linked: either both are present and recorded in the event data, or neither is captured. Previous works treat the recovery of these two visual quantities as separate tasks, which does not fit with the above-mentioned nature of event cameras and overlooks the inherent relations between them. We propose an unsupervised learning framework that jointly estimates optical flow (motion) and image intensity (appearance) using a single network. From the data generation model, we newly derive the event-based photometric error as a function of optical flow and image intensity. This error is further combined with the contrast maximization framework to form a comprehensive loss function that provides proper constraints for both flow and intensity estimation. Exhaustive experiments show our method's state-ofthe-art performance: in optical flow estimation, it reduces EPE by 20% and AE by 25% compared to unsupervised approaches, while delivering competitive intensity estimation results, particularly in high dynamic range scenarios. Our method also achieves shorter inference time than all other optical flow methods and many of the image reconstruction methods, while they output only one quantity. Project page: https://github.com/tub-rip/E2FAI
WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images
Yansong Guo
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Jie Hu
National University of Singapore
Yansong Qu
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Liujuan Cao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Abstract
Recent advances in intuitive 3D segmentation from 2D images have demonstrated impressive performance. However, current models typically require extensive scene-specific training to accurately reconstruct and segment objects, which limits their applicability in real-time scenarios. In this paper, we introduce WildSeg3D, an efficient approach that enables the segmentation of arbitrary 3D objects across diverse environments using a feed-forward mechanism. A key challenge of this feed-forward approach lies in the accumulation of 3D alignment errors across multiple 2D views, which can lead to inaccurate 3D segmentation results. To address this issue, we propose Dynamic Global Aligning (DGA), a technique that improves the accuracy of global multi-view alignment by focusing on difficult-to-match 3D points across images, using a dynamic adjustment function. Additionally, for real-time intuitive segmentation, we introduce Multi-view Group Mapping (MGM), a method that utilizes an object mask cache to integrate multi-view segmentations and respond rapidly to user prompts. WildSeg3D demonstrates robust generalization across arbitrary scenes, thereby eliminating the need for scene-specific training. Specifically, WildSeg3D not only attains the accuracy of state-of-the-art (SOTA) methods but also achieves a 40x speedup compared to existing SOTA models. Code will be released at https://github.com/Ethan16162/WildSeg3D.
HVPUNet: Hybrid-Voxel Point-cloud Upsampling Network
Juhyung Ha
Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN
Vibhas Kumar Vats
Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN
Soon-heung Jung
Electronics and Telecommunications Research Institute, Daejeon
Alimoor Reza
Department of Mathematics and Computer Science, Drake University, Des Moines, IA
David J. Crandall
Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN
Abstract
Point-cloud upsampling aims to generate dense point sets from sparse or incomplete 3D data. Most existing work uses a point-to-point framework. While this method achieves high geometric precision, it is slow because of irregular memory accesses to process unstructured point data. Alternatively, voxel-based methods offer computational efficiency by using regular grids, but struggle to preserve precise point locations due to discretization. To resolve this efficiency-precision trade-off, we introduce Hybrid Voxels, a representation that combines both voxel occupancy and a continuous point offset. We then present the Hybrid-Voxel Point-cloud Upsampling Network (HVPUNet), an efficient framework built upon this representation. HVPUNet integrates two key modules: (1) Shape Completion to restore missing geometry by filling empty voxels, and (2) SuperResolution to enhance spatial resolution and capture finer surface details. We also use progressive refinement, operational voxel expansion, and implicit geometric learning. Experimental results demonstrate that HVPUNet can upsample point clouds at significantly lower computational cost than the state-of-the-art, but with comparable model accuracy.
Multi-modal Multi-platform Person Re-Identification: Benchmark and Method
Ruiyang Ha
ShanghaiTech University
Songyi Jiang
ShanghaiTech University
Bin Li
ShanghaiTech University
Bikang Pan
ShanghaiTech University
Yihang Zhu
ShanghaiTech University
Junjie Zhang
Xi'an Jiaotong-Liverpool University
Xiatian Zhu
University of Surrey
Shaogang Gong
Queen Mary University London
Jingya Wang
ShanghaiTech University
Abstract
Conventional person re-identification (ReID) research is often limited to single-modality sensor data from static cameras, which fails to address the complexities of realworld scenarios where multi-modal signals are increasingly prevalent. For instance, consider an urban ReID system integrating stationary RGB cameras, nighttime infrared sensors, and UAVs equipped with dynamic tracking capabilities. Such systems face significant challenges due to variations in camera perspectives, lighting conditions, and sensor modalities, hindering effective person ReID. To address these challenges, we introduce the MP-ReID benchmark, a novel dataset designed specifically for multi-modality and multi-platform ReID. This benchmark uniquely compiles data from 1,930 identities across diverse modalities, including RGB, infrared, and thermal imaging, captured by both UAVs and ground-based cameras in indoor and outdoor environments. Building on this benchmark, we introduce UniPrompt ReID, a framework with specific-designed prompts, tailored for cross-modality and cross-platform scenarios. Our method consistently outperforms state-of-the-art approaches, establishing a robust foundation for future research in complex and dynamic ReID environments. Our dataset and code are available at: https://mp-reid. github.io/.
CarGait: Cross-Attention based Re-ranking for Gait recognition
Gavriel Habib
OriginAI
Noa Barzilay
OriginAI
Or Shimshi
OriginAI
Rami Ben-Ari
OriginAI
Nir Darshan
OriginAI
Abstract
Gait recognition is a computer vision task that identifies individuals based on their walking patterns. Its performance is commonly evaluated by ranking a gallery of candidates and measuring the identification accuracy at Rank-K. Existing models are typically single-staged, searching for the probe's nearest neighbors in a gallery, using a global feature representation. While these models can excel at retrieving the correct identity within the top-K predictions, they often struggle when hard negatives are among the top shortlist, leading to relatively low performance at the highest ranks (e.g., Rank-1). In this paper, we introduce CarGait, a Re-ranking (re-ordering the top-K list) method for gait recognition, leveraging the fine-grained correlations between pairs of gait sequences, through cross-attention between gait strips. This re-ranking scheme can be adapted to existing single-stage models to enhance their final results. We demonstrate the capabilities of CarGait by extensive experiments on three common gait datasets, Gait3D, GREW, and OU-MVLP, and seven different gait models, showing consistent gains in Rank-1,5 accuracy, while outperforming existing re-ranking approaches, and a strong baseline.
DoppDrive: Doppler-Driven Temporal Aggregation for Improved Radar Object Detection
Yuval Haitman
General Motors, Technical Center Israel
Oded Bialer
General Motors, Technical Center Israel
Abstract
Radar-based object detection is essential for autonomous driving due to radar's long detection range. However, the sparsity of radar point clouds, especially at long range, poses challenges for accurate detection. Existing methods increase point density through temporal aggregation with ego-motion compensation, but this approach introduces scatter from dynamic objects, degrading detection performance. We propose DoppDrive, a novel DopplerDriven temporal aggregation method that enhances radar point cloud density while minimizing scatter. Points from previous frames are shifted radially according to their dynamic Doppler component to eliminate radial scatter, with each point assigned a unique aggregation duration based on its Doppler and angle to minimize tangential scatter. DoppDrive is a point cloud density enhancement step applied before detection, compatible with any detector, and we demonstrate that it significantly improves object detection performance across various detectors and datasets.Our project page: https://yuvalhg.github.io/DoppDrive/
Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description
Anna-Maria Halacheva
INSAIT, Sofia University 'St. Kliment Ohridski'
Yang Miao
INSAIT, Sofia University 'St. Kliment Ohridski'
Jan-Nico Zaech
INSAIT, Sofia University 'St. Kliment Ohridski'
Xi Wang
INSAIT, Sofia University 'St. Kliment Ohridski', ETH Zurich, TU Munich
Luc Van Gool
INSAIT, Sofia University 'St. Kliment Ohridski'
Danda Pani Paudel
INSAIT, Sofia University 'St. Kliment Ohridski'
Abstract
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. Providing a solution to these applications requires a multifaceted approach that covers scene-centric, object-centric, as well as interaction-centric capabilities. While there exist numerous datasets and algorithms approaching the former two problems, the task of understanding interactable and articulated objects is underrepresented and only partly covered in the research field. In this work, we address this shortcoming by introducing: (1) Articulate3D, an expertly curated 3D dataset featuring high-quality manual annotations on 280 indoor scenes. Articulate3D provides 8 types of annotations for articulated objects, covering parts and detailed motion information, all stored in a standardized scene representation format designed for scalable 3D content creation, exchange and seamless integration into simulation environments. (2) USDNet, a novel unified framework capable of simultaneously predicting part segmentation along with a full specification of motion attributes for articulated objects. We evaluate USDNet on Articulate3D as well as two existing datasets, demonstrating the advantage of our unified dense prediction approach. Furthermore, we highlight the value of Articulate3D through cross-dataset and crossdomain evaluations and showcase its applicability in downstream tasks such as scene editing through LLM prompting and robotic policy training for articulated object manipulation. We provide open access to our dataset, benchmark, and method's source code.
ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models
Shadi Hamdan
Koc University, KUIS AI Center
Chonghao Sima
The University of Hong Kong
Zetong Yang
OpenDriveLab
Hongyang Li
The University of Hong Kong
Fatma Güney
Koc University, KUIS AI Center
Abstract
How can we benefit from large models without sacrificing inference speed, a common dilemma in self-driving systems? A prevalent solution is a dual-system architecture, employing a small model for rapid, reactive decisions and a larger model for slower but more informative analyses. Existing dual-system designs often implement parallel architectures where inference is either directly conducted using the large model at each current frame or retrieved from previously stored inference results. However, these works still struggle to enable large models for a timely response to every online frame. Our key insight is to shift intensive computations of the current frame to previous time steps and perform a batch inference of multiple time steps to make large models respond promptly to each time step. To achieve the shifting, we introduce Efficiency through Thinking Ahead (ETA), an asynchronous system designed to: (1) propagate informative features from the past to the current frame using future predictions from the large model, (2) extract current frame features using a small model for real-time responsiveness, and (3) integrate these dual features via an action mask mechanism that emphasizes action-critical image regions. Evaluated on the Bench2Drive CARLA Leaderboard-v2 benchmark, ETA advances state-of-the-art performance by 8% with a driving score of 69.53 while maintaining a near-real-time inference speed at 50 ms. Code and checkpoints can be found here.
All in One: Visual-Description-Guided Unified Point Cloud Segmentation
Zongyan Han
Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi
Mohamed El Amine Boudjoghra
Technical University of Munich
Jiahua Dong
Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi
Jinhong Wang
Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi
Rao Muhammad Anwer
Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi
Abstract
Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a SemanticVisual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at https://github. com/Hanzy1996/VDG-Uni3DSeg.
DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing
Shengdong Han
School of Computer Science, Nanjing University of Posts and Telecommunications
Shangdong Yang
School of Computer Science, Nanjing University of Posts and Telecommunications
Yuxuan Li
VCIP, CS, Nankai University
Xin Zhang
VCIP, CS, Nankai University
Xiang Li
VCIP, CS, Nankai University, NKIARI, Futian, Shenzhen
Jian Yang
VCIP, CS, Nankai University
Ming-Ming Cheng
VCIP, CS, Nankai University, NKIARI, Futian, Shenzhen
Yimian Dai
VCIP, CS, Nankai University, NKIARI, Futian, Shenzhen
Abstract
Resolving closely-spaced small targets in dense clusters presents a significant challenge in infrared imaging, as the overlapping signals hinder precise determination of their quantity, sub-pixel positions, and radiation intensities. While deep learning has advanced the field of infrared small target detection, its application to closely-spaced infrared small targets has not yet been explored. This gap exists primarily due to the complexity of separating superimposed characteristics and the lack of an open-source infrastructure. In this work, we propose the Dynamic Iterative Shrinkage Thresholding Network (DISTA-Net), which reconceptualizes traditional sparse reconstruction within a dynamic framework. DISTANet adaptively generates convolution weights and thresholding parameters to tailor the reconstruction process in real time. To the best of our knowledge, DISTA-Net is the first deep learning model designed specifically for the unmixing of closely-spaced infrared small targets, achieving superior sub-pixel detection accuracy. Moreover, we have established the first open-source ecosystem to foster further research in this field. This ecosystem comprises three key components: (1) CSIST-100K, a publicly available benchmark dataset; (2) CSO-mAP, a custom evaluation metric for sub-pixel detection; and (3) GrokCSO, an open-source toolkit featuring DISTA-Net and other state-of-the-art models, available at https://github.com/GrokCV/GrokCSO.
Extrapolated Urban View Synthesis Benchmark
Xiangyu Han
NYU
Zhen Jia
NYU
Boyi Li
NVIDIA
Yan Wang
NVIDIA
Boris Ivanovic
NVIDIA
Yurong You
NVIDIA
Lingjie Liu
UPenn
Yue Wang
NVIDIA
Marco Pavone
Stanford
Chen Feng
NYU
Yiming Li
NVIDIA
Abstract
Photorealistic simulators are essential for the training and evaluation of vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis (NVS), a crucial capability that generates diverse unseen viewpoints to accommodate the broad and continuous pose distribution of AVs. Recent advances in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic rendering at realtime speeds and have been widely used in modeling largescale driving scenes. However, their performance is commonly evaluated using an interpolated setup with highly correlated training and test views. In contrast, extrapolation, where test views largely deviate from training views, remains underexplored, limiting progress in generalizable simulation technology. To address this gap, we leverage publicly available AV datasets with multiple traversals, multiple vehicles, and multiple cameras to build the first Extrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct both quantitative and qualitative evaluations of state-of-the-art NVS methods across different evaluation settings. Our results show that current NVS methods are prone to overfitting to training views. Besides, incorporating diffusion priors and improving geometry cannot fundamentally improve NVS under large view changes, highlighting the need for more robust approaches and largescale training. We have released the data to help advance self-driving and urban robotics simulation technology.
MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking
Han Han
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
Wei Zhai
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
Yang Cao
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
Bin Li
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
Zheng-jun Zha
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
Abstract
Tracking Any Point (TAP) plays a crucial role in motion analysis. Video-based approaches rely on iterative local matching for tracking, but they assume linear motion during the blind time between frames, which leads to point loss under large displacements or nonlinear motion. The high temporal resolution and motion blur-free characteristics of event cameras provide continuous, fine-grained motion information, capturing subtle variations with microsecond precision. This paper presents an event-based framework for tracking any point, which tackles the challenges posed by spatial sparsity and motion sensitivity in events through two tailored modules. Specifically, to resolve ambiguities caused by event sparsity, a motion-guidance module incorporates kinematic vectors into the local matching process. Additionally, a variable motion aware module is integrated to ensure temporally consistent responses that are insensitive to varying velocities, thereby enhancing matching precision. To validate the effectiveness of the approach, two event dataset for tracking any point is constructed by simulation. The method improves the Survival50 metric by 17.9% over event-only tracking of any point baseline. Moreover, on standard feature tracking benchmarks, it outperforms all existing methods, even those that combine events and video frames.
PolGS: Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction
Yufei Han
Beijing University of Posts and Telecommunications
Bowen Tie
Beijing University of Posts and Telecommunications
Heng Guo
Beijing University of Posts and Telecommunications, Xiong'an Aerospace Information Research Institute
Youwei Lyu
Beijing University of Posts and Telecommunications
Si Li
Beijing University of Posts and Telecommunications
Boxin Shi
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Yunpeng Jia
Beijing University of Posts and Telecommunications
Zhanyu Ma
Beijing University of Posts and Telecommunications
Abstract
Efficient shape reconstruction for surfaces with complex reflectance properties is crucial for real-time virtual reality. While 3D Gaussian Splatting (3DGS)-based methods offer fast novel view rendering by leveraging their explicit surface representation, their reconstruction quality lags behind that of implicit neural representations, particularly in the case of recovering surfaces with complex reflective reflectance. To address these problems, we propose PolGS, a Polarimetric Gaussian Splatting model allowing fast reflective surface reconstruction in 10 minutes. By integrating polarimetric constraints into the 3DGS framework, PolGS effectively separates specular and diffuse components, enhancing reconstruction quality for challenging reflective materials. Experimental results on the synthetic and real-world dataset validate the effectiveness of our method. Project page: https://yu-fei-han.github.io/polgs.
REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment
Haonan Han
Tsinghua University
Rui Yang
The University of Hong Kong
Huan Liao
Tsinghua University
Jiankai Xing
Tsinghua University
Zunnan Xu
Tsinghua University
Xiaoming Yu
Tencent
Junwei Zha
Tencent
Xiu Li
Tsinghua University
Wanhua Li
Harvard University
Abstract
Traditional image-to-3D models often struggle with scenes containing multiple objects due to biases and occlusion complexities. To address this challenge, we present REPARO, a novel approach for compositional 3D asset generation from single images. REPARO employs a two-step process: first, it extracts individual objects from the scene and reconstructs their 3D meshes using image-to-3D models; then, it optimizes the layout of these meshes through differentiable rendering techniques, ensuring coherent scene composition. By integrating optimal transport-based longrange appearance loss term and high-level semantic loss term in the differentiable rendering, REPARO can effectively recover the layout of 3D assets. The proposed method can significantly enhance object independence, detail accuracy, and overall scene coherence. Extensive evaluation of multi-object scenes demonstrates that our REPARO offers a comprehensive approach to address the complexities of multi-object 3D scene generation from single images. The demo have been available at https://reparo-3d.github.io/
SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies
Liang Han
School of Software, Tsinghua University
Xu Zhang
China Telecom
Haichuan Song
Computer Science and Technology, East China Normal University
Kanle Shi
Kuaishou Technology
Yu-Shen Liu
School of Software, Tsinghua University
Zhizhong Han
Department of Computer Science, Wayne State University
Abstract
Surface reconstruction from sparse views aims to reconstruct a 3D shape or scene from few RGB images. The latest methods are either generalization-based or overfittingbased. However, the generalization-based methods do not generalize well on views that were unseen during training, while the reconstruction quality of overfitting-based methods is still limited by the limited geometry clues. To address this issue, we propose SparseRecon, a novel neural implicit reconstruction method for sparse views with volume rendering-based feature consistency and uncertaintyguided depth constraint. Firstly, we introduce a feature consistency loss across views to constrain the neural implicit field. This design alleviates the ambiguity caused by insufficient consistency information of views and ensures completeness and smoothness in the reconstruction results. Secondly, we employ an uncertainty-guided depth constraint to back up the feature consistency loss in areas with occlusion and insignificant features, which recovers geometry details for better reconstruction quality. Experimental results demonstrate that our method outperforms the state-of-the-art methods, which can produce high-quality geometry with sparse-view input, especially in the scenarios with small overlapping views. Project page: https: //hanl2010.github.io/SparseRecon/.
PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation
Xiaoyang Hao
Southern University of Science and Technology
Han Li
Southern University of Science and Technology
Abstract
Monocular 3D human pose estimation (HPE) methods estimate the 3D positions of joints from individual images. Existing 3D HPE approaches often use the cropped image alone as input for their models. However, the relative depths of joints cannot be accurately estimated from cropped images without the corresponding camera intrinsics, which determine the perspective relationship between 3D objects and the cropped images. In this work, we introduce Perspective Encoding (PE) to encode the camera intrinsics of the cropped images. Moreover, since the human subject can appear anywhere within the original image, the perspective relationship between the 3D scene and the cropped image differs significantly, which complicates model fitting. Additionally, the further the human subject deviates from the image center, the greater the perspective distortions in the cropped image. To address these issues, we propose Perspective Rotation (PR), a transformation applied to the original image that centers the human subject, thereby reducing perspective distortions and alleviating the difficulty of model fitting. By incorporating PE and PR, we propose a novel 3D HPE framework, PersPose. Experimental results demonstrate that PersPose achieves state-of-the-art (SOTA) performance on the 3DPW, MPIINF-3DHP, and Human3.6M datasets. For example, on the in-the-wild dataset 3DPW, PersPose achieves an MPJPE of 60.1 mm, 7.54% lower than the previous SOTA approach. Code is available at: https://github.com/KenAdamsJoseph/PersPose.
Principles of Visual Tokens for Efficient Video Understanding
Xinyue Hao
University of Edinburgh
Gen Li
University of Edinburgh
Shreyank N Gowda
University of Nottingham
Robert B. Fisher
University of Edinburgh
Jonathan Huang
Scaled Foundations
Anurag Arnab
Google DeepMind
Laura Sevilla-Lara
University of Edinburgh
Abstract
Video understanding has made huge strides in recent years, relying largely on the power of transformers. As this architecture is notoriously expensive and video data is highly redundant, research into improving efficiency has become particularly relevant. Some creative solutions include token selection and merging. While most methods succeed in reducing the cost of the model and maintaining accuracy, an interesting pattern arises: most methods do not outperform the baseline of randomly discarding tokens. In this paper we take a closer look at this phenomenon and observe 5 principles of the nature of visual tokens. For example, we observe that the value of tokens follows a clear Paretodistribution where most tokens have remarkably low value, and just a few carry most of the perceptual information. We build on these and further insights to propose a lightweight video model, LITE, that can select a small number of tokens effectively, outperforming state-of-the-art and existing baselines across datasets (Kinetics-400 and SomethingSomething-V2) in the challenging trade-off of computation (GFLOPs) vs accuracy. Experiments also show that LITE generalizes across datasets and even other tasks without the need for retraining. The code is released at https: //github.com/maggieHao/Efficient-LITE.
AllTracker: Efficient Dense Point Tracking at High Resolution
Adam W. Harley
Stanford University
Yang You
Stanford University
Xinglong Sun
Stanford University
Yang Zheng
Stanford University
Nikhil Raghuraman
Stanford University
Yunqi Gu
Stanford University
Sheldon Liang
Carnegie Mellon University
Wen-Hsuan Chu
Carnegie Mellon University
Achal Dave
Toyota Research Institute
Suya You
Army Research Laboratory
Rares Ambrus
Toyota Research Institute
Katerina Fragkiadaki
Carnegie Mellon University
Leonidas Guibas
Stanford University
Abstract
We introduce AllTracker: a model that estimates longrange point tracks by way of estimating the flow field between a query frame and every other frame of a video. Unlike existing point tracking methods, our approach delivers high-resolution and dense (all-pixel) correspondence fields, which can be visualized as flow maps. Unlike existing optical flow methods, our approach corresponds one frame to hundreds of subsequent frames, rather than just the next frame. We develop a new architecture for this task, blending techniques from existing work in optical flow and point tracking: the model performs iterative inference on lowresolution grids of correspondence estimates, propagating information spatially via 2D convolution layers, and propagating information temporally via pixel-aligned attention layers. The model is fast and parameter-efficient (16 million parameters), and delivers state-of-the-art point tracking accuracy at high resolution (i.e., tracking 768 →1024 pixels, on a 40G GPU). A benefit of our design is that we can train jointly on optical flow datasets and point tracking datasets, and we find that doing so is crucial for top performance. We provide an extensive ablation study on our architecture details and training recipe, making it clear which details matter most. Our code and model weights are available: https://alltracker.github.io
TorchAdapt: Towards Light-Agnostic Real-Time Visual Perception
Khurram Azeem Hashmi
DFKI
Karthik Palyakere Suresh
DFKI
Didier Stricker
DFKI
Muhammad Zeshan Afzal
DFKI
Abstract
Low-light conditions significantly degrade the performance of high-level vision tasks. Existing approaches either enhance low-light images without considering normal illumination scenarios, leading to poor generalization, or are tailored to specific tasks. We propose TorchAdapt, a realtime adaptive feature enhancement framework that generalizes robustly across varying illumination conditions without degrading performance in well-lit scenarios. TorchAdapt consists of two complementary modules: the Torch module enhances semantic features beneficial for downstream tasks, while the Adapt module dynamically modulates these enhancements based on input content. Leveraging a novel light-agnostic learning strategy, TorchAdapt aligns feature representations of enhanced and well-lit images to produce powerful illumination-invariant features. Extensive experiments on multiple high-level vision tasks, including object detection, face detection, instance segmentation, semantic segmentation, and video object detection, demonstrate that TorchAdapt consistently outperforms state-of-the-art lowlight enhancement and task-specific methods in both lowlight and light-agnostic settings. TorchAdapt thus provides a unified, flexible solution for robust visual perception across diverse lighting conditions.
Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness, Generalization, and Transferability
Boyong He
Institute of Artificial Intelligence, Xiamen University
Yuxiang Ji
Institute of Artificial Intelligence, Xiamen University
Zhuoyue Tan
Institute of Artificial Intelligence, Xiamen University
Liaoni Wu
Institute of Artificial Intelligence, Xiamen University
Abstract
Detectors often suffer from performance drop due to domain gap between training and testing data. Recent methods explore diffusion models applied to domain generalization (DG) and adaptation (DA) tasks, but still struggle with large inference costs and have not yet fully leveraged the capabilities of diffusion models. We propose to tackle these problems by extracting intermediate features from a single-step diffusion process, improving feature collection and fusion to reduce inference time by 75% while enhancing performance on source domains (i.e., Fitness). Then, we construct an object-centered auxiliary branch by applying box-masked images with class prompts to extract robust and domain-invariant features that focus on object. We also apply consistency loss to align the auxiliary and ordinary branch, balancing fitness and generalization while preventing overfitting and improving performance on target domains (i.e., Generalization). Furthermore, within a unified framework, standard detectors are guided by diffusion detectors through feature-level and object-level alignment on source domains (for DG) and unlabeled target domains (for DA), thereby improving cross-domain detection performance (i.e., Transferability). Our method achieves competitive results on 3 DA benchmarks and 5 DG benchmarks. Additionally, experiments on COCO generalization benchmark demonstrate that our method maintains significant advantages and show remarkable efficiency in large domain shifts and low-data scenarios. Our work shows the superiority of applying diffusion models to domain generalized and adaptive detection tasks and offers valuable insights for visual perception tasks across diverse domains. The code is available at Fitness-Generalization-Transferability.
DexVLG: Dexterous Vision-Language-Grasp Model at Scale
Jiawei He
Beijing Academy of Artificial Intelligence
Danshi Li
Galbot
Xinqiang Yu
Galbot
Zekun Qi
Galbot
Wenyao Zhang
Galbot
Jiayi Chen
Galbot
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Zhizheng Zhang
Galbot
Li Yi
Tsinghua University
He Wang
Beijing Academy of Artificial Intelligence
Abstract
As large models gain traction, vision-language models are enabling robots to tackle increasingly complex tasks. However, limited by the difficulty of data collection, progress has mainly focused on controlling simple gripper end-effectors. There is little research on functional grasping with large models for human-like dexterous hands. In this paper, we introduce DexVLG, a large Vision-Language-Grasp model for Dexterous grasp pose prediction aligned with language instructions using single-view RGBD input. To accomplish this, we generate a dataset of 170 million dexterous grasp poses mapped to semantic parts across 174,000 objects in simulation, paired with detailed part-level captions. This large-scale dataset, named DexGraspNet 3.0, is used to train a VLM with a flow-matching-based pose head producing instruction-aligned grasp poses for tabletop objects. To evaluate DexVLG's performance, we create benchmarks in simulations and conduct real-world experiments. Extensive experiments demonstrate DexVLG's strong zero-shot generalization capabilities, achieving an over 76% zero-shot execution success rate and state-of-the-art part-grasp accuracy in simulation, as well as successful part-aligned grasps on physical objects in real-world scenarios.
Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds
Pei He
Xidian University
Lingling Li
Xidian University
Licheng Jiao
Xidian University
Ronghua Shang
Xidian University
Fang Liu
Xidian University
Shuang Wang
Xidian University
Xu Liu
Xidian University
Wenping Ma
Xidian University
Abstract
Domain generalization in 3D segmentation is a critical challenge in deploying models to unseen environments. Current methods mitigate the domain shift by augmenting the data distribution of point clouds. However, the model learns global geometric patterns in point clouds while ignoring the category-level distribution and alignment. In this paper, a category-level geometry learning framework is proposed to explore the domain-invariant geometric features for domain generalized 3D semantic segmentation. Specifically, Category-level Geometry Embedding (CGE) is proposed to perceive the fine-grained geometric properties of point cloud features, which constructs the geometric properties of each class and couples geometric embedding to semantic learning. Secondly, Geometric Consistent Learning (GCL) is proposed to simulate the latent 3D distribution and align the category-level geometric embeddings, allowing the model to focus on the geometric invariant information to improve generalization. Experimental results verify the effectiveness of the proposed method, which has very competitive segmentation accuracy compared with the state-of-the-art domain generalized point cloud methods. The code will be available at https://github.com/ChicalH/DCGL.
Dual-Rate Dynamic Teacher for Source-Free Domain Adaptive Object Detection
Qi He
Southwest Jiaotong University, Chengdu
Xiao Wu
Southwest Jiaotong University, Chengdu
Jun-Yan He
Meituan Inc.
Shuai Li
The Hong Kong Polytechnic University
Abstract
Source-Free Domain Adaptive Object Detection transfers knowledge from a labeled source domain to an unlabeled target domain while preserving data privacy by restricting access to source data during adaptation. Existing approaches predominantly leverage the Mean Teacher framework for self-training in the target domain. The exponential moving average (EMA) mechanism in the Mean Teacher stabilizes the training by averaging the student weights over training steps. However, in domain adaptation, its inherent lag in responding to emerging knowledge can hinder the rapid adaptation of the student to target-domain shifts. To address this challenge, Dual-rate Dynamic Teacher (DDT) with Asynchronous EMA (AEMA) is proposed, which implements group-wise parameter updates. In contrast to traditional EMA, which simultaneously updates all parameters, AEMA dynamically decomposes teacher parameters into two functional groups based on their contributions to capture the domain shift. By applying a distinct smoothing coefficient to two groups, AEMA simultaneously enables fast adaptation and historical knowledge retention. Comprehensive experiments carried out on three widely used traffic benchmarks have demonstrated that the proposed DDT achieves superior performance, outperforming SOTA methods by a clear margin. The codes are available at https://github.com/qih96/DDT.
ERNet: Efficient Non-Rigid Registration Network for Point Sequences
Guangzhao He
Zhejiang University
Yuxi Xiao
ATE3D
Zhen Xu
ATE3D
Xiaowei Zhou
ATE3D
Sida Peng
Zhejiang University
Abstract
Registering an object shape to a sequence of point clouds undergoing non-rigid deformation is a long-standing challenge. The key difficulties stem from two factors: (i) the presence of local minima due to the non-convexity of registration objectives, especially under noisy or partial inputs, which hinders accurate and robust deformation estimation, and (ii) error accumulation over long sequences, leading to tracking failures. To address these challenges, we introduce to adopt a scalable data-driven approach and propose ERNet, an efficient feed-forward model trained on large deformation datasets. It is designed to handle noisy and partial inputs while effectively leveraging temporal information for accurate and consistent sequential registration. The key to our design is predicting a sequence of deformation graphs through a two-stage pipeline, which first estimates framewise coarse graph nodes for robust initialization, before refining their trajectories over time in a sliding-window fashion. Extensive experiments show that our proposed approach The authors are affiliated with the State Key Lab of CAD&CG. (i) outperforms previous state of the art on both the DeformingThings4D and D-FAUST datasets, and (ii) achieves more than 4x speedup compared to the previous best, offering significant efficiency improvement.
RareCLIP: Rarity-aware Online Zero-shot Industrial Anomaly Detection
Jianfang He
Institute of Automation, Chinese Academy of Sciences
Min Cao
Soochow University
Silong Peng
Institute of Automation, Chinese Academy of Sciences
Qiong Xie
Institute of Automation, Chinese Academy of Sciences
Abstract
Large vision-language models such as CLIP have made significant strides in zero-shot anomaly detection through prompt engineering. However, most existing methods typically process each test image individually, ignoring the practical rarity of abnormal patches in real-world scenarios. Although some batch-based approaches exploit the rarity by processing multiple samples concurrently, they generally introduce unacceptable latency for real-time applications. To mitigate these limitations, we propose RareCLIP, a novel online zero-shot anomaly detection framework that enables sequential image processing in real-time without requiring prior knowledge of the target domain. RareCLIP capitalizes on the zero-shot capabilities of CLIP and integrates a dynamic test-time rarity estimation mechanism. A key innovation of our framework is the introduction of a prototype patch feature memory bank, which aggregates representative features from historical observations and continuously updates their corresponding rarity measures. For each incoming image patch, RareCLIP computes a rarity score by aggregating the rarity measures of its nearest neighbors within the memory bank. Moreover, we introduce a prototype sampling strategy based on dissimilarity to enhance computational efficiency, as well as a similarity calibration strategy to enhance the robustness of rarity estimation. Extensive experiments demonstrate that RareCLIP attains state-of-the-art performance with 98.2% image-level AUROC on MVTec AD and 94.4% on VisA, while achieving a latency of 59.4 ms. Code is available at https://github.com/hjf02/RareCLIP.
Simulating Dual-Pixel Images From Ray Tracing For Depth Estimation
Fengchen He
Huazhong University of Science and Technology
Dayang Zhao
Huazhong University of Science and Technology
Hao Xu
Huazhong University of Science and Technology
Tingwei Quan
Huazhong University of Science and Technology
Shaoqun Zeng
Huazhong University of Science and Technology
Abstract
Many studies utilize dual-pixel (DP) sensor phase information for various applications, such as depth estimation and deblurring. However, since DP image features are entirely determined by the camera hardware, DP-depth paired datasets are very scarce, especially when performing depth estimation on customized cameras. To overcome this, studies simulate DP images using ideal optical models. However, these simulations often violate real optical propagation laws, leading to poor generalization to real DP data. To address this, we investigate the domain gap between simulated and real DP data, and propose solutions using the Simulating DP Images from Ray Tracing (Sdirt) scheme. Sdirt generates realistic DP images via ray tracing and integrates them into the depth estimation training pipeline. Experimental results show that models trained with Sdirt-simulated images generalize better to real DP data. The code and collected datasets will be available at https://github.com/LinYark/Sdirt.
SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling
Xianglong He
Tsinghua University
Zi-Xin Zou
VAST
Chia-Hao Chen
Tsinghua University
Yuan-Chen Guo
VAST
Ding Liang
VAST
Chun Yuan
Tsinghua University
Wanli Ouyang
The Chinese University of Hong Kong
Yan-Pei Cao
VAST
Yangguang Li
VAST
Abstract
Creating high-fidelity 3D meshes with arbitrary topology, including open surfaces and complex interiors, remains a significant challenge. Existing implicit field methods often require costly and detail-degrading watertight conversion, while other approaches struggle with high resolutions. This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to 10243 directly from rendering losses. SparseFlex combines the accuracy of Flexicubes with a sparse voxel structure, focusing computation on surface-adjacent regions and efficiently handling open surfaces. Crucially, we introduce a frustum-aware sectional voxel training strategy that activates only relevant voxels during rendering, dramatically reducing memory consumption and enabling high-resolution training. This also allows, for the first time, the reconstruction of mesh interiors using only rendering supervision. Building upon this, we demonstrate a complete shape modeling pipeline by training a variational autoencoder (VAE) and a rectified flow transformer for high-quality 3D shape generation. Our experiments show state-of-the-art reconstruction accuracy, with a ∼82% reduction in Chamfer Distance and a ∼88% increase in F-score compared to previous methods, and demonstrate the generation of high-resolution, detailed 3D shapes with arbitrary topology. By enabling high-resolution, differentiable mesh reconstruction and generation with rendering losses, SparseFlex significantly advances the state-of-the-art in 3D shape representation and modeling.
SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis
Wenkun He
Tsinghua University
Yun Liu
Tsinghua University
Ruitao Liu
Tsinghua University
Li Yi
Tsinghua University
Abstract
Synthesizing realistic human-object interaction motions is a critical problem in VR/AR and human animation. Unlike the commonly studied scenarios involving a single human or hand interacting with one object, we address a more generic multi-body setting with arbitrary numbers of humans, hands, and objects. The high correlations and mutual influences among bodies leads to two major challenges, for which we propose solutions. First, to satisfy the high demands for synchronization of different body motions, we mathematically derive a new set of alignment scores during the training process, and use maximum likelihood sampling on a dynamic graphical model for explicit synchronization during inference. Second, the high-frequency interactions between objects are often overshadowed by the large-scale low-frequency movements. To address this, we introduce frequency decomposition and explicitly represent high-frequency components in the frequency domain. Extensive experiments across five datasets with various multibody configurations demonstrate the superiority of SyncDiff over existing state-of-the-art motion synthesis methods.
2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos
Marvin Heidinger
Computer Science Department, Technische Universität Darmstadt
Snehal Jauhri
Computer Science Department, Technische Universität Darmstadt
Vignesh Prasad
Computer Science Department, Technische Universität Darmstadt
Georgia Chalvatzaki
Computer Science Department, Technische Universität Darmstadt
Abstract
When interacting with objects, humans effectively reason about which regions of objects are viable for an intended action, i.e., the affordance regions of the object. They can also account for subtle differences in object regions based on the task to be performed and whether one or two hands need to be used. However, current visionbased affordance prediction methods often reduce the problem to naive object part segmentation. In this work, we propose a framework for extracting affordance data from human activity video datasets. Our extracted 2HANDS dataset contains precise object affordance region segmentations and affordance class-labels as narrations of the activity performed. The data also accounts for bimanual actions, i.e., two hands co-ordinating and interacting with one or more objects. We present a VLM-based affordance prediction model, 2HandedAfforder, trained on the dataset and demonstrate superior performance over baselines in affordance region segmentation for various activities. Finally, we show that our predicted affordance regions are actionable, i.e., can be used by an agent performing a task, through demonstration in robotic manipulation scenarios. Project-website: sites.google.com/view/2handedafforder
Kaputt: A Large-Scale Dataset for Visual Defect Detection
Sebastian Höfer
Amazon, Fulfillment Technologies & Robotics
Dorian F. Henning
Amazon, Fulfillment Technologies & Robotics
Artemij Amiranashvili
Amazon, Fulfillment Technologies & Robotics
Douglas Morrison
Amazon, Fulfillment Technologies & Robotics
Mariliza Tzes
Amazon, Fulfillment Technologies & Robotics
Ingmar Posner
Amazon, Fulfillment Technologies & Robotics, University of Oxford
Marc Matvienko
Amazon, Fulfillment Technologies & Robotics
Alessandro Rennola
Amazon, Fulfillment Technologies & Robotics
Anton Milan
Amazon, Fulfillment Technologies & Robotics
Abstract
We present a novel large-scale dataset for defect detection in a logistics setting. Recent work on industrial anomaly detection has primarily focused on manufacturing scenarios with highly controlled poses and a limited number of object categories. Existing benchmarks like MVTec-AD [6] and VisA [33] have reached saturation, with state-of-theart methods achieving up to 99.9% AUROC scores. In contrast to manufacturing, anomaly detection in retail logistics faces new challenges, particularly in the diversity and variability of object pose and appearance. Leading anomaly detection methods fall short when applied to this new setting. To bridge this gap, we introduce a new benchmark that overcomes the current limitations of existing datasets. With over 230,000 images (and more than 29,000 defective instances), it is 40 times larger than MVTec and contains more than 48,000 distinct objects. To validate the difficulty of the problem, we conduct an extensive evaluation of multiple state-of-the-art anomaly detection methods, demonstrating that they do not surpass 56.96% AUROC on our dataset. Further qualitative analysis confirms that existing methods struggle to leverage normal samples under heavy pose and appearance variation. With our large-scale dataset, we set a new benchmark and encourage future research towards solving this challenging problem in retail logistics anomaly detection. The dataset is available for download under https://www.kaputt-dataset.com.
3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt
Lukas Höllein
Technical University of Munich
Aljaž Božič
Meta
Michael Zollhöfer
Meta
Matthias Nießner
Technical University of Munich
Abstract
We present 3DGS-LM, a new method that accelerates the reconstruction of 3D Gaussian Splatting (3DGS) by replacing its ADAM optimizer with a tailored Levenberg-Marquardt (LM). Existing methods reduce the optimization time by decreasing the number of Gaussians or by improving the implementation of the differentiable rasterizer. However, they still rely on the ADAM optimizer to fit Gaussian parameters of a scene in thousands of iterations, which can take up to an hour. To this end, we change the optimizer to LM that runs in conjunction with the 3DGS differentiable rasterizer. For efficient GPU parallelization, we propose a caching data structure for intermediate gradients that allows us to efficiently calculate Jacobian-vector products in custom CUDA kernels. In every LM iteration, we calculate update directions from multiple image subsets using these kernels and combine them in a weighted mean. Overall, our method is 20% faster than the original 3DGS while obtaining the same reconstruction quality. Our optimization is also agnostic to other methods that accelerate 3DGS, thus enabling even faster speedups compared to vanilla 3DGS.
Communication-Efficient Multi-Vehicle Collaborative Semantic Segmentation via Sparse 3D Gaussian Sharing
Tianyu Hong
Tianjin University
Xiaobo Zhou
Tianjin University
Wenkai Hu
Tianjin University
Qi Xie
Tianjin University
Zhihui Ke
Tianjin University
Tie Qiu
Qinghai Minzu University
Abstract
Collaborative perception is considered a promising approach to address the inherent limitations of single-vehicle systems by sharing data among vehicles, thereby enhancing performance in perception tasks such as bird's-eye view (BEV) semantic segmentation. However, existing methods share the entire dense, scene-level BEV feature, which contains significant redundancy and lacks height information, ultimately leading to unavoidable bandwidth waste and performance degradation. To address these challenges, we present GSCOOP, the first collaborative semantic segmentation framework that leverages sparse, object-centric 3D Gaussians to fundamentally overcome communication bottlenecks. By representing scenes with compact Gaussians that preserve complete spatial information, GSCOOP achieves both high perception accuracy and communication efficiency. To further optimize transmission, we introduce the PriorityBased Gaussian Selection (PGS) module to adaptively select critical Gaussians and a Semantic Gaussian Compression (SGC) module to compress Gaussian attributes with minimal overhead. Extensive experiments on OPV2V and V2X-Seq demonstrate that GSCOOP achieves state-of-the-art performance, even with more than 500x lower communication volume.The code link is https://github.com/SHEVIP/GSCOOP.
General Compression Framework for Efficient Transformer Object Tracking
Lingyi Hong
Shanghai Key Lab of Intelligent Information Processing, College of Computer Science and Artificial Intelligence, Fudan University
Jinglun Li
College of Intelligent Robotics and Advanced Manufacturing, Fudan University
Xinyu Zhou
Shanghai Key Lab of Intelligent Information Processing, College of Computer Science and Artificial Intelligence, Fudan University
Shilin Yan
Shanghai Key Lab of Intelligent Information Processing, College of Computer Science and Artificial Intelligence, Fudan University
Pinxue Guo
College of Intelligent Robotics and Advanced Manufacturing, Fudan University
Kaixun Jiang
College of Intelligent Robotics and Advanced Manufacturing, Fudan University
Zhaoyu Chen
College of Intelligent Robotics and Advanced Manufacturing, Fudan University
Shuyong Gao
Shanghai Key Lab of Intelligent Information Processing, College of Computer Science and Artificial Intelligence, Fudan University
Runze Li
Lenovo Research
Xingdong Sheng
Lenovo Research
Abstract
Previous works have attempted to improve tracking efficiency through lightweight architecture design or knowledge distillation from teacher models to compact student trackers. However, these solutions often sacrifice accuracy for speed to a great extent, and also have the problems of complex training process and structural limitations. Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce model size while preserving tracking accuracy. Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages to break the limitation of model structure. Additionally, we also design a unique replacement training technique that randomly substitutes specific stages in the student model with those from the teacher model, as opposed to training the student model in isolation. Replacement training enhances the student model's ability to replicate the teacher model's behavior and simplifies the training process. To further forcing student model to emulate teacher model, we incorporate prediction guidance and stage-wise feature mimicking to provide additional supervision during the teacher model's compression process. CompressTracker is structurally agnostic, making it compatible with any transformer architecture. We conduct a series of experiment to verify the effectiveness and generalizability of our CompressTracker. Our CompressTracker-SUTrack, compressed from SUTrack, retains about 99% performance on LaSOT (72.2% AUC) while achieves 2.42x speed up. Code is available at here.
4D Visual Pre-training for Robot Learning
Chengkai Hou
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Yanjie Ze
Shanghai Qizhi Institute
Yankai Fu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Zeyu Gao
CASIA
Songbo Hu
Tsinghua University
Yue Yu
Tsinghua University
Shanghang Zhang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Huazhe Xu
Shanghai Qizhi Institute
Abstract
General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks; yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP, is a novel 4D Visual Pre-training framework for realworld robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on the larger public datasets directly. Across twelve real-world manipulation tasks, FVP boosts the average success rate of 3D Diffusion Policy (DP3) for these tasks by 28%. The FVP pre-trained DP3 achieves stateof-the-art performance across imitation learning methods. Moreover, the efficacy of FVP adapts across various point cloud encoders and datasets. Finally, we apply FVP to the RDT-1B, a larger Vision-Language-Action robotic model, enhancing its performance on various robot tasks. Our project page is available at: https://4d-visualpretraining.github.io/.
FROSS: Faster-Than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images
Hao-Yu Hou
National Tsing Hua University
Chun-Yi Lee
National Taiwan University
Motoharu Sonogashira
RIKEN
Yasutomo Kawanishi
RIKEN
Abstract
The ability to abstract complex 3D environments into simplified and structured representations is crucial across various domains. 3D semantic scene graphs (SSGs) achieve this by representing objects as nodes and their interrelationships as edges, facilitating high-level scene understanding. Existing methods for 3D SSG generation, however, face significant challenges, including high computational demands and non-incremental processing that hinder their suitability for real-time open-world applications. To address this issue, we propose FROSS (Faster-thanReal-Time Online 3D Semantic Scene Graph Generation), an innovative approach for online and faster-than-realtime 3D SSG generation that leverages the direct lifting of 2D scene graphs to 3D space and represents objects as 3D Gaussian distributions. This framework eliminates the dependency on precise and computationallyintensive point cloud processing. Furthermore, we extend the Replica dataset with inter-object relationship annotations, creating the ReplicaSSG dataset for comprehensive evaluation of FROSS. The experimental results from evaluations on ReplicaSSG and 3DSSG datasets show that FROSS can achieve superior performance while operating significantly faster than prior 3D SSG generation methods. Our implementation and dataset are publicly available at https://github.com/Howardkhh/FROSS.
Single-Scanline Relative Pose Estimation for Rolling Shutter Cameras
Petr Hruby
ETH Zürich
Marc Pollefeys
ETH Zürich / Microsoft Spatial AI Lab
Abstract
We propose a novel approach for estimating the relative pose between rolling shutter cameras using the intersections of line projections with a single scanline per image. This allows pose estimation without explicitly modeling camera motion. Alternatively, scanlines can be selected within a single image, enabling single-view relative pose estimation for scanlines of rolling shutter cameras. Our approach is designed as a foundational building block for rolling shutter structure-from-motion (SfM), where no motion model is required, and each scanline's pose can be computed independently. We classify minimal solvers for this problem in both generic and specialized settings, including cases with parallel lines and known gravity direction, assuming known intrinsics and no lens distortion. Furthermore, we develop minimal solvers for the parallel-lines scenario, both with and without gravity priors, by leveraging connections between this problem and the estimation of 2D structure from 1D cameras. Experiments on rolling shutter images from the Fastec dataset demonstrate the feasibility of our approach for initializing rolling shutter SfM, highlighting its potential for further development. The code will be made publicly available.
OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations
Peng-Hao Hsu
National Tsing Hua University
Ke Zhang
Amazon
Fu-En Wang
Amazon
Tao Tu
Cornell University
Ming-Feng Li
Carnegie Mellon University
Yu-Lun Liu
National Yang Ming Chiao Tung University
Albert Y. C. Chen
Amazon
Min Sun
National Tsing Hua University
Cheng-Hao Kuo
Amazon
Abstract
Open-vocabulary (OV) 3D object detection is an emerging field, yet its exploration through image-based methods remains limited compared to 3D point cloud-based methods. We introduce OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector trained without human annotations. In particular, OpenM3D is a single-stage detector adapting the 2D-induced voxel features from the ImGeoNet model. To support OV, it is jointly trained with a class-agnostic 3D localization loss requiring high-quality 3D pseudo boxes and a voxel-semantic alignment loss requiring diverse pre-trained CLIP features. We follow the training setting of OV-3DET where posed RGB-D images are given but no human annotations of 3D boxes or classes are available. We propose a 3D Pseudo Box Generation method using a graph embedding technique that combines 2D segments into coherent 3D structures. Our pseudo-boxes achieve higher precision and recall than other methods, including the method proposed in OV-3DET. We further sample diverse CLIP features from 2D segments associated with each coherent 3D structure to align with the corresponding voxel feature. The key to training a highly accurate singlestage detector requires both losses to be learned toward high-quality targets. At inference, OpenM3D, a highly efficient detector, requires only multi-view images for input and demonstrates superior accuracy and speed (0.3 sec. per scene) on ScanNet200 and ARKitScenes indoor benchmarks compared to existing methods. We outperform a strong twostage method that leverages our class-agnostic detector with a ViT CLIP-based OV classifier and a baseline incorporating multi-view depth estimator on both accuracy and speed.
Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts
Zixuan Hu
School of Computer Science, Peking University
Dongxiao Li
School of Computer Science, Peking University
Xinzhu Ma
The Chinese University of Hong Kong
Shixiang Tang
The Chinese University of Hong Kong
Xiaotong Li
School of Computer Science, Peking University
Wenhan Yang
Peng Cheng Laboratory, Shenzhen, China
Ling-Yu Duan
School of Computer Science, Peking University
Abstract
Accurate monocular 3D object detection (M3OD) is pivotal for safety-critical applications like autonomous driving, yet its reliability deteriorates significantly under real-world domain shifts caused by environmental or sensor variations. To address these shifts, Test-Time Adaptation (TTA) methods have emerged, enabling models to adapt to target distributions during inference. While prior TTA approaches recognize the positive correlation between low uncertainty and high generalization ability, they fail to address the dual uncertainty inherent to M3OD: semantic uncertainty (ambiguous class predictions) and geometric uncertainty (unstable spatial localization). To bridge this gap, we propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD. Through a convex optimization lens, we introduce an innovative convex structure of the focal loss and further derive a novel unsupervised version, enabling label-agnostic uncertainty weighting and balanced learning for high-uncertainty objects. In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues, reducing uncertainty from the unstable 3D representation. This dual-branch mechanism forms a complementary loop: enhanced spatial perception improves semantic classification, and robust semantic predictions further refine spatial understanding. Extensive experiments demonstrate the superiority of DUO over existing methods across various datasets and domain shift types. The source code is available at https://github.com/hzcar/DUO.
DyGS-SLAM: Real-Time Accurate Localization and Gaussian Reconstruction for Dynamic Scenes
Xinggang Hu
Dalian University of Technology
Chenyangguang Zhang
Tsinghua University
Mingyuan Zhao
University of Chinese Academy of Sciences
Yuanze Gui
Beijing University of Technology
Xiangkui Zhang
Dalian University of Technology
Xiangyang Ji
Tsinghua University
Abstract
In dynamic scenes, achieving accurate camera localization and reconstructing a long-term consistent map containing only the static background are two major challenges faced by Visual Simultaneous Localization and Mapping (VSLAM). In current traditional dynamic VSLAM systems, the methods used to handle dynamic objects are primarily designed for localization; if applied to reconstruction, they are prone to introducing motion artifacts. Meanwhile, mask compensation strategies in NeRF- or 3DGS-based dynamic VSLAM systems also face challenges, such as the inability to completely eliminate dynamic object artifacts and low real-time performance. To address these issues, we leverage object detection to extract semantic information and propose a dynamic feature detection algorithm based on both geometry and appearance. This algorithm accurately identifies known and unknown moving objects and determines their actual motion states. To mitigate the issue of insufficient detection box coverage, we design a dynamic object box correction algorithm based on clustering and Gaussian mixture models to comprehensively identify moving object regions. Furthermore, to overcome the limitations of sparse features in texture-scarce environments, we introduce a feature densification strategy based on image texture complexity, enhancing reconstruction quality while maintaining real-time performance. Extensive experimental evaluations demonstrate that our system achieves state-of-the-art localization and reconstruction performance in dynamic scenes and can run in real time on resource-constrained devices.
Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion
Tongyan Hua
HKUST(GZ)
Lutao Jiang
HKUST(GZ)
Ying-Cong Chen
HKUST
Wufan Zhao
HKUST(GZ)
Abstract
Recent advancements in generative models have enabled 3D urban scene generation from satellite imagery, unlocking promising applications in gaming, digital twins, and beyond. However, most existing methods rely heavily on neural rendering techniques, which hinder their ability to produce detailed 3D structures on a broader scale, largely due to the inherent structural ambiguity derived from relatively limited 2D observations. To address this challenge, we propose Sat2City, a novel framework that synergizes the representational capacity of sparse voxel grids with latent diffusion models, tailored specifically for our novel 3D city dataset. Our approach is enabled by three key components: (1) A cascaded latent diffusion framework that progressively recovers 3D city structures from satellite imagery, (2) a Re-Hash operation at its Variational Autoencoder (VAE) bottleneck to compute multi-scale feature grids for stable appearance optimization, and (3) an inverse sampling strategy enabling implicit supervision for smooth appearance transitioning. To overcome the challenge of collecting real-world city-scale 3D models with high-quality geometry and appearance, we introduce a dataset of synthesized large-scale 3D cities paired with satellite-view height maps. Validated on this dataset, our framework generates detailed 3D structures from a single satellite image, achieving superior fidelity compared to existing city generation models.
From Gaze to Movement: Predicting Visual Attention for Autonomous Driving Human-Machine Interaction based on Programmatic Imitation Learning
Yexin Huang
Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University
Yongbin Lin
Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University
Lishengsa Yue
Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University
Zhihong Yao
School of Transportation and Logistics, Southwest Jiaotong University
Jie Wang
Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University
Abstract
Human-machine interaction technology requires not only the distribution of human visual attention but also the prediction of the gaze point trajectory. We introduce PILOT, a programmatic imitation learning approach that predicts a driver's eye movements based on a set of rule-based conditions. These conditions-derived from driving operations and traffic flow characteristics-define how gaze shifts occur. They are initially identified through incremental synthesis, a heuristic search method, and then refined via LBFGS, a numerical optimization technique. These humanreadable rules enable us to understand drivers' eye movement patterns and make efficient and explainable predictions. We also propose DATAD, a dataset that covers 12 types of autonomous driving takeover scenarios, collected from 60 participants and comprising approximately 600,000 frames of gaze point data. Compared to existing eye-tracking datasets, DATAD includes additional driving metrics and surrounding traffic flow characteristics, providing richer contextual information for modeling gaze behavior. Experimental evaluations of PILOT on DATAD demonstrate superior accuracy and faster prediction speeds compared to four baseline models. Specifically, PILOT reduces the MSE of predicted trajectories by 38.59% to 88.02% and improves the accuracy of gaze object predictions by 6.90% to 55.06%. Moreover, PILOT achieves these gains with approximately 30% lower prediction time, offering both more accurate and more efficient eye movement prediction.
Generalizable Object Re-Identification via Visual In-Context Prompting
Zhizhong Huang
Michigan State University
Xiaoming Liu
Michigan State University
Abstract
Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture identity-sensitive features critical for ReID. This paper proposes Visual In-Context Prompting (VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only in-context examples as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models (VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through taskspecific prompting, which then guides a VFM (e.g., DINO) to extract ID-discriminative features via dynamic visual prompts. By aligning LLM-derived semantic concepts with the VFM's pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories. Code is available at https://github.com/Hzzone/VICP.
HarmonySeg: Tubular Structure Segmentation with Deep-Shallow Feature Fusion and Growth-Suppression Balanced Loss
Yi Huang
DAMO Academy, Alibaba Group
Ke Zhang
Department of Electrical and Computer Engineering, Johns Hopkins University
Wei Liu
DAMO Academy, Alibaba Group
Yuanyuan Wang
Department of Biomedical Engineering, Fudan University
Vishal M. Patel
Department of Electrical and Computer Engineering, Johns Hopkins University
Le Lu
DAMO Academy, Alibaba Group
Xu Han
Department of Hepatobiliary and Pancreatic Surgery, The First Affiliated Hospital of College of Medicine, Zhejiang University
Dakai Jin
DAMO Academy, Alibaba Group
Ke Yan
DAMO Academy, Alibaba Group
Abstract
Accurate segmentation of tubular structures in medical images, such as vessels and airway trees, is crucial for computer-aided diagnosis, radiotherapy, and surgical planning. However, significant challenges exist in algorithm design when faced with diverse sizes, complex topologies, and (often) incomplete data annotation of these structures. We address these difficulties by proposing a new tubular structure segmentation framework named HarmonySeg. First, we design a deep-to-shallow decoder network featuring flexible convolution blocks with varying receptive fields, which enables the model to effectively adapt to tubular structures of different scales. Second, to highlight potential anatomical regions and improve the recall of small tubular structures, we incorporate vesselness maps as auxiliary information. These maps are aligned with image features through a shallow-and-deep fusion module, which simultaneously eliminates unreasonable candidates to maintain high precision. Finally, we introduce a topology-preserving loss function that leverages contextual and shape priors to balance the growth and suppression of tubular structures, which also allows the model to handle low-quality and incomplete annotations. Extensive quantitative experiments are conducted on four public datasets. The results show that our model can accurately segment 2D and 3D tubular structures and outperform existing state-of-the-art methods. External validation on a private dataset also demonstrates good generalizability. Code will be released at this link.
Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive Segmentation
You Huang
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Lichao Chen
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Jiayi Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Liujuan Cao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Shengchuan Zhang
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Abstract
Interactive segmentation (IS) improves annotation efficiency by segmenting target regions from user prompts, with widespread applications in real-world scenarios. Current approaches face a critical trade-off: dense-token methods achieve superior accuracy and detail preservation but suffer from prohibitively slow processing on CPU devices, while the Segment Anything Model (SAM) advances the field with sparse prompt tokens for fast inference but compromises segmentation quality. In this paper, we propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing, which introduces four key enhancements. First, we propose Dynamic Prompt Embedding (DPE) that adaptively processes only regions of interest while avoiding additional overhead from background tokens. Second, we introduce Dynamic Hybrid Attention (DHA), which leverages previous segmentation masks to route tokens through either full attention (O(N 2)) for boundary regions or our proposed efficient BSQ attention (O(N)) for non-boundary regions. Third, we develop Hybrid Mixture of Experts (HMoE), which applies similar adaptive computation strategies in FFN modules with CPU-optimized parallel processing. Finally, we present Dynamic Local Upsampling (DLU), a reverse operation of DPE, which localizes objects with a lightweight MLP and performs fine-grained upsampling only in detected regions. Experimental results on high-precision IS benchmarks demonstrate that Inter2Former achieves SOTA performance with high efficiency on CPU devices.
LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression
Wenjie Huang
Shanghai Jiao Tong University
Qi Yang
University of Missouri-Kansas City
Shuting Xia
Shanghai Jiao Tong University
He Huang
Shanghai Jiao Tong University
Yiling Xu
Shanghai Jiao Tong University
Zhu Li
University of Missouri-Kansas City
Abstract
Existing AI-based point cloud compression methods struggle with dependence on specific training data distributions, which limits their real-world deployment. Implicit Neural Representation (INR) methods solve the above problem by encoding overfitted network parameters to the bitstream, resulting in more distribution-agnostic results. However, due to the limitation of encoding time and decoder size, current INR based methods only consider lossy geometry compression. In this paper, we propose the first INR based lossless point cloud geometry compression method called Lossless Implicit Neural Representations for Point Cloud Geometry Compression (LINR-PCGC). To accelerate encoding speed, we design a group of point clouds level coding framework with an effective network initialization strategy, which can reduce around 60% encoding time. A lightweight coding network based on multiscale SparseConv, consisting of scale context extraction, child node prediction, and model compression modules, is proposed to realize fast inference and compact decoder size. Experimental results show that our method consistently outperforms traditional and AI-based methods: for example, with the convergence time in the MVUB dataset, our method reduces the bitstream by approximately 21.21% compared to G-PCC TMC13v23 and 21.95% compared to SparsePCGC. Our project can be seen on https://huangwenjie2023.github.io/LINR-PCGC/.
Learning A Unified Template for Gait Recognition
Panjian Huang
School of Artificial Intelligence, Beijing Normal University
Saihui Hou
School of Artificial Intelligence, Beijing Normal University
Junzhou Huang
Department of Computer Science and Engineering, The University of Texas at Arlington
Yongzhen Huang
School of Artificial Intelligence, Beijing Normal University
Abstract
'What I cannot create, I do not understand.' Human wisdom reveals that creation is one of the highest forms of learning. For example, Diffusion Models have demonstrated remarkable semantic structure and memory in image generation, understanding, and restoration, which intuitively benefits representation learning. However, current gait networks rarely embrace this perspective, relying primarily on learning by contrasting gait samples under varying complex conditions, leading to semantic inconsistency and uniformity issues. To address these issues, we propose Origins with generative capabilities whose underlying philosophy is that different entities are generated from a unified template, inherently regularizing gait representations within a consistent and diverse semantic space to capture accurate gait differences. Admittedly, learning this unified template is exceedingly challenging, as it requires the comprehensiveness of the template to encompass gait representations with various conditions. Inspired by Diffusion Models, Origins diffuses the unified template into timestep templates for gait generative learning, and meanwhile transfers the unified template for gait representation learning. Especially, gait generative and representation learning serve as a unified framework for end-to-end joint training. Extensive experiments on CASIA-B, CCPG, SUSTech1K, Gait3D, GREW and CCGR-MINI demonstrate that Origins performs unified generative and representation learning, achieving superior performance.
MV-Adapter: Multi-View Consistent Image Generation Made Easy
Zehuan Huang
Beihang University
Yuan-Chen Guo
VAST
Haoran Wang
Shanghai Jiao Tong University
Ran Yi
Shanghai Jiao Tong University
Lizhuang Ma
Shanghai Jiao Tong University
Yan-Pei Cao
VAST
Lu Sheng
Beihang University
Abstract
Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) †Project lead; corresponding author models and require full fine-tuning, leading to high computational costs and degradation in image quality due to scarce high-quality 3D data. This paper introduces MVAdapter, an efficient and versatile adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated selfattention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pretrained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and imagebased 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MVAdapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.
No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views
Ranran Huang
Imperial College London
Krystian Mikolajczyk
Imperial College London
Abstract
We introduce SPFSplat, an efficient framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training or inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs within a single feed-forward step. Alongside the rendering loss based on estimated novel-view poses, a reprojection loss is integrated to enforce the learning of pixel-aligned Gaussian primitives for enhanced geometric constraints. This pose-free training paradigm and efficient one-step feedforward design make SPFSplat well-suited for practical applications. Remarkably, despite the absence of pose supervision, SPFSplat achieves state-of-the-art performance in novel view synthesis even under significant viewpoint changes and limited image overlap. It also surpasses recent methods trained with geometry priors in relative pose estimation. Code and trained models are available on our project page: https://ranrhuang.github.io/spfsplat/.
OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images
Ziyue Huang
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
Yongchao Feng
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
Ziqi Liu
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
Shuai Yang
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
Qingjie Liu
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
Yunhong Wang
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
Abstract
Remote sensing object detection has made significant progress, but most studies still focus on closed-set detection, limiting generalization across diverse datasets. Openvocabulary object detection (OVD) provides a solution by leveraging multimodal associations between text prompts and visual features. However, existing OVD methods for remote sensing (RS) images are constrained by small-scale datasets and fail to address the unique challenges of remote sensing interpretation, include oriented object detection and the need for both high precision and real-time performance in diverse scenarios. To tackle these challenges, we propose OpenRSD, a universal open-prompt RS object detection framework. OpenRSD supports multimodal prompts and integrates multi-task detection heads to balance accuracy and real-time requirements. Additionally, we design a multi-stage training pipeline to enhance the generalization of model. Evaluated on seven public datasets, OpenRSD demonstrates superior performance in oriented and horizontal bounding box detection, with real-time inference capabilities suitable for large-scale RS image analysis. Compared to YOLO-World, OpenRSD exhibits an 8.7% higher average precision and achieves an inference speed of 20.8 FPS. Codes and models are available at: https://github.com/floatingstarZ/OpenRSDt.
RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation
Junwen Huang
Technical University of Munich
Shishir Reddy Vutukur
Technical University of Munich
Peter KT Yu
XYZ Robotics
Nassir Navab
Technical University of Munich
Slobodan Ilic
Technical University of Munich
Benjamin Busam
Technical University of Munich
Abstract
Typical template-based object pose pipelines estimate the pose by retrieving the closest matching template and aligning it with the observed image. However, failure to retrieve the correct template often leads to inaccurate pose predictions. To address this, we reformulate template-based object pose estimation as a ray alignment problem, where the viewing directions from multiple posed template images are learned to align with a non-posed query image. Inspired by recent progress in diffusion-based camera pose estimation, we embed this formulation into a diffusion transformer architecture that aligns a query image with a set of posed templates. We reparameterize object rotation using object-centered camera rays and model object translation by extending scale-invariant translation estimation to dense translation offsets. Our model leverages geometric priors from the templates to guide accurate query pose inference. A coarse-to-fine training strategy based on narrowed template sampling improves performance without modifying the network architecture. Extensive experiments across multiple benchmark datasets show competitive results of our method compared to state-of-the-art approaches in unseen object pose estimation.
RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving
Zhijian Huang
Shenzhen Campus of Sun Yat-sen University
Chengjian Feng
Meituan
Feng Yan
Meituan
Baihui Xiao
Meituan
Zequn Jie
Meituan
Yujie Zhong
Meituan
Xiaodan Liang
Shenzhen Campus of Sun Yat-sen University
Lin Ma
Meituan
Abstract
Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current datadriven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose RoboTron-Drive, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pretraining to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD datasets to finetune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on three unseen datasets, where RoboTron-Drive achieves state-of-theart performance across all tasks. We hope RoboTron-Drive as a promising solution for AD in the real world.
Towards Foundational Models for Single-Chip Radar
Tianshu Huang
Carnegie Mellon University
Akarsh Prabhakara
University of Wisconsin-Madison
Chuhan Chen
Carnegie Mellon University
Jay Karhade
Carnegie Mellon University
Deva Ramanan
Carnegie Mellon University
Matthew O'Toole
Carnegie Mellon University
Anthony Rowe
Carnegie Mellon University
Abstract
mmWave radars are compact, inexpensive, and durable sensors that are robust to occlusions and work regardless of environmental conditions, such as weather and darkness. However, this comes at the cost of poor angular resolution, especially for inexpensive single-chip radars, which are typically used in automotive and indoor sensing applications. Although many have proposed learning-based methods to mitigate this weakness, no standardized foundational models or large datasets for the mmWave radar have emerged, and practitioners have largely trained task-specific models from scratch using relatively small datasets. In this paper, we collect (to our knowledge) the largest available raw radar dataset with 1M samples (29 hours) and train a foundational model for 4D single-chip radar, which can predict 3D occupancy and semantic segmentation with quality that is typically only possible with much higher resolution sensors. We demonstrate that our Generalizable Radar Transformer (GRT) generalizes across diverse settings, can be fine-tuned for different tasks, and shows logarithmic data scaling of 20% per 10x data. We also run extensive ablations on common design decisions, and find that using raw radar data significantly outperforms widely-used lossy representations, equivalent to a 10x increase in training data. Finally, we roughly estimate that ≈100M samples (3000 hours) of data are required to fully exploit the potential of GRT.
ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition
Ronggang Huang
South China University of Technology
Haoxin Yang
South China University of Technology
Yan Cai
South China University of Technology
Xuemiao Xu
South China University of Technology
Huaidong Zhang
South China University of Technology
Shengfeng He
Singapore Management University
Abstract
3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multianchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation. Code is available at https://github. com/visualjason/ViewSRD.
Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting
Jiaxin Huang
Zhejiang University
Sheng Miao
Zhejiang University
Bangbang Yang
ByteDance
Yuewen Ma
ByteDance
Yiyi Liao
Zhejiang University
Abstract
Reconstructing 4D dynamic scenes from casually captured monocular videos is valuable but highly challenging, as each timestamp is observed from a single viewpoint. We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views - synthesizing multi-view videos from a monocular input. Unlike existing methods that either solely leverage geometric priors for supervision or use generative priors while overlooking geometry, we integrate both. This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints based on monocular depth priors. To achieve this, we train a video inpainting model on unposed web videos with synthetically generated masks that mimic warping occlusions, ensuring spatially and temporally consistent completion of missing regions. To further mitigate inaccuracies in monocular depth priors, we introduce an iterative view augmentation strategy and a robust reconstruction loss. Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion.
When Anchors Meet Cold Diffusion: A Multi-Stage Approach to Lane Detection
Bo-Lun Huang
National Yang Ming Chiao Tung University
Zi-Xiang Ni
National Yang Ming Chiao Tung University
Feng-Kai Huang
National Taiwan University
Hong-Han Shuai
National Yang Ming Chiao Tung University
Wen-Huang Cheng
National Taiwan University
Abstract
Accurate and stable lane detection is crucial for the reliability of autonomous driving systems. A core challenge lies in predicting lane positions in complex scenarios, such as curved roads or when markings are ambiguous or absent. Conventional approaches leverage deep learning techniques to extract both high-level and low-level visual features, aiming to achieve a comprehensive understanding of the driving environment. However, these methods often rely on predefined anchors within a single-pass model, limiting their adaptability. The one-shot prediction paradigm struggles with precise lane estimation in challenging scenarios, such as curved roads or adverse conditions like low visibility at night. To address these limitations, we propose a novel cold diffusion-based framework that initializes lane predictions with predefined anchors and iteratively refines them. This approach retains the flexibility and progressive refinement capabilities of diffusion models while overcoming the constraints of traditional hot diffusion techniques. To further enhance the model's coarse-to-fine refinement capabilities, we introduce a multi-resolution image processing strategy, where images are analyzed at different timesteps to capture both global and local lane structure details. Besides, we incorporate a learnable noise variance schedule, enabling the model to dynamically adjust its learning process based on multi-resolution inputs. Experimental results demonstrate that our method significantly improves detection accuracy across a variety of challenging scenarios, outperforming state-of-the-art lane detection methods. Codes and trained weights are available at https://github.com/ntudr/CDiffLane
Everything is a Video: Unifying Modalities through Next-Frame Prediction
G. Thomas Hudson
Durham University
Dean Slack
Durham University
Thomas Winterbottom
Durham University
Jamie Sterling
Durham University
Chenghao Xiao
Durham University
Junjie Shentu
Durham University
Noura Al Moubayed
Durham University
Abstract
MBTI: Masked Blending Transformers with Implicit Positional Encoding for Frame-rate Agnostic Motion Estimation
Jungwoo Huh
Yonsei University
Yeseung Park
Yonsei University
Seongjean Kim
Yonsei University
Jungsu Kim
Yonsei University
Sanghoon Lee
Yonsei University
Abstract
Human motion estimation models typically assume a fixed number of input frames, making them sensitive to variations in frame rate and leading to inconsistent motion predictions across different temporal resolutions. This limitation arises because input frame rates inherently determine the temporal granularity of motion capture, causing discrepancies when models trained on a specific frame rate encounter different sampling frequencies. To address this challenge, we propose MBTI (Masked Blending Transformers with Implicit Positional Encoding), a frame rate-agnostic human motion estimation framework designed to maintain temporal consistency across varying input frame rates. Our approach leverages a masked autoencoder (MAE) architecture with masked token blending, which aligns input tokens with a predefined high-reference frame rate, ensuring a standardized temporal representation. Additionally, we introduce implicit positional encoding, which encodes absolute time information using neural implicit functions, enabling more natural motion reconstruction beyond discrete sequence indexing. By reconstructing motion at a high reference frame rate and optional downsampling, MBTI ensures both frame rate generalization and temporal consistency. To comprehensively evaluate MBTI, we introduce EMDB-FPS, an augmented benchmark designed to assess motion estimation robustness across multiple frame rates in both local and global motion estimation tasks. To further assess MBTI's robustness, we introduce the Motion Consistency across Frame rates (MCF), a novel metric to quantify the deviation of motion predictions across different input frame rates. Our results demonstrate that MBTI outperforms state-of-the-art methods in both motion accuracy and temporal consistency, achieving the most stable and consistent motion predictions across varying frame rates.
Motion Synthesis with Sparse and Flexible Keyjoint Control
Inwoo Hwang
Seoul National University
Jinseok Bae
Seoul National University
Donggeun Lim
Seoul National University
Young Min Kim
Seoul National University
Abstract
Creating expressive character animations is laborintensive, requiring intricate manual adjustment of animators across space and time. Previous works on controllable motion generation often rely on a predefined set of dense spatio-temporal specifications (e.g., dense pelvis trajectories with exact per-frame timing), limiting practicality for animators. To process high-level intent and intuitive control in diverse scenarios, we propose a practical controllable motions synthesis framework that respects sparse and flexible keyjoint signals. Our approach employs a decomposed diffusion-based motion synthesis framework that first synthesizes keyjoint movements from sparse input control signals and then synthesizes full-body motion based on the completed keyjoint trajectories. The low-dimensional keyjoint movements can easily adapt to various control signal types, such as end-effector position for diverse goal-driven motion synthesis, or incorporate functional constraints on a subset of keyjoints. Additionally, we introduce a time-agnostic control formulation, eliminating the need for frame-specific timing annotations and enhancing control flexibility. Then, the shared second stage can synthesize a natural whole-body motion that precisely satisfies the task requirement from dense keyjoint movements. We demonstrate the effectiveness of sparse and flexible keyjoint control through comprehensive experiments on diverse datasets and scenarios. Project page: http://inwoohwang.me/SFControl
SceneMI: Motion In-betweening for Modeling Human-Scene Interaction
Inwoo Hwang
Seoul National University
Bing Zhou
Snap Inc.
Young Min Kim
Seoul National University
Jian Wang
Snap Inc.
Chuan Guo
Snap Inc.
Abstract
Modeling human-scene interactions (HSI) is essential for understanding and simulating everyday human behaviors. Recent approaches utilizing generative modeling have made progress in this domain; however, they are limited in controllability and flexibility for real-world applications. To address these challenges, we propose reformulating the HSI modeling problem as Scene-aware Motion In-betweening-a more tractable and practical task. We introduce SceneMI, a framework that supports several practical applications, including keyframe-guided character animation in 3D scenes and enhancing the motion quality of imperfect HSI data. SceneMI employs dual scene descriptors to comprehensively encode global and local scene context. Furthermore, our framework leverages the inherent denoising nature of diffusion models to generalize on noisy keyframes. Experimental results demonstrate SceneMI's effectiveness in scene-aware keyframe in-betweening and generalization to the real-world GIMO dataset, where motions and scenes are acquired by noisy IMU sensors and smartphones. We further showcase SceneMI's applicability in HSI reconstruction from monocular videos. Project page: http://inwoohwang.me/SceneMI
Towards Visual Localization Interoperability: Cross-Feature for Collaborative Visual Localization and Mapping
Alberto Jaenal
Ericsson Research
Paula Carbó Cubero
Ericsson Research
José Araujo
Ericsson Research
André Mateus
Ericsson Research
Abstract
The growing presence of vision-based systems in the physical world comes with a major requirement: highly accurate estimation of the pose, a task typically addressed through methods based on local features. The totality of the available feature-based localization solutions are designed under the assumption of using the same feature for mapping and localization. However, as the implementation provided by each vendor is based on heterogeneous feature extraction algorithms, collaboration between different devices is not straightforward or even not possible. Although there are some alternatives, such as re-extracting the features or reconstructing the image from them, these are impractical or costly to implement in a real pipeline. To overcome this, and inspired in the seminal work Cross-Descriptor [13], we propose Cross-Feature, a method that applies a patchbased training strategy to a simple MLP which projects features to a common embedded space. As a consequence, our proposal allows to establish suitable correspondences between features computed through heterogeneous algorithms, e.g. SIFT [25] and SuperPoint [10]. We experimentally demonstrate the validity of Cross-Feature by evaluating it in tasks as Image Matching, Visual Localization and a new Collaborative Visual Localization and Mapping scenario. We believe this is the first step towards full Visual Localization interoperability. Code is available at https: //github.com/EricssonResearch/crossfeat.
Identity-aware Language Gaussian Splatting for Open-vocabulary 3D Semantic Segmentation
SungMin Jang
Konkuk University
Wonjun Kim
Konkuk University
Abstract
Open-vocabulary 3D semantic segmentation has been actively studied by incorporating language features into 3D scene representations. Even though many methods have shown the notable improvement in this task, they still have difficulties to make language embeddings be consistent across different views. This inconsistency highly results in mis-labeling where different language embeddings are assigned to the same part of an object. To address this issue, we propose a simple yet powerful method that aligns language embeddings via the identity information. The key idea is to locate language embeddings for the same identity closely in the latent space while putting them apart otherwise. This approach allows the same object to have identical language embeddings in novel views with accurate semantic masks, which are well aligned with the input text. Furthermore, we propose a progressive mask expanding scheme that enables more accurate extraction of semantic mask boundaries. This scheme is very effective in preserving the boundary shape of the target region by allowing the model to consider the local relationship between segments. Experimental results on benchmark datasets demonstrate that our method delivers state-of-the-art performance in open-vocabulary 3D semantic segmentation. https://github.com/DCVL-3D/ILGS release
Splat-based 3D Scene Reconstruction with Extreme Motion-blur
Hyeonjoong Jang
KAIST
Dongyoung Choi
KAIST
Donggun Kim
KAIST
Woohyun Kang
KAIST
Min H. Kim
KAIST
Abstract
We propose a splat-based 3D scene reconstruction method from RGB-D input that effectively handles extreme motion blur, a frequent challenge in low-light environments. Under dim illumination, RGB frames often suffer from severe motion blur due to extended exposure times, causing traditional camera pose estimation methods, such as COLMAP, to fail. This results in inaccurate camera pose and blurry color input, compromising the quality of 3D reconstructions. Although recent 3D reconstruction techniques like Neural Radiance Fields and Gaussian Splatting have demonstrated impressive results, they rely on accurate camera trajectory estimation, which becomes challenging under fast motion or poor lighting conditions. Furthermore, rapid camera movement and the limited field of view of depth sensors reduce point cloud overlap, limiting the effectiveness of pose estimation with the ICP algorithm. To address these issues, we introduce a method that combines camera pose estimation and image deblurring using a Gaussian Splatting framework, leveraging both 3D Gaussian splats and depth inputs for enhanced scene representation. Our method first aligns consecutive RGB-D frames through optical flow and ICP, then refines camera poses and 3D geometry by adjusting Gaussian positions for optimal depth alignment. To handle motion blur, we model camera movement during exposure and deblur images by comparing the input with a series of sharp, rendered frames. Experiments on a new RGB-D dataset with extreme motion blur show that our method outperforms existing approaches, enabling high-quality reconstructions even in challenging conditions. This approach has broad implications for 3D mapping applications in robotics, autonomous navigation, and augmented reality. Both code and dataset are publicly available on https://github.com/KAISTVCLAB/gs-extreme-motion-blur.
Sparfels: Fast Reconstruction from Sparse Unposed Imagery
Shubhendu Jena
Inria, Univ. Rennes, CNRS, IRISA
Amine Ouasfi
Inria, Univ. Rennes, CNRS, IRISA
Mae Younes
Inria, Univ. Rennes, CNRS, IRISA
Adnane Boukhayma
Inria, Univ. Rennes, CNRS, IRISA
Abstract
We present a method for Sparse view reconstruction with surface element splatting that runs within 3 minutes on a consumer grade GPU. While few methods address sparse radiance field learning from noisy or unposed sparse cameras, shape recovery remains relatively underexplored in this setting. Several radiance and shape learning testtime optimization methods address the sparse posed setting by learning data priors or using combinations of external monocular geometry priors. Differently, we propose an efficient and simple pipeline harnessing a single recent 3D foundation model. We leverage its various task heads, notably point maps and camera initializations to instantiate a bundle adjusting 2D Gaussian Splatting (2DGS) model, and image correspondences to guide camera optimization midst 2DGS training. Key to our contribution is a novel formulation of splatted color variance along rays, which can be computed efficiently. Reducing this moment in training leads to more accurate shape reconstructions. We demonstrate state-of-the-art performances in the sparse uncalibrated setting in reconstruction and novel view benchmarks based on established multi-view datasets. Code will be made available at https://shubhendujena.github.io/Sparfels-web/
Robust Adverse Weather Removal via Spectral-based Spatial Grouping
Yuhwan Jeong
KAIST
Yunseo Yang
KAIST
Youngho Yoon
KAIST
Kuk-Jin Yoon
KAIST
Abstract
Adverse weather conditions cause diverse and complex degradation patterns, driving the development of All-inOne (AiO) models. However, recent AiO solutions still struggle to capture diverse degradations, since global filtering methods like direct operations on the frequency domain fail to handle highly variable and localized distortions. To address these issue, we propose Spectral-based Spatial Grouping Transformer (SSGformer), a novel approach that leverages spectral decomposition and group-wise attention for multi-weather image restoration. SSGformer decomposes images into high-frequency edge features using conventional edge detection and low-frequency information via Singular Value Decomposition. We utilize multi-head linear attention to effectively model the relationship between these features. The fused features are integrated with the input to generate a grouping-mask that clusters regions based on the spatial similarity and image texture. To fully leverage this mask, we introduce a group-wise attention mechanism, enabling robust adverse weather removal and ensuring consistent performance across diverse weather conditions. We also propose a Spatial Grouping Transformer Block that uses both channel attention and spatial attention, effectively balancing feature-wise relationships and spatial dependencies. Extensive experiments show the superiority of our approach, validating its effectiveness in handling the varied and intricate adverse weather degradations.
Test-Time Prompt Tuning for Zero-Shot Depth Completion
Chanhwi Jeong
GIST
Inhwan Bae
GIST
Jin-Hwi Park
Chung-Ang University
Hae-Gon Jeon
Yonsei University
Abstract
Zero-shot depth completion using metric scales remains challenging, primarily due to performance limitations such as domain specificity and sensor characteristics. One recent emerging solution is to integrate monocular depth foundation models into depth completion frameworks, yet such efforts still face issues with suboptimal performance and often require further adaptation to the target task. Surprisingly, we find that a simple test-time training, which finetunes monocular depth foundation models on sparse depth measurements from sensors just as it is, yields reasonable results. However, this test-time training obviously incurs high computational costs and introduces biases towards specific conditions, making it impractical for real-world scenarios. In this paper, we introduce a new approach toward parameter-efficient zero-shot depth completion. Our key idea in this work is to leverage visual prompt tuning, achieving sensor-specific depth scale adaptation without forgetting foundational knowledge. Experimental results on diverse datasets demonstrate that our approach outperforms relevant state-of-the-art methods, showing superior generalization and efficiency. Code is publicly available at https://github.com/ch5374/TestPromptDC
MMGeo: Multimodal Compositional Geo-Localization for UAVs
Yuxiang Ji
Institute of Artificial Intelligence, Xiamen University
Boyong He
Institute of Artificial Intelligence, Xiamen University
Zhuoyue Tan
Institute of Artificial Intelligence, Xiamen University
Liaoni Wu
Institute of Artificial Intelligence, Xiamen University
Abstract
Multimodal geo-localization methods can inherently overcome the limitations of unimodal sensor systems by leveraging complementary information from different modalities. However, existing retrieval-based methods rely on a comprehensive multimodal database, which is often challenging to fulfill in practice. In this paper, we introduce a more practical problem for localizing drone-view images by collaborating multimodal data within a satellite-view reference map, which integrates multimodal information while avoiding the need for an extensive multimodal database. We present MMGEO that learns to push the composition of multimodal representations to the target reference map through a unified framework. By utilizing a comprehensive multimodal query (image, point cloud/depth/text), we can achieve more robust and accurate geo-localization, especially in unknown and complex environments. Additionally, we extend two visual geo-localization datasets GTA-UAV and UAV-VisLoc to multi-modality, establishing the first UAV geo-localization datasets that combine image, point cloud, depth and text data. Experiments demonstrate the effectiveness of MMGEO for UAV multimodal compositional geo-localization, as well as the generalization capabilities to real-world scenarios. The code and dataset are at https://github.com/Yux1angJi/MMGeo.
OcRFDet: Object-Centric Radiance Fields for Multi-View 3D Object Detection in Autonomous Driving
Mingqian Ji
PCA Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology
Shanshan Zhang
PCA Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology
Jian Yang
PCA Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology
Abstract
Current multi-view 3D object detection methods typically transfer 2D features into 3D space using depth estimation or 3D position encoder, but in a fully data-driven and implicit manner, which limits the detection performance. Inspired by the success of radiance fields on 3D reconstruction, we assume they can be used to enhance the detector's ability of 3D geometry estimation. However, we observe a decline in detection performance, when we directly use them for 3D rendering as an auxiliary task. From our analysis, we find the performance drop is caused by the strong responses on the background when rendering the whole scene. To address this problem, we propose object-centric radiance fields, focusing on modeling foreground objects while discarding background noises. Specifically, we employ Object-centric Radiance Fields (OcRF) to enhance 3D voxel features via an auxiliary task of rendering foreground objects. We further use opacity - the side-product of rendering- to enhance the 2D foreground BEV features via Height-aware Opacity-based Attention (HOA), where attention maps at different height levels are generated separately via multiple networks in parallel. Extensive experiments on the nuScenes validation and test datasets demonstrate that our OcRFDet achieves superior performance, outperforming previous state-of-the-art methods with 57.2% mAP and 64.8% NDS on the nuScenes test benchmark. Code is available at https://github.com/Mingqj/OcRFDet.
Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis
Kaiyang Ji
ShanghaiTech University
Ye Shi
ShanghaiTech University
Zichen Jin
ShanghaiTech University
Kangyi Chen
ShanghaiTech University
Lan Xu
ShanghaiTech University
Yuexin Ma
ShanghaiTech University
Jingyi Yu
ShanghaiTech University
Jingya Wang
ShanghaiTech University
Abstract
Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between realtime responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners' movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration. Project page: https://humanx-interaction.github. io/
H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction
Heng Jia
Zhejiang University
Linchao Zhu
Zhejiang University
Na Zhao
Singapore University of Technology and Design
Abstract
Despite recent advances in feed-forward 3D Gaussian Splatting, generalizable 3D reconstruction remains challenging, particularly in multi-view correspondence modeling. Existing approaches face a fundamental trade-off: explicit methods achieve geometric precision but struggle with ambiguous regions, while implicit methods provide robustness but suffer from slow convergence. We present H3R, a hybrid framework that addresses this limitation by integrating volumetric latent fusion with attention-based feature aggregation. Our framework consists of two complementary components: an efficient latent volume that enforces geometric consistency through epipolar constraints, and a camera-aware Transformer that leverages Pl¨ucker coordinates for adaptive correspondence refinement. By integrating both paradigms, our approach enhances generalization while converging 2→faster than existing methods. Furthermore, we show that spatial-aligned foundation models (e.g., SD-VAE) substantially outperform semantic-aligned models (e.g., DINOv2), resolving the mismatch between semantic representations and spatial reconstruction requirements. Our method supports variable-number and highresolution input views while demonstrating robust crossdataset generalization. Extensive experiments show that our method achieves state-of-the-art performance across multiple benchmarks, with significant PSNR improvements of 0.59 dB, 1.06 dB, and 0.22 dB on the RealEstate10K, ACID, and DTU datasets, respectively. Code is available at https://github.com/JiaHeng-DLUT/H3R.
PrimHOI: Compositional Human-Object Interaction via Reusable Primitives
Kai Jia
Beijing Institute of Technology
Tengyu Liu
National Key Laboratory of General Artificial Intelligence, BIGAI
Yixin Zhu
Peking University
Mingtao Pei
Beijing Institute of Technology
Siyuan Huang
National Key Laboratory of General Artificial Intelligence, BIGAI
Abstract
Synthesizing realistic Human-Object Interaction (HOI) motions is essential for creating believable digital characters and intelligent robots. Existing approaches rely on dataintensive learning models that struggle with the compositional structure of daily HOI motions, particularly for complex multi-object manipulation tasks. The exponential growth of possible interaction scenarios makes comprehensive data collection prohibitively expensive. The fundamental challenge is synthesizing unseen, complex HOI sequences without extensive task-specific training data. Here we show that PrimHOI generates complex HOI motions through spatial and temporal composition of generalizable interaction primitives defined by relative geometry. Our approach demonstrates that repetitive local contact patterns- grasping, clamping, and supporting-serve as reusable building blocks for diverse interaction sequences. Unlike previous data-driven methods requiring end-to-end training for each task variant, PrimHOI achieves zero-shot transfer to unseen scenarios through hierarchical primitive planning. Experimental validation demonstrates substantial improvements in adaptability, diversity, and motion quality compared to existing approaches.
G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation
Juntao Jian
Shenzhen University
Xiuping Liu
Dalian University of Technology
Zixuan Chen
Dalian University of Technology
Manyi Li
Shandong University
Jian Liu
Shenyang University of Technology
Ruizhen Hu
Shenzhen University
Abstract
Recent advances in dexterous grasping synthesis have demonstrated significant progress in producing reasonable and plausible grasps for many task purposes. But it remains challenging to generalize to unseen object categories and diverse task instructions. In this paper, we propose GDexGrasp, a retrieval-augmented generation approach that can produce high-quality dexterous hand configurations for unseen object categories and language-based task instructions. The key is to retrieve generalizable grasping priors, including the fine-grained contact part and the affordancerelated distribution of relevant grasping instances, for the following synthesis pipeline. Specifically, the fine-grained contact part and affordance act as generalizable guidance to infer reasonable grasping configurations for unseen objects with a generative model, while the relevant grasping distribution plays as regularization to guarantee the plausibility of synthesized grasps during the subsequent refinement optimization. Our comparison experiments validate the effectiveness of our key designs for generalization and demonstrate the remarkable performance against the existing approaches. Project page: https://g-dexgrasp. github.io/
Diffusion-based Source-biased Model for Single Domain Generalized Object Detection
Han Jiang
University of Science and Technology of China
Wenfei Yang
University of Science and Technology of China
Tianzhu Zhang
University of Science and Technology of China
Yongdong Zhang
University of Science and Technology of China
Abstract
Single domain generalized object detection aims to train an object detector on a single source domain and generalize it to any unseen domain. Although existing approaches based on data augmentation exhibit promising results, they overlook domain discrepancies across multiple augmented domains, which limits the performance of object detectors. To tackle these problems, we propose a novel diffusionbased framework, termed SDG-DiffDet, to mitigate the impact of domain gaps on object detectors. The proposed SDG-DiffDet consists of a memory-guided diffusion module and a source-guided denoising module. Specifically, in the memory-guided diffusion module, we design feature statistics memories that mine diverse style information from local parts to augment source features. The augmented features further serve as noise in the diffusion process, enabling the model to capture distribution differences between practical domain distributions. In the source-guided denoising module, we design a text-guided condition to facilitate distribution transfer from any unseen distribution to source distribution in the denoising process. By combining these two designs, our proposed SDG-DiffDet effectively models feature augmentation and target-to-source distribution transfer within a unified diffusion framework, thereby enhancing the detection performance on unseen domains. Extensive experiments demonstrate that the proposed SDG-DiffDet achieves state-of-the-art performance across two challenging scenarios.
Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction
Zeren Jiang
University of Oxford
Chuanxia Zheng
University of Oxford
Iro Laina
University of Oxford
Diane Larlus
Naver Labs Europe
Andrea Vedaldi
University of Oxford
Abstract
We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic priors captured by largescale pre-trained video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, disparity, and ray maps. We propose a new multi-modal alignment algorithm to align and fuse these modalities, as well as a sliding window approach at inference time, thus enabling robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods.
MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network
Jianfei Jiang
University of Science and Technology Beijing
Qiankun Liu
University of Science and Technology Beijing
Haochen Yu
University of Science and Technology Beijing
Hongyuan Liu
University of Science and Technology Beijing
Liyong Wang
University of Science and Technology Beijing
Jiansheng Chen
University of Science and Technology Beijing
Huimin Ma
University of Science and Technology Beijing
Abstract
Learning-based Multi-View Stereo (MVS) methods aim to predict depth maps for a sequence of calibrated images to recover dense point clouds. However, existing MVS methods often struggle with challenging regions, such as textureless regions and reflective surfaces, where feature matching fails. In contrast, monocular depth estimation inherently does not require feature matching, allowing it to achieve robust relative depth estimation in these regions. To bridge this gap, we propose MonoMVSNet, a novel monocular feature and depth guided MVS network that integrates powerful priors from a monocular foundation model into multiview geometry. Firstly, the monocular feature of the reference view is integrated into source view features by the attention mechanism with a newly designed cross-view position encoding. Then, the monocular depth of the reference view is aligned to dynamically update the depth candidates for edge regions during the sampling procedure. Finally, a relative consistency loss is further designed based on the monocular depth to supervise the depth prediction. Extensive experiments demonstrate that MonoMVSNet achieves state-of-the-art performance on the DTU and Tanks-andTemples datasets, ranking first on the Tanks-and-Temples Intermediate and Advanced benchmarks. The source code is available at https://github.com/JianfeiJ/MonoMVSNet.
Multimodal LLM Guided Exploration and Active Mapping using Fisher Information
Wen Jiang
University of Pennsylvania
Boshu Lei
University of Pennsylvania
Katrina Ashton
University of Pennsylvania
Kostas Daniilidis
University of Pennsylvania
Abstract
We present an active mapping system which plans for both long-horizon exploration goals and short-term actions using a 3D Gaussian Splatting (3DGS) representation. Existing methods either do not take advantage of recent developments in multimodal Large Language Models (LLM) or do not consider challenges in localization uncertainty, which is critical in embodied agents. We propose employing multimodal LLMs for long-horizon planning in conjunction with detailed motion planning using our information-based objective. By leveraging high-quality view synthesis from our 3DGS representation, our method employs a multimodal LLM as a zero-shot planner for long-horizon exploration goals from the semantic perspective. We also introduce an uncertainty-aware path proposal and selection algorithm that balances the dual objectives of maximizing the information gain for the environment while minimizing the cost of localization errors. Experiments conducted on the Gibson and Habitat-Matterport 3D datasets demonstrate state-of-the-art results of the proposed method.
PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos
Hanxiao Jiang
Columbia University
Hao-Yu Hsu
University of Illinois Urbana-Champaign
Kaifeng Zhang
Columbia University
Hsin-Ni Yu
University of Illinois Urbana-Champaign
Shenlong Wang
University of Illinois Urbana-Champaign
Yunzhu Li
Columbia University
Abstract
Creating a physical digital twin of a real-world object has immense potential in robotics, content creation, and XR. In this paper, we present PhysTwin, a novel framework that uses sparse videos of dynamic objects under interaction to produce a photo- and physically realistic, realtime interactive virtual replica. Our approach centers on two key components: (1) a physics-informed representation that combines spring-mass models for realistic physical simulation, generative shape models for geometry, and Gaussian splats for rendering; and (2) a novel multi-stage, optimization-based inverse modeling framework that reconstructs complete geometry, infers dense physical properties, and replicates realistic appearance from videos. Our method integrates an inverse physics framework with visual perception cues, enabling high-fidelity reconstruction even from partial, occluded, and limited viewpoints. PhysTwin supports modeling various deformable objects, including ropes, stuffed animals, cloth, and delivery packages. Experiments show that PhysTwin outperforms competing methods in reconstruction, rendering, future prediction, and simulation under novel interactions. We further demonstrate its applications in interactive real-time simulation and model-based robotic motion planning. Project Page: https://jianghanxiao.github.io/phystwin-web/
Real3D: Towards Scaling Large Reconstruction Models with Real Images
Hanwen Jiang
The University of Texas at Austin
Qixing Huang
The University of Texas at Austin
Georgios Pavlakos
The University of Texas at Austin
Abstract
Training single-view Large Reconstruction Models (LRMs) follows the fully supervised route, requiring multi-view supervision. However, the multi-view data typically comes from synthetic 3D assets, which are hard to scale further and are not representative of the distribution of real-world object shapes. To address these limitations, we introduce Real3D, the first LRM that uses single-view real images for training, benefiting from their scalability and capturing the real-world shape distribution. Real3D introduces a novel self-training framework, including unsupervised losses at the pixel- and semantic-level, enabling LRMs to learn from these singleview images without multi-view supervision. Simultaneously, to deal with the noise of real data, Real3D also presents an automatic data curation approach to gather high-quality examples that have positive impact on training. Our experiments show that Real3D consistently outperforms prior work in diverse evaluation settings that include real and synthetic data, as well as both in-domain and out-of-domain shapes.
Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework
Jian-Jian Jiang
Sun Yat-sen University
Xiao-Ming Wu
Sun Yat-sen University
Yi-Xiang He
Sun Yat-sen University
Ling-An Zeng
Sun Yat-sen University
Yi-Lin Wei
Sun Yat-sen University
Dandan Zhang
Imperial College London
Wei-Shi Zheng
Sun Yat-sen University
Abstract
Bimanual robotic manipulation is an emerging and critical topic in the robotics community. Previous works primarily rely on integrated control models that take the perceptions and states of both arms as inputs to directly predict their actions. However, we think bimanual manipulation involves not only coordinated tasks but also various uncoordinated tasks that do not require explicit cooperation during execution, such as grasping objects with the closest hand, which integrated control frameworks ignore to consider due to their enforced cooperation in the early inputs. In this paper, we propose a novel decoupled interaction framework that considers the characteristics of different tasks in bimanual manipulation. The key insight of our framework is to assign an independent model to each arm to enhance the learning of uncoordinated tasks, while introducing a selective interaction module that adaptively learns weights from its own arm to improve the learning of coordinated tasks. Extensive experiments on seven tasks in the RoboTwin dataset demonstrate that: (1) Our framework achieves outstanding performance, with a 23.5% boost over the SOTA method. (2) Our framework is flexible and can be seamlessly integrated into existing methods. (3) Our framework can be effectively extended to multi-agent manipulation tasks, achieving a 28% boost over the integrated control SOTA. (4) The performance boost stems from the decoupled design itself, surpassing the SOTA by 16.5% in success rate with only 1/6 of the model size.
TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction
Dadong Jiang
Tianjin University
Zhi Hou
Shanghai Artificial Intelligence Laboratory
Zhihui Ke
Tianjin University
Xianghui Yang
Tencent
Xiaobo Zhou
Tianjin University
Tie Qiu
Tianjin University
Abstract
Dynamic scene reconstruction is a long-term challenge in 3D vision. Recent methods extend 3D Gaussian Splatting to dynamic scenes via additional deformation fields and apply explicit constraints like motion flow to guide the deformation. However, they learn motion changes from individual timestamps independently, making it challenging to reconstruct complex scenes, particularly when dealing with violent movement, extreme-shaped geometries, or reflective surfaces. To address the above issue, we design a simple yet effective plug-and-play module called TimeFormer to enable existing deformable 3D Gaussians reconstruction methods with the ability to implicitly model motion patterns from a learning perspective. Specifically, † Corresponding Author Project Page: https://patrickddj.github.io/TimeFormer TimeFormer includes a Cross-Temporal Transformer Encoder, which adaptively learns the temporal relationships of deformable 3D Gaussians. Furthermore, we propose a two-stream optimization strategy that transfers the motion knowledge learned from TimeFormer to the base stream during the training phase. This allows us to remove TimeFormer during inference, thereby preserving the original rendering speed. Extensive experiments in the multi-view and monocular dynamic scenes validate qualitative and quantitative improvement brought by TimeFormer.
VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding
Minchao Jiang
School of Computer Science and Technology, Xidian University
Shunyu Jia
School of Computer Science and Technology, Xidian University
Jiaming Gu
Algorithm R&D Center, Qing Yi (Shanghai)
Xiaoyuan Lu
Shanghai Pudong Cryptography Research Institute
Guangming Zhu
School of Computer Science and Technology, Xidian University
Anqi Dong
Division of Decision and Control Systems and Department of Mathematics, KTH Royal Institute of Technology
Liang Zhang
School of Computer Science and Technology, Xidian University
Abstract
3D Gaussian Splatting (3DGS) has become horsepower in high-quality, real-time rendering for novel view synthesis of 3D scenes. However, existing methods focus primarily on geometric and appearance modeling, lacking deeper scene understanding while also incurring high training costs that complicate the originally streamlined differentiable rendering pipeline. To this end, we propose VoteSplat, a novel 3D scene understanding framework that integrates Hough voting with 3DGS. Specifically, Segment Anything Model (SAM) is utilized for instance segmentation, extracting objects, and generating 2D vote maps. We then embed spatial offset vectors into Gaussian primitives. These offsets construct 3D spatial votes by associating them with 2D image votes, while depth distortion constraints refine localization along the depth axis. For open-vocabulary object localization, VoteSplat maps 2D image semantics to 3D point clouds via voting points, reducing training costs associated with high-dimensional CLIP features while preserving semantic unambiguity. Extensive experiments demonstrate VoteSplat's effectiveness in open-vocabulary 3D instance localization, 3D point cloud understanding, click-based 3D object localization, hierarchical segmentation, and ablation studies. Our code is available at VoteSplat.
GSOT3D: Towards Generic 3D Single Object Tracking in the Wild
Yifan Jiao
Institute of Software Chinese Academy of Sciences
Yunhao Li
Institute of Software Chinese Academy of Sciences
Junhua Ding
University of North Texas
Qing Yang
University of North Texas
Song Fu
University of North Texas
Heng Fan
University of North Texas
Libo Zhang
Institute of Software Chinese Academy of Sciences
Abstract
In this paper, we present a novel benchmark, GSOT3D, that aims at facilitating development of generic 3D single object tracking (SOT) in the wild. Specifically, GSOT3D offers 620 sequences with 123K frames, and covers a wide selection of 54 object categories. Each sequence is offered with multiple modalities, including the point cloud (PC), RGB image, and depth. This allows GSOT3D to support various 3D tracking tasks, such as single-modal 3D SOT on PC and multi-modal 3D SOT on RGB-PC or RGB-D, and thus greatly broadens research directions for 3D object tracking. To provide highquality per-frame 3D annotations, all sequences are labeled manually with multiple rounds of meticulous inspection and refinement. To our best knowledge, GSOT3D is the largest benchmark dedicated to various generic 3D object tracking tasks. To understand how existing 3D trackers perform and to provide comparisons for future research on GSOT3D, we assess eight representative point cloud-based tracking models. Our evaluation results exhibit that these models heavily degrade on GSOT3D, and more efforts are required for robust and generic 3D object tracking. Besides, to encourage future research, we present a simple yet effective generic 3D tracker, named PROT3D, that localizes the target object viaa progressive spatial-temporal network and outperforms all current solutions by a large margin. By releasing GSOT3D, we expect to advance further 3D tracking in future research and applications. Our benchmark and model as well as the evaluation toolkit and results are publicly available at https://github.com/ailovejinx/GSOT3D.
6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting
Yufeng Jin
Computer Science Department, Technische Universität Darmstadt
Vignesh Prasad
Computer Science Department, Technische Universität Darmstadt
Snehal Jauhri
Computer Science Department, Technische Universität Darmstadt
Mathias Franzius
Honda Research Institute Europe GmbH
Georgia Chalvatzaki
Hessian.AI, Darmstadt
Abstract
Efficient and accurate object pose estimation is an essential component for modern vision systems in many applications such as Augmented Reality, autonomous driving, and robotics. While research in model-based 6D object pose estimation has delivered promising results, model-free methods are hindered by the high computational load in rendering and inferring consistent poses of arbitrary objects in a live RGB-D video stream. To address this issue, we present 6DOPE-GS, a novel method for online 6D object pose estimation and tracking with a single RGB-D camera by effectively leveraging advances in Gaussian Splatting. Thanks to the fast differentiable rendering capabilities of Gaussian Splatting, 6DOPE-GS can simultaneously optimize for 6D object poses and 3D object reconstruction. To achieve the necessary efficiency and accuracy for live tracking, our method uses incremental 2D Gaussian Splatting with an intelligent dynamic keyframe selection procedure to achieve high spatial object coverage and prevent erroneous pose updates. We also propose an opacity statistic-based pruning mechanism for adaptive Gaussian density control, to ensure training stability and efficiency. We evaluate our method on the HO3D and YCBInEOAT datasets and show that 6DOPE-GS matches the performance of state-of-theart baselines for model-free simultaneous 6D pose tracking and reconstruction while providing a 5x speedup. We also demonstrate the method's suitability for live, dynamic object tracking and reconstruction in a real-world setting.
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models
Yudong Jin
Zhejiang University
Sida Peng
Zhejiang University
Xuan Wang
Ant Research
Tao Xie
Zhejiang University
Zhen Xu
Zhejiang University
Yifan Yang
Zhejiang University
Yujun Shen
Ant Research
Hujun Bao
Zhejiang University
Xiaowei Zhou
Zhejiang University
Abstract
This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoise the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the GPU memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches. See our project page for interactive demos and video results: https://diffuman4d.github.io/.
Feature Purification Matters: Suppressing Outlier Propagation for Training-Free Open-Vocabulary Semantic Segmentation
Shuo Jin
Xi'an Jiaotong-Liverpool University
Siyue Yu
Xi'an Jiaotong-Liverpool University
Bingfeng Zhang
China University of Petroleum (East China)
Mingjie Sun
Soochow University
Yi Dong
University of Liverpool
Jimin Xiao
Xi'an Jiaotong-Liverpool University
Abstract
Training-free open-vocabulary semantic segmentation has advanced with vision-language models like CLIP, which exhibit strong zero-shot abilities. However, CLIP's attention mechanism often wrongly emphasises specific image tokens, namely outliers, which results in irrelevant over-activation. Existing approaches struggle with these outliers that arise in intermediate layers and propagate through the model, ultimately degrading spatial perception. In this paper, we propose a Self-adaptive Feature Purifier framework (SFP) to suppress propagated outliers and enhance semantic representations for open-vocabulary semantic segmentation. Specifically, based on an in-depth analysis of attention responses between image and class tokens, we design a selfadaptive outlier mitigator to detect and mitigate outliers at each layer for propagated feature purification. In addition, we introduce a semantic-aware attention enhancer to augment attention intensity in semantically relevant regions, which strengthens the purified feature to focus on objects. Further, we introduce a hierarchical attention integrator to aggregate multi-layer attention maps to refine spatially coherent feature representations for final segmentation. Our proposed SFP enables robust outlier suppression and object-centric feature representation, leading to a more precise segmentation. Extensive experiments show that our method achieves state-of-the-art performance and surpasses existing methods by an average of 4.6% mIoU on eight segmentation benchmarks. The code is released at: https://github.com/Kimsure/SFP.
GeoFormer: Geometry Point Encoder for 3D Object Detection with Graph-based Transformer
Xin Jin
Chang'an University
Haisheng Su
Shanghai Jiao Tong University
Cong Ma
SenseAuto Research
Kai Liu
SenseAuto Research
Wei Wu
SenseAuto Research
Fei Hui
Chang'an University
Junchi Yan
Shanghai Jiao Tong University
Abstract
Lidar-based 3D detection is one of the most popular research fields in autonomous driving. 3D detectors typically detect specific targets in a scene according to the pattern formed by the spatial distribution of point clouds. However, existing voxel-based methods usually adopt MLP and global pooling (e.g., PointNet, CenterPoint) as voxel feature encoder, which makes it less effective to extract detailed spatial structure information from raw points, leading to information loss and inferior performance. In this paper, we propose a novel graph-based transformer to encode voxel features by condensing the full and detailed point's geometry, termed as GeoFormer. We first represent points within a voxel as a graph, based on relative distances to capture its spatial geometry. Then, We introduce a geometry-guided transformer architecture to encode voxel features, where the adjacent geometric clues are used to re-weight point feature similarities, enabling more effective extraction of geometric relationships between point pairs at varying distances. We highlight that GeoFormer is a plug-and-play module which can be seamlessly integrated to enhance the performance of existing voxel-based detectors. Extensive experiments conducted on three popular outdoor datasets demonstrate that our GeoFormer achieves the start-of-the-art performance on both effectiveness and robustness comparisons.
Stereo Any Video: Temporally Consistent Stereo Matching
Junpeng Jing
Imperial College London
Weixun Luo
Imperial College London
Ye Mao
Imperial College London
Krystian Mikolajczyk
Imperial College London
Abstract
This paper introduces Stereo Any Video, a powerful framework for video stereo matching. It can estimate spatially accurate and temporally consistent disparities without relying on auxiliary information such as camera poses or optical flow. The strong capability is driven by rich priors from monocular video depth models, which are integrated with convolutional features to produce stable representations. To further enhance performance, key architectural innovations are introduced: all-to-all-pairs correlation, which constructs smooth and robust matching cost volumes, and temporal convex upsampling, which improves temporal coherence. These components collectively enhance robustness, accuracy, and temporal consistency, establishing a new standard in video stereo matching. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple datasets both qualitatively and quantitatively in zero-shot settings, as well as strong generalization to real-world indoor and outdoor scenarios.
Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization
Hao Ju
University of Macau
Shaofei Huang
University of Macau
Si Liu
Beihang University
Zhedong Zheng
University of Macau
Abstract
Existing approaches to drone visual geo-localization predominantly adopt the image-based setting, where a single drone-view snapshot is matched with images from other platforms. Such task formulation, however, underutilizes the inherent video output of the drone and is sensitive to occlusions and viewpoint disparity. To address these limitations, we formulate a new video-based drone geolocalization task and propose the Video2BEV paradigm. This paradigm transforms the video into a Bird's Eye View (BEV), simplifying the subsequent inter-platform matching process. In particular, we employ Gaussian Splatting to reconstruct a 3D scene and obtain the BEV projection. Different from the existing transform methods, e.g., polar transform, our BEVs preserve more fine-grained details without significant distortion. To facilitate the discriminative intra-platform representation learning, our Video2BEV paradigm also incorporates a diffusion-based module for generating hard negative samples. To validate our approach, we introduce UniV, a new videobased geo-localization dataset that extends the image-based University-1652 dataset. UniV features flight paths at 30◦ and 45◦elevation angles with increased frame rates of up to 10 frames per second (FPS). Extensive experiments on the UniV dataset show that our Video2BEV paradigm achieves competitive recall rates and outperforms conventional video-based methods. Compared to other competitive methods, our proposed approach exhibits robustness at lower elevations with more occlusions. The code is available at: https://github.com/HaoDot/Video2BEV-Open.
Details Matter for Indoor Open-vocabulary 3D Instance Segmentation
Sanghun Jung
University of Washington
Jingjing Zheng
Amazon Lab126
Ke Zhang
Amazon Lab126
Nan Qiao
Amazon Lab126
Albert Y. C. Chen
Amazon Lab126
Lu Xia
Amazon Lab126
Chi Liu
Amazon Lab126
Yuyin Sun
Amazon Lab126
Xiao Zeng
Amazon Lab126
Hsiang-Wei Huang
University of Washington
Byron Boots
University of Washington
Min Sun
National Tsing Hua University
Cheng-Hao Kuo
Amazon Lab126
Abstract
Unlike closed-vocabulary 3D instance segmentation that is often trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) often leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two-stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha-CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object-centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text-to-proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state-of-the-art performance on ScanNet200 and S3DIS across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.
IM360: Large-scale Indoor Mapping with 360 Cameras
Dongki Jung
University of Maryland, College Park
Jaehoon Choi
University of Maryland, College Park
Yonghan Lee
University of Maryland, College Park
Dinesh Manocha
University of Maryland, College Park
Abstract
We present a novel 3D mapping pipeline for large-scale indoor environments. To address the significant challenges in large-scale indoor scenes, such as prevalent occlusions and textureless regions, we propose IM360, a novel approach that leverages the wide field of view of omnidirectional images and integrates the spherical camera model into the Structure-from-Motion (SfM) pipeline. Our SfM utilizes dense matching features specifically designed for 360◦images, demonstrating superior capability in image registration. Furthermore, with the aid of mesh-based neural rendering techniques, we introduce a texture optimization method that refines texture maps and accurately captures view-dependent properties by combining diffuse and specular components. We evaluate our pipeline on largescale indoor scenes, demonstrating its effectiveness in realworld scenarios. In practice, IM360 demonstrates superior performance, achieving a 3.5 PSNR increase in textured mesh reconstruction. We attain state-of-the-art performance in terms of camera localization and registration on Matterport3D and Stanford2D3D. Project page: https://jdk9405.github.io/IM360/
MAESTRO: Task-Relevant Optimization via Adaptive Feature Enhancement and Suppression for Multi-task 3D Perception
Changwon Kang
Hanyang University
Jisong Kim
Hanyang University
Hongjae Shin
Seoul National University
Junseo Park
Seoul National University
Jun Won Choi
Seoul National University
Abstract
The goal of multi-task learning is to learn to conduct multiple tasks simultaneously based on a shared data representation. While this approach can improve learning efficiency, it may also cause performance degradation due to task conflicts that arise when optimizing the model for different objectives. To address this challenge, we introduce MAESTRO, a structured framework designed to generate taskspecific features and mitigate feature interference in multitask 3D perception, including 3D object detection, bird'seye view (BEV) map segmentation, and 3D occupancy prediction. MAESTRO comprises three components: the Class-wise Prototype Generator (CPG), the Task-Specific Feature Generator (TSFG), and the Scene Prototype Aggregator (SPA). CPG groups class categories into foreground and background groups and generates group-wise prototypes. The foreground and background prototypes are assigned to the 3D object detection task and the map segmentation task, respectively, while both are assigned to the 3D occupancy prediction task. TSFG leverages these prototype groups to retain task-relevant features while suppressing irrelevant features, thereby enhancing the performance for each task. SPA enhances the prototype groups assigned for 3D occupancy prediction by utilizing the information produced by the 3D object detection head and the map segmentation head. Extensive experiments on the nuScenes and Occ3D benchmarks demonstrate that MAESTRO consistently outperforms existing methods across 3D object detection, BEV map segmentation, and 3D occupancy prediction tasks.
Unleashing the Temporal Potential of Stereo Event Cameras for Continuous-Time 3D Object Detection
Jae-Young Kang
KAIST
Hoonhee Cho
KAIST
Kuk-Jin Yoon
KAIST
Abstract
3D object detection is essential for autonomous systems, enabling precise localization and dimension estimation. While LiDAR and RGB cameras are widely used, their fixed frame rates create perception gaps in high-speed scenarios. Event cameras, with their asynchronous nature and high temporal resolution, offer a solution by capturing motion continuously. The recent approach, which integrates event cameras with conventional sensors for continuoustime detection, struggles in fast-motion scenarios due to its dependency on synchronized sensors. We propose a novel stereo 3D object detection framework that relies solely on event cameras, eliminating the need for conventional 3D sensors. To compensate for the lack of semantic and geometric information in event data, we introduce a dual filter mechanism that extracts both. Additionally, we enhance regression by aligning bounding boxes with object-centric information. Experiments show that our method outperforms prior approaches in dynamic environments, demonstrating the potential of event cameras for robust, continuoustime 3D perception. The code is available at https: //github.com/mickeykang16/Ev-Stereo3D.
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos
Nikita Karaev
Meta AI
Yuri Makarov
Meta AI
Jianyuan Wang
Meta AI
Natalia Neverova
Meta AI
Andrea Vedaldi
Meta AI
Christian Rupprecht
Visual Geometry Group, University of Oxford
Abstract
We introduce CoTracker3, a new state-of-the-art point tracker. With CoTracker3, we revisit the design of recent trackers, removing components and reducing the number of parameters while also improving performance. We also explore the interplay of synthetic and real data. Recent trackers are trained on synthetic videos due to the difficulty of collecting tracking annotations for real data. However, this can result in suboptimal performance due to the statistical gap between synthetic and real videos. We thus suggest using off-the-shelf trackers as teachers to annotate real videos with pseudo-labels. Compared to other recent attempts at using real data for learning trackers, this scheme is much simpler and achieves better results using 1,000 times less data. CoTracker3 is available here in online (causal) and offline variants.
Towards Safer and Understandable Driver Intention Prediction
Mukilan Karuppasamy
IIIT Hyderabad
Shankar Gangisetty
IIIT Hyderabad
Shyam Nandan Rai
Politecnico di Torino
Carlo Masone
Politecnico di Torino
C V Jawahar
IIIT Hyderabad
Abstract
Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, mainly due to recent advances in deep learning and AI. As interactions between autonomous systems and humans increase, the interpretability of decision-making processes in driving systems becomes increasingly crucial for ensuring safe driving operations. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the eXplainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver's decisions. These explanations are derived from both the driver's eye-gaze and the ego-vehicle's perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatiotemporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on the DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability than conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations. Our data, code and models are available at: https://mukil07.github.io/VCBM.github.io/
Princeton365: A Diverse Dataset with Accurate Camera Pose
Karhan Kayan
Princeton University
Stamatis Alexandropoulos
Princeton University
Rishabh Jain
Princeton University
Yiming Zuo
Princeton University
Erich Liang
Princeton University
Jia Deng
Princeton University
Abstract
We introduce Princeton365, a large-scale diverse dataset of 365 videos with accurate camera pose. Our dataset bridges the gap between accuracy and data diversity in current SLAM benchmarks by introducing a novel ground truth collection framework that leverages calibration boards and a 360◦camera. We collect indoor, outdoor, and object scanning videos with synchronized monocular and stereo RGB video outputs as well as IMU. We further propose a new scene scale-aware evaluation metric for SLAM based on the the optical flow induced by the camera pose estimation error. In contrast to the current metrics, our new metric allows for comparison between the performance of SLAM methods across scenes as opposed to existing metrics such as Average Trajectory Error (ATE), allowing researchers to analyze the failure modes of their methods. We also propose a challenging Novel View Synthesis benchmark that covers cases not covered by current NVS benchmarks, such as fully non-Lambertian scenes with 360◦camera trajectories. Please visit princeton365.cs.princeton.edu for the dataset, code, videos, and submission.
Bridging the Sky and Ground: Towards View-Invariant Feature Learning for Aerial-Ground Person Re-Identification
Wajahat Khalid
School of Cyber Science and Technology, University of Science and Technology of China
Bin Liu
School of Cyber Science and Technology, University of Science and Technology of China
Xulin Li
School of Cyber Science and Technology, University of Science and Technology of China
Muhammad Waqas
School of Cyber Science and Technology, University of Science and Technology of China
Muhammad Sher Afgan
School of Cyber Science and Technology, University of Science and Technology of China
Abstract
Aerial-Ground Person Re-Identification (AG-ReID) is a practical yet challenging task that involves cross-platform matching between aerial and ground cameras. Existing person Re-Identification (Re-ID) methods are primarily designed for homogeneous camera settings, such as groundto-ground or aerial-to-aerial matching. Therefore, these conventional Re-ID approaches underperform due to the significant viewpoint discrepancies introduced by crossplatform cameras in the AG-ReID task. To address this limitation, we propose a novel and efficient approach, termed View-Invariant Feature Learning for Aerial-Ground Person Re-Identification (VIF-AGReID), which explores viewinvariant features without leveraging any auxiliary information. Our approach introduces two key components: (1) Patch-Level RotateMix (PLRM), an augmentation strategy that enhances rotational diversity within local regions of training samples, enabling the model to capture finegrained view-invariant features, and (2) View-Invariant Angular Loss (VIAL), which mitigates the impact of perspective variations by imposing angular constraints that exponentially penalize large angular deviations, optimizing the similarity of positive pairs while enhancing dissimilarity for hard negatives. These components interact synergistically to drive view-invariant feature learning, enhancing robustness across diverse viewpoints. Extensive experiments on the CARGO, AG-ReIDv1, and AG-ReIDv2 benchmarks demonstrate the effectiveness of our method in addressing the AG-ReID task.
CARIM: Caption-Based Autonomous Driving Scene Retrieval via Inclusive Text Matching
Minjoo Ki
Yonsei University
Daejung Kim
Naver Labs
Kisung Kim
Naver Labs
Seon Joo Kim
Yonsei University
Jinhan Lee
Naver Labs
Abstract
Text-to-video retrieval is a powerful tool for navigating vast video databases. This is especially useful in autonomous driving to retrieve scenes from a text query to simulate and evaluate a driving system in desired scenarios. However, traditional ranking-based retrieval methods often return partial matches that fail to satisfy all query conditions. To address this, we introduce Inclusive Text-to-Video Retrieval, which retrieves only videos that meet all specified conditions, regardless of additional irrelevant elements. We propose CARIM, a driving scene retrieval framework that employs inclusive text matching. By utilizing Vision-Language Model and Large Language Model to generate compressed captions for driving scenes, we reformulate text-to-video retrieval as a more efficient text-to-text retrieval problem, eliminating modality mismatch and heavy annotation cost. We present a novel positive and negative data curation strategy and an attention-based scoring mechanism tailored for driving scene retrieval. Experiments show that CARIM outperforms state-of-the-art retrieval methods, excelling in edge cases where traditional models fail.
Removing Cost Volumes from Optical Flow Estimators
Simon Kiefhaber
Department of Computer Science, Technical University of Darmstadt
Stefan Roth
Department of Computer Science, Technical University of Darmstadt
Simone Schaub-Meyer
Department of Computer Science, Technical University of Darmstadt
Abstract
Cost volumes are used in every modern optical flow estimator, but due to their computational and space complexity, they are often a limiting factor regarding both processing speed and the resolution of input frames. Motivated by our empirical observation that cost volumes lose their importance once all other network parts of, e.g., a RAFT-based pipeline have been sufficiently trained, we introduce a training strategy that allows removing the cost volume from optical flow estimators throughout training. This leads to significantly improved inference speed and reduced memory requirements. Using our training strategy, we create three different models covering different compute budgets. Our most accurate model reaches state-of-the-art accuracy while being 1.2x faster and having a 6x lower memory footprint than comparable models; our fastest model is capable of processing Full HD frames at 20 FPS using only 500 MB of GPU memory.
2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update
Jeongyun Kim
Seoul National University
Seunghoon Jeong
Seoul National University
Giseop Kim
DGIST
Myung-Hwan Jeon
Kumoh National Institute of Technology
Eunji Jun
Hyundai Motor Group
Ayoung Kim
Seoul National University
Abstract
Understanding the 3D geometry of transparent objects from RGB images is challenging due to their inherent physical properties, such as reflection and refraction. To address these difficulties, especially in scenarios with sparse views and dynamic environments, we introduce TRAN-D, a novel 2D Gaussian Splatting-based depth reconstruction method for transparent objects. Our key insight lies in separating transparent objects from the background, enabling focused optimization of Gaussians corresponding to the object. We mitigate artifacts with an object-aware loss that places Gaussians in obscured regions, ensuring coverage of invisible surfaces while reducing overfitting. Furthermore, we incorporate a physics-based simulation that refines the reconstruction in just a few seconds, effectively handling object removal and chain-reaction movement of remaining objects without the need for rescanning. TRAN-D is evaluated on both synthetic and real-world sequences, and it consistently demonstrated robust improvements over existing GS-based state-of-the-art methods. In comparison with baselines, TRAN-D reduces the mean absolute error by over 39% for the synthetic TRansPose sequences. Furthermore, despite being updated using only one image, TRAN-D reaches a δ < 2.5 cm accuracy of 48.46%, over 1.5 times that of baselines, which uses six images. Code and more results are available at https://jeongyun0609.github.io/TRAN-D/.
CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models
Junho Kim
EverEx
Hyungjin Chung
EverEx
Byung-Hoon Kim
EverEx
Abstract
Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints, a process that is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories. Recent efforts have explored the use of text queries, leveraging their enhanced stability and generalization capabilities. However, existing approaches often remain constrained by their reliance on support queries, their failure to fully utilize the rich priors embedded in pretrained large language models, and the limitations imposed by their parametric distribution assumptions. To address these challenges, we introduce CapeLLM, the first multimodal large language model (MLLM) designed for CAPE. Our method only employs query image and detailed text descriptions as an input to estimate category-agnostic keypoints. Our method encompasses effective training strategies and carefully designed instructions for applying the MLLM to CAPE. Moreover, we propose an inference mechanism that further enhances the reasoning process for unseen keypoints. while flexibly modeling their underlying spatial distribution and uncertainty, allowing for adaptive refinement based on contextual cues. We conducted extensive experiments to apply the MLLM to CAPE effectively, focusing not only on the model architecture and prompt design but also on ensuring robustness across input variations. Our approach sets a new state-of-the-art on the MP-100 benchmark in the 1-shot and even 5-shot setting, marking a significant advancement in the field of categoryagnostic pose estimation. Code is available here.
DAViD: Modeling Dynamic Affordance of 3D Objects Using Pre-trained Video Diffusion Models
Hyeonwoo Kim
Seoul National University
Sangwon Baik
Seoul National University
Hanbyul Joo
Seoul National University
Abstract
Modeling how humans interact with objects is crucial for AI to effectively assist or mimic human behaviors. Existing studies for learning such ability primarily focus on static human-object interaction (HOI) patterns, such as contact and spatial relationships, while dynamic HOI patterns, capturing the movement of humans and objects over time, remain relatively underexplored. In this paper, we present a novel framework for learning Dynamic Affordance across various target object categories. To address the scarcity of 4D HOI datasets, our method learns the 3D dynamic affordance from synthetically generated 4D HOI samples. Specifically, we propose a pipeline that first generates 2D HOI videos from a given 3D target object using a pre-trained video diffusion model, then lifts them into 3D to generate 4D HOI samples. Leveraging these synthesized 4D HOI samples, we train DAViD, our generative 4D human-object interaction model, which is composed of two key components: (1) a human motion diffusion model (MDM) with Low-Rank Adaptation (LoRA) module to fine-tune a pretrained MDM to learn the HOI motion concepts from limited HOI motion samples, (2) a motion diffusion model for 4D object poses conditioned by produced human interaction motions. Interestingly, DAViD can integrate newly learned HOI motion concepts with pre-trained human motions to create novel HOI motions, even for multiple HOI motion concepts, demonstrating the advantage of our pipeline with LoRA in integrating dynamic HOI concepts. Through extensive experiments, we demonstrate that DAViD outperforms baselines in synthesizing HOI motion.
From Sharp to Blur: Unsupervised Domain Adaptation for 2D Human Pose Estimation Under Extreme Motion Blur Using Event Cameras
Youngho Kim
KAIST
Hoonhee Cho
KAIST
Kuk-Jin Yoon
KAIST
Abstract
Human pose estimation is critical for applications such as rehabilitation, sports analytics, and AR/VR systems. However, rapid motion and low-light conditions often introduce motion blur, significantly degrading pose estimation due to the domain gap between sharp and blurred images. Most datasets assume stable conditions, making models trained on sharp images struggle in blurred environments. To address this, we introduce a novel domain adaptation approach that leverages event cameras, which capture high temporal resolution motion data and are inherently robust to motion blur. Using event-based augmentation, we generate motionaware blurred images, effectively bridging the domain gap between sharp and blurred domains without requiring paired annotations. Additionally, we develop a student-teacher framework that iteratively refines pseudo-labels, leveraging mutual uncertainty masking to eliminate incorrect labels and enable more effective learning. Experimental results demonstrate that our approach outperforms conventional domain-adaptive human pose estimation methods, achieving robust pose estimation under motion blur without requiring annotations in the target domain. Our findings highlight the potential of event cameras as a scalable and effective solution for domain adaptation in real-world motion blur environments. Our project codes are available at https: //github.com/kmax2001/EvSharp2Blur.
GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion
Gwanghyun Kim
NVIDIA
Xueting Li
NVIDIA
Ye Yuan
NVIDIA
Koki Nagano
NVIDIA
Tianye Li
NVIDIA
Jan Kautz
NVIDIA
Se Young Chun
Seoul National University
Umar Iqbal
NVIDIA
Abstract
Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of highquality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model's role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the limitations of traditional affine-invariant and metric depth representations. GeoMan achieves state-of-the-art performance in both qualitative and quantitative evaluations, demonstrating its effectiveness in overcoming longstanding challenges in 3D human geometry estimation from videos.
Learning 3D Scene Analogies with Neural Contextual Scene Maps
Junho Kim
Seoul National University
Gwangtak Bae
Seoul National University
Eun Sun Lee
Seoul National University
Young Min Kim
Seoul National University
Abstract
Understanding scene contexts is crucial for machines to perform tasks and adapt prior knowledge in unseen or noisy 3D environments. As data-driven learning is intractable to comprehensively encapsulate diverse ranges of layouts and open spaces, we propose teaching machines to identify relational commonalities in 3D spaces. Instead of focusing on point-wise or object-wise representations, we introduce 3D scene analogies, which are smooth maps between 3D scene regions that align spatial relationships. Unlike well-studied single instance-level maps, these scene-level maps smoothly link large scene regions, potentially enabling unique applications in trajectory transfer in AR/VR, long demonstration transfer for imitation learning, and context-aware object rearrangement. To find 3D scene analogies, we propose neural contextual scene maps, which extract descriptor fields summarizing semantic and geometric contexts, and holistically align them in a coarse-to-fine manner for map estimation. This approach reduces reliance on individual feature points, making it robust to input noise or shape variations. Experiments demonstrate the effectiveness of our approach in identifying scene analogies and transferring trajectories or object placements in diverse indoor scenes, indicating its potential for robotics and AR/VR applications. Project page including the code is available through this link: https: //82magnolia.github.io/3d_scene_analogies/.
Lightweight and Fast Real-time Image Enhancement via Decomposition of the Spatial-aware Lookup Tables
Wontae Kim
IPAI, Seoul National University
Keuntek Lee
Department of ECE, INMC, Seoul National University
Nam Ik Cho
IPAI, Seoul National University
Abstract
The image enhancement methods based on 3D lookup tables (3D LUTs) efficiently reduce both model size and runtime by interpolating pre-calculated values at the vertices. However, the 3D LUT methods have a limitation due to their lack of spatial information, as they convert color values on a point-by-point basis. Although spatial-aware 3D LUT methods address this limitation, they introduce additional modules that require a substantial number of parameters, leading to increased runtime as image resolution increases. To address this issue, we propose a method for generating image-adaptive LUTs by focusing on the redundant parts of the tables. Our efficient framework decomposes a 3D LUT into a linear sum of low-dimensional LUTs and employs singular value decomposition (SVD). Furthermore, we enhance the modules for spatial feature fusion to be more cache-efficient. Extensive experimental results demonstrate that our model effectively decreases both the number of parameters and runtime while maintaining spatial awareness and performance. The code is available at https://github.com/WontaeaeKim/SVDLUT.
PoseAnchor: Robust Root Position Estimation for 3D Human Pose Estimation
Jun-Hee Kim
Korea University
Jumin Han
Korea University
Seong-Whan Lee
Korea University
Abstract
Standard 3D human pose estimation (HPE) benchmarks employ root-centering, which normalizes poses relative to the pelvis but discards absolute root position information. While effective for evaluation, this approach limits real-world applications such as motion tracking, AR/VR, and humancomputer interaction, where absolute root position is essential. Moreover, incorporating root position into these models often leads to performance degradation. To address these limitations, we introduce PoseAnchor, a unified framework that seamlessly integrates root position estimation while improving overall pose accuracy. PoseAnchor leverages Iterative Hard Thresholding Robust Least Squares Regression (ITRR), a novel robust regression approach introduced to 3D HPE for the first time. ITRR effectively mitigates the impact of noisy 2D detections, enabling more accurate root position estimation. With ITRR, PoseAnchor enables zeroshot root localization, allowing existing models to estimate absolute root positions without retraining or architectural modifications. ITRR identifies a support set of reliable joints based on their spatial relationships to achieve robust root estimation, effectively filtering out unreliable joints. Beyond zero-shot localization, PoseAnchor incorporates ITRR into a Data-Driven Training framework that selectively utilizes the support set to optimize pose learning. By dynamically filtering high-confidence joint data, PoseAnchor mitigates noise while improving robustness.
Probabilistic Inertial Poser (ProbIP): Uncertainty-aware Human Motion Modeling from Sparse Inertial Sensors
Min Kim
KAIST
Younho Jeon
KAIST
Sungho Jo
KAIST
Abstract
Wearable Inertial Measurement Units (IMUs) allow nonintrusive motion tracking, but limited sensor placements can introduce uncertainty in capturing detailed full-body movements. Existing methods mitigate this issue by selecting more physically plausible motion patterns but do not directly address inherent uncertainties in the data. We introduce the Probabilistic Inertial Poser (ProbIP), a novel probabilistic model that transforms sparse IMU data into human motion predictions without physical constraints. ProbIP utilizes RU-Mamba blocks to predict a matrix Fisher distribution over rotations, effectively estimating both rotation matrices and associated uncertainties. To refine motion distribution through layers, our Progressive Distribution Narrowing (PDN) technique enables stable learning across a diverse range of motions. Experimental results demonstrate that ProbIP achieves state-of-the-art performance on multiple public datasets with six and fewer IMU sensors. Our contributions include the development of ProbIP with RUMamba blocks for probabilistic motion estimation, applying Progressive Distribution Narrowing (PDN) for uncertainty reduction, and evidence of superior results with six and reduced sensor configurations. The code will be available at https://github.com/MinKim14/ProbIP-ICCV2025.
SynAD: Enhancing Real-World End-to-End Autonomous Driving Models through Synthetic Data Integration
Jongsuk Kim
KAIST
Jaeyoung Lee
KAIST
Gyojin Han
KAIST
Dong-Jae Lee
KAIST
Minki Jeong
AI Center, Samsung Electronics
Junmo Kim
KAIST
Abstract
Recent advancements in deep learning and the availability of high-quality real-world driving datasets have propelled end-to-end autonomous driving. Despite this progress, relying solely on real-world data limits the variety of driving scenarios for training. Synthetic scenario generation has emerged as a promising solution to enrich the diversity of training data; however, its application within E2E AD models remains largely unexplored. This is primarily due to the absence of a designated ego vehicle and the associated sensor inputs, such as camera or LiDAR, typically provided in real-world scenarios. To address this gap, we introduce SynAD, the first framework designed to enhance real-world E2E AD models using synthetic data. Our method designates the agent with the most comprehensive driving information as the ego vehicle in a multi-agent synthetic scenario. We further project path-level scenarios onto maps and employ a newly developed Map-to-BEV Network to derive bird's-eye-view features without relying on sensor inputs. Finally, we devise a training strategy that effectively integrates these map-based synthetic data with real driving data. Experimental results demonstrate that SynAD effectively integrates all components and notably enhances safety performance. By bridging synthetic scenario generation and E2E AD, SynAD paves the way for more comprehensive and robust autonomous driving models.
Free-running vs Synchronous: Single-Photon Lidar for High-flux 3D Imaging
Ruangrawee Kitichotkul
Boston University
Shashwath Bharadwaj
Boston University
Joshua Rapp
Mitsubishi Electric Research Laboratories
Yanting Ma
Mitsubishi Electric Research Laboratories
Alexander Mehta
University of California, Berkeley
Vivek K Goyal
Boston University
Abstract
Conventional wisdom suggests that single-photon lidar (SPL) should operate in low-light conditions (< 0.05 photons per laser pulse repetition) to minimize dead-time effects. Many methods have been developed to mitigate these effects in synchronous SPL systems. However, solutions for free-running SPL remain limited despite the advantage of reduced histogram distortion from dead times. To improve the accuracy of free-running SPL, we propose a computationally efficient joint maximum likelihood estimator of the signal flux, the background flux, and the depth using only histograms, along with a complementary regularization framework that incorporates a learned point cloud score model as a prior. Simulations and experiments demonstrate that free-running SPL yields lower estimation errors than its synchronous counterpart under identical conditions, with our regularization further improving accuracy.
DONUT: A Decoder-Only Model for Trajectory Prediction
Markus Knoche
RWTH Aachen University
Daan de Geus
RWTH Aachen University
Bastian Leibe
RWTH Aachen University
Abstract
Predicting the motion of other agents in a scene is highly relevant for autonomous driving, as it allows a self-driving car to anticipate. Inspired by the success of decoderonly models for language modeling, we propose DONUT, a Decoder-Only Network for Unrolling Trajectories. Unlike existing encoder-decoder forecasting models, we encode historical trajectories and predict future trajectories with a single autoregressive model. This allows the model to make iterative predictions in a consistent manner, and ensures that the model is always provided with up-to-date information, thereby enhancing performance. Furthermore, inspired by multi-token prediction for language modeling, we introduce an ‘overprediction' strategy that gives the model the auxiliary task of predicting trajectories at longer temporal horizons. This allows the model to better anticipate the future and further improves performance. Through experiments, we demonstrate that our decoder-only approach outperforms the encoder-decoder baseline, and achieves new state-of-the-art results on the Argoverse 2 single-agent motion forecasting benchmark.
GVDepth: Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion
Karlo Koledić
University of Zagreb Faculty of Electrical Engineering and Computing
Luka Petrović
University of Zagreb Faculty of Electrical Engineering and Computing
Ivan Marković
University of Zagreb Faculty of Electrical Engineering and Computing
Ivan Petrović
University of Zagreb Faculty of Electrical Engineering and Computing
Abstract
Generalizing metric monocular depth estimation presents a significant challenge due to its ill-posed nature, while the entanglement between camera parameters and depth amplifies issues further, hindering multi-dataset training and zero-shot accuracy. This challenge is particularly evident in autonomous vehicles and mobile robotics, where data is collected with fixed camera setups, limiting the geometric diversity. Yet, this context also presents an opportunity: the fixed relationship between the camera and the ground plane imposes additional perspective geometry constraints, enabling depth regression via vertical image positions of objects. However, this cue is highly susceptible to overfitting, thus we propose a novel canonical representation that maintains consistency across varied camera setups, effectively disentangling depth from specific parameters and enhancing generalization across datasets. We also propose a novel architecture that adaptively and probabilistically fuses depths estimated via object size and vertical image position cues. A comprehensive evaluation demonstrates the effectiveness of the proposed approach on five autonomous driving datasets, achieving accurate metric depth estimation for varying resolutions, aspect ratios and camera setups. Notably, we achieve comparable accuracy to existing zero-shot methods, despite training on a single dataset with a single-camera setup. Project website: https://unizgferlamor.github.io/gvdepth/
Embodied Navigation with Auxiliary Task of Action Description Prediction
Haru Kondoh
Institute of Science Tokyo
Asako Kanezaki
Institute of Science Tokyo
Abstract
The field of multimodal robot navigation in indoor environments has garnered significant attention in recent years. However, as tasks and methods become more advanced, the action decision systems tend to become more complex and operate as black-boxes. For a reliable system, the ability to explain or describe its decisions is crucial; however, there tends to be a trade-off in that explainable systems cannot outperform non-explainable systems in terms of performance. In this paper, we propose incorporating the task of describing actions in language into the reinforcement learning of navigation as an auxiliary task. Existing studies have found it difficult to incorporate describing actions into reinforcement learning due to the absence of ground-truth data. We address this issue by leveraging knowledge distillation from pre-trained description generation models, such as vision-language models. We comprehensively evaluate our approach across various navigation tasks, demonstrating that it can describe actions while attaining high navigation performance. Furthermore, it achieves state-of-theart performance in the particularly challenging multimodal navigation task of semantic audio-visual navigation.
Leaps and Bounds: An Improved Point Cloud Winding Number Formulation for Fast Normal Estimation and Surface Reconstruction
Chamin Hewa Koneputugodage
The Australian National University
Dylan Campbell
The Australian National University
Stephen Gould
The Australian National University
Abstract
Recent methods for point cloud surface normal estimation predominantly use the generalized winding number field induced by the normals. Optimizing the field towards satisfying desired properties, such as the input points being on the surface defined by the field, provides a principled way to obtain globally consistent surface normals. However, we show that the existing winding number formulation for point clouds is a poor approximation near the input surface points, diverging as the query point approaches a surface point. This is problematic for methods that rely on the accuracy and stability of this approximation, requiring heuristics to compensate. Instead, we derive a more accurate approximation that is properly bounded and converges to the correct value. We then examine two distinct approaches that optimize for globally consistent normals using point cloud winding numbers. We show how the original unbounded formulation influences key design choices in both methods and demonstrate that substituting our formulation yields substantive improvements with respect to normal estimation and surface reconstruction accuracy.
EquiCaps: Predictor-Free Pose-Aware Pre-Trained Capsule Networks
Athinoulla Konstantinou
University of Aberdeen
Georgios Leontidis
University of Aberdeen
Mamatha Thota
University of Lincoln
Aiden Durrant
University of Aberdeen
Abstract
Learning self-supervised representations that are invariant and equivariant to transformations is crucial for advancing beyond traditional visual classification tasks. However, many methods rely on predictor architectures to encode equivariance, despite evidence that architectural choices, such as capsule networks, inherently excel at learning interpretable pose-aware representations. To explore this, we introduce EquiCaps (Equivariant Capsule Network), a capsule-based approach to pose-aware self-supervision that eliminates the need for a specialised predictor for enforcing equivariance. Instead, we leverage the intrinsic pose-awareness capabilities of capsules to improve performance in pose estimation tasks. To further challenge our assumptions, we increase task complexity via multi-geometric transformations to enable a more thorough evaluation of invariance and equivariance by introducing 3DIEBench-T, an extension of a 3D object-rendering benchmark dataset. Empirical results demonstrate that EquiCaps outperforms prior state-of-the-art equivariant methods on geometric tasks, including rotation and translation, achieving a supervised-level R2 of 0.78 on the 3DIEBench rotation prediction benchmark and improving upon SIE and CapsIE by 0.05 and 0.04 R2, respectively. Moreover, in contrast to non-capsule-based equivariant approaches, EquiCaps maintains robust equivariant performance under combined geometric transformations, underscoring its generalisation capabilities and the promise of predictor-free capsule architectures. Code, dataset, and weights are released at http://github.com/AberdeenML/EquiCaps.
RoboAnnotatorX: A Comprehensive and Universal Annotation Framework for Accurate Understanding of Long-horizon Robot Demonstration
Longxin Kou
Tianjin University
Fei Ni
Tianjin University
Yan Zheng
Tianjin University
Peilong Han
Tianjin University
Jinyi Liu
Tianjin University
Haiqin Cui
Tianjin University
Rui Liu
Tianjin University
Jianye Hao
Tianjin University
Abstract
Recent advances in robotics have produced numerous valuable large-scale demonstration datasets, yet their potential remains underutilized due to annotation limitations. Current datasets often suffer from sparse temporal annotations and inconsistent labeling granularity, particularly for complex long-horizon demonstrations. Traditional manual annotation methods are expensive and poorly scalable while existing automated methods struggle with temporal coherence and semantic richness across extended demonstrations. For this, we propose RoboAnnotatorX, a reliable annotation tool that enhances multimodal large language model to generate high-quality, context-rich annotations for complex long-horizon demonstrations. Specifically, we introduce a multi-scale token-efficient encoder to maintain computational efficiency while simultaneously capturing fine-grained visual details and preserving temporal information by jointly integrating scene-level anchoring, clip-level temporal dynamics, and video-level global modeling. We further construct a comprehensive dataset RoboXVQA that synthesizes diverse QA pairs from both realworld and simulated data, bridging the significant domain gap in robotics demonstrations. Moreover, we leverage a curriculum-inspired three-stage training to progressively develop capabilities from basic visual perception to sophisticated temporal reasoning. Extensive experiments demonstrate that RoboAnnotatorX significantly outperforms existing approaches in annotation quality and exhibits strong generalization across diverse robotic environments, helping unlock the full potential of existing robotic datasets. The details and visualizations are available at project website.
Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints
Jens U. Kreber
University of Augsburg
Joerg Stueckler
University of Augsburg
Abstract
Articulated objects are an important type of interactable objects in everyday environments. In this paper, we propose PhysNAP, a novel diffusion model-based approach for generating articulated objects that aligns them with partial point clouds and improves their physical plausibility. The model represents part shapes by signed distance functions (SDFs). We guide the reverse diffusion process using a point cloud alignment loss computed using the predicted SDFs. Additionally, we impose non-penetration and mobility constraints based on the part SDFs for guiding the model to generate more physically plausible objects. We also make our diffusion approach category-aware to further improve point cloud alignment if category information is available. We evaluate the generative ability and constraint consistency of samples generated with PhysNAP using the PartNet-Mobility dataset. We also compare it with an unguided baseline diffusion model and demonstrate that PhysNAP can improve constraint consistency and provides a tradeoff with generative ability.
DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding
Thomas Kreutz
Telekooperation Lab, Technical University Darmstadt
Max Mühlhäuser
Telekooperation Lab, Technical University Darmstadt
Alejandro Sanchez Guinea
Telekooperation Lab, Technical University Darmstadt
Abstract
Despite LiDAR (Light Detection and Ranging) being an effective privacy-preserving alternative to RGB cameras to perceive human activities, it remains largely underexplored in the context of multi-modal contrastive pre-training for human activity understanding tasks, such as human activity recognition (HAR), retrieval, or person re-identification (RE-ID). To close this gap, our work explores learning the correspondence between LiDAR point clouds, human skeleton poses, IMU data, and text in a joint embedding space. More specifically, we present DeSPITE, a Deep SkeletonPointcloud-IMU-Text Embedding model, which effectively learns a joint embedding space across these four modalities. At the heart of our empirical exploration, we have combined the existing LIPD and Babel datasets, which enabled us to synchronize data of all four modalities, allowing us to explore the learning of a new joint embedding space. Our experiments demonstrate novel human activity understanding tasks for point cloud sequences enabled through DeSPITE, including Skeleton↔Pointcloud↔IMU matching, retrieval, and temporal moment retrieval. Furthermore, we show that DeSPITE is an effective pre-training strategy for point cloud HAR through experiments in MSR-Action3D and HMPEAR. Code and models are publicly available at https://github.com/thkreutz/despite.
Benchmarking Egocentric Visual-Inertial SLAM at City Scale
Anusha Krishnan
ETH Zürich
Shaohui Liu
ETH Zürich
Paul-Edouard Sarlin
Google
Oscar Gentilhomme
ETH Zürich
David Caruso
Meta Reality Labs Research
Maurizio Monge
Meta Reality Labs Research
Richard Newcombe
Meta Reality Labs Research
Jakob Engel
Meta Reality Labs Research
Marc Pollefeys
ETH Zürich, Microsoft Spatial AI Lab
Abstract
Precise 6-DoF simultaneous localization and mapping (SLAM) from onboard sensors is critical for wearable devices capturing egocentric data, which exhibits specific challenges, such as a wider diversity of motions and viewpoints, prevalent dynamic visual content, or long sessions affected by time-varying sensor calibration. While recent progress on SLAM has been swift, academic research is still driven by benchmarks that do not reflect these challenges or do not offer sufficiently accurate ground truth poses. In this paper, we introduce a new dataset and benchmark for visualinertial SLAM with egocentric, multi-modal data. We record hours and kilometers of trajectories through a city center with glasses-like devices equipped with various sensors. We leverage surveying tools to obtain control points as indirect pose annotations that are metric, centimeter-accurate, and available at city scale. This makes it possible to evaluate extreme trajectories that involve walking at night or traveling in a vehicle. We show that state-of-the-art systems developed by academia are not robust to these challenges and we identify components that are responsible for this. In addition, we design tracks with different levels of difficulty to ease in-depth analysis and evaluation of less mature approaches. The dataset and benchmark are available at lamaria.ethz.ch.
CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector
Abhinav Kumar
Michigan State University
Yuliang Guo
Bosch Research North America, Bosch Center for AI
Zhihao Zhang
Michigan State University
Xinyu Huang
Bosch Research North America, Bosch Center for AI
Liu Ren
Bosch Research North America, Bosch Center for AI
Xiaoming Liu
Michigan State University
Abstract
Monocular 3D object detectors, while effective on data from one ego camera height, struggle with unseen or out-ofdistribution camera heights. Existing methods often rely on Plucker embeddings, image transformations or data augmentation. This paper takes a step towards this understudied problem by first investigating the impact of camera height variations on state-of-the-art (SoTA) Mono3D models. With a systematic analysis on the extended CARLA dataset with multiple camera heights, we observe that depth estimation is a primary factor influencing performance under height variations. We mathematically prove and also empirically observe consistent negative and positive trends in mean depth error of regressed and ground-based depth models, respectively, under camera height changes. To mitigate this, we propose Camera Height Robust Monocular 3D Detector (CHARM3R), which averages both depth estimates within the model. CHARM3R improves generalization to unseen camera heights by more than 45%, achieving SoTA performance on the CARLA dataset.
Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition
Pulkit Kumar
University of Maryland, College Park
Shuaiyi Huang
University of Maryland, College Park
Matthew Walmer
University of Maryland, College Park
Sai Saketh Rambhatla
University of Maryland, College Park, GenAI, Meta
Abhinav Shrivastava
University of Maryland, College Park
Abstract
Video understanding requires effective modeling of both motion and appearance information, particularly for fewshot action recognition. While recent advances in point tracking have been shown to improve few-shot action recognition, two fundamental challenges persist: selecting informative points to track and effectively modeling their motion patterns. We present Trokens, a novel approach that transforms trajectory points into semantic-aware relational tokens for action recognition. First, we introduce a semantic-aware sampling strategy to adaptively distribute tracking points based on object scale and semantic relevance. Second, we develop a motion modeling framework that captures both intra-trajectory dynamics through the Histogram of Oriented Displacements (HoD) and intertrajectory relationships to model complex action patterns. Our approach effectively combines these trajectory tokens with semantic features to enhance appearance features with motion information, achieving state-of-the-art performance across six diverse few-shot action recognition benchmarks: Something-Something-V2 (both full and small splits), Kinetics, UCF101, HMDB51, and FineGym. Our project page is available here.
ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition
Sanjoy Kundu
Auburn University
Shanmukha Vellamcheti
Auburn University
Sathyanarayanan N. Aakur
Auburn University
Abstract
Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jumpdiffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0-L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPICKitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding. Our results highlight the importance of structured search strategies, paving the way for scalable and efficient open-world activity recognition.
RadarSplat: Radar Gaussian Splatting for High-Fidelity Data Synthesis and 3D Reconstruction of Autonomous Driving Scenes
Pou-Chun Kung
University of Michigan
Skanda Harisha
University of Michigan
Ram Vasudevan
University of Michigan
Aline Eid
University of Michigan
Katherine A. Skinner
University of Michigan
Abstract
High-fidelity 3D scene reconstruction plays a crucial role in autonomous driving by enabling novel data generation from existing datasets. This allows simulating safety-critical scenarios and augmenting training datasets without incurring further data collection costs. While recent advances in radiance fields have demonstrated promising results in 3D reconstruction and sensor data synthesis using cameras and LiDAR, their potential for radar remains largely unexplored. Radar is crucial for autonomous driving due to its robustness in adverse weather conditions like rain, fog, and snow, where optical sensors often struggle. Although the state-of-the-art radar-based neural representation shows promise for 3D driving scene reconstruction, it performs poorly in scenarios with significant radar noise, including receiver saturation and multipath reflection. Moreover, it is limited to synthesizing preprocessed, noise-excluded radar images, failing to address realistic radar data synthesis. To address these limitations, this paper proposes RadarSplat, which integrates Gaussian Splatting with novel radar noise modeling to enable realistic radar data synthesis and enhanced 3D reconstruction. Compared to the state-ofthe-art, RadarSplat achieves superior radar image synthesis (+3.4 PSNR /2.6x SSIM) and improved geometric reconstruction (−40% RMSE /1.5x Accuracy), demonstrating its effectiveness in generating high-fidelity radar data and scene reconstruction. A project page is available at https://umautobots.github.io/radarsplat.
RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction
Johannes Künzel
Fraunhofer Heinrich-Hertz-Institut, HHI, Germany
Anna Hilsmann
Fraunhofer Heinrich-Hertz-Institut, HHI, Germany
Peter Eisert
Fraunhofer Heinrich-Hertz-Institut, HHI, Germany
Abstract
We introduce RIPE, an innovative reinforcement learningbased framework for weakly-supervised training of a keypoint extractor that excels in both detection and description tasks. In contrast to conventional training regimes that depend heavily on artificial transformations, pre-generated models, or 3D data, RIPE requires only a binary label indicating whether paired images represent the same scene. This minimal supervision significantly expands the pool of training data, enabling the creation of a highly generalized and robust keypoint extractor. RIPE utilizes the encoder's intermediate layers for the description of the keypoints with a hyper-column approach to integrate information from different scales. Additionally, we propose an auxiliary loss to enhance the discriminative capability of the learned descriptors. Comprehensive evaluations on standard benchmarks demonstrate that RIPE simplifies data preparation while achieving competitive performance compared to state-ofthe-art techniques, marking a significant advancement in robust keypoint extraction and description. To support further research, we have made our code publicly available at https://github.com/fraunhoferhhi/RIPE.
Thermal Polarimetric Multi-view Stereo
Takahiro Kushida
Ritsumeikan University
Kenichiro Tanaka
Ritsumeikan University
Abstract
This paper introduces a novel method for detailed 3D shape reconstruction utilizing thermal polarization cues. Unlike state-of-the-art methods, the proposed approach is independent of illumination and material properties. In this paper, we formulate a general theory of polarization observation and show that long-wave infrared (LWIR) polarimetric imaging is free from the ambiguities that affect visible polarization analyses. Subsequently, we propose a method for recovering detailed 3D shapes using multi-view thermal polarimetric images. Experimental results demonstrate that our approach effectively reconstructs fine details in transparent, translucent, and heterogeneous objects, outperforming existing techniques.
MemDistill: Distilling LiDAR Knowledge into Memory for Camera-Only 3D Object Detection
Donghyeon Kwon
POSTECH
Youngseok Yoon
POSTECH
Hyeongseok Son
Samsung Electronics
Suha Kwak
POSTECH
Abstract
Camera-based 3D object detection has gained attention for its cost-effectiveness, but it in general lags behind LiDARbased approaches due to its lack of explicit 3D spatial cues. To take the best of both camera- and LiDAR-based detectors, we propose MemDistill, a novel cross-modal knowledge distillation framework for 3D object detection. MemDistill transfers rich 3D knowledge from a LiDARbased teacher model to a camera-based student model through a dedicated memory unit and a scene-dependent memory retrieval module. To be specific, our framework distills the teacher's 3D knowledge, optimizes the memory to store that knowledge compactly, and learns the retriever that searches the memory to produce 3D features relevant to the input scene, compensating for the missing LiDAR modality. Experiments on the nuScenes dataset demonstrate that MemDistill significantly improves performance of its camera-only baseline, achieving the state of the art in camera-based 3D object detection.
One Look is Enough: Seamless Patchwise Refinement for Zero-Shot Monocular Depth Estimation on High-Resolution Images
Byeongjun Kwon
KAIST
Munchurl Kim
KAIST
Abstract
Zero-shot depth estimation (DE) models exhibit strong generalization performance as they are trained on large-scale datasets. However, existing models struggle with highresolution images due to the discrepancy in image resolutions of training (with smaller resolutions) and inference (for high resolutions). Processing them at full resolution leads to decreased estimation accuracy on depth with tremendous memory consumption, while downsampling to the training resolution results in blurred edges in the estimated depth images. Prevailing high-resolution depth estimation methods adopt a patch-based approach, which introduces depth discontinuity issues when reassembling the estimated depth patches, resulting in test-time inefficiency. Additionally, to obtain fine-grained depth details, these methods rely on synthetic datasets due to the real-world sparse ground truth depth, leading to poor generalizability. To tackle these limitations, we propose Patch Refine Once (PRO), an efficient and generalizable tile-based framework. Our PRO consists of two key components: (i) Grouped Patch Consistency Training that enhances test-time efficiency while mitigating the depth discontinuity problem by jointly processing four overlapping patches and enforcing a consistency loss on their overlapping regions within a single backpropagation step, and (ii) Bias Free Masking that prevents the DE models from overfitting to dataset-specific biases, enabling better generalization to real-world datasets even after training on synthetic data. Zero-shot evaluations on Booster, ETH3D, Middlebury 2014, and NuScenes demonstrate that our PRO can be seamlessly integrated into existing depth estimation models. It preserves the performance of original depth estimation models even under gridbased inference on high-resolution images, exhibiting minimal depth discontinuities along patch boundaries. Moreover, our PRO achieves significantly faster inference speed compared to prior patch-based methods.
ViLU: Learning Vision-Language Uncertainties for Failure Prediction
Marc Lafon
Conservatoire National des Arts et Métiers, CEDRIC, Paris, France
Yannis Karmim
Conservatoire National des Arts et Métiers, CEDRIC, Paris, France
Julio Silva-Rodríguez
ETS Montreal
Paul Couairon
Sorbonne Université, CNRS, Paris, France
Clément Rambour
Sorbonne Université, CNRS, Paris, France
Raphaël Fournier-Sniehotta
Sorbonne Université, CNRS, Paris, France
Ismail Ben Ayed
ETS Montreal
Jose Dolz
ETS Montreal
Nicolas Thome
Sorbonne Université, CNRS, Paris, France
Abstract
Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new VisionLanguage Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all taskrelevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via crossattention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it lossagnostic. In particular, our proposed approach is wellsuited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-theart failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: ViLU Repository.
CAVIS: Context-Aware Video Instance Segmentation
Seunghun Lee
DGIST, Daegu, Korea
Jiwan Seo
DGIST, Daegu, Korea
Kiljoon Han
DGIST, Daegu, Korea
Minwoo Choi
DGIST, Daegu, Korea
Sunghoon Im
DGIST, Daegu, Korea
Abstract
In this paper, we introduce the Context-Aware Video Instance Segmentation (CAVIS), a novel framework designed to enhance instance association by integrating contextual information adjacent to each object. To efficiently extract and leverage this information, we propose the Context-Aware Instance Tracker (CAIT), which merges contextual data surrounding the instances with the core instance features to improve tracking accuracy. Additionally, we design the Prototypical Cross-frame Contrastive (PCC) loss, which ensures consistency in object-level features across frames, thereby significantly enhancing matching accuracy. CAVIS demonstrates superior performance over state-of-the-art methods on all benchmark datasets in video instance segmentation (VIS) and video panoptic segmentation (VPS). Notably, our method excels on the OVIS dataset, known for its particularly challenging videos. Project page: this https URL
CF3: Compact and Fast 3D Feature Fields
Abstract
3D Gaussian Splatting (3DGS) has begun incorporating rich information from 2D foundation models. However, most approaches rely on a bottom-up optimization process that treats raw 2D features as ground truth, incurring increased computational costs. We propose a top-down pipeline for constructing compact and fast 3D Gaussian feature fields, namely, CF3. We first perform a fast weighted fusion of multi-view 2D features with pre-trained Gaussians. This approach enables training a per-Gaussian autoencoder directly on the lifted features, instead of training autoencoders in the 2D domain. As a result, the autoencoder better aligns with the feature distribution. More importantly, we introduce an adaptive sparsification method that optimizes the Gaussian attributes of the feature field while pruning and merging the redundant Gaussians, constructing an efficient representation with preserved geometric details. Our approach achieves a competitive 3D feature field using as little as 5% of the Gaussians compared to Feature-3DGS.
CityNav: A Large-Scale Dataset for Real-World Aerial Navigation
Jungdae Lee
Institute of Science Tokyo
Taiki Miyanishi
The University of Tokyo
Shuhei Kurita
National Institute of Informatics
Koya Sakamoto
The University of Tokyo
Daichi Azuma
The University of Tokyo
Yutaka Matsuo
The University of Tokyo
Nakamasa Inoue
Institute of Science Tokyo
Abstract
Vision-and-language navigation (VLN) aims to develop agents capable of navigating in realistic environments. While recent cross-modal training approaches have significantly improved navigation performance in both indoor and outdoor scenarios, aerial navigation over real-world cities remains underexplored primarily due to limited datasets and the difficulty of integrating visual and geographic information. To fill this gap, we introduce CityNav, the first large-scale real-world dataset for aerial VLN. Our dataset consists of 32,637 human demonstration trajectories, each paired with a natural language description, covering 4.65 km2 across two real cities: Cambridge and Birmingham. In contrast to existing datasets composed of synthetic scenes such as AerialVLN, our dataset presents a unique challenge because agents must interpret spatial relationships between real-world landmarks and the navigation destination, making CityNav an essential benchmark for advancing aerial VLN. Furthermore, as an initial step toward addressing this challenge, we provide a methodology of creating geographic semantic maps that can be used as an auxiliary modality input during navigation. In our experiments, we compare performance of three representative aerial VLN agents (Seq2seq, CMA and AerialVLN models) and demonstrate that the semantic map representation significantly improves their navigation performance.
CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images
Jungho Lee
Yonsei University
Donghyeong Kim
Yonsei University
Dogyoon Lee
Yonsei University
Suhwan Cho
Yonsei University
Minhyeok Lee
Yonsei University
Wonjoon Lee
Yonsei University
Taeoh Kim
NAVER Cloud
Dongyoon Wee
NAVER Cloud
Sangyoun Lee
Yonsei University
Abstract
3D Gaussian Splatting (3DGS) has gained significant attention due to its high-quality novel view rendering, motivating research to address real-world challenges. A critical issue is the camera motion blur caused by movement during exposure, which hinders accurate 3D scene reconstruction. In this study, we propose CoMoGaussian, a Continuous Motion-Aware Gaussian Splatting that reconstructs precise 3D scenes from motion-blurred images while maintaining real-time rendering speed. Considering the complex motion patterns inherent in real-world camera movements, we predict continuous camera trajectories using neural ordinary differential equations (ODEs). To ensure accurate modeling, we employ rigid body transformations, preserving the shape and size of the object but rely on the discrete integration of sampled frames. To better approximate the continuous nature of motion blur, we introduce a continuous motion refinement (CMR) transformation that refines rigid transformations by incorporating additional learnable parameters. By revisiting fundamental camera theory and leveraging advanced neural ODE techniques, we achieve precise modeling of continuous camera trajectories, leading to improved reconstruction accuracy. Extensive experiments demonstrate state-of-the-art performance both quantitatively and qualitatively on benchmark datasets, which include a wide range of motion blur scenarios, from moderate to extreme blur. Project page is available at https://JhoYonsei.github.io/CoMoGaussian.
Combinative Matching for Geometric Shape Assembly
Nahyuk Lee
POSTECH
Juhong Min
POSTECH
Junhong Lee
POSTECH
Chunghyun Park
POSTECH
Minsu Cho
POSTECH
Abstract
This paper introduces a new shape-matching methodology, combinative matching, to combine interlocking parts for geometric shape assembly. Previous methods for geometric assembly typically rely on aligning parts by finding identical surfaces between the parts as in conventional shape matching and registration. Specifically, we explicitly model two distinct properties of interlocking shapes: ‘identical surface shape' and ‘opposite volume occupancy.' Our method thus learns to establish correspondences across regions where their surface shapes appear identical but their volumes occupy the inverted space to each other. To facilitate this process, we also learn to align regions in rotation by estimating their shape orientations via equivariant neural networks. The proposed approach significantly reduces local ambiguities in matching and allows a robust combination of parts in assembly. Experimental results on geometric assembly benchmarks demonstrate the efficacy of our method, consistently outperforming the state of the art.
EVT: Efficient View Transformation for Multi-Modal 3D Object Detection
Yongjin Lee
ThorDrive Co., Ltd
Hyeon-Mun Jeong
ThorDrive Co., Ltd
Yurim Jeon
Seoul National University
Sanghyun Kim
ThorDrive Co., Ltd, Seoul National University
Abstract
Multi-modal sensor fusion in Bird's Eye View (BEV) representation has become the leading approach for 3D object detection. However, existing methods often rely on depth estimators or transformer encoders to transform image features into BEV space, which reduces robustness or introduces significant computational overhead. Moreover, the insufficient geometric guidance in view transformation results in ray-directional misalignments, limiting the effectiveness of BEV representations. To address these challenges, we propose Efficient View Transformation (EVT), a novel 3D object detection framework that constructs a wellstructured BEV representation, improving both accuracy and efficiency. Our approach focuses on two key aspects. First, Adaptive Sampling and Adaptive Projection (ASAP), which utilizes LiDAR guidance to generate 3D sampling points and adaptive kernels, enables more effective transformation of image features into BEV space and a refined BEV representation. Second, an improved query-based detection framework, incorporating group-wise mixed query selection and geometry-aware cross-attention, effectively captures both the common properties and the geometric structure of objects in the transformer decoder. On the nuScenes test set, EVT achieves state-of-the-art performance of 75.3% NDS with real-time inference speed.
FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction
Donghyun Lee
Seoul National University
Dawoon Jeong
Seoul National University
Jae W. Lee
Seoul National University
Hongil Yoon
Google
Abstract
Deep neural networks have revolutionized 3D point cloud processing, yet efficiently handling large and irregular point clouds remains challenging. To tackle this problem, we introduce FastPoint, a novel software-based acceleration technique that leverages the predictable distance trend between sampled points during farthest point sampling. By predicting the distance curve, we can efficiently identify subsequent sample points without exhaustively computing all pairwise distances. Our proposal substantially accelerates farthest point sampling and neighbor search operations while preserving sampling quality and model performance. By integrating FastPoint into state-of-the-art 3D point cloud models, we achieve 2.55x end-to-end speedup on NVIDIA RTX 3090 GPU without sacrificing accuracy.
InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation
Jungmin Lee
Chung-Ang University
Seonghyuk Hong
National Research Institute of Cultural Heritage
Juyong Lee
Chung-Ang University
Jaeyoon Lee
Chung-Ang University
Jongwon Choi
Chung-Ang University
Abstract
We introduce InsideOut, an extension of 3D Gaussian splatting (3DGS) that bridges the gap between high-fidelity RGB surface details and subsurface X-ray structures. The fusion of RGB and X-ray imaging is invaluable in fields such as medical diagnostics, cultural heritage restoration, and manufacturing. We collect new paired RGB and X-ray data, perform hierarchical fitting to align RGB and X-ray radiative Gaussian splats, and propose an X-ray reference loss to ensure consistent internal structures. InsideOut effectively addresses the challenges posed by disparate data representations between the two modalities and limited paired datasets. This approach significantly extends the applicability of 3DGS, enhancing visualization, simulation, and nondestructive testing capabilities across various domains.
Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning
Giwon Lee
KAIST
Wooseong Jeong
KAIST
Daehee Park
DGIST
Jaewoo Jeong
KAIST
Kuk-Jin Yoon
KAIST
Abstract
Motion planning is a crucial component of autonomous robot driving. While various trajectory datasets exist, effectively utilizing them for a target domain remains challenging due to differences in agent interactions and environmental characteristics. Conventional approaches, such as domain adaptation or ensemble learning, leverage multiple source datasets but suffer from domain imbalance, catastrophic forgetting, and high computational costs. To address these challenges, we propose Interaction-Merged Motion Planning (IMMP), a novel approach that leverages parameter checkpoints trained on different domains during adaptation to the target domain. IMMP follows a two-step process: pre-merging to capture agent behaviors and interactions, sufficiently extracting diverse information from the source domain, followed by merging to construct an adaptable model that efficiently transfers diverse interactions to the target domain. Our method is evaluated on various planning benchmarks and models, demonstrating superior performance compared to conventional approaches.
Joint Learning of Pose Regression and Denoising Diffusion with Score Scaling Sampling for Category-level 6D Pose Estimation
Seunghyun Lee
KAIST
Tae-Kyun Kim
KAIST
Abstract
Latest diffusion models have shown promising results in category-level 6D object pose estimation by modeling the conditional pose distribution with depth image input. The existing methods, however, suffer from slow convergence during training, learning its encoder with the diffusion denoising network in end-to-end fashion, and require an additional network that evaluates sampled pose hypotheses to filter out low-quality pose candidates. In this paper, we propose a novel pipeline that tackles these limitations by two key components. First, the proposed method pretrains the encoder with the direct pose regression head, and jointly learns the networks via the regression head and the denoising diffusion head, significantly accelerating training convergence while achieving higher accuracy. Second, sampling guidance via time-dependent score scaling is proposed s.t. the exploration-exploitation trade-off is effectively taken, eliminating the need for the additional evaluation network. The sampling guidance maintains multimodal characteristics of symmetric objects at early denoising steps while ensuring high-quality pose generation at final steps. Extensive experiments on multiple benchmarks including REAL275, HouseCat6D, and ROPE, demonstrate that the proposed method, simple yet effective, achieves state-of-the-art accuracies even with single-pose inference, while being more efficient in both training and inference.
LOMM: Latest Object Memory Management for Temporally Consistent Video Instance Segmentation
Seunghun Lee
DGIST
Jiwan Seo
DGIST
Minwoo Choi
DGIST
Kiljoon Han
DGIST
Jahoon Jeong
DGIST
Zane Durante
Stanford University
Ehsan Adeli
Stanford University
Sang Hyun Park
DGIST
Sunghoon Im
DGIST
Abstract
In this paper, we introduce Latest Object Memory (LOM), a system for robustly tracking and continuously updating the latest states of objects by explicitly modeling their presence across video frames. LOM enables consistent tracking and accurate identity management across frames, enhancing both performance and reliability through the video segmentation process. Building upon LOM, we present Latest Object Memory Management (LOMM) for temporally consistent video instance segmentation, significantly improving long-term instance tracking. This enables consistent tracking and accurate identity management across frames, enhancing both performance and reliability through the video segmentation process. Moreover, we introduce Decoupled Object Association (DOA), a strategy that separately handles newly appearing and already existing objects. By leveraging our memory system, DOA accurately assigns object indices, improving matching accuracy and ensuring stable identity consistency, even in dynamic scenes where objects frequently appear and disappear. Extensive experiments and ablation studies demonstrate the superiority of our method over traditional approaches, setting a new state-of-the-art in video instance segmentation. Notably, our LOMM achieves an AP score of 54.0 on YouTube-VIS 2022, a dataset known for its challenging long videos. Project page: this https URL
NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes
Han-Hung Lee
Simon Fraser University
Qinghong Han
Simon Fraser University
Angel X. Chang
Simon Fraser University
Abstract
In this paper, we explore the task of generating expansive outdoor scenes, ranging from castles to high-rises. Unlike indoor scene generation, which has been a primary focus of prior work, outdoor scene generation presents unique challenges, including wide variations in scene heights and the need for a method capable of rapidly producing large landscapes. To address this, we propose an efficient approach that encodes scene chunks as uniform vector sets, offering better compression and performance than the spatially structured latents used in prior methods. Furthermore, we train an explicit outpainting model for unbounded generation, which improves coherence compared to prior resamplingbased inpainting schemes while also speeding up generation by eliminating extra diffusion steps. To facilitate this task, we curate NuiScene43, a small but high-quality set of scenes, preprocessed for joint training. Notably, when trained on scenes of varying styles, our model can blend different environments, such as rural houses and city skyscrapers, within the same scene, highlighting the potential of our curation process to leverage heterogeneous scenes for joint training.
PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior
Seunggwan Lee
Korea University
Hwanhee Jung
Korea University
Byoungsoo Koh
KOCCA
Qixing Huang
The University of Texas at Austin
Sang Ho Yoon
KAIST
Sangpil Kim
Korea University
Abstract
A fundamental challenge in conditional 3D shape generation is to minimize the information loss and maximize the intention of user input. Existing approaches have predominantly focused on two types of isolated conditional signals, i.e., user sketches and text descriptions, each of which does not offer flexible control of the generated shape. In this paper, we introduce PASTA, the flexible approach that seamlessly integrates a user sketch and a text description for 3D shape generation. The key idea is to use text embeddings from a vision-language model to enrich the semantic representation of sketches. Specifically, these text-derived priors specify the part components of the object, compensating for missing visual cues from ambiguous sketches. In addition, we introduce ISG-Net which employs two types of graph convolutional networks: IndivGCN, which processes fine-grained details, and PartGCN, which aggregates these details into parts and refines the structure of objects. Extensive experiments demonstrate that PASTA outperforms existing methods in part-level editing and achieves stateof-the-art results in sketch-to-3D shape generation.
Power of Cooperative Supervision: Multiple Teachers Framework for Advanced 3D Semi-Supervised Object Detection
Jin-Hee Lee
DGIST
Jae-Keun Lee
DGIST
Jeseok Kim
DGIST
Kwon Soon
DGIST
Abstract
To ensure safe autonomous driving in complex urban environments, it is essential not only to develop highperformance object detection models but also to establish a diverse and representative dataset that captures a wide range of urban scenarios and object characteristics. To address these challenges, we introduce a new multi-class 3D LiDAR dataset that comprehensively reflects various urban environments and object types, along with a robust 3D semi-supervised object detection (SSOD) framework. Our SSOD framework leverages a novel multiple teachers model, where similar object classes are grouped and supervised by category-specialized teacher networks. This category-specific collaborative guidance enables the student network to learn more effectively, leading to improved object detection performance. Additionally, we propose the Pseudo-points Generator (PointGen), a simple yet effective technique designed to enhance the generation of highquality pseudo-labels for the teacher network, mitigating the impact of sparse LiDAR point clouds. Extensive experiments on the Waymo Open Dataset (WOD), KITTI, and our newly introduced dataset validate the effectiveness of both our dataset and SSOD framework. Experimental results demonstrate that our approach consistently outperforms state-of-the-art 3D SSOD methods across all evaluated datasets. To encourage further research in this domain, we will publicly release our multi-class LiDAR dataset and source code on our GitHub repository1.
HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation
Qinqian Lei
National University of Singapore
Bo Wang
University of Mississippi
Robby T. Tan
National University of Singapore
Abstract
Zero-shot human-object interaction (HOI) detection remains a challenging task, particularly in generalizing to unseen actions. Existing methods address this challenge by tapping Vision-Language Models (VLMs) to access knowledge beyond the training data. However, they either struggle to distinguish actions involving the same object or demonstrate limited generalization to unseen classes. In this paper, we introduce HOLa (Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation), a novel approach that both enhances generalization to unseen classes and improves action distinction. In training, HOLa decomposes VLM text features for given HOI classes via low-rank factorization, producing class-shared basis features and adaptable weights. These features and weights form a compact HOI representation that preserves shared information across classes, enhancing generalization to unseen classes. Subsequently, we refine action distinction by adapting weights for each HOI class and introducing human-object tokens to enrich visual interaction representations. To further distinguish unseen actions, we guide the weight adaptation with LLM-derived action regularization. Experimental results show that our method sets a new state-of-the-art across zero-shot HOI settings on HICO-DET, achieving an unseen-class mAP of 27.91 in the unseen-verb setting. Our code is available at https://github.com/ChelsieLei/HOLa.
MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps
Jiahui Lei
University of Pennsylvania
Kyle Genova
Google DeepMind
George Kopanas
Google
Noah Snavely
Google
Leonidas Guibas
Google
Abstract
This paper addresses the challenge of learning semantically and functionally meaningful 3D motion priors from real-world videos, in order to enable prediction of future 3D scene motion from a single input image. We propose a novel pixel-aligned Motion Map (MoMap) representation for 3D scene motion, which can be generated from existing generative image models to facilitate efficient and effective motion prediction. To learn meaningful distributions over motion, we create a large-scale database of MoMaps from over 50,000 real videos and train a diffusion model on these representations. Our motion generation not only synthesizes trajectories in 3D but also suggests a new pipeline for 2D video synthesis: first generate a MoMap, then warp an image accordingly and complete the warped point-based renderings. Experimental results demonstrate that our approach generates plausible and semantically consistent 3D scene motion.
Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration
Ting Lei
Peking University
Shaofeng Yin
Peking University
Qingchao Chen
Peking University
Yuxin Peng
Peking University
Yang Liu
Peking University
Abstract
Open Vocabulary Human-Object Interaction (HOI) detection aims to detect interactions between humans and objects while generalizing to novel interaction classes beyond the training set. Current methods often rely on Vision and Language Models (VLMs) but face challenges due to suboptimal image encoders, as image-level pre-training does not align well with the fine-grained region-level interaction detection required for HOI. Additionally, effectively encoding textual descriptions of visual appearances remains difficult, limiting the model's ability to capture detailed HOI relationships. To address these issues, we propose INteractionaware Prompting with Concept Calibration (INP-CC), an end-to-end open-vocabulary HOI detector that integrates interaction-aware prompts and concept calibration. Specifically, we propose an interaction-aware prompt generator that dynamically generates a compact set of prompts based on the input scene, enabling selective sharing among similar interactions. This approach directs the model's attention to key interaction patterns rather than generic image-level semantics, enhancing HOI detection. Furthermore, we refine HOI concept representations through language modelguided calibration, which helps distinguish diverse HOI concepts by investigating visual similarities across categories. A negative sampling strategy is also employed to improve inter-modal similarity modeling, enabling the model to better differentiate visually similar but semantically distinct actions. Extensive experimental results demonstrate that INP-CC significantly outperforms state-of-the-art models on the SWIG-HOI and HICO-DET datasets. Code is available at https://github.com/ltttpku/INP-CC.
Occupancy Learning with Spatiotemporal Memory
Ziyang Leng
University of California, Los Angeles
Jiawei Yang
University of Southern California
Wenlong Yi
University of California, Los Angeles
Bolei Zhou
University of California, Los Angeles
Abstract
3D occupancy becomes a promising perception representation for autonomous driving to model the surrounding environment at a fine-grained scale. However, it remains challenging to efficiently aggregate 3D occupancy over time across multiple input frames due to the high processing cost and the uncertainty and dynamics of voxels. To address this issue, we propose ST-Occ, a scene-level occupancy representation learning framework that effectively learns the spatiotemporal feature with temporal consistency. ST-Occ consists of two core designs: a spatiotemporal memory that captures comprehensive historical information and stores it efficiently through a scene-level representation and a memory attention that conditions the current occupancy representation on the spatiotemporal memory with a model of uncertainty and dynamic awareness. Our method significantly enhances the spatiotemporal representation learned for 3D occupancy prediction tasks by exploiting the temporal dependency between multi-frame inputs. Experiments show that our approach outperforms the state-of-the-art methods by a margin of 3 mIoU and reduces the temporal inconsistency by 29%. The code and model are available at https://github.com/matthew-leng/ST-Occ.
4D Gaussian Splatting SLAM
Yanyan Li
Hangzhou Dianzi University
Youxu Fang
Hangzhou Dianzi University
Zunjie Zhu
Hangzhou Dianzi University
Kunyi Li
Technical University of Munich
Yong Ding
Zhejiang University
Federico Tombari
Google
Abstract
Simultaneously localizing camera poses and constructing Gaussian radiance fields in dynamic scenes establish a crucial bridge between 2D images and the 4D real world. Instead of removing dynamic objects as distractors and reconstructing only static environments, this paper proposes an efficient architecture that incrementally tracks camera poses and establishes the 4D Gaussian radiance fields in unknown scenarios by using a sequence of RGB-D images. First, by generating motion masks, we obtain static and dynamic priors for each pixel. To eliminate the influence of static scenes and improve the efficiency of learning the motion of dynamic objects, we classify the Gaussian primitives into static and dynamic Gaussian sets, while the sparse control points along with an MLP are utilized to model the transformation fields of the dynamic Gaussians. To more accurately learn the motion of dynamic Gaussians, a novel 2D optical flow map reconstruction algorithm is designed to render optical flows of dynamic objects between neighbor images, which are further used to supervise the 4D Gaussian radiance fields along with traditional photometric and geometric constraints. In experiments, qualitative and quantitative evaluation results show that the proposed method achieves robust tracking and high-quality view synthesis performance in real-world environments.
AGO: Adaptive Grounding for Open World 3D Occupancy Prediction
Peizheng Li
Mercedes-Benz AG
Shuxiao Ding
Mercedes-Benz AG
You Zhou
Mercedes-Benz AG
Qingwen Zhang
KTH Royal Institute of Technology
Onat Inak
Mercedes-Benz AG
Larissa Triess
Mercedes-Benz AG
Niklas Hanselmann
Mercedes-Benz AG
Marius Cordts
Mercedes-Benz AG
Andreas Zell
University of Tübingen
Abstract
Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects. Transferring open-vocabulary knowledge from vision-language models (VLMs) offers a promising direction but remains challenging. However, methods based on VLM-derived 2D pseudo-labels with traditional supervision are limited by a predefined label space and lack general prediction capabilities. Direct alignment with pretrained image embeddings, on the other hand, often fails to achieve reliable performance because of inconsistent image and text representations in VLMs. To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios. AGO first encodes surrounding images and class prompts into 3D and text embeddings, respectively, leveraging similarity-based grounding training with 3D pseudolabels. Additionally, a modality adapter maps 3D embeddings into a space aligned with VLM-derived image embeddings, reducing modality gaps. Experiments on Occ3DnuScenes show that AGO improves unknown object prediction in zero-shot and few-shot transfer while achieving state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU. Code is available at: https://github.com/EdwardLeeLPZ/AGO.
Adversarial Exploitation of Data Diversity Improves Visual Localization
Sihang Li
New York University
Siqi Tan
New York University
Bowen Chang
New York University
Jing Zhang
New York University
Chen Fengu
New York University
Yiming Liu
New York University
Abstract
Visual localization, which estimates a camera's pose within a known scene, is a fundamental capability for autonomous systems. While absolute pose regression (APR) methods have shown promise for efficient inference, they often struggle with generalization. Recent approaches attempt to address this through data augmentation with varied viewpoints, yet they overlook a critical factor: appearance diversity. In this work, we identify appearance variation as the key to robust localization. Specifically, we first lift real 2D images into 3D Gaussian Splats with varying appearance and deblurring ability, enabling the synthesis of diverse training data that varies not just in poses but also in environmental conditions such as lighting and weather. To fully unleash the potential of the appearance-diverse data, we build a two-branch joint training pipeline with an adversarial discriminator to bridge the syn-to-real gap. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, reducing translation and rotation errors by 50% and 33% on indoor datasets, and 38% and 44% on outdoor datasets. Most notably, our method shows remarkable robustness in dynamic driving scenarios under varying weather conditions and in day-to-night scenarios, where previous APR methods fail.
Amodal Depth Anything: Amodal Depth Estimation in the Wild
Zhenyu Li
KAUST
Mykola Lavreniuk
Space Research Institute NASU-SSAU
Jian Shi
KAUST
Shariq Farooq Bhat
KAUST
Peter Wonka
KAUST
Abstract
Amodal depth estimation aims to predict the depth of occluded (invisible) parts of objects in a scene. This task addresses the question of whether models can effectively perceive the geometry of occluded regions based on visible cues. Prior methods primarily rely on synthetic datasets and focus on metric depth estimation, limiting their generalization to real-world settings due to domain shifts and scalability challenges. In this paper, we propose a novel formulation of amodal depth estimation in the wild, focusing on relative depth prediction to improve model generalization across diverse natural images. We introduce a new large-scale dataset, Amodal Depth In the Wild (ADIW), created using a scalable pipeline that leverages segmentation datasets and compositing techniques. Depth maps are generated using large pre-trained depth models, and a scaleand-shift alignment strategy is employed to refine and blend depth predictions, ensuring consistency in ground-truth annotations. To tackle the amodal depth task, we present two complementary frameworks: Amodal-DAV2, a deterministic model based on Depth Anything V2, and AmodalDepthFM, a generative model that integrates conditional flow matching principles. Our proposed frameworks effectively leverage the capabilities of large pre-trained models with minimal modifications to achieve high-quality amodal depth predictions (Fig. 1). Experiments validate our design choices, demonstrating the flexibility of our models in generating diverse, plausible depth structures for occluded regions. Our method achieves a 50.7% improvement in RMSE over the previous SoTA on the ADIW dataset.
Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking
Yunhao Li
Institute of Software Chinese Academy of Sciences
Yifan Jiao
Institute of Software Chinese Academy of Sciences
Dan Meng
OPPO Research Institute
Heng Fan
University of North Texas
Libo Zhang
Institute of Software Chinese Academy of Sciences
Abstract
Open-Vocabulary Multi-Object Tracking (OV-MOT) aims to enable approaches to track objects without being limited to a predefined set of categories. Current OV-MOT methods typically rely primarily on instance-level detection and association, often overlooking trajectory information that is unique and essential for object tracking tasks. Utilizing trajectory information can enhance association stability and classification accuracy, especially in cases of occlusion and category ambiguity, thereby improving adaptability to novel classes. Thus motivated, in this paper we propose TRACT, an open-vocabulary tracker that leverages trajectory information to improve both object association and classification in OV-MOT. Specifically, we introduce a Trajectory Consistency Reinforcement (TCR) strategy, that benefits tracking performance by improving target identity and category consistency. In addition, we present TraCLIP, a plug-andplay trajectory classification module. It integrates Trajectory Feature Aggregation (TFA) and Trajectory Semantic Enrichment (TSE) strategies to fully leverage trajectory information from visual and language perspectives for enhancing the classification results. Extensive experiments on OV-TAO show that our TRACT significantly improves tracking performance, highlighting trajectory information as a valuable asset for OV-MOT. We will release TRACT at https://github.com/Nathan-Li123/TRACT.
Benefit From Seen: Enhancing Open-Vocabulary Object Detection by Bridging Visual and Textual Co-Occurrence Knowledge
Yanqi Li
Beihang University
Jianwei Niu
Beihang University
Tao Ren
Institute of Software Chinese Academy of Sciences
Abstract
Open-Vocabulary Object Detection (OVOD) aims to localize and recognize objects from both known and novel categories. However, existing methods rely heavily on internal knowledge from Vision-Language Models (VLMs), restricting their generalization to unseen categories due to limited contextual understanding. To address this, we propose CODet, a plug-and-play framework that enhances OVOD by integrating object co-occurrence -- a form of external contextual knowledge pervasive in real-world scenes. Specifically, CODet extracts visual co-occurrence patterns from images, aligns them with textual dependencies validated by Large Language Models (LLMs), and injects contextual co-occurrence pseudo-labels as external knowledge to guide detection. Without architectural changes, CODet consistently improves five state-of-the-art VLM-based detectors across two benchmarks, achieving notable gains (up to +2.3 AP on novel categories). Analyses further confirm its ability to encode meaningful contextual guidance, advancing open-world perception by bridging visual and textual co-occurrence knowledge.
Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios
Chunxiao Li
Beijing Normal University
Xiaoxiao Wang
University of Chinese Academy of Sciences
Meiling Li
Fudan University
Boming Miao
Beijing Normal University
Peng Sun
Central University of Finance and Economics
Yunjian Zhang
Tsinghua University
Xiangyang Ji
Tsinghua University
Yao Zhu
Tsinghua University
Abstract
With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization - RRDataset encompasses high-quality images from seven major scenarios (War & Conflict, Disasters & Accidents, Political & Social Events, Medical & Public Health, Culture & Religion, Labor & Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness - examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms. 3) Re-digitization Robustness - assessing model effectiveness on images altered through four distinct re-digitization methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a largescale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AIgenerated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms. Our dataset is publicly available at: https: //zenodo.org/records/14963880.
Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis
Lei-Lei Li
Xi'an Jiaotong University
Jianwu Fang
National University of Singapore
Junbin Xiao
National University of Singapore
Shanmin Pang
Xi'an Jiaotong University
Hongkai Yu
Cleveland State University
Chen Lv
Nanyang Technological University
Jianru Xue
Xi'an Jiaotong University
Tat-Seng Chua
National University of Singapore
Abstract
Egocentricly comprehending the causes and effects of car accidents is crucial for the safety of self-driving cars, and synthesizing causal-entity reflected accident videos can facilitate the capability test to respond to unaffordable accidents in reality. However, incorporating causal relations as seen in real-world videos into synthetic videos remains challenging. This work argues that precisely identifying the accident participants and capturing their related behaviors are of critical importance. In this regard, we propose a novel diffusion model Causal-VidSyn for synthesizing egocentric traffic accident videos. To enable causal entity grounding in video diffusion, Causal-VidSyn leverages the cause descriptions and driver fixations to identify the accident participants and behaviors, facilitated by accident reason answering and gaze-conditioned selection modules. To support CausalVidSyn, we further construct Drive-Gaze, the largest driver gaze dataset (with 1.54M frames of fixations) in driving accident scenarios. Extensive experiments show that CausalVidSyn surpasses state-of-the-art video diffusion models in terms of frame quality and causal sensitivity in various tasks, including accident video editing, normal-to-accident video diffusion, and text-to-video generation.
CoA-VLA: Improving Vision-Language-Action Models via Visual-Text Chain-of-Affordance
Jinming Li
Shanghai University
Yichen Zhu
Midea Group
Zhibin Tang
unknown
Junjie Wen
East China Normal University
Minjie Zhu
East China Normal University
Xiaoyu Liu
Shanghai University
Chengmeng Li
Shanghai University
Ran Cheng
Midea Group
Yaxin Peng
Shanghai University
Yan Peng
Shanghai University
Feifei Feng
Midea Group
Abstract
Robot foundation models, particularly Vision-LanguageAction (VLA) models, have garnered significant attention for their ability to enhance robot policy learning, greatly improving robot's generalization and robustness. OpenAI's recent model, O1, showcased impressive capabilities in solving complex problems by utilizing extensive reasoning chains. This prompts an important question: can robot models achieve better performance in multi-task, complex environments by reviewing prior observations and then providing task-specific reasoning to guide action prediction? In this paper, we introduce Chain-of-Affordance (CoAVLA), a novel approach to scaling robot models by incorporating reasoning in the format of sequential robot affordances to facilitate task completion. Specifically, we prompt the model to consider the following four types of affordances before taking action: (1) object affordance - what object to manipulate and where it is; (2) grasp affordance - the specific object part to grasp; (3) spatial affordance - the optimal space to place the object; and (4) movement affordance - the collision-free path for movement. We further transform each affordance into two prompting formats: visual affordance and textual affordance. We introduce a novel vision-language co-injection module that integrates this knowledge into the policy network. This allows the robot to leverage essential contextual information during action inference, resulting in improved precision and robustness. Our experiments demonstrate that CoA-VLA outperforms state-of-the-art robot foundation models, including OpenVLA and Octo, on a variety of tasks. Furthermore, CoA-VLA exhibits strong generalization capabilities, including recognizing unseen object poses, identifying free space, and avoiding obstacles in novel environments.
Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios
Deng Li
Tianjin University
Aming Wu
Hefei University of Technology
Yang Li
Tianjin University
Yaowei Wang
Peng Cheng Laboratory
Yahong Han
Tianjin University
Abstract
In practice, environments constantly change over time and space, posing significant challenges for object detectors trained based on a closed-set assumption, i.e., training and test data share the same distribution. To this end, continual test-time adaptation has attracted much attention, aiming to improve detectors' generalization by finetuning a few specific parameters, e.g., BatchNorm layers. However, based on a small number of test images, fine-tuning certain parameters may affect the representation ability of other fixed parameters, leading to performance degradation. Instead, we explore a new mechanism, i.e., converting the fine-tuning process to a specificparameter generation. Particularly, we first design a dualpath LoRA-based domain-aware adapter that disentangles features into domain-invariant and domain-specific components, enabling efficient adaptation. Additionally, a conditional diffusion-based parameter generation mechanism is presented to synthesize the adapter's parameters based on the current environment, preventing the optimization from getting stuck in local optima. Finally, we propose a classcentered optimal transport alignment method to mitigate catastrophic forgetting. Extensive experiments conducted on various continuous domain adaptive object detection tasks demonstrate the effectiveness. Meanwhile, visualization results show that the representation extracted by the generated parameters can capture more object-related information and strengthen the generalization ability.
DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness
Ruining Li
University of Oxford
Chuanxia Zheng
University of Oxford
Christian Rupprecht
University of Oxford
Andrea Vedaldi
University of Oxford
Abstract
Most 3D object generators prioritize aesthetic quality, often neglecting the physical constraints necessary for practical applications. One such constraint is that a 3D object should be self-supporting, i.e., remain balanced under gravity. Previous approaches to generating stable 3D objects relied on differentiable physics simulators to optimize geometry at test time, which is slow, unstable, and prone to local optima. Inspired by the literature on aligning generative models with external feedback, we propose Direct Simulation Optimization (DSO). This framework leverages feedback from a (non-differentiable) simulator to increase the likelihood that the 3D generator directly outputs stable 3D objects. We construct a dataset of 3D objects labeled with stability scores obtained from the physics simulator. This dataset enables fine-tuning of the 3D generator using the stability score as an alignment metric, via direct preference optimization (DPO) or direct reward optimization (DRO)-a novel objective we introduce to align diffusion models without requiring pairwise preferences. Our experiments demonstrate that the fine-tuned feed-forward generator, using either the DPO or DRO objective, is significantly faster and more likely to produce stable objects than testtime optimization. Notably, the DSO framework functions even without any ground-truth 3D objects for training, allowing the 3D generator to self-improve by automatically collecting simulation feedback on its own outputs.
EDM: Efficient Deep Feature Matching
Xi Li
Realsee
Tong Rao
Realsee
Cihui Pan
Realsee
Abstract
Recent feature matching methods have achieved remarkable performance but lack efficiency consideration. In this paper, we revisit the mainstream detector-free matching pipeline and improve all its stages considering both accuracy and efficiency. We propose an Efficient Deep feature Matching network, EDM. We first adopt a deeper CNN with fewer dimensions to extract multi-level features. Then we present a Correlation Injection Module that conducts feature transformation on high-level deep features, and progressively injects feature correlations from global to local for efficient multi-scale feature aggregation, improving both speed and performance. In the refinement stage, a novel lightweight bidirectional axis-based regression head is designed to directly predict subpixel-level correspondences from latent features, avoiding the significant computational cost of explicitly locating keypoints on high-resolution local feature heatmaps. Moreover, effective selection strategies are introduced to enhance matching accuracy. Extensive experiments show that our EDM achieves competitive matching accuracy on various benchmarks and exhibits excellent efficiency, offering valuable best practices for real-world applications. The code is available at https://github. com/chicleee/EDM.
EgoM2P: Egocentric Multimodal Multitask Pretraining
Gen Li
ETH Zürich
Yutong Chen
ETH Zürich
Yiqian Wu
ETH Zürich
Kaifeng Zhao
ETH Zürich
Marc Pollefeys
ETH Zürich
Siyu Tang
ETH Zürich
Abstract
Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction, enabling systems to better interpret the camera wearer's actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal and multitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models. To address these challenges, we introduce a set of efficient temporal tokenizers and propose EgoM2P, a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding. This unified design supports multitasking across diverse egocentric perception and synthesis tasks, including gaze prediction, egocentric camera tracking, and monocular depth estimation from egocentric video, and also serves as a generative model for conditional egocentric video synthesis. Across these tasks, EgoM2P matches or outperforms specialist models while being an order of magnitude faster. We will fully opensource EgoM2P to support the community and advance egocentric vision research.
End-to-End Driving with Online Trajectory Evaluation via BEV World Model
Yingyan Li
Chinese Academy of Sciences
Yuqi Wang
unknown
Yang Liu
unknown
Jiawei He
unknown
Lue Fan
unknown
Zhaoxiang Zhang
unknown
Abstract
End-to-end autonomous driving has achieved remarkable progress by integrating perception, prediction, and planning into a fully differentiable framework. Yet, to fully realize its potential, an effective online trajectory evaluation is indispensable to ensure safety. By forecasting the future outcomes of a given trajectory, trajectory evaluation becomes much more effective. This goal can be achieved by employing a world model to capture environmental dynamics and predict future states. Therefore, we propose an endto-end driving framework WoTE, which leverages a BEV World model to predict future BEV states for Trajectory Evaluation. The proposed BEV world model is latencyefficient compared to image-level world models and can be seamlessly supervised using off-the-shelf BEV-space traffic simulators. We validate our framework on both the NAVSIM benchmark and the closed-loop Bench2Drive benchmark based on the CARLA simulator, achieving state-of-the-art performance. Code is released at https://github. com/liyingyanUCAS/WoTE.
Estimating 2D Camera Motion with Hybrid Motion Basis
Haipeng Li
University of Electronic Science and Technology of China
Tianhao Zhou
University of Electronic Science and Technology of China
Zhanglei Yang
University of Electronic Science and Technology of China
Yi Wu
Xiaomi Corporation
Yan Chen
Xiaomi Corporation
Zijing Mao
Xiaomi Corporation
Shen Cheng
Dexmal
Bing Zeng
University of Electronic Science and Technology of China
Shuaicheng Liu
University of Electronic Science and Technology of China
Abstract
Estimating 2D camera motion is a fundamental computer vision task that models the projection of 3D camera movements onto the 2D image plane. Current methods rely on either homography-based approaches, limited to planar scenes, or meshflow techniques that use grid-based local homographies but struggle with complex non-linear transformations. We introduce CamFlow, a novel framework that represents camera motion using hybrid motion bases: physical bases derived from camera geometry and stochastic bases for complex scenarios. Our approach includes a hybrid probabilistic loss function based on the Laplace distribution that enhances training robustness. For evaluation, we create a new benchmark by masking dynamic objects in existing optical flow datasets to isolate pure camera motion. Experiments show CamFlow outperforms stateof-the-art methods across diverse scenarios, demonstrating superior robustness and generalization in zero-shot settings. Code and datasets are available at our project page: https://lhaippp.github.io/CamFlow/.
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving
Yue Li
University of Science and Technology of China
Meng Tian
Huawei Noah's Ark Lab
Zhenyu Lin
Huawei Noah's Ark Lab
Jiangtong Zhu
Huawei Noah's Ark Lab
Dechang Zhu
Huawei Noah's Ark Lab
Haiqiang Liu
Huawei Noah's Ark Lab
Yueyi Zhang
University of Science and Technology of China
Zhiwei Xiong
University of Science and Technology of China
Xinhai Zhao
Huawei Noah's Ark Lab
Abstract
Existing benchmarks for Vision-Language Model (VLM) in autonomous driving (AD) primarily assess interpretability through open-form visual question answering (QA) within coarse-grained tasks, which remain insufficient to assess capabilities in complex driving scenarios. To this end, we introduce VLADBench, a challenging and finegrained benchmark featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. The elaborate VLADBench spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute Comprehension, and Ego Decision-Making and Planning. These domains are further broken down into 11 secondary aspects and 29 tertiary tasks for a granular evaluation. A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts. To further exploit the cognitive and reasoning interactions among the 5 domains for AD understanding, we start from a small-scale VLM and train the DS models on individual domain datasets (collected from 1.4M DS QAs across public sources). The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems. The benchmark is available at https://github.com/Depth2World/VLADBench.
Future-Aware Interaction Network For Motion Forecasting
Shijie Li
I2R, A*STAR
Chunyu Liu
CEPRI
Xun Xu
I2R, A*STAR
Si Yong Yeo
LKCMedicine, NTU
Xulei Yang
I2R, A*STAR
Abstract
Motion forecasting is a crucial component of autonomous driving systems, enabling the generation of accurate and smooth future trajectories to ensure safe navigation to the destination. In previous methods, potential future trajectories are often absent in the scene encoding stage, which may lead to suboptimal outcomes. Additionally, prior approaches typically employ transformer architectures for spatiotemporal modeling of trajectories and map information, which suffer from the quadratic scaling complexity of the transformer architecture. In this work, we propose an interaction-based method, named Future-Aware Interaction Network, that introduces potential future trajectories into scene encoding for a comprehensive traffic representation. Furthermore, a State Space Model (SSM), specifically Mamba, is introduced for both spatial and temporal modeling. To adapt Mamba for spatial interaction modeling, we propose an adaptive reordering strategy that transforms unordered data into a structured sequence. Additionally, Mamba is employed to refine generated future trajectories temporally, ensuring more consistent predictions. These enhancements not only improve model efficiency but also enhance the accuracy and diversity of predictions. We conduct comprehensive experiments on the widely used Argoverse 1 and Argoverse 2 datasets, demonstrating that the proposed method achieves superior performance compared to previous approaches in a more efficient way. The code is available here.
GARF: Learning Generalizable 3D Reassembly for Real-World Fractures
Sihang Li
New York University
Zeyu Jiang
New York University
Grace Chen
New York University
Chenyang Xu
New York University
Siqi Tan
New York University
Xue Wang
New York University
Irving Fang
New York University
Kristof Zyskowski
Yale University
Shannon P. McPherron
Max Planck Institute
Radu Iovita
New York University
Chen Feng
New York University
Jing Zhang
New York University
Abstract
3D reassembly is a challenging spatial intelligence task with broad applications across scientific domains. While large-scale synthetic datasets have fueled promising learning-based approaches, their generalizability to different domains is limited. Critically, it remains uncertain whether models trained on synthetic datasets can generalize to real-world fractures where breakage patterns are more complex. To bridge this gap, we propose GARF, a generalizable 3D reassembly framework for real-world fractures. GARF leverages fracture-aware pretraining to learn fracture features from individual fragments, with flow matching enabling precise 6-DoF alignments. At inference time, we introduce two-session flow matching, improving robustness to unseen objects and varying numbers of fractures. In collaboration with archaeologists, paleoanthropologists, and ornithologists, we curate FRACTURA, a diverse dataset for vision and learning communities, featuring real-world fracture types across ceramics, bones, eggshells, and lithics. Comprehensive experiments have shown our approach consistently outperforms state-ofthe-art methods on both synthetic and real-world datasets, achieving 82.87% lower rotation error and 25.15% higher part accuracy on the Breaking Bad Everyday dataset. This sheds light on training on synthetic data to advance realworld 3D puzzle solving, demonstrating its strong generalization across unseen object shapes and diverse fracture types. GARF's code, data and demo are available at https://ai4ce.github.io/GARF/.
GENMO: A GENeralist Model for Human MOtion
Jiefeng Li
NVIDIA
Jinkun Cao
NVIDIA
Haotian Zhang
NVIDIA
Davis Rempe
NVIDIA
Jan Kautz
NVIDIA
Umar Iqbal
NVIDIA
Ye Yuan
NVIDIA
Abstract
Human motion modeling traditionally separates motion generation and estimation into distinct tasks with specialized models. Motion generation models focus on creating diverse, realistic motions from inputs like text, audio, or keyframes, while motion estimation models aim to reconstruct accurate motion trajectories from observations like videos. Despite sharing underlying representations of temporal dynamics and kinematics, this separation limits knowledge transfer between tasks and requires maintaining separate models. We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework. Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals. Leveraging the synergy between regression and diffusion, GENMO achieves accurate global motion estimation while enabling diverse motion generation. We also introduce an estimation-guided training objective that exploits in-the-wild videos with 2D annotations and text descriptions to enhance generative diversity. Furthermore, our novel architecture handles variablelength motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control. This unified approach creates synergistic benefits: generative priors improve estimated motions under challenging conditions like occlusions, while diverse video data enhances generation capabilities. Extensive experiments demonstrate GENMO's effectiveness as a generalist framework that successfully handles multiple human motion tasks within a single model.
GenFlow3D: Generative Scene Flow Estimation and Prediction on Point Cloud Sequences
Hanlin Li
University of Science and Technology of China
Wenming Weng
University of Science and Technology of China
Yueyi Zhang
MiroMind
Zhiwei Xiong
University of Science and Technology of China
Abstract
Scene flow provides the fundamental information of the scene dynamics. Existing scene flow estimation methods typically rely on the correlation between only a consecutive point cloud pair, which makes them limited to the instantaneous state of the scene and face challenges in real-world scenarios with factors like occlusion, noise, and diverse motion of background and foreground. In this paper, we study the joint sequential scene flow estimation and future scene flow prediction on point cloud sequences. The expanded sequential input introduces long-term and high-order motion information. We propose GenFlow3D, a recurrent neural network model which integrates diffusion in the decoder to better incorporate the two tasks and enhance the ability to extract general motion patterns. A transformer-based denoising network is adopted to help capture useful information. Depending on the input point clouds, discriminative condition signals are generated to guide the diffusion decoder to switch among different modes specific for scene flow estimation and prediction in a multi-scale manner. GenFlow3D is evaluated on the real-world datasets nuScenes and Argoverse 2, and demonstrates superior performance compared with the existing methods. Our code is available at https://github.com/ustc-hlli/GenFlow3D.
Generalized Few-Shot Point Cloud Segmentation via LLM-Assisted Hyper-Relation Matching
Zhaoyang Li
University of Science and Technology of China
Yuan Wang
University of Science and Technology of China
Guoxin Xiong
University of Science and Technology of China
Wangkai Li
University of Science and Technology of China
Yuwen Pan
University of Science and Technology of China
Tianzhu Zhang
National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory
Abstract
Generalized few-shot point cloud segmentation (GFS3DSeg) aims to segment objects of both base and novel classes using abundant base class samples and limited novel class samples. Existing GFS-3DSeg methods encounter bottlenecks due to the scarcity of novel class data and inter-class confusion. In this paper, we propose the LLM-Assisted Hyper-Relation Matching (LARM) framework, which leverages the wealth of prior knowledge in Large Language Models (LLM) to enrich novel category prototypes and introduces a hyper-relation matching strategy to mitigate false matches between point features and category prototypes caused by inter-class confusion. The proposed LARM enjoys several merits. First, the vast knowledge embedded in LLM can be an effective complement to vanilla category prototypes, enabling them to exhibit greater robustness. Second, the hyper-relation matching strategy harnesses the structural information implicit in the inter-class relationships, making it more robust than individual feature comparisons. Extensive experiments on two benchmarks demonstrate that LARM outperforms previous state-of-the-art methods by large margins.
Global-Aware Monocular Semantic Scene Completion with State Space Models
Shijie Li
I2R, A*STAR
Zhongyao Cheng
I2R, A*STAR
Rong Li
HKUST(GZ)
Shuai Li
University of Bonn
Juergen Gall
University of Bonn
Xun Xu
I2R, A*STAR
Xulei Yang
I2R, A*STAR
Abstract
Monocular Semantic Scene Completion (MonoSSC) reconstructs and interprets 3D environments from a single image, enabling diverse real-world applications. However, existing methods are often constrained by the local receptive field of Convolutional Neural Networks (CNNs), making it challenging to handle the non-uniform distribution of projected points and effectively reconstruct missing information caused by the 3D-to-2D projection. In this work, we introduce GA-MonoSSC, a hybrid architecture for MonoSSC that effectively captures global context in both the 2D image domain and 3D space. Specifically, we propose a DualHead Multi-Modality Encoder, which leverages a Transformer architecture to capture spatial relationships across all features in the 2D image domain, enabling more comprehensive 2D feature extraction. Additionally, we introduce the Frustum Mamba Decoder, built on the State Space Model (SSM), to efficiently capture long-range dependencies in 3D space. Furthermore, we propose a frustum reordering strategy within the Frustum Mamba Decoder to mitigate feature discontinuities in the reordered voxel sequence, ensuring better alignment with the scan mechanism of the State Space Model (SSM) for improved 3D representation learning. We conduct extensive experiments on the widely used Occ-ScanNet and NYUv2 datasets, demonstrating that our proposed method achieves state-of-the-art performance, validating its effectiveness. The code is available here.
Global Regulation and Excitation via Attention Tuning for Stereo Matching
Jiahao LI
City University of Hong Kong
Xinhong Chen
City University of Hong Kong
Zhengmin JIANG
City University of Hong Kong
Qian Zhou
City University of Hong Kong
Yung-Hui Li
Hon Hai Research Institute
Jianping Wang
City University of Hong Kong
Abstract
Stereo matching achieves significant progress with iterative algorithms like RAFT-Stereo and IGEV-Stereo. However, these methods struggle in ill-posed regions with occlusions, textureless, or repetitive patterns, due to a lack of global context and geometric information for effective iterative refinement. To enable the existing iterative approaches to incorporate global context, we propose the Global Regulation and Excitation via Attention Tuning (GREAT) framework which encompasses three attention modules. Specifically, Spatial Attention (SA) captures the global context within the spatial dimension, Matching Attention (MA) extracts global context along epipolar lines, and Volume Attention (VA) works in conjunction with SA and MA to construct a more robust cost-volume excited by global context and geometric details. To verify the universality and effectiveness of this framework, we integrate it into several representative iterative stereo-matching methods and validate it through extensive experiments, collectively denoted as GREAT-Stereo. This framework demonstrates superior performance in challenging ill-posed regions. Applied to IGEV-Stereo, among all published methods, our GREAT-IGEV ranks first on the Scene Flow test set, KITTI 2015, and ETH3D leaderboards, and achieves second on the Middlebury benchmark. Code is available at https://github.com/JarvisLee0423/GREAT-Stereo.
Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training
Zhenxin Li
Fudan University
Shihao Wang
The Hong Kong Polytechnic University
Shiyi Lan
NVIDIA
Zhiding Yu
NVIDIA
Zuxuan Wu
Fudan University
Jose M. Alvarez
NVIDIA
Abstract
End-to-end autonomous driving research currently faces a critical challenge in bridging the gap between open-loop training and closed-loop deployment. Current approaches are trained to predict trajectories in an open-loop environment, which struggle with quick reactions to other agents in closed-loop environments and risk generating kinematically infeasible plans due to the gap between open-loop training and closed-loop driving. In this paper, we introduce Hydra-NeXt, a novel multi-branch planning framework that unifies trajectory prediction, control prediction, and a trajectory refinement network in one model. Unlike current open-loop trajectory prediction models that only handle general-case planning, Hydra-NeXt further utilizes a control decoder to focus on short-term actions, which enables faster responses to dynamic situations and reactive agents. Moreover, we propose the Trajectory Refinement module to augment and refine the planning decisions by effectively adhering to kinematic constraints in closed-loop environments. This unified approach bridges the gap between open-loop training and closed-loop driving, demonstrating superior performance of 65.89 Driving Score (DS) and 48.20% Success Rate (SR) on the Bench2Drive dataset without relying on external experts for data collection. Hydra-NeXt surpasses the previous state-of-the-art by 22.98 DS and 17.49 SR, marking a significant advancement in autonomous driving. Code will be available at https://github.com/woxihuanjiangguo/Hydra-NeXt.
IMoRe: Implicit Program-Guided Reasoning for Human Motion Q&A
Chen Li
Institute of High-Performance Computing, Agency for Science, Technology and Research
Chinthani Sugandhika
Nanyang Technological University
Yeo Keat Ee
Institute of High-Performance Computing, Agency for Science, Technology and Research
Eric Peh
Institute of High-Performance Computing, Agency for Science, Technology and Research
Hao Zhang
Institute of High-Performance Computing, Agency for Science, Technology and Research
Hong Yang
Institute of High-Performance Computing, Agency for Science, Technology and Research
Deepu Rajan
Nanyang Technological University
Basura Fernando
Institute of High-Performance Computing, Agency for Science, Technology and Research
Abstract
Existing human motion Q&A methods rely on explicit program execution, where the requirement for manually defined functional modules may limit the scalability and adaptability. To overcome this, we propose an implicit program-guided motion reasoning (IMoRe) framework that unifies reasoning across multiple query types without manually designed modules. Unlike existing implicit reasoning approaches that infer reasoning operations from question words, our model directly conditions on structured program functions, ensuring a more precise execution of reasoning steps. Additionally, we introduce a program-guided reading mechanism, which dynamically selects multi-level motion representations from a pretrained motion Vision Transformer (ViT), capturing both high-level semantics and fine-grained motion cues. The reasoning module iteratively refines memory representations, leveraging structured program functions to extract relevant information for different query types. Our model achieves state-of-the-art performance on Babel-QA and generalizes to a newly constructed motion Q&A dataset based on HuMMan, demonstrating its adaptability across different motion reasoning datasets. Code and dataset are available at https://github.com/LUNAProject22/IMoRe.
Intermediate Connectors and Geometric Priors for Language-Guided Affordance Segmentation on Unseen Object Categories
Yicong Li
National University of Singapore
Yiyang Chen
National University of Singapore
Zhenyuan Ma
National University of Singapore
Junbin Xiao
National University of Singapore
Xiang Wang
University of Science and Technology of China
Angela Yao
University of Science and Technology of China
Abstract
Language-guided Affordance Segmentation (LASO) aims to identify actionable object regions based on text instructions. At the core of its practicality is learning generalizable affordance knowledge that captures functional regions across diverse objects. However, current LASO solutions struggle to extend learned affordances to object categories that are not encountered during training. Scrutinizing these designs, we identify limited generalizability on unseen categories, stemming from (1) underutilized generalizable patterns in the intermediate layers of both 3D and text backbones, which impedes the formation of robust affordance knowledge, and (2) the inability to handle substantial variability in affordance regions across object categories due to a lack of structural knowledge of the target region. Towards this, we introduce a GeneraLized frAmework on uNseen CategoriEs (GLANCE), incorporating two key components: a cross-modal connector that links intermediate stages of the text and 3D backbones to enrich pointwise embeddings with affordance concepts, and a VLM-guided query generator that provides affordance priors by extracting a few 3D key points based on the intra-view reliability and crossview consistency of their multi-view segmentation masks. Extensive experiments on two benchmark datasets demonstrate that GLANCE outperforms state-of-the-art methods (SoTAs), with notable improvements in generalization to unseen categories. Our code is available at https://github.com/Monoxide-Chen/Affordance.
LMM-Det: Make Large Multimodal Models Excel in Object Detection
Jincheng Li
AI Research
Chunyu Xie
Beihang University
Ji Ao
AI Research
Dawei Leng
AI Research
Yuhui Yin
AI Research
Abstract
Large multimodal models (LMMs) have garnered widespread attention and interest within the artificial intelligence research and industrial communities, owing to their remarkable capability in multimodal understanding, reasoning, and in-context learning, among others. While LMMs have demonstrated promising results in tackling multimodal tasks like image captioning, visual question answering, and visual grounding, the object detection capabilities of LMMs exhibit a significant gap compared to specialist detectors. To bridge the gap, we depart from the conventional methods of integrating heavy detectors with LMMs and propose a simple yet effective approach that leverages a Large Multimodal Model for vanilla object Detection without relying on specialized detection modules. Specifically, we conduct a comprehensive exploratory analysis when a large multimodal model meets with object detection, revealing that the recall rate degrades significantly compared with specialist detection models. To mitigate this, we propose to increase the recall rate by introducing data distribution adjustment and inference optimization tailored for object detection. We re-organize the instruction conversations to enhance the object detection capabilities of large multimodal models. We claim that a large multimodal model possesses detection capability without any extra detection modules. Extensive experiments support our claim and show the effectiveness of the versatile LMM-Det. The datasets, models, and codes are available at https://github.com/360CVGroup/LMM-Det.
Language Decoupling with Fine-grained Knowledge Guidance for Referring Multi-object Tracking
Guangyao Li
Xiamen University
Siping Zhuang
Xiamen University
Yajun Jian
Xiamen University
Yan Yan
Xiamen University
Hanzi Wang
Xiamen University
Abstract
Referring multi-object tracking (RMOT) aims to detect and track specific objects based on natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, often failing to exploit finegrained linguistic cues that are crucial for distinguishing objects with similar characteristics. Notably, these cues play distinct roles at different tracking stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose DKGTrack, a novel RMOT method that enhances language comprehension for precise object tracking by decoupling language expressions into localized descriptions and motion states. To improve the accuracy of language-guided object identification, we introduce a Static Semantic Enhancement (SSE) module, which enhances region-level vision-language alignment through hierarchical cross-modal feature interaction, providing more discriminative object representations for tracking. Furthermore, we propose a Motion Perception Alignment (MPA) module that explicitly aligns object queries with motion descriptions, enabling accurate object trajectory prediction across frames. Experimental results on multiple RMOT benchmarks demonstrate the effectiveness of our method, which achieves competitive performance in challenging tracking scenarios. The code is available at https://github.com/acyddl/DKGTrack.
Learning Precise Affordances from Egocentric Videos for Robotic Manipulation
Gen Li
University of Edinburgh
Nikolaos Tsagkas
University of Edinburgh
Jifei Song
Huawei Noah's Ark Lab
Ruaridh Mon-Williams
University of Edinburgh
Sethu Vijayakumar
University of Edinburgh
Kun Shao
Huawei Noah's Ark Lab
Laura Sevilla-Lara
University of Edinburgh
Abstract
Affordance, defined as the potential actions that an object offers, is crucial for embodied AI agents. For example, such knowledge directs an agent to grasp a knife by the handle for cutting or by the blade for safe handover. While existing approaches have made notable progress, affordance research still faces three key challenges: data scarcity, poor generalization, and real-world deployment. Specifically, there is a lack of large-scale affordance datasets with precise segmentation maps, existing models struggle to generalize across different domains or novel object and affordance classes, and little work demonstrates deployability in real-world scenarios. In this work, we address these issues by proposing a complete affordance learning system that (1) takes in egocentric videos and outputs precise affordance annotations without human labeling, (2) leverages geometric information and vision foundation models to improve generalization, and (3) introduces a framework that facilitates affordance-oriented robotic manipulation such as tool grasping and robot-to-human tool handover. Experimental results show that our model surpasses the state-of-the-art by 13.8% in mIoU, and the framework achieves 77.1% successful grasping among 179 trials, including evaluations on seen, unseen classes, and cluttered scenes. Project page: https://reagan1311.github.io/affgrasp.
M2EIT: Multi-Domain Mixture of Experts for Robust Neural Inertial Tracking
Yan Li
School of Systems Science and Engineering, Sun Yat-sen University
Yang Xu
Tianjin University
Changhao Chen
The Hong Kong University of Science and Technology (Guangzhou)
Zhongchen Shi
Defense Innovation Institute, Academy of Military Sciences (AMS)
Wei Chen
Defense Innovation Institute, Academy of Military Sciences (AMS)
Liang Xie
Defense Innovation Institute, Academy of Military Sciences (AMS)
Hongbo Chen
School of Systems Science and Engineering, Sun Yat-sen University
Erwei Yin
Defense Innovation Institute, Academy of Military Sciences (AMS)
Abstract
Inertial tracking (IT), independent of the environment and external infrastructure, has long been the ideal solution for providing location services to humans. Despite significant strides in inertial tracking empowered by deep learning, prevailing neural inertial tracking predominantly utilizes conventional spatial-temporal features from inertial measurements. Unfortunately, the frequency domain dimension is usually overlooked in the current literature. To this end, in this paper, we propose a Multi-Domain Mixture of Experts model for Neural Inertial Tracking, named M2EIT. Specifically, M2EIT first leverages ResNet as a spatial decomposition expert to capture spatial relationships between multivariate timeseries, and State Space Model (SSM)- based Bi-Mamba, the other expert to focus on learning temporal correlations. In the frequency domain mapping, we then introduce the Wavelet-based frequency decomposition expert, which decomposes IMU samples into low-frequency bands and high-frequency bands using the Haar wavelet transform for simulating motion patterns at different temporal scales. To bridge the semantic gap across multiple domains and integrate them adaptively, we design the Multi-Representation Alignment Router (MAR), which consists of a dual cross-domain translation layer, followed by a dynamic router, to achieve multi-domain semantic alignment and optimize expert contributions. Extensive experiments conducted on three real-world datasets demonstrate that the proposed M2EIT can achieve SOTA results in neural inertial tracking.
MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance
Quanhao Li
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
Zhen Xing
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
Rui Wang
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
Hui Zhang
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
Qi Dai
Microsoft Research Asia
Zuxuan Wu
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
Abstract
Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths. However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality. Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios. Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation. To address these challenges, we introduce MagicMotion, a novel image-tovideo generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality. Furthermore, we present MagicData, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce MagicBench, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects. Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics.
Morph: A Motion-free Physics Optimization Framework for Human Motion Generation
Zhuo Li
WeChat, Tencent Inc
Mingshuang Luo
State Key Laboratory of AI Safety, Institute of Computing Technology, CAS
Ruibing Hou
Peng Cheng Laboratory
Xin Zhao
MoE Key Laboratory of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Hao Liu
WeChat, Tencent Inc
Hong Chang
State Key Laboratory of AI Safety, Institute of Computing Technology, CAS
Zimo Liu
Peng Cheng Laboratory
Chen Li
WeChat, Tencent Inc
Abstract
Human motion generation has been widely studied due to its crucial role in areas such as digital humans and humanoid robot control. However, many current motion generation approaches disregard physics constraints, frequently resulting in physically implausible motions with pronounced artifacts such as floating and foot sliding. Meanwhile, training an effective motion physics optimizer with noisy motion data remains largely unexplored. In this paper, we propose Morph, a Motion-Free physics optimization framework, consisting of a Motion Generator and a Motion Physics Refinement module, for enhancing physical plausibility without relying on expensive real-world motion data. Specifically, the motion generator is responsible for providing large-scale synthetic, noisy motion data, while the motion physics refinement module utilizes these synthetic data to learn a motion imitator within a physics simulator, enforcing physical constraints to project the noisy motions into a physically-plausible space. Additionally, we introduce a prior reward module to enhance the stability of the physics optimization process and generate smoother and more stable motions. These physically refined motions are then used to fine-tune the motion generator, further enhancing its capability. This collaborative training paradigm enables mutual enhancement between the motion generator and the motion physics refinement module, significantly improving practicality and robustness in realworld applications. Experiments on both text-to-motion and music-to-dance generation tasks demonstrate that our framework achieves state-of-the-art motion quality while improving physical plausibility drastically. Ground Penetration Leaning Backward Interpenetration Foot Sliding Floating Unnatural Rotation Figure 1. Examples of physical inconsistencies in generations.
MultiModal Action Conditioned Video Simulation
Yichen Li
MIT CSAIL
Antonio Torralba
MIT CSAIL
Abstract
Current video models fail as world model as they lack finegraiend control. General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. We consider senses of proprioception, kinesthesia, force haptics, and muscle activation. Such multimodal senses naturally enables fine-grained interactions that are difficult to simulate with text-conditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further propose a regularization scheme to enhance causality of the action trajectory features in representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate the effectiveness and practicality of our work.†
NATRA: Noise-Agnostic Framework for Trajectory Prediction with Noisy Observations
Rongqing Li
Beijing Institute of Technology
Changsheng Li
Beijing Institute of Technology
Ruilin Lv
Beijing Institute of Technology
Yuhang Li
Beijing Institute of Technology
Yang Gao
Meituan
Xiaolu Zhang
Ant Group
JUN ZHOU
Ant Group
Abstract
Trajectory prediction aims to forecast an agent's future trajectories based on its historical observed trajectories, which is a critical task for various applications such as autonomous driving, robotics, and surveillance systems. Most existing trajectory prediction methods assume that the observed trajectories collected for forecasting are clean. However, in real-world scenarios, noise is inevitably introduced into the observations, resulting in the collapse of the existing approaches. Therefore, it is essential to perform robust trajectory prediction based on noisy observations, which is a more practical scenario. In this paper, we propose NATRA, a Noise-Agnostic framework capable of tackling the problem of TRAjectory prediction with arbitrary types of noisy observations. Specifically, we put forward a mutual information-based mechanism to denoise the original noisy observations. It optimizes the produced trajectories to exhibit a pattern that closely resembles the clean trajectory pattern while deviating from the noisy one. Considering that the trajectory structure may be destroyed through the only optimization of mutual information, we introduce an additional reconstruction loss to preserve the structure information of the produced observed trajectories. Moreover, we further propose a ranking loss to further enhance the performance. Because NATRA does not rely on any specific module tailored to particular noise distributions, it can handle arbitrary types of noise in principle. Additionally, our proposed NATRA can be easily integrated into existing trajectory prediction models. Extensive experiments on both synthetic and real-world noisy datasets demonstrate the effectiveness of our method.
PBCAT: Patch-Based Composite Adversarial Training against Physically Realizable Attacks on Object Detection
Xiao Li
Department of Computer Science and Technology, BNRist, IDG/McGovern Institute for Brain Research, THBI, Tsinghua University
Yiming Zhu
University of Science and Technology Beijing
Yifan Huang
University of Science and Technology Beijing
Wei Zhang
Department of Computer Science and Technology, BNRist, IDG/McGovern Institute for Brain Research, THBI, Tsinghua University
Yingzhe He
Huawei Technologies
Jie Shi
Huawei Technologies
Xiaolin Hu
Department of Computer Science and Technology, BNRist, IDG/McGovern Institute for Brain Research, THBI, Tsinghua University
Abstract
Object detection plays a crucial role in many securitysensitive applications. However, several recent studies have shown that object detectors can be easily fooled by physically realizable attacks, e.g., adversarial patches and recent adversarial textures, which pose realistic and urgent threats. Adversarial Training (AT) has been recognized as the most effective defense against adversarial attacks. While AT has been extensively studied in the l→attack settings on classification models, AT against physically realizable attacks on object detectors has received limited exploration. Early attempts are only performed to defend against adversarial patches, leaving AT against a wider range of physically realizable attacks under-explored. In this work, we consider defending against various physically realizable attacks with a unified AT method. We propose PBCAT, a novel Patch-Based Composite Adversarial Training strategy. PBCAT optimizes the model by incorporating the combination of small-area gradient-guided adversarial patches and imperceptible global adversarial perturbations covering the entire image. With these designs, PBCAT has the potential to defend against not only adversarial patches but also unseen physically realizable attacks such as adversarial textures. Extensive experiments in multiple settings demonstrated that PBCAT significantly improved robustness against various physically realizable attacks over state-of-the-art defense methods. Notably, it improved the detection accuracy by 29.7% over previous defense methods under one recent adversarial texture attack1. Code is available at https://github.com/LixiaoTHU/oddefense-PatchAT
PointGAC: Geometric-Aware Codebook for Masked Point Modeling
Abiao Li
Jiangxi University of Finance and Economics
Chenlei Lv
Shenzhen University
Yuming Fang
Jiangxi University of Finance and Economics
Yifan Zuo
Jiangxi University of Finance and Economics
Jian Zhang
University of Technology Sydney
Guofeng Mei
Fondazione Bruno Kessler
Abstract
Most masked point cloud modeling (MPM) methods follow a regression paradigm to reconstruct the coordinate or feature of masked regions. However, they tend to overconstrain the model to learn the details of the masked region, resulting in failure to capture generalized features. To address this limitation, we propose PointGAC, a novel clustering-based MPM method that aims to align the feature distribution of masked regions. Specially, it features an online codebook-guided teacher-student framework. Firstly, it presents a geometry-aware partitioning strategy to extract initial patches. Then, the teacher model updates a codebook via online k-means based on features extracted from the complete patches. This procedure facilitates codebook vectors to become cluster centers. Afterward, we assigns the unmasked features to their corresponding cluster centers, and the student model aligns the assignment for the reconstructed masked features. This strategy focuses on identifying the cluster centers to which the masked features belong, enabling the model to learn more generalized feature representations. Benefiting from a proposed codebook maintenance mechanism, codebook vectors are actively updated, which further increases the efficiency of semantic feature learning. Experiments validate the effectiveness of the proposed method on various downstream tasks. Code is available at https://github.com/LAB123-tech/PointGAC.
Proactive Scene Decomposition and Reconstruction
Baicheng Li
School of Intelligence Science and Technology, Peking University
Zike Yan
AIR, Tsinghua University
Dong Wu
School of Intelligence Science and Technology, Peking University
Hongbin Zha
School of Intelligence Science and Technology, Peking University
Abstract
Human behaviors are the major causes of scene dynamics and inherently contain rich cues regarding the dynamics. This paper formalizes a new task of proactive scene decomposition and reconstruction, an online approach that leverages human-object interactions to iteratively disassemble and reconstruct the environment. By observing these intentional interactions, we can dynamically refine the decomposition and reconstruction process, addressing inherent ambiguities in static object-level reconstruction. The proposed system effectively integrates multiple tasks in dynamic environments such as accurate camera and object pose estimation, instance decomposition, and online map updating, capitalizing on cues from human-object interactions in egocentric live streams for a flexible, progressive alternative to conventional object-level reconstruction methods. Aided by the Gaussian splatting technique, accurate and consistent dynamic scene modeling is achieved with photorealistic and efficient rendering. The efficacy is validated in multiple real-world scenarios with promising advantages.
RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control
Teng Li
College of Computer Science & Technology, Zhejiang University
Guangcong Zheng
College of Computer Science & Technology, Zhejiang University
Rui Jiang
College of Computer Science & Technology, Zhejiang University
Shuigen Zhan
College of Computer Science & Technology, Zhejiang University
Tao Wu
College of Computer Science & Technology, Zhejiang University
Yehao Lu
College of Computer Science & Technology, Zhejiang University
Yining Lin
Supremind
Chuanyun Deng
Central Media Technology Institute, 2012 Lab, Huawei
Yepan Xiong
Central Media Technology Institute, 2012 Lab, Huawei
Min Chen
Central Media Technology Institute, 2012 Lab, Huawei
Lin Cheng
Central Media Technology Institute, 2012 Lab, Huawei
Xi Li
College of Computer Science & Technology, Zhejiang University
Abstract
Recent advancements in camera-trajectory-guided imageto-video generation offer higher precision and better support for complex camera control compared to text-based approaches. However, they also introduce significant usability challenges, as users often struggle to provide precise camera parameters when working with arbitrary realworld images without knowledge of their depth nor scene scale. To address these real-world application issues, we propose RealCam-I2V, a novel diffusion-based video generation framework that integrates monocular metric depth estimation to establish 3D scene reconstruction in a preprocessing step. During training, the reconstructed 3D scene enables scaling camera parameters from relative to metric scales, ensuring compatibility and scale consistency across diverse real-world images. In inference, RealCam-I2V offers an intuitive interface where users can precisely draw camera trajectories by dragging within the 3D scene. To further enhance precise camera control and scene consistency, we propose scene-constrained noise shaping, which shapes high-level noise and also allows the framework to maintain dynamic and coherent video generation in lower noise stages. RealCam-I2V achieves significant improvements in controllability and video quality on the RealEstate10K and out-of-domain images. We further enables applications like camera-controlled looping video generation and generative frame interpolation. Project page: zgctroy.github.io/RealCam-I2V.
Robust Low-light Scene Restoration via Illumination Transition
Ze Li
The Hong Kong University of Science and Technology, Hong Kong SAR
Feng Zhang
Nanjing University of Posts and Telecommunications, Nanjing, China
Xiatian Zhu
University of Surrey, Guildford, United Kingdom
Meng Zhang
The Hong Kong University of Science and Technology, Hong Kong SAR
Yanghong Zhou
The Hong Kong Polytechnic University, Hong Kong SAR
P. Y. Mok
The Hong Kong University of Science and Technology, Hong Kong SAR
Abstract
Synthesizing normal-light novel views from low-light multiview images is an important yet challenging task, given the low visibility and high ISO noise present in the input images. Existing low-light enhancement methods often struggle to effectively preprocess such low-light inputs, as they fail to consider correlations among multiple views. Although other state-of-the-art methods have introduced illumination-related components offering alternative solutions to the problem, they often result in drawbacks such as color distortions and artifacts, and they provide limited denoising effectiveness. In this paper, we propose a novel Robust Low-light Scene Restoration framework (RoSe), which enables effective synthesis of novel views in normal lighting conditions from low-light multiview image inputs, by formulating the task as an illuminance transition estimation problem in 3D space, conceptualizing it as a specialized rendering task. This multiviewconsistent illuminance transition field establishes a robust connection between low-light and normal-light conditions. By further exploiting the inherent low-rank property of illumination to constrain the transition representation, we achieve more effective denoising without complex 2D techniques or explicit noise modeling. To implement RoSe, we design a concise dual-branch architecture and introduce a low-rank denoising module. Experiments demonstrate that RoSe significantly outperforms state-of-the-art models in both rendering quality and multiview consistency on standard benchmarks. The codes and data are available at https://pegasus2004.github.io/RoSe.
SAS: Segment Any 3D Scene with Integrated 2D Priors
Zhuoyuan Li
University of Science and Technology of China
Jiahao Lu
University of Science and Technology of China
Jiacheng Deng
University of Science and Technology of China
Hanzhi Chang
University of Science and Technology of China
Lifan Wu
University of Science and Technology of China
Yanzhe Liang
University of Science and Technology of China
Tianzhu Zhang
Deep Space Exploration Laboratory
Abstract
The open vocabulary capability of 3D models is increasingly valued, as traditional methods with models trained with fixed categories fail to recognize unseen objects in complex dynamic 3D scenes. In this paper, we propose a simple yet effective approach, SAS, to integrate the open vocabulary capability of multiple 2D models and migrate it to 3D domain. Specifically, we first propose Model Alignment via Text to map different 2D models into the same embedding space using text as a bridge. Then, we propose Annotation-Free Model Capability Construction to explicitly quantify the 2D model's capability of recognizing different categories using diffusion models. Following this, point cloud features from different 2D models are fused with the guide of constructed model capabilities. Finally, the integrated 2D open vocabulary capability is transferred to 3D domain through feature distillation. SAS outperforms previous methods by a large margin across multiple datasets, including ScanNet v2, Matterport3D, and nuScenes, while its generalizability is further validated on downstream tasks, e.g., gaussian segmentation and instance segmentation.
SD2Actor: Continuous State Decomposition via Diffusion Embeddings for Robotic Manipulation
Jiayi Li
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Abstract
Language-conditioned robot manipulation in the continuous spectrum presents a persistent challenge due to the difficult of mapping states to target actions. Previous methods face limitations in effectively modeling object states, primarily due to their reliance on executing ambiguous instructions devoid of explicit state information. In response, we present SD2Actor, a zero-shot robotic manipulation framework that possesses the capability to generate precise actions in continuous states. Specifically, given the novel instructions, we aim to generate instructionfollowing and accurate robot manipulation actions. Instead of time-consuming optimization and finetuning, our zeroshot method generalizes to any object state with a wide range of translations and versatile rotations. At its core, we quantify multiple base states in the training set and utilize their combination to refine the target action generated by the diffusion model. To obtain novel state representations, we initially employ LLMs to extract the novel state from the instruction and decompose it into multiple learned base states. We then employ the linear combination of base state embeddings to produce novel state features. Moreover, we introduce the orthogonalization loss to constrain the state embedding space, which ensures the validity of linear interpolation. Experiments demonstrate that SD2Actor outperforms state-of-the-art methods across a diverse range of manipulation tasks in ARNOLD Benchmark. Moreover, SD2Actor can effectively learn generalizable policies from a limited number of human demonstrations, achieving promising accuracy in a variety of realworld manipulation tasks.
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
Yue Li
University of Amsterdam
Qi Ma
ETH Zürich
Runyi Yang
INSAIT, Sofia University 'St. Kliment Ohridski'
Huapeng Li
ETH Zürich
Mengjiao Ma
Nanjing University of Aeronautics and Astronautics
Bin Ren
INSAIT, Sofia University 'St. Kliment Ohridski'
Nikola Popovic
INSAIT, Sofia University 'St. Kliment Ohridski'
Nicu Sebe
University of Trento
Ender Konukoglu
ETH Zürich
Theo Gevers
University of Amsterdam
Martin R. Oswald
University of Amsterdam
Danda Pani Paudel
INSAIT, Sofia University 'St. Kliment Ohridski'
Abstract
Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training or together at inference. This highlights the clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable manner remains an open challenge. To address these limitations, we introduce SceneSplat in Fig. 1, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. To power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising 7916 scenes derived from seven established datasets, such as ScanNet and Matterport3D. Generating SceneSplat-7K required computational resources equivalent to 150 GPU days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes. Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed method over the established baselines. Our code, model, and datasets will be released at SceneSplat.
ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion
Abstract
Joint reconstruction of human-object interaction marks a significant milestone in comprehending the intricate interrelations between humans and their surrounding environment. Nevertheless, previous optimization methods often struggle to achieve physically plausible reconstruction results due to the lack of prior knowledge about human-object interactions. In this paper, we introduce ScoreHOI, an effective diffusion-based optimizer that introduces diffusion priors for the precise recovery of human-object interactions. By harnessing the controllability within score-guided sampling, the diffusion model can reconstruct a conditional distribution of human and object pose given the image observation and object feature. During inference, the ScoreHOI effectively improves the reconstruction results by guiding the denoising process with specific physical constraints. Furthermore, we propose a contact-driven iterative refinement approach to enhance the contact plausibility and improve the reconstruction accuracy. Extensive evaluations on standard benchmarks demonstrate ScoreHOI's superior performance over state-of-the-art methods, highlighting its ability to achieve a precise and robust improvement in joint human-object interaction reconstruction.
TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos
Jinxi Li
LAR Group, The Hong Kong Polytechnic University
Ziyang Song
LAR Group, The Hong Kong Polytechnic University
Bo Yang
LAR Group, The Hong Kong Polytechnic University
Abstract
In this paper, we aim to model 3D scene geometry, appearance, and physical information just from dynamic multiview videos in the absence of any human labels. By leveraging physics-informed losses as soft constraints or integrating simple physics models into neural nets, existing works often fail to learn complex motion physics, or doing so requires additional labels such as object types or masks. We propose a new framework named TRACE to model the motion physics of complex dynamic 3D scenes. The key novelty of our method is that, by formulating each 3D point as a rigid particle with size and orientation in space, we directly learn a translation rotation dynamics system for each particle, explicitly estimating a complete set of physical parameters to govern the particle's motion over time. Extensive experiments on three existing dynamic datasets and one newly created challenging synthetic datasets demonstrate the extraordinary performance of our method over baselines in the task of future frame extrapolation. A nice property of our framework is that multiple objects or parts can be easily segmented just by clustering the learned physical parameters. Our datasets and code are available at https://github.com/vLAR-group/TRACE.
Task-Specific Zero-shot Quantization-Aware Training for Object Detection
Changhao Li
School of Computational Science and Engineering, Georgia Institute of Technology
Xinrui Chen
Shenzhen International Graduate School, Tsinghua University
Ji Wang
School of Software, Tsinghua University
Kang Zhao
Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua-Bosch Joint ML Center, Tsinghua University
Jianfei Chen
Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua-Bosch Joint ML Center, Tsinghua University
Abstract
Quantization is a key technique to reduce network size and computational complexity by representing the network parameters with a lower precision. Traditional quantization methods rely on access to original training data, which is often restricted due to privacy concerns or security challenges. Zero-shot Quantization (ZSQ) addresses this by using synthetic data generated from pre-trained models, eliminating the need for real training data. Recently, ZSQ has been extended to object detection. However, existing methods use unlabeled task-agnostic synthetic images that lack the specific information required for object detection, leading to suboptimal performance. In this paper, we propose a novel task-specific ZSQ framework for object detection networks, which consists of two main stages. First, we introduce a bounding box and category sampling strategy to synthesize a task-specific calibration set from the pre-trained network, reconstructing object locations, sizes, and category distributions without any prior knowledge. Second, we integrate task-specific training into the knowledge distillation process to restore the performance of quantized detection networks. Extensive experiments conducted on the MS-COCO and Pascal VOC datasets demonstrate the efficiency and state-of-the-art performance of our method. Our project is publicly accessible athttps://dfq-dojo.github.io/dfq-toolkit-web.
Towards Long-Horizon Vision-Language-Action System: Reasoning, Acting and Memory
Daixun Li
Xidian University
Yusi Zhang
Xidian University
Mingxiang Cao
Xidian University
Donglai Liu
Xidian University
Weiying Xie
Xidian University
Tianlin Hui
Xidian University
Lunkai Lin
AgileX Robotics
Zhiqiang Xie
AgileX Robotics
Yunsong Li
Xidian University
Abstract
Vision-Language-Action (VLA) is crucial for autonomous decision-making in embodied systems. While current methods have advanced single-skill abilities, their short-horizon capability limits applicability in real-world scenarios. To address this challenge, we innovatively propose MindExplore, a general hierarchical VLA system with cross-skill for long-horizon tasks in highly dynamic sand. The key insight is to iteratively align the knowledge domain of task planning and action execution. Thus, this task-oriented action enables outstanding generalization across a wide range of real-world scenarios. In the reasoning layer, task-specific chains of thought (CoT) are designed for planning longhorizon task sequences and providing meta-action signals. In the acting layer, a simple but powerful Mixture of Policy Experts strategy is built inspired by signals and multimodal inputs for adaptively selecting skill experts and generating closed-loop action sequences. Also, it integrates a lightweight Multimodal Diffusion Policy (MMDP) to enhance spatial perception by fusing multi-visual modality features. Besides, the pioneering memory mechanism establishes feedback between the reasoning and acting layers, facilitating adaptive execution of long-horizon tasks and real-time replanning. Notably, we create SandGo-1k and SandThink-21k, the first expert-level multimodal embodied dataset and CoT dataset tailored for sandy environments. At a high execution frequency of 30 FPS, MindExplore is 3.01 x more successful than existing methods in unstructured and dynamic environments.
Triad: Empowering LMM-based Anomaly Detection with Expert-guided Region-of-Interest Tokenizer and Manufacturing Process
Yuanze Li
Harbin Institute of Technology
Shihao Yuan
Harbin Institute of Technology
Haolin Wang
Harbin Institute of Technology
Qizhang Li
Harbin Institute of Technology
Ming Liu
Harbin Institute of Technology
Chen Xu
Pengcheng Lab, Guangzhou
Guangming Shi
Pengcheng Lab, Guangzhou
Wangmeng Zuo
Harbin Institute of Technology
Abstract
Although recent methods have tried to introduce large multimodal models (LMMs) into industrial anomaly detection (IAD), their generalization in the IAD field is far inferior to that for general purposes. We summarize the main reasons for this gap into two aspects. On one hand, general-purpose LMMs lack cognition of defects in the visual modality, thereby failing to sufficiently focus on defect areas. Therefore, we propose to modify the AnyRes structure of the LLaVA model, providing the potential anomalous areas identified by existing IAD models to the LMMs. On the other hand, existing methods mainly focus on identifying defects by learning defect patterns or comparing with normal samples, yet they fall short of understanding the causes of these defects. Considering that the generation of defects is closely related to the manufacturing process, we propose a manufacturing-driven IAD paradigm. An instruction-tuning dataset for IAD (InstructIAD) and a data organization approach for Chain-of-Thought with manufacturing (CoT-M) are designed to leverage the manufacturing process for IAD. Based on the above two modifications, we present Triad, a novel LMM-based method incorporating an expert-guided region-of-interest tokenizer and manufacturing process for industrial anomaly detection. Extensive experiments show that our Triad not only demonstrates competitive performance against current LMMs but also achieves further improved accuracy when equipped with manufacturing processes. Source code, training data, and pre-trained models will be publicly available at https: //github.com/tzjtatata/Triad.
U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration
Xiaofan Li
Baidu Inc.
Zhihao Xu
Baidu Inc.
Chenming Wu
Baidu Inc.
Zhao Yang
Baidu Inc.
Yumeng Zhang
Baidu Inc.
Jiang-Jiang Liu
Baidu Inc.
Haibao Yu
Baidu Inc.
Xiaoqing Ye
Baidu Inc.
Yuan Wang
Baidu Inc.
Shirui Li
Baidu Inc.
Xun Sun
Baidu Inc.
Ji Wan
Baidu Inc.
Jun Wang
Baidu Inc.
Abstract
Accurate localization using visual information is a critical yet challenging task, especially in urban environments where nearby buildings and construction sites significantly degrade GNSS (Global Navigation Satellite System) signal quality. This issue underscores the importance of visual localization techniques in scenarios where GNSS signals are unreliable. This paper proposes U-ViLAR, a novel uncertainty-aware visual localization framework designed to address these challenges while enabling adaptive localization using high-definition (HD) maps or navigation maps. Specifically, our method first extracts features from the input visual data and maps them into Bird'sEye-View (BEV) space to enhance spatial consistency with the map input. Subsequently, we introduce: a) Perceptual Uncertainty-guided Association, which mitigates errors caused by perception uncertainty, and b) Localization Uncertainty-guided Registration, which reduces errors introduced by localization uncertainty. By effectively balancing the coarse-grained large-scale localization capability of association with the fine-grained precise localization capability of registration, our approach achieves robust and accurate localization. Experimental results demonstrate that our method achieves state-of-the-art performance across multiple localization tasks. Furthermore, our model has undergone rigorous testing on large-scale autonomous driving fleets and has demonstrated stable performance in various challenging urban scenarios.
UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling
Peiming Li
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
Ziyi Wang
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
Yulin Yuan
The Zhejiang University-University of Illinois Urbana-Champaign Institute, Zhejiang University
Hong Liu
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
Xiangming Meng
The Zhejiang University-University of Illinois Urbana-Champaign Institute, Zhejiang University
Junsong Yuan
State University of New York at Buffalo
Mengyuan Liu
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
Abstract
Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https: //github.com/wangzy01/UST-SSM.
Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA
Zhixuan Li
College of Computing and Data Science, Nanyang Technological University
Hyunse Yoon
Department of Electrical and Electronic Engineering, Yonsei University
Sanghoon Lee
Department of Electrical and Electronic Engineering, Yonsei University
Weisi Lin
College of Computing and Data Science, Nanyang Technological University
Abstract
Amodal segmentation aims to infer the complete shape of occluded objects, even when the occluded region's appearance is unavailable. However, current amodal segmentation methods lack the capability to interact with users through text input and struggle to understand or reason about implicit and complex purposes. While methods like LISA integrate multi-modal large language models (LLMs) with segmentation for reasoning tasks, they are limited to predicting only visible object regions and face challenges in handling complex occlusion scenarios. To address these limitations, we propose a novel task named amodal reasoning segmentation, aiming to predict the complete amodal shape of occluded objects while providing answers with elaborations based on user text input. We develop a generalizable dataset generation pipeline and introduce a new dataset focusing on daily life scenarios, encompassing diverse real-world occlusions. Furthermore, we present AURA (Amodal Understanding and Reasoning Assistant), a novel model with advanced global and spatial-level designs specifically tailored to handle complex occlusions. Extensive experiments validate AURA's effectiveness on the proposed dataset. The code, model, and dataset are released in this page.
Dual Reciprocal Learning of Language-based Human Motion Understanding and Generation
Chen Liang
State Key Lab of Brain-Machine Intelligence, Zhejiang University
Zhicheng Shi
State Key Lab of Brain-Machine Intelligence, Zhejiang University
Wenguan Wang
State Key Lab of Brain-Machine Intelligence, Zhejiang University
Yi Yang
State Key Lab of Brain-Machine Intelligence, Zhejiang University
Abstract
Language-based human motion understanding focuses on describing human motions using natural language descriptions. Conversely, human motion generation aims to generate human motions from textual inputs. Despite significant progress in both fields, further advancements are hindered by two primary challenges: i) Both tasks rely heavily on vast amounts of paired motion-language data for model training. However, human labeling is costly, making it increasingly unsustainable as model scales increase. ii) Existing models often learn the two tasks in parallel. The strong reciprocity between them has not been fully explored. In response, this work proposes Dual Reciprocal Learning (DRL) for language-based human motion understanding and generation. DRL establishes a symmetric learning framework where both tasks collaboratively evolve in a closed-loop, bootstrapping manner, effectively leveraging the reciprocity between them. In DRL, the tasks serve as evaluators for each other, enabling the generation of informative feedback signals even with easily acquired unpaired, unidirectional motion or language data. Furthermore, to mitigate dataset-specific bias in existing evaluations, we propose a generalized protocol that extends evaluation to a general-domain cross-modal feature space. Experimental results on standard benchmarks demonstrate that DRL achieves remarkable performance boosts over representative baselines in both tasks across evaluation protocols.
Efficient Event Camera Data Pretraining with Adaptive Prompt Fusion
Quanmin Liang
School of Computer Science and Engineering, Sun Yat-Sen University
Qiang Li
Xpeng Motors Technology Co Ltd
Shuai Liu
School of Computer Science and Engineering, Sun Yat-Sen University
Xinzi Cao
School of Computer Science and Engineering, Sun Yat-Sen University
Jinyi Lu
School of Computer Science and Engineering, Sun Yat-Sen University
Feidiao Yang
Department of Intelligent Computing, Pengcheng Laboratory
Wei Zhang
Department of Intelligent Computing, Pengcheng Laboratory
Kai Huang
School of Computer Science and Engineering, Sun Yat-Sen University
Yonghong Tian
Department of Intelligent Computing, Pengcheng Laboratory
Abstract
Applying pretraining-finetuning paradigm to event cameras presents significant challenges due to the scarcity of largescale event datasets and the inherently sparse nature of event data, which increases the risk of overfitting during extensive pretraining. In this paper, we explore the transfer of pretrained image knowledge to the domain of event cameras to address this challenge. The key to our approach lies in adapting event data representations to align with image pretrained models while simultaneously integrating spatiotemporal information and mitigating data sparsity. To achieve this, we propose a lightweight SpatioTemporal information fusion Prompting (STP) method, which progressively fuses the spatiotemporal characteristics of event data through a dynamic perception module with multi-scale spatiotemporal receptive fields, enabling compatibility with image pretrained models. STP enhances event data representation by capturing local information within a large receptive field and performing global information exchange along the temporal dimension. This strategy effectively reduces sparse regions in event data while refining fine-grained details, all while preserving its inherent spatiotemporal structure. Our method significantly outperforms previous state-of-the-art approaches across classification, semantic segmentation, and optical flow estimation tasks. For instance, it achieves a top-1 accuracy of 68.87% (+4.04%) on N-ImageNet with only 1/10 of the pretraining parameters and 1/3 of the training epochs. Our code is available at https://github.com/Lqm26/STP.
EventUPS: Uncalibrated Photometric Stereo Using an Event Camera
Jinxiu Liang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Bohan Yu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Siqi Yang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Haotian Zhuang
Tsinghua University
Jieji Ren
Shanghai Jiaotong University
Peiqi Duan
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Boxin Shi
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Abstract
We present EventUPS, the first uncalibrated photometric stereo (UPS) method using an event camera-a neuromorphic sensor that asynchronously detects brightness changes with microsecond resolution. Traditional frame-based UPS methods are hindered by high bandwidth demands and limited use in dynamic scenes. These methods require dense image correspondence under varying illumination and are incompatible with the fundamentally different sensing paradigm of event data. Our approach introduces three key innovations: an augmented null space formulation that directly relates each event to joint constraints on surface normals and lighting, naturally handling ambient illumination; a continuous parameterization of time-varying illumination that connects asynchronous events to synchronized lighting estimation; and a lighting fixture with known relative geometry that reduces ambiguity to a convex-concave uncertainty. We validate EventUPS using a custom-built LED lighting system. Experimental results show that our method achieves accuracy surpassing its frame-based counterpart while requiring only 5% of the data bandwidth.
Fine-grained Spatiotemporal Grounding on Egocentric Videos
Shuo Liang
The Chinese University of Hong Kong
Yiwu Zhong
The Chinese University of Hong Kong
Zi-Yuan Hu
The Chinese University of Hong Kong
Yeyao Tao
The Chinese University of Hong Kong
Liwei Wang
The Chinese University of Hong Kong
Abstract
Spatiotemporal video grounding aims to localize target entities in videos based on textual queries. While existing research has made significant progress in exocentric videos, the egocentric setting remains relatively underexplored, despite its growing importance in applications such as augmented reality and robotics. In this work, we conduct a systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts. To address these challenges, we introduce EgoMask, the first pixel-level benchmark for finegrained spatiotemporal grounding in egocentric videos. It is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks across short-, medium-, and long-term videos. Additionally, we create EgoMask-Train, a large-scale training dataset to facilitate model development. Experiments demonstrate that the state-of-the-art spatiotemporal grounding models perform poorly on our benchmark EgoMask, but fine-tuning on EgoMask-Train yields significant improvements, while preserving performance on exocentric datasets. Our work thus provides essential resources and insights for advancing egocentric video understanding. Our code is available at https://github.com/LaVi-Lab/EgoMask.
Gradient-Reweighted Adversarial Camouflage for Physical Object Detection Evasion
Jiawei Liang
Shenzhen Campus of Sun Yat-sen University
Siyuan Liang
Nanyang Technological University
Tianrui Lou
Shenzhen Campus of Sun Yat-sen University
Ming Zhang
National Key Laboratory of Science and Technology on Information System Security
Wenjin Li
Nsfocus
Dunqiu Fan
Nsfocus
Xiaochun Cao
Shenzhen Campus of Sun Yat-sen University
Abstract
Object detection is widely used in real-world applications such as autonomous driving, yet adversarial camouflage poses a significant threat by deceiving detectors from multiple viewpoints. Existing techniques struggle to maintain consistent attack efficacy across different viewpoints. To address this, we propose GRAC, an adversarial camouflage framework that enhances attack effectiveness across viewpoints and distances. First, we identify conflicts in gradient updates across angles and introduce gradient reweighting to resolve them, enabling coordinated optimization. Second, we model light interactions to simulate illumination changes, improving robustness under varying lighting conditions. Additionally, we address non-uniform texture updates arising from inconsistent sampling density during rendering by applying pooling-based texture regularization to improve smoothness. Extensive experiments in both simulated and physical environments demonstrate that GRAC outperforms existing methods across diverse conditions.1
Instance-Level Video Depth in Groups Beyond Occlusions
Yuan Liang
South China University of Technology
Yang Zhou
South China University of Technology
Ziming Sun
South China University of Technology
Tianyi Xiang
South China University of Technology
Guiqing Li
South China University of Technology
Shengfeng He
Singapore Management University
Abstract
Depth estimation in dynamic, multi-object scenes remains a major challenge, especially under severe occlusions. Existing monocular models, including foundation models, struggle with instance-wise depth consistency due to their reliance on global regression. We tackle this problem from two key aspects: data and methodology. First, we introduce the Group Instance Depth (GID) dataset, the first large-scale video depth dataset with instance-level annotations, featuring 101,500 frames from real-world activity scenes. GID bridges the gap between synthetic and real-world depth data by providing high-fidelity depth supervision for multi-object interactions. Second, we propose InstanceDepth, the first occlusion-aware depth estimation framework for multi-object environments. Our twostage pipeline consists of (1) Holistic Depth Initialization, which assigns a coarse scene-level depth structure, and (2) Instance-Aware Depth Rectification, which refines instancewise depth using object masks, shape priors, and spatial relationships. By enforcing geometric consistency across occlusions, our method sets a new state-of-the-art on the GID dataset and multiple benchmarks. Our code and dataset can be found at https://github.com/ViktorLiang/GID.
Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space
Yingping Liang
Beijing Institute of Technology
Yutao Hu
School of Computer Science and Engineering, Southeast University
Wenqi Shao
Shanghai Al Laboratory
Ying Fu
Beijing Institute of Technology
Abstract
Feature matching plays a fundamental role in many computer vision tasks, yet existing methods rely on scarce and clean multi-view image collections, which constrains their generalization to diverse and challenging scenarios. Moreover, conventional feature encoders are typically trained on single-view 2D images, limiting their capacity to capture 3D-aware correspondences. In this paper, we propose a novel two-stage framework that lifts 2D images to 3D space, named as Lift to Match (L2M), taking full advantage of large-scale and diverse single-view images. To be specific, in the first stage, we learn a 3D-aware feature encoder using a combination of multi-view image synthesis and 3D feature Gaussian representation, which injects 3D geometry knowledge into the encoder. In the second stage, a novelview rendering strategy, combined with large-scale synthetic data generation from single-view images, is employed to learn a feature decoder for robust feature matching, thus achieving generalization across diverse domains. Extensive experiments demonstrate that our method achieves superior generalization across zero-shot evaluation benchmarks, highlighting the effectiveness of the proposed framework for robust feature matching. Code is available at https://github.com/Sharpiless/L2M.
Perspective-Invariant 3D Object Detection
Ao Liang
National University of Singapore
Lingdong Kong
National University of Singapore
Dongyue Lu
National University of Singapore
Youquan Liu
Fudan University
Jian Fang
Shenyang Institute of Automation, Chinese Academy of Sciences
Huaici Zhao
Shenyang Institute of Automation, Chinese Academy of Sciences
Wei Tsang Ooi
National University of Singapore
Abstract
With the rise of robotics, LiDAR-based 3D object detection has garnered significant attention in both academia and industry. However, existing datasets and methods predominantly focus on vehicle-mounted platforms, leaving other autonomous platforms underexplored. To bridge this gap, we introduce Pi3DET, the first benchmark featuring LiDAR data and 3D bounding box annotations collected from multiple platforms: vehicle, quadruped, and drone, thereby facilitating research in 3D object detection for non-vehicle platforms as well as cross-platform 3D detection. Based on Pi3DET, we propose a novel cross-platform adaptation framework that transfers knowledge from the well-studied vehicle platform to other platforms. This framework achieves perspective-invariant 3D detection through robust alignment at both geometric and feature levels. Additionally, we estab- (∗) Ao, Lingdong, and Dongyue contributed equally to this work. lish a benchmark to evaluate the resilience and robustness of current 3D detectors in cross-platform scenarios, providing valuable insights for developing adaptive 3D perception systems. Extensive experiments validate the effectiveness of our approach on challenging cross-platform tasks, demonstrating substantial gains over existing adaptation methods. We hope this work paves the way for generalizable and unified 3D perception systems across diverse and complex environments. Our Pi3DET dataset, cross-platform benchmark suite, and annotation toolkit have been made publicly available.
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
Tianming Liang
Sun Yat-sen University
Kun-Yu Lin
Sun Yat-sen University
Chaolei Tan
Sun Yat-sen University
Jianguo Zhang
Southern University of Science and Technology
Wei-Shi Zheng
Sun Yat-sen University
Jian-Fang Hu
Sun Yat-sen University
Abstract
Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This is challenging as it involves deep visionlanguage understanding, pixel-level dense prediction and spatiotemporal reasoning. Despite notable progress in recent years, existing methods still exhibit a noticeable gap when considering all these aspects. In this work, we propose ReferDINO, a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixellevel dense perception and cross-modal spatiotemporal reasoning. In detail, ReferDINO integrates two key components: 1) a grounding-guided deformable mask decoder that utilizes location prediction to progressively guide mask prediction through differentiable deformation mechanisms; 2) an object-consistent temporal enhancer that injects pretrained time-varying text features into inter-frame interaction to capture object-aware dynamic changes. Moreover, a confidence-aware query pruning strategy is designed to accelerate object decoding without compromising model performance. Extensive experimental results on five benchmarks demonstrate that our ReferDINO significantly outperforms previous methods (e.g., +3.9% \ protect \mathcal {J}\&\mathcal {F} on RefYouTube-VOS) with real-time inference speed (51 FPS).
Spatial Alignment and Temporal Matching Adapter for Video-Radar Remote Physiological Measurement
Qian Liang
University of Science and Technology of China
Ruixu Geng
University of Science and Technology of China
Jinbo Chen
Nanyang Technological University
Haoyu Wang
University of Science and Technology of China
Yan Chen
University of Science and Technology of China
Yang Hu
University of Science and Technology of China
Abstract
Remote physiological measurement (RPM) based on video and radar has made significant progress in recent years. However, unimodal methods based solely on video or radar sensor have notable limitations due to their measurement principles, and multimodal RPM that combines these modalities has emerged as a promising direction. Despite its potential, the lack of large-scale multimodal data and the significant modality gap between video and radar pose substantial challenges in building robust videoradar RPM models. To handle these problems, we suggest leveraging unimodal pre-training and present the Spatial alignment and Temporal Matching (SATM) Adapter to effectively fine-tune pre-trained unimodal backbones into a multimodal RPM model. Given the distinct measurement principles of video- and radar-based methods, we propose Spatial Alignment to align the spatial distribution of their features. Furthermore, Temporal Matching is applied to mitigate waveform discrepancies between video and radar signals. By integrating these two modules into adapters, the unimodal backbones could retain their modality-specific knowledge while effectively extracting complementary features from each other. Extensive experiments across various challenging scenarios, including low light conditions and head motions, demonstrate that our approach significantly surpasses the state-of-the-art methods.
UniDxMD: Towards Unified Representation for Cross-Modal Unsupervised Domain Adaptation in 3D Semantic Segmentation
Zhengyin Liang
State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University
Hui Yin
State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University
Min Liang
Beijing University of Technology
Qianqian Du
Beijing Jiaotong University
Ying Yang
Beijing Jiaotong University
Hua Huang
Beijing Jiaotong University
Abstract
Modality or domain distribution shifts pose formidable challenges in 3D semantic segmentation. Existing methods predominantly address either cross-modal or cross-domain adaptation in isolation, leading to insufficient exploration of semantic associations and complementary features in heterogeneous data. To bridge this gap, we present UniDxMD, a unified representation method for cross-modal unsupervised domain adaptation (UDA) in 3D semantic segmentation that simultaneously tackles both cross-modal and cross-domain adaptation objectives. Our core insight is deriving a unified discrete representation from heterogeneous data to mitigate distribution shifts, inspired by vector quantization. Specifically, we propose a differentiable, clusterbased soft quantization mechanism (CSQM) that maps heterogeneous data (spanning modalities and domains) into a shared discrete latent space. Then, we introduce latent space regularization (LSR), leveraging joint prototypes that satisfy semantic relation consistency as learnable anchors to enhance the compactness and semantic discriminability of the discrete latent space. Our method paves the way for advancing cross-modal UDA in 3D semantic segmentation towards the unified representation. Extensive results across four challenging cross-modal UDA scenarios demonstrate the superiority of our method. Code is available here.
I2-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting
Zhimin Liao
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Ping Wei
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Ruijie Zhang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Shuaijia Chen
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Haoxuan Wang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Ziyang Ren
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Abstract
Forecasting the evolution of 3D scenes and generating unseen scenarios via occupancy-based world models offer substantial potential for addressing corner cases in autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this issue, we propose I2-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike decoder-only GPT-style autoregressive models, I2-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to enable high-level control over scene generation. The decoder, conditioned on transformation matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that I2-World achieves state-of-theart performance, outperforming existing methods by 25.1% in mIoU and 36.9% in IoU for 4D occupancy forecasting while exhibiting exceptional computational efficiency. It nearly requires 2.9 GB of training memory and achieves real-time inference at 37.0 FPS. Our code is available on https://github.com/lzzzzzm/II-World.
LLM-Assisted Semantic Guidance for Sparsely Annotated Remote Sensing Object Detection
Wei Liao
Nanjing University of Science and Technology, Nanjing, Jiangsu, China
Chunyan Xu
Nanjing University of Science and Technology, Nanjing, Jiangsu, China
Chenxu Wang
Nanjing University of Science and Technology, Nanjing, Jiangsu, China
Zhen Cui
Beijing Normal University, Beijing, China
Abstract
Sparse annotation in remote sensing object detection poses significant challenges due to dense object distributions and category imbalances. Although existing Dense Pseudo-Label methods have demonstrated substantial potential in pseudo-labeling tasks, they remain constrained by selection ambiguities and inconsistencies in confidence estimation. In this paper, we introduce an LLM-assisted semantic guidance framework tailored for sparsely annotated remote sensing object detection, exploiting the advanced semantic reasoning capabilities of large language models (LLMs) to distill high-confidence pseudo-labels. By integrating LLM-generated semantic priors, we propose a Class-Aware Dense Pseudo-Label Assignment mechanism that adaptively assigns pseudo-labels for both unlabeled and sparsely labeled data, ensuring robust supervision across varying data distributions. Additionally, we develop an Adaptive Hard-Negative Reweighting Module to stabilize the supervised learning branch by mitigating the influence of confounding background information. Extensive experiments on DOTA and HRSC2016 demonstrate that the proposed method outperforms existing single-stage detector-based frameworks, significantly improving detection performance under sparse annotations. Our source code is available at https://github. com/wuxiuzhilianni/RSST.
MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent
Xinyao Liao
Nanyang Technological University
Xianfang Zeng
StepFun
Liao Wang
StepFun
Gang Yu
StepFun
Guosheng Lin
Nanyang Technological University
Chi Zhang
Westlake University
Abstract
We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields, providing flexible and precise motion guidance. Specifically, the agent extracts the object movement and camera motion described in the text, and converts them into object trajectories and camera extrinsics, respectively. An analytical optical flow composition module integrates these motion representations in 3D space and projects them into a unified optical flow. An optical flow adapter takes the flow to control the base image-to-video diffusion model for generating fine-grained controlled videos. After that, an optional rethinking step can be adopted to ensure the generated video is aligned well with motion information in the prompt. The significant improvement in the Video-Text Camera Motion metrics on VBench indicates that our method achieves precise control over camera motion. We further construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation
Xiao Lin
College of Electronic and Information Engineering, Tongji University, Shanghai, China
Yun Peng
College of Electronic and Information Engineering, Tongji University, Shanghai, China
Liuyi Wang
College of Electronic and Information Engineering, Tongji University, Shanghai, China
Xianyou Zhong
College of Electronic and Information Engineering, Tongji University, Shanghai, China
Minghao Zhu
College of Electronic and Information Engineering, Tongji University, Shanghai, China
Yi Feng
College of Electronic and Information Engineering, Tongji University, Shanghai, China
Jingwei Yang
College of Electronic and Information Engineering, Tongji University, Shanghai, China
Chengju Liu
College of Electronic and Information Engineering, Tongji University, Shanghai, China
Qijun Chen
State Key Laboratory of Autonomous Intelligent Unmanned Systems, Tongji University, Shanghai, China
Abstract
In the effort to achieve robust and generalizable categorylevel object pose estimation, recent methods primarily focus on learning fundamental representations from data. However, the inherent biases within the data are often overlooked: the repeated training samples and similar environments may mislead the models to over-rely on specific patterns, hindering models' performance on novel instances. In this paper, we present CleanPose, a novel method that mitigates the data biases to enhance categorylevel pose estimation by integrating causal learning and knowledge distillation. By incorporating key causal variables (structural information and hidden confounders) into causal modeling, we propose the causal inference module based on front-door adjustment, which promotes unbiased estimation by reducing potential spurious correlations. Additionally, to further confront the data bias at the feature level, we devise a residual-based knowledge distillation approach to transfer unbiased semantic knowledge from 3D foundation model, providing comprehensive causal supervision. Extensive experiments across multiple benchmarks (REAL275, CAMERA25 and HouseCat6D) hightlight the superiority of proposed CleanPose over stateof-the-art methods. Code will be available at https: //github.com/chrislin0621/CleanPose.
ClearSight: Human Vision-Inspired Solutions for Event-Based Motion Deblurring
Xiaopeng Lin
The Hong Kong University of Science and Technology (Guangzhou)
Yulong Huang
The Hong Kong University of Science and Technology (Guangzhou)
Hongwei Ren
The Hong Kong University of Science and Technology (Guangzhou)
Zunchang Liu
The Hong Kong University of Science and Technology (Guangzhou)
Hongxiang Huang
The Hong Kong University of Science and Technology (Guangzhou)
Yue Zhou
The Hong Kong University of Science and Technology (Guangzhou)
Haotian Fu
The Hong Kong University of Science and Technology (Guangzhou)
Bojun Cheng
The Hong Kong University of Science and Technology (Guangzhou)
Abstract
Motion deblurring addresses the challenge of image blur caused by camera or scene movement. Event cameras provide motion information that is encoded in the asynchronous event streams. To efficiently leverage the temporal information of event streams, we employ Spiking Neural Networks (SNNs) for motion feature extraction and Artificial Neural Networks (ANNs) for color information processing. Due to the non-uniform distribution and inherent redundancy of event data, existing cross-modal feature fusion methods exhibit certain limitations. Inspired by the visual attention mechanism in the human visual system, this study introduces a bioinspired dual-drive hybrid network (BDHNet). Specifically, the Neuron Configurator Module (NCM) is designed to dynamically adjust neuron configurations based on cross-modal features, thereby focusing the spikes in blurry regions and adapting to varying blurry scenarios dynamically. Additionally, the Region of Blurry Attention Module (RBAM) is introduced to generate a blurry mask in an unsupervised manner, effectively extracting motion clues from the event features and guiding more accurate cross-modal feature fusion. Extensive subjective and objective evaluations demonstrate that our method outperforms current state-of-the-art methods on both synthetic and real-world datasets.
DRaM-LHM: A Quaternion Framework for Iterative Camera Pose Estimation
Chen Lin
CCB & CCM, Flatiron Institute
Weizhi Du
University of Michigan, Ann Arbor
Zhixiang Min
Stevens Institute of Technology
Baochen She
Stanford University
Enrique Dunn
Stevens Institute of Technology
Sonya M. Hanson
CCB & CCM, Flatiron Institute
Abstract
We explore a quaternion adjugate matrix-based representation for rotational motion in the Perspective-n-Point (PnP) problem. Leveraging quadratic quaternion terms within a Determinant Ratio Matrix (DRaM) estimation framework, we extend its application to perspective scenarios, providing a robust and efficient initialization for iterative PnP pose estimation. Notably, by solving the orthographic projection least-squares problem, DRaM provides a reliable initialization that enhances the accuracy and stability of iterative PnP solvers. Experiments on synthetic and real data demonstrate its efficiency, accuracy, and robustness, particularly under high noise conditions. Furthermore, our nonminimal formulation ensures numerical stability, making it effective for real-world applications.
Global Motion Corresponder for 3D Point-Based Scene Interpolation under Large Motion
Junru Lin
University of Toronto
Chirag Vashist
Stanford University
Mikaela Angelina Uy
Nvidia
Colton Stearns
Nvidia
Xuan Luo
Google
Leonidas Guibas
Stanford University
Ke Li
Simon Fraser University
Abstract
Existing dynamic scene interpolation methods typically assume that the motion between consecutive timesteps is small enough so that displacements can be locally approximated by linear models. In practice, even slight deviations from this small-motion assumption can cause conventional techniques to fail. In this paper, we introduce Global Motion Corresponder (GMC), a novel approach that robustly handles large motion and achieves smooth transitions. GMC learns unary potential fields that predict SE(3) mappings into a shared canonical space, balancing correspondence, spatial and semantic smoothness, and local rigidity. We demonstrate that our method significantly outperforms existing baselines on 3D scene interpolation when the two states undergo large global motions. Furthermore, our method enables extrapolation capabilities where other baseline methods cannot.
GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding
Zijun Lin
Nanyang Technological University
Shuting He
Shanghai University of Finance and Economics
Cheston Tan
Centre for Frontier AI Research, A*STAR
Bihan Wen
Nanyang Technological University
Abstract
Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple steps as a whole, without extracting useful temporal information from each step. However, the instructions in SG3D often contain pronouns such as 'it', 'here' and 'the same' to make language expressions concise. This requires grounding methods to understand the context and retrieve relevant information from previous steps to correctly locate object sequences. Due to the lack of an effective module for collecting related historical information, state-of-theart 3DVG methods face significant challenges in adapting to the SG3D task. To fill this gap, we propose GroundFlow - a plug-in module for temporal reasoning on 3D point cloud sequential grounding. Firstly, we demonstrate that integrating GroundFlow improves the task accuracy of 3DVG baseline methods by a large margin (+7.5% and +10.2%) in the SG3D benchmark, even outperforming a 3D large language model pre-trained on various datasets. Furthermore, we selectively extract both short-term and long-term step information based on its relevance to the current instruction, enabling GroundFlow to take a comprehensive view of historical information and maintain its temporal understanding advantage as step counts increase. Overall, our work introduces temporal reasoning capabilities to existing 3DVG models and achieves state-of-the-art performance in the SG3D benchmark across five datasets.
MCOP: Multi-UAV Collaborative Occupancy Prediction
Zefu Lin
University of Chinese Academy of Sciences (UCAS)
Wenbo Chen
Institute of Automation, Chinese Academy of Sciences (CASIA)
Xiaojuan Jin
Institute of Automation, Chinese Academy of Sciences (CASIA)
Yuran Yang
Beijing University of Posts and Telecommunications (BUPT)
Lue Fan
Institute of Automation, Chinese Academy of Sciences (CASIA)
Yixin Zhang
Tencent
Yufeng Zhang
University of Chinese Academy of Sciences (UCAS)
Zhaoxiang Zhang
University of Chinese Academy of Sciences (UCAS)
Abstract
Unmanned Aerial Vehicle (UAV) swarm systems necessitate efficient collaborative perception mechanisms for diverse operational scenarios. Current Bird's Eye View (BEV)- based approaches exhibit two main limitations: boundingbox representations fail to capture complete semantic and geometric information of the scene, and their performance significantly degrades when encountering undefined or occluded objects. To address these limitations, we propose a novel multi-UAV collaborative occupancy prediction framework. Our framework effectively preserves 3D spatial structures and semantics through integrating a Spatial-Aware Feature Encoder and Cross-Agent Feature Integration. To enhance efficiency, we further introduce Altitude-Aware Feature Reduction to compactly represent scene information, along with a Dual-Mask Perceptual Guidance mechanism to adaptively select features and reduce communication overhead. Due to the absence of suitable benchmark datasets, we extend three datasets for evaluation: two virtual datasets (Air-to-Pred-Occ and UAV3D-Occ) and one real-world dataset (GauUScene-Occ). Experiments results demonstrate that our method achieves state-of-the-art accuracy, significantly outperforming existing collaborative methods while reducing communication overhead to only a fraction of previous approaches.
Pretend Benign: A Stealthy Adversarial Attack by Exploiting Vulnerabilities in Cooperative Perception
Hongwei Lin
Xiamen University
Dongyu Pan
Xiamen University
Qiming Xia
Xiamen University
Hai Wu
Xiamen University
Cheng Wang
Xiamen University
Siqi Shen
Xiamen University
Chenglu Wen
Xiamen University
Abstract
Recently, learning-based multi-agent cooperative perception has garnered widespread attention. However, the inherent vulnerabilities of neural networks, combined with the risks posed by cooperative communication as a wideopen backdoor, render these systems highly susceptible to adversarial attacks. Existing attack methods lack stealth as they perturb transmitted information indiscriminately, producing numerous false positives that are readily detected by consensus-based defenses. This paper proposes Pretend Benign (PB), a novel stealthy adversarial attack method that exploits vulnerabilities in cooperative perception to enable the attacker to disguise as a benign cooperator. To achieve this, we first introduce the Attack Region Selection (ARS) module, which divides the perception area into subregions based on confidence levels to pinpoint optimal attack locations. Then, we propose Multi-target Adversarial Perturbation Generation (MAPG), which maintains consensus, gain the victim's trust, and thereby reverse the normal cooperative role of perception. To mitigate the latency in adversarial signal generation and communication, we further propose a real-time attack by predicting future information through historical feature flow. Extensive experiments on the OPV2V and V2XSet datasets demonstrate that PB effectively bypasses state-of-the-art defense methods, underscoring its stealth and efficacy.
RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models
Yijing Lin
University of Science and Technology of China
Mengqi Huang
University of Science and Technology of China
Shuhan Zhuang
University of Science and Technology of China
Zhendong Mao
University of Science and Technology of China
Abstract
Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGen- †Zhendong Mao is the corresponding author. eral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to incontext learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate crossmodal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project Page: realgeneral web; GitHub Link: https://github.com/Lyne1/RealGeneral
SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting
Shengjie Lin
Toyota Technological Institute at Chicago
Jiading Fang
Toyota Technological Institute at Chicago
Muhammad Zubair Irshad
Toyota Research Institute
Vitor Campagnolo Guizilini
Toyota Research Institute
Rares Andrei Ambrus
Toyota Research Institute
Greg Shakhnarovich
Toyota Technological Institute at Chicago
Matthew R. Walter
Toyota Technological Institute at Chicago
Abstract
Reconstructing articulated objects prevalent in daily environments is crucial for applications in augmented/virtual reality and robotics. However, existing methods face scalability limitations (requiring 3D supervision or costly annotations), robustness issues (being susceptible to local optima), and rendering shortcomings (lacking speed or photorealism). We introduce SPLART, a self-supervised, category-agnostic framework that uses 3D Gaussian Splatting (3DGS) to reconstruct and infer the kinematics of articulated objects from two sets of posed RGB images captured at different articulation states, enabling real-time photorealistic rendering for novel viewpoints and articulations. SPLART augments 3DGS with a differentiable mobility parameter per Gaussian, achieving refined part segmentation. A multi-stage optimization strategy is employed to progressively handle reconstruction, part segmentation, and articulation estimation, significantly enhancing robustness and accuracy. SPLART exploits geometric self-supervision, effectively addressing challenging scenarios without requiring 3D annotations or category-specific priors. Evaluations on established and newly proposed benchmarks, along with applications to real-world scenarios using a handheld RGB camera, demonstrate SPLART's state-of-the-art performance and real-world practicality. Code is publicly available at https://github.com/ripl/splart.
VMBench: A Benchmark for Perception-Aligned Video Motion Generation
Xinran Ling
AMAP, Alibaba Group
Chen Zhu
AMAP, Alibaba Group
Meiqi Wu
AMAP, Alibaba Group
Hangyu Li
AMAP, Alibaba Group
Xiaokun Feng
CRISE, Institute of Automation, Chinese Academy of Sciences
Cundian Yang
AMAP, Alibaba Group
Aiming Hao
AMAP, Alibaba Group
Jiashu Zhu
AMAP, Alibaba Group
Jiahong Wu
AMAP, Alibaba Group
Xiangxiang Chu
AMAP, Alibaba Group
Abstract
Video generation has advanced rapidly, improving evaluation methods, yet assessing video's motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based on these findings, we introduce VMBench-a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: (1) PerceptionDriven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models' strengths and weaknesses in motion quality. (2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multilevel prompt library covering six key dynamic scene dimensions. (3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3% improvement in Spearman's correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment. Additionally, we release VMBench at https://github.com/AMAP-ML/VMBench, setting a new standard for evaluating and advancing motion generation models.
4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads
Ling Liu
IIIS, Tsinghua University
Jun Tian
IIIS, Tsinghua University
Li Yi
IIIS, Tsinghua University
Abstract
4D panoptic segmentation in a streaming setting is critical for highly dynamic environments, such as evacuating dense crowds and autonomous driving in complex scenarios, where real-time, fine-grained perception within a constrained time budget is essential. In this paper, we introduce 4DSegStreamer, a novel framework that employs a DualThread System to efficiently process streaming frames. The framework is general and can be seamlessly integrated into existing 3D and 4D segmentation methods to enable realtime capability. It also demonstrates superior robustness compared to existing streaming perception approaches, particularly under high FPS conditions. The system consists of a predictive thread and an inference thread. The predictive thread leverages historical motion and geometric information to extract features and forecast future dynamics. The inference thread ensures timely prediction for incoming frames by aligning with the latest memory and compensating for ego-motion and dynamic object movements. We evaluate 4DSegStreamer on the indoor HOI4D dataset and the outdoor SemanticKITTI and nuScenes datasets. Comprehensive experiments demonstrate the effectiveness of our approach, particularly in accurately predicting dynamic objects in complex scenes.
AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations
Junli Liu
Northwestern Polytechnical University
Qizhi Chen
Shanghai AI Laboratory
Zhigang Wang
Shanghai AI Laboratory
Yiwen Tang
Northwestern Polytechnical University
Yiting Zhang
Shanghai AI Laboratory
Chi Yan
Shanghai AI Laboratory
Dong Wang
Shanghai AI Laboratory
Xuelong Li
TeleAI
Bin Zhao
Northwestern Polytechnical University
Abstract
Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, e.g., appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning. Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code will be released at https://github.com/Ideal-ljl/AerialVG.
CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving
Changxing Liu
Shanghai Jiao Tong University
Genjia Liu
Shanghai Jiao Tong University
Zijun Wang
Shanghai Jiao Tong University
Jinchang Yang
Shanghai Jiao Tong University
Siheng Chen
Shanghai Jiao Tong University
Abstract
Vehicle-to-vehicle (V2V) cooperative autonomous driving holds great promise for improving safety by addressing the perception and prediction uncertainties inherent in single-agent systems. However, traditional cooperative methods are constrained by rigid collaboration protocols and limited generalization to unseen interactive scenarios. While LLM-based approaches offer generalized reasoning capabilities, their challenges in spatial planning and unstable inference latency hinder their direct application in cooperative driving. To address these limitations, we propose CoLMDriver, the first full-pipeline LLM-based cooperative driving system, enabling effective languagebased negotiation and real-time driving control. CoLMDriver features a parallel driving pipeline with two key components: (i) an LLM-based negotiation module under an critic-feedback paradigm, which continuously refines cooperation policies through feedback from previous decisions of all vehicles; and (ii) an intention-guided waypoint generator, which translates negotiation outcomes into executable waypoints. Additionally, we introduce InterDrive, a CARLA-based simulation benchmark comprising 10 challenging interactive driving scenarios for evaluating V2V cooperation. Experimental results demonstrate that CoLMDriver significantly outperforms existing approaches, achieving an 11% higher success rate across diverse highly interactive V2V driving scenarios. Code will be released on https://github.com/cxliu0314/CoLMDriver.
Controllable 3D Outdoor Scene Generation via Scene Graphs
Yuheng Liu
Texas A&M University
Xinke Li
City University of Hong Kong
Yuning Zhang
Southwest Jiaotong University
Lu Qi
UC Merced
Xin Li
Texas A&M University
Wenping Wang
Texas A&M University
Chongshou Li
Southwest Jiaotong University
Xueting Li
NVIDIA
Ming-Hsuan Yang
UC Merced
Abstract
Three-dimensional scene generation is crucial in computer vision, with applications spanning autonomous driving and gaming. However, current methods offer limited or nonintuitive user control. In this work, we propose a method that uses scene graph as a user-friendly control format to generate outdoor 3D scenes. We develop an interactive system that transforms a sparse scene graph into a dense Bird's Eye View (BEV) Embedding Map, which guides a conditional diffusion model to generate 3D scenes that match the scene graph description. Users can easily create or modify scene graphs to generate large-scale outdoor scenes. We create a large-scale dataset with paired scene graphs and 3D semantic scenes to train the BEV embedding and diffusion models. Experimental results show that our approach consistently produces high-quality 3D urban scenes closely aligned with the input scene graphs. To the best of our knowledge, this is the first approach to generate 3D outdoor scenes conditioned on scene graphs. Code is available at https://github.com/yuhengliu02/control-3d-scene.
CountSE: Soft Exemplar Open-set Object Counting
Shuai Liu
School of Software Engineering, Xi'an Jiaotong University
Peng Zhang
School of Software Engineering, Xi'an Jiaotong University
Shiwei Zhang
School of Software Engineering, Xi'an Jiaotong University
Wei Ke
School of Software Engineering, Xi'an Jiaotong University
Abstract
Open-set counting is garnering increasing attention due to its capability to enumerate objects of arbitrary category. It can be generally categorized into two methodologies: text-guided zero-shot counting methods and exemplarguided few-shot counting methods. Previous text-guided zero-shot methods only provide limited object information through text, resulting in poor performance. Besides, though exemplar-guided few-shot approaches gain better results, they rely heavily on manually annotated visual exemplars, resulting in low efficiency and high labor intensity. Therefore, we propose CountSE, which simultaneously achieves high efficiency and high performance. CountSE is a new text-guided zero-shot object counting algorithm that generates multiple precise soft exemplars at different scales to enhance counting models driven solely by semantics. Specifically, to obtain richer object information and address the diversity in object scales, we introduce Semantic-guided Exemplar Selection, a module that generates candidate soft exemplars at various scales and selects those with high similarity scores. Then, to ensure accuracy and representativeness, Clustering-based Exemplar Filtering is introduced to refine the candidate exemplars by effectively eliminating inaccurate exemplars through clustering analysis. In the text-guided zero-shot setting, CountSE outperforms all state-of-the-art methods on the FSC-147 benchmark by at least 15%. Additionally, experiments on two other widely used datasets demonstrate that CountSE significantly outperforms all previous text-guided zero-shot counting methods and is competitive with the most advanced exemplarguided few-shot methods. Codes will be available. Code is available at https://github.com/pppppz22/CountSE.
Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion
Enyu Liu
Huazhong University of Science and Technology
En Yu
Huazhong University of Science and Technology
Sijia Chen
Huazhong University of Science and Technology
Wenbing Tao
Huazhong University of Science and Technology
Abstract
3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception. Recent advancements have primarily focused on refining voxellevel features to construct 3D scenes. However, treating voxels as the basic interaction units inherently limits the utilization of class-level information, which is proven critical for enhancing the granularity of completion results. To address this, we propose Disentangling Instance and Scene Contexts (DISC), a novel dual-stream paradigm that enhances learning for both instance and scene categories through separated optimization. Specifically, we replace voxel queries with discriminative class queries, which incorporate class-specific geometric and semantic priors. Additionally, we exploit the intrinsic properties of classes to design specialized decoding modules, facilitating targeted interactions and efficient class-level information flow. Experimental results demonstrate that DISC achieves state-ofthe-art (SOTA) performance on both SemanticKITTI and SSCBench-KITTI-360 benchmarks, with mIoU scores of 17.35 and 20.55, respectively. Remarkably, DISC even outperforms multi-frame SOTA methods using only singleframe input and significantly improves instance category performance, surpassing both single-frame and multi-frame SOTA instance mIoU by 17.9% and 11.9%, respectively, on the SemanticKITTI hidden test. The code is available at https://github.com/Enyu-Liu/DISC.
E-NeMF: Event-based Neural Motion Field for Novel Space-time View Synthesis of Dynamic Scenes
Yan Liu
College of Computer Science and Technology, Zhejiang University
Zehao Chen
College of Computer Science and Technology, Zhejiang University
Haojie Yan
College of Computer Science and Technology, Zhejiang University
De Ma
College of Computer Science and Technology, Zhejiang University
Huajin Tang
College of Computer Science and Technology, Zhejiang University
Qian Zheng
College of Computer Science and Technology, Zhejiang University
Gang Pan
College of Computer Science and Technology, Zhejiang University
Abstract
Synthesizing novel space-time views from a monocular video is a highly ill-posed problem, and its effectiveness relies on accurately reconstructing motion and appearance of the dynamic scene. Frame-based methods for novel spacetime view synthesis in dynamic scenes rely on simplistic motion assumptions due to the absence of inter-frame cues, which makes them fall in complex motion. Event camera captures inter-frame cues with high temporal resolution, which makes it hold the promising potential to handle complex motion. However, it is still difficult due to the event noise and sparsity. To mitigate the impact caused by event noise and sparsity, we propose E-NeMF, which alleviates the impact of event noise with Parametric Motion Representation and mitigates the event sparsity with Flow Prediction Module. Experiments on multiple real-world datasets demonstrate our superior performance in handling complex motion. Codes will be released at https://github.com/zjubmi-lab/E-NeMF.
Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow
Ruyang Liu
School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
Shangkun Sun
School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
Haoran Tang
School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
Wei Gao
School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
Ge Li
School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
Abstract
Long-form video understanding has always been a challenging problem due to the significant redundancy in both temporal and spatial contents. This challenge is further exacerbated by the limited context length of Multimodal Large Language Models (MLLMs). To address this issue, many previous works have attempted to extract key video information, where the 'key' is typically semantic-aware and heavily dependent on the CLIP model as prior. In this paper, we propose Flow4Agent, a novel framework that pioneeringly incorporates motion priors from optical flow to facilitate LLM-based long video understanding. Flow4Agent mitigates the redundancy in long videos at both temporal and spatial levels through two core modules: Temporal Granularity Optimization (TGO) adaptively refines framelevel hierarchies, which first leverages coarse flow priors to group similar visual contents and then applies semantic priors to filter out highly irrelevant scene information. Motion Token Pruning (MTP) further refines the intra-frame visual representations, pruning high-redundancy video tokens using fine-grained optical flow information. Extensive experiments demonstrate that our Flow4Agent outperforms existing methods across a wide range of video MLLM benchmarks, especially for hour-level video understanding tasks, achieving 64.7% on Video-MME, 71.4% on MLVU and 60.4% on LongVideoBench.
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency
Tianqi Liu
School of AIA, Huazhong University of Science and Technology
Zihao Huang
School of AIA, Huazhong University of Science and Technology
Zhaoxi Chen
S-Lab, Nanyang Technological University
Guangcong Wang
Great Bay University
Shoukang Hu
School of AIA, Huazhong University of Science and Technology
Liao Shen
School of AIA, Huazhong University of Science and Technology
Huiqiang Sun
School of AIA, Huazhong University of Science and Technology
Zhiguo Cao
School of AIA, Huazhong University of Science and Technology
Wei Li
S-Lab, Nanyang Technological University
Ziwei Liu
S-Lab, Nanyang Technological University
Abstract
We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multi-view videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.
Improving SAM for Camouflaged Object Detection via Dual Stream Adapters
Jiaming Liu
School of Computer Science, Shanghai Jiao Tong University
Linghe Kong
School of Computer Science, Shanghai Jiao Tong University
Guihai Chen
School of Computer Science, Shanghai Jiao Tong University
Abstract
Segment anything model (SAM) has shown impressive general-purpose segmentation performance on natural images, but its performance on camouflaged object detection (COD) is unsatisfactory. In this paper, we propose SAMDSA that performs COD for RGB-D inputs via Dual Stream Adapters. While keeping the SAM architecture intact, dual stream adapters are expanded on the image encoder to learn potential complementary information from RGB images and depth images, and fine-tune the mask decoder and its depth-aware replica to perform dual-stream mask prediction. In practice, the dual stream adapters are embedded into the attention block of the image encoder in a parallel manner to facilitate the refinement and correction of the two types of image embeddings. To mitigate channel discrepancies arising from dual stream embeddings that do not directly interact with each other, we augment the association of dual stream embeddings using bidirectional knowledge distillation including a model distiller and a modal distiller. In addition, to predict the masks for RGB and depth attention maps, we integrate the two types of image embeddings which are jointly learned with the prompt embeddings to update the initial prompt, and then feed them into the mask decoders to synchronize the consistency of image embeddings and prompt embeddings. Experimental results on four COD benchmarks show that our SAM-DSA achieves excellent detection performance gains over SAM and achieves state-of-the-art results with a given fine-tuning paradigm.
Learning Efficient and Generalizable Human Representation with Human Gaussian Model
Yifan Liu
Tsinghua University
Shengjun Zhang
Tsinghua University
Chensheng Dai
Tsinghua University
Yang Chen
Nanyang Technological University
Hao Liu
WeChat Vision, Tencent Inc.
Chen Li
WeChat Vision, Tencent Inc.
Yueqi Duan
Tsinghua University
Abstract
Modeling animatable human avatars from videos is a long-standing and challenging problem. While conventional methods require per-instance optimization, recent feed-forward methods have been proposed to generate 3D Gaussians with a learnable network. However, these methods predict Gaussians for each frame independently, without fully capturing the relations of Gaussians from different timestamps. To address this, we propose Human Gaussian Graph to model the connection between predicted Gaussians and human SMPL mesh, so that we can leverage information from all frames to recover an animatable human representation. Specifically, the Human Gaussian Graph contains dual layers where Gaussians are the first layer nodes and mesh vertices serve as the second layer nodes. Based on this structure, we further propose the intra-node operation to aggregate various Gaussians connected to one mesh vertex, and inter-node operation to support message passing among mesh node neighbors. Experimental results on novel view synthesis and novel pose animation demonstrate the efficiency and generalization of our method.
MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments
Zhixuan Liu
Carnegie Mellon University
Haokun Zhu
Carnegie Mellon University
Rui Chen
Carnegie Mellon University
Jonathan Francis
Carnegie Mellon University
Soonmin Hwang
Hanyang University
Ji Zhang
Carnegie Mellon University
Jean Oh
Carnegie Mellon University
Abstract
We introduce a diffusion-based approach for generating privacy-preserving digital twins of multi-room indoor environments from depth images only. Central to our approach is a novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense. MOSAIC operates through a multichannel inference-time optimization that avoids error accumulation common in sequential or single-room constraints in panorama-based approaches. MOSAIC scales to complex scenes with zero extra training and provably reduces the variance during denoising process when more overlapping views are added, leading to improved generation quality. Experiments show that MOSAIC outperforms state-ofthe-art baselines on image fidelity metrics in reconstructing complex multi-room environments. Resources and code are at https://mosaic-cmubig.github.io.
MixRI: Mixing Features of Reference Images for Novel Object Pose Estimation
Xinhang Liu
School of Electronics and Information, Northwestern Polytechnical University
Jiawei Shi
School of Electronics and Information, Northwestern Polytechnical University
Zheng Dang
CVLab, EPFL, Switzerland
Yuchao Dai
School of Electronics and Information, Northwestern Polytechnical University
Abstract
We present MixRI, a lightweight network that solves the CAD-based novel object pose estimation problem in RGB images. It can be instantly applied to a novel object at test time without finetuning. We design our network to meet the demands of real-world applications, emphasizing reduced memory requirements and fast inference time. Unlike existing works that utilize many reference images and have large network parameters, we directly match points based on the multi-view information between the query and reference images with a lightweight network. Thanks to our reference image fusion strategy, we significantly decrease the number of reference images, thus decreasing the time needed to process these images and the memory required to store them. Furthermore, with our lightweight network, our method requires less inference time. Though with fewer reference images, experiments on seven core datasets in the BOP challenge show that our method achieves comparable results with other methods that require more reference images and larger network parameters1.
Multi-Object Sketch Animation by Scene Decomposition and Motion Planning
Jingyu Liu
Renmin University of China
Zijie Xin
Renmin University of China
Yuhan Fu
Renmin University of China
Ruixiang Zhao
Renmin University of China
Bangxiang Lan
Renmin University of China
Xirong Li
Renmin University of China
Abstract
Sketch animation, which brings static sketches to life by generating dynamic video sequences, has found widespread applications in GIF design, cartoon production, and daily entertainment. While current methods for sketch animation perform well in single-object sketch animation, they struggle in multi-object scenarios. By analyzing their failures, we identify two major challenges of transitioning from single-object to multi-object sketch animation: objectaware motion modeling and complex motion optimization. For multi-object sketch animation, we propose MoSketch based on iterative optimization through Score Distillation Sampling (SDS) and thus animating a multi-object sketch in a training-data free manner. To tackle the two challenges in a divide-and-conquer strategy, MoSketch has four novel modules, i.e., LLM-based scene decomposition, LLMbased motion planning, multi-grained motion refinement, and compositional SDS. Extensive qualitative and quantitative experiments demonstrate the superiority of our method over existing sketch animation approaches. MoSketch takes a pioneering step towards multi-object sketch animation, opening new avenues for future research and applications.
OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering
Shiyong Liu
Huawei Noah's Ark Lab
Xiao Tang
Huawei Noah's Ark Lab
Zhihao Li
Huawei Noah's Ark Lab
Yingfan He
The Chinese University of Hong Kong (Shenzhen)
Chongjie Ye
The Chinese University of Hong Kong (Shenzhen)
Jianzhuang Liu
Shenzhen Institutes of Advanced Technology
Binxiao Huang
The University of Hong Kong
Shunbo Zhou
Huawei Embodied Intelligence Lab
Xiaofei Wu
Huawei Noah's Ark Lab
Abstract
In large-scale scene reconstruction using 3D Gaussian splatting, it is common to partition the scene into multiple smaller regions and reconstruct them individually. However, existing division methods are occlusion-agnostic, meaning that each region may contain areas with severe occlusions. As a result, the cameras within those regions are less correlated, leading to a low average contribution to the overall reconstruction. In this paper, we propose an occlusion-aware scene division strategy that clusters training cameras based on their positions and co-visibilities to acquire multiple regions. Cameras in such regions exhibit stronger correlations and a higher average contribution, facilitating high-quality scene reconstruction. We further propose a region-based rendering technique to accelerate large scene rendering, which culls Gaussians invisible to the region where the viewpoint is located. Such a technique significantly speeds up the rendering without compromising quality. Extensive experiments on multiple large scenes show that our method achieves superior reconstruction results with faster rendering speed compared to existing state-of-the-art approaches. Project page: https: //occlugaussian.github.io.
Omni-scene Perception-oriented Point Cloud Geometry Enhancement for Coordinate Quantization
Wang Liu
Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University
Wei Gao
Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University; Peng Cheng Laboratory
Abstract
Information quantization has been widely adopted in multimedia content, such as images, videos, and point clouds. The goal of information quantization is to achieve efficient storage and transmission by reducing data precision or redundancy. However, the information distortion caused by quantization will lead to the degradation of signal fidelity and the performance of downstream tasks. This paper focuses on the geometry quantization distortion of point clouds and proposes a unified learning-based quality enhancement framework for omni-scene point clouds. Based on the characteristics of geometry quantization distortion, we analyze and find that existing upsampling methods are not competitive in dealing with point reduction and geometry displacement simultaneously caused by coordinate quantization. Therefore, we design a general rootinggrowing-pruning paradigm to efficiently perceive the geometry feature of quantized point clouds and improve the quality significantly. In addition, a novel loss constraint term related to the quantization step parameter is proposed to further improve quality and accelerate model convergence. To the best of our knowledge, this is the first unified quality enhancement framework for object and scene point clouds with coordinate quantization. Extensive experiments verify the superiority of the proposed method on multi-scale point clouds with different levels of quantization distortion, including object (ModelNet40, 8iVFB) and scene (KITTI). In particular, the enhanced point clouds improve the performance of downstream analysis tasks, including classification and 3D object detection.
PartField: Learning 3D Feature Fields for Part Segmentation and Beyond
Minghua Liu
NVIDIA
Mikaela Angelina Uy
NVIDIA
Donglai Xiang
NVIDIA
Hao Su
UCSD
Sanja Fidler
NVIDIA; University of Toronto; Vector Institute
Nicholas Sharp
NVIDIA
Jun Gao
NVIDIA; University of Toronto; Vector Institute
Abstract
We propose PartField, a feedforward approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy without relying on predefined templates or text-based names, and can be applied to open-world 3D shapes across various modalities. PartField requires only a 3D feedforward pass at inference time, significantly improving runtime and robustness compared to prior approaches. Our model is trained by distilling 2D and 3D part proposals from a mix of labeled datasets and image segmentations on large unsupervised datasets, via a contrastive learning formulation. It produces a continuous feature field which can be clustered to yield a hierarchical part decomposition. Comparisons show that PartField is up to 20% more accurate and often orders of magnitude faster than other recent class-agnostic part-segmentation methods. Beyond single-shape part decomposition, consistency in the learned field emerges across shapes, enabling tasks such as co-segmentation and correspondence, which we demonstrate in several applications of these general-purpose, hierarchical, and consistent 3D feature fields. Check our Webpage! https://research.nvidia.com/labs/toronto-ai/partfield-release/
PriOr-Flow: Enhancing Primitive Panoramic Optical Flow with Orthogonal View
Longliang Liu
Huazhong University of Science and Technology
Miaojie Feng
Huazhong University of Science and Technology
Junda Cheng
Huazhong University of Science and Technology
Jijun Xiang
Huazhong University of Science and Technology
Xuan Zhu
Huazhong University of Science and Technology
Xin Yang
Optics Valley Laboratory
Abstract
Panoramic optical flow enables a comprehensive understanding of temporal dynamics across wide fields of view. However, severe distortions caused by sphere-toplane projections, such as the equirectangular projection (ERP), significantly degrade the performance of conventional perspective-based optical flow methods, especially in polar regions. To address this challenge, we propose PriOr-Flow, a novel dual-branch framework that leverages the low-distortion nature of the orthogonal view to enhance optical flow estimation in these regions. Specifically, we introduce the Dual-Cost Collaborative Lookup (DCCL) operator, which jointly retrieves correlation information from both the primitive and orthogonal cost volumes, effectively mitigating distortion noise during cost volume construction. Furthermore, our Ortho-Driven Distortion Compensation (ODDC) module iteratively refines motion features of the primitive branch, further suppressing polar distortions. Extensive experiments demonstrate that PriOrFlow is compatible with various perspective-based iterative optical flow methods and consistently achieves stateof-the-art performance on publicly available panoramic optical flow datasets, setting a new benchmark for widefield motion estimation. The code is publicly available at: https://github.com/longliangLiu/PriOr-Flow.
QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization
Yueh-Cheng Liu
Technical University of Munich
Lukas Höllein
Technical University of Munich
Matthias Nießner
Technical University of Munich
Angela Dai
Technical University of Munich
Abstract
Surface reconstruction is fundamental to computer vision and graphics, enabling applications in 3D modeling, mixed reality, robotics, and more. Existing approaches based on volumetric rendering obtain promising results, but optimize on a per-scene basis, resulting in a slow optimization that can struggle to model under-observed or textureless regions. We introduce QuickSplat, which learns datadriven priors to generate dense initializations for 2D gaussian splatting optimization of large-scale indoor scenes. This provides a strong starting point for the reconstruction, which accelerates the convergence of the optimization and improves the geometry of flat wall structures. We further learn to jointly estimate the densification and update of the scene parameters during each iteration; our proposed densifier network predicts new Gaussians based on the rendering gradients of existing ones, removing the needs of heuristics for densification. Extensive experiments on large-scale indoor scene reconstruction demonstrate the superiority of our data-driven optimization. Concretely, we accelerate runtime by 8x, while decreasing depth errors by 48% in comparison to state of the art methods.
SGAD: Semantic and Geometric-aware Descriptor for Local Feature Matching
Xiangzeng Liu
Xidian University
Chi Wang
Xidian University
Guanglu Shi
Xidian University
Xiaodong Zhang
Xidian University
Qiguang Miao
Xidian University
Miao Fan
Navinfo Europe B.V
Abstract
Local feature matching remains a fundamental challenge in computer vision. Recent Area to Point Matching (A2PM) methods have improved matching accuracy. However, existing research based on this framework relies on inefficient pixel-level comparisons and complex graph matching that limit scalability. In this work, we introduce the Semantic and Geometric-aware Descriptor Network (SGAD), which fundamentally rethinks area-based matching by generating highly discriminative area descriptors that enable direct matching without complex graph optimization. This approach significantly improves both accuracy and efficiency of area matching. We further improve the performance of area matching through a novel supervision strategy that decomposes the area matching task into classification and ranking subtasks. Finally, we introduce the Hierarchical Containment Redundancy Filter (HCRF) to eliminate overlapping areas by analyzing containment graphs. SGAD demonstrates remarkable performance gains, reducing runtime by 60x (0.82s vs. 60.23s) compared to MESA. Extensive evaluations show consistent improvements across multiple point matchers: SGAD+LoFTR reduces runtime compared to DKM, while achieving higher accuracy (0.82s vs. 1.51s, 65.98 vs. 61.11) in outdoor pose estimation, and SGAD+ROMA delivers +7.39% AUC@5◦ in indoor pose estimation, establishing a new state-ofthe-art.
Spatial-Temporal Aware Visuomotor Diffusion Policy Learning
Zhenyang Liu
Fudan University
Yikai Wang
Nanyang Technological University
Kuanning Wang
Fudan University
Longfei Liang
NeuHelium Co., Ltd
Xiangyang Xue
Fudan University
Yanwei Fu
Fudan University
Abstract
Visual imitation learning is effective for robots to learn versatile tasks. However, many existing methods rely on behavior cloning with supervised historical trajectories, limiting their 3D spatial and 4D spatiotemporal awareness. Consequently, these methods struggle to capture the 3D structures and 4D spatiotemporal relationships necessary for real-world deployment. In this work, we propose 4D Diffusion Policy (DP4), a novel visual imitation learning method that incorporates spatiotemporal awareness into diffusion-based policies. Unlike traditional approaches that rely on trajectory cloning, DP4 leverages a dynamic Gaussian world model to guide the learning of 3D spatial and 4D spatiotemporal perceptions from interactive environments. Our method constructs the current 3D scene from a single-view RGB-D observation and predicts the future 3D scene, optimizing trajectory generation by explicitly modeling both spatial and temporal dependencies. Extensive experiments across 17 simulation tasks with 173 variants and 3 real-world robotic tasks demonstrate that the 4D Diffusion Policy (DP4) outperforms baseline methods, improving the average simulation task success rate by 16.4% (Adroit), 14% (DexArt), and 6.45% (RLBench), and the average realworld robotic task success rate by 8.6%.
TAD-E2E: A Large-scale End-to-end Autonomous Driving Dataset
Chang Liu
ADLab, Tencent
Mingxu Zhu
ADLab, Tencent
Zheyuan Zhang
ADLab, Tencent
Linna Song
ADLab, Tencent
Xiao Zhao
ADLab, Tencent
Qingliang Luo
ADLab, Tencent
Qi Wang
ADLab, Tencent
Chufan Guo
ADLab, Tencent
Kuifeng Su
ADLab, Tencent
Abstract
End-to-end autonomous driving technology has recently become a focal point of research and application in autonomous driving. State-of-the-art (SOTA) methods are often trained and evaluated on the nuScenes dataset. However, the nuScenes dataset, introduced in 2019 for 3D perception tasks, faces several limitations-such as insufficient scale, simple scenes, and homogeneous driving behaviors-that restrict the upper-bound development of end-toend autonomous driving algorithms. In light of these issues, we propose a novel, large-scale real-world dataset specifically designed for end-to-end autonomous driving tasks, named TAD-E2E, which is 25x larger, 1.7x scene complexity over nuScenes, and features a highly diverse range of driving behaviors. We replicated SOTA methods on the TADE2E dataset and observed that these methods no longer performed well, as expected. Additionally, in response to the challenging scenarios presented in the TAD-E2E dataset, we devised a multimodal sparse end-to-end method that significantly outperforms SOTA methods. Ablation studies demonstrate the effectiveness of our method, and we analyze the contributions of each module. The dataset will be released in the near future.
Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers
An-Lun Liu
National Yang Ming Chiao Tung University
Yu-Wei Chao
NVIDIA
Yi-Ting Chen
National Yang Ming Chiao Tung University
Abstract
In this paper, we study task-oriented human grasp synthesis, a new grasp synthesis task that demands both task and context awareness. At the core of our method is the task-aware contact maps. Unlike traditional contact maps that only reason about the manipulated object and its relation with the hand, our enhanced maps take into account scene and task information. This comprehensive map is critical for hand-object interaction, enabling accurate grasping poses that align with the task. We propose a two-stage pipeline that first constructs a task-aware contact map informed by the scene and task. In the subsequent stage, we use this contact map to synthesize task-oriented human grasps. We introduce a new dataset and a metric for the proposed task to evaluate our approach. Our experiments validate the importance of modeling both scene and task, demonstrating significant improvements over existing methods in both grasp quality and task performance. See our project page for more details: https://hcis-lab.github.io/TOHGS/
Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge
Linshen Liu
Johns Hopkins University
Boyan Su
Johns Hopkins University
Junyue Jiang
Johns Hopkins University
Guanlin Wu
Johns Hopkins University
Cong Guo
Duke University
Ceyu Xu
HKUST
Hao Frank Yang
Johns Hopkins University
Abstract
This paper presents Edge-based Mixture of Experts (MoE) Collaborative Computing (EMC2‡), an optimal computing system designed for autonomous vehicles (AVs) that simultaneously achieves low-latency and high-accuracy 3D object detection. Unlike existing works, the EMC2 introduces a novel scenario-aware MoE architecture optimized for fusing complementary sparse 3D point clouds and dense 2D images to achieve robust multimodal representations for detection. Furthermore, EMC2 integrates an adaptive multimodal data bridge with multi-scale region proposing and scenario-aware routing, dynamically dispatching features to complementary experts based on object visibility and distance. In addition, EMC2 integrates joint hardwaresoftware optimizations, including hardware resource utilization optimization and computational graph simplification, to ensure efficient and real-time inference on resourceconstrained edge devices. Experiments on open-source benchmarks clearly show the EMC2 advancements as an end-to-end system. On the KITTI dataset, it achieves an average accuracy improvement of 3.58% and a 159.06% inference speedup compared to 15 baseline methods on Jetson platforms, with similar performance gains on the nuScenes dataset, highlighting its capability to advance reliable, realtime 3D object detection tasks for AVs.
Underwater Visual SLAM with Depth Uncertainty and Medium Modeling
Rui Liu
ReLER, CCAI, Zhejiang University
Sheng Fan
ReLER, CCAI, Zhejiang University
Wenguan Wang
ReLER, CCAI, Zhejiang University
Yi Yang
ReLER, CCAI, Zhejiang University
Abstract
Underwater visual simultaneous localization and mapping (SLAM) faces critical challenges in light attenuation and degraded geometric consistency. Despite recent advances of visual SLAM in indoor and urban scenes, these approaches typically assume a clear medium and neglect medium-light interactions, leading to performance degradation in underwater environments. To overcome these limitations, we propose DUV-SLAM, a dense underwater visual SLAM framework that integrates uncertainty-aware geometry estimation with physics-inspired neural scattering modeling. Our method introduces two core innovations: i) depth uncertainty quantification derived from differentiable bundle adjustment, which propagates geometric confidence to guide mapping optimization; and ii) a neural-Gaussian hybrid representation that combines adaptive 3D Gaussians for underwater reconstruction with a neural field capturing wavelength-dependent medium properties, optimized using a combination of photometric, geometric, and distribution losses. Experiments on synthetic and real-world datasets demonstrate that DUV-SLAM achieves high-quality monocular reconstruction while maintaining real-time efficiency and robust tracking accuracy.
Unified Open-World Segmentation with Multi-Modal Prompts
Yang Liu
Zhejiang University
Yufei Yin
Hangzhou Dianzi University
Chenchen Jing
Zhejiang University of Technology
Muzhi Zhu
Zhejiang University
Hao Chen
Zhejiang University
Yuling Xi
Zhejiang University
Bo Feng
Apple
Hao Wang
Apple
Shiyu Li
Apple
Chunhua Shen
Zhejiang University
Abstract
In this work, we present COSINE, a unified open-world segmentation model that Consolidates Open-vocabulary Segmentation and IN-context sEgmentation with multimodal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and incontext segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches. Our code is released at https://github.com/aim-uofa/COSINE.
Video Motion Graphs
Haiyang Liu
The University of Tokyo
Zhan Xu
Adobe Research
Fa-Ting Hong
Adobe Research
Hsin-Ping Huang
Adobe Research
Yi Zhou
Adobe Research
Yang Zhou
Adobe Research
Abstract
We present Video Motion Graphs, a system designed to generate realistic human motion videos. Using a reference video and conditional signals such as music or motion tags, the system synthesizes new videos by first retrieving video clips with gestures matching the conditions and then generating interpolation frames to seamlessly connect clip boundaries. The core of our approach is HMInterp, a robust Video Frame Interpolation (VFI) model that enables seamless interpolation of discontinuous frames, even for complex motion scenarios like dancing. HMInterp i) employs a dual-branch interpolation approach, combining a Motion Diffusion Model for human skeleton motion interpolation with a diffusion-based video frame interpolation model for final frame generation. ii) adopts condition progressive training to effectively leverage identity strong and weak conditions, such as images and pose. These designs ensure both high video texture quality and accurate motion trajectory. Results show that our Video Motion Graphs outperforms existing generative- and retrieval-based methods for multi-modal conditioned human motion video generation. Project page can be found here.
When Confidence Fails: Revisiting Pseudo-Label Selection in Semi-supervised Semantic Segmentation
Pan Liu
Central South University
Jinshi Liu
Central South University
Abstract
While significant advances exist in pseudo-label generation for semi-supervised semantic segmentation, pseudolabel selection remains understudied. Existing methods typically use fixed confidence thresholds to retain highconfidence predictions as pseudo-labels. However, these methods cannot cope with network overconfidence tendency, where correct and incorrect predictions overlap significantly in high-confidence regions, making separation challenging and amplifying model cognitive bias. Meanwhile, the direct discarding of low-confidence predictions disrupts spatial-semantic continuity, causing critical context loss. We propose Confidence Separable Learning (CSL) to address these limitations. CSL formulates pseudo-label selection as a convex optimization problem within the confidence distribution feature space, establishing sample-specific decision boundaries to distinguish reliable from unreliable predictions. Additionally, CSL introduces random masking of reliable pixels to guide the network in learning contextual relationships from lowreliability regions, thereby mitigating the adverse effects of discarding uncertain predictions. Extensive experimental results on the Pascal, Cityscapes, and COCO benchmarks show that CSL performs favorably against state-ofthe-art methods. Code and model weights are available at:https://github.com/PanLiuCSU/CSL.
mmCooper: A Multi-agent Multi-stage Communication-efficient and Collaboration-robust Cooperative Perception Framework
Bingyi Liu
Wuhan University Of Technology
Jian Teng
Wuhan University Of Technology
Hongfei Xue
University of North Carolina at Charlotte
Enshu Wang
Wuhan University
Chuanhui Zhu
Wuhan University Of Technology
Pu Wang
University of North Carolina at Charlotte
Libing Wu
Wuhan University
Abstract
Collaborative perception significantly enhances individual vehicle perception performance through the exchange of sensory information among agents. However, realworld deployment faces challenges due to bandwidth constraints and inevitable calibration errors during information exchange. To address these issues, we propose mmCooper, a novel multi-agent, multi-stage, communicationefficient, and collaboration-robust cooperative perception framework. Our framework leverages a multi-stage collaboration strategy that dynamically and adaptively balances intermediate- and late-stage information to share among agents, enhancing perceptual performance while maintaining communication efficiency. To support robust collaboration despite potential misalignments and calibration errors, our framework prevents misleading low-confidence sensing information from transmission and refines the received detection results from collaborators to improve accuracy. The extensive evaluation results on both real-world and simulated datasets demonstrate the effectiveness of the mmCooper framework and its components.
PseudoMapTrainer: Learning Online Mapping without HD Maps
Christian L¨owens
Bosch Research
Thorben Funke
Bosch Research
Jingchao Xie
Bosch Research; Technical University of Munich
Alexandru Paul Condurache
Automated Driving, Bosch; University of L¨ubeck
Abstract
Online mapping models show remarkable results in predicting vectorized maps from multi-view camera images only. However, all existing approaches still rely on ground-truth high-definition maps during training, which are expensive to obtain and often not geographically diverse enough for reliable generalization. In this work, we propose PseudoMapTrainer, a novel approach to online mapping that uses pseudo-labels generated from unlabeled sensor data. We derive those pseudo-labels by reconstructing the road surface from multi-camera imagery using Gaussian splatting and semantics of a pre-trained 2D segmentation network. In addition, we introduce a mask-aware assignment algorithm and loss function to handle partially masked pseudo-labels, allowing for the first time the training of online mapping models without any ground-truth maps. Furthermore, our pseudo-labels can be effectively used to pretrain an online model in a semi-supervised manner to leverage large-scale unlabeled crowdsourced data. The code is available at github.com/boschresearch/PseudoMapTrainer.
HUMOTO: A 4D Dataset of Mocap Human Object Interactions
Jiaxin Lu
University of Texas at Austin
Chun-Hao Paul Huang
Adobe Research
Uttaran Bhattacharya
Adobe Research
Qixing Huang
University of Texas at Austin
Yi Zhou
Adobe Research
Abstract
levels of human text annotation. Figure 1. Overview of the HUMOTO dataset. The dataset contains mocap 4D human-object interaction animations with multiple objects. The unique features of the dataset include its detailed, accurate interaction modeling, specifically the detailed hand pose. The objects are precisely modeled by artists. We additionally provide different abstract levels of text annotation for the interactions. Abstract We present Human Motions with Objects (HUMOTO), a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications. Featuring 735 sequences (7,875 seconds at 30 fps), HUMOTO captures interactions with 63 precisely modeled objects and 72 articulated parts. Our innovations include a scene-driven LLM scripting pipeline creating complete, purposeful tasks with natural progression, and a mocapand-camera recording setup to effectively handle occlusions. Spanning diverse activities from cooking to outdoor picnics, HUMOTO preserves both physical accuracy and logical task flow. Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations. We also provide benchmarks compared to other datasets. HUMOTO's comprehensive full-body motion and simultaneous multi-object interactions address key data-capturing challenges and provide opportunities to advance realistic human-object interaction modeling across ∗The work was mainly conducted at Adobe Research. research domains with practical applications in animation, robotics, and embodied AI systems. Project Page: https: //jiaxin-lu.github.io/humoto/.
InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models
Yifan Lu
NVIDIA
Xuanchi Ren
NVIDIA
Jiawei Yang
University of Southern California
Tianchang Shen
NVIDIA
Zhangjie Wu
NVIDIA
Jun Gao
NVIDIA
Yue Wang
University of Southern California
Siheng Chen
Shanghai Jiao Tong University
Mike Chen
NVIDIA
Sanja Fidler
NVIDIA
Jiahui Huang
NVIDIA
Abstract
We present InfiniCube, a scalable and controllable method to generate unbounded and dynamic 3D driving scenes with high fidelity. Previous methods for scene generation are constrained either by their applicability to indoor scenes or by their lack of controllability. In contrast, we take advantage of recent advances in 3D and video generative models to achieve large dynamic scene generation with flexible controls like HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned 3D voxel generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of pixel-aligned guidance buffers, synthesizing a consistent appearance on long-video generation for large-scale scenes. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift videos to dynamic 3D Gaussians with controllable objects. Our method generates realistic and dynamic 3D driving scenes, and extensive experiments validate the effectiveness of our model design.
Jigsaw++: Imagining Complete Shape Priors for Object Reassembly
Jiaxin Lu
University of Texas at Austin
Gang Hua
Amazon
Qixing Huang
University of Texas at Austin
Abstract
The automatic assembly problem has attracted increasing interest due to its complex challenges that involve 3D representation. This paper introduces Jigsaw++, a novel generative method designed to tackle the multifaceted challenges of reconstructing complete shape for the reassembly problem. Existing approach focusing primarily on piecewise information for both part and fracture assembly, often overlooking the integration of complete object prior. Jigsaw++ distinguishes itself by learning a shape prior of complete objects. It employs the proposed 'retargeting' strategy that effectively leverages the output of any existing assembly method to generate complete shape reconstructions. This capability allows it to function orthogonally to the current methods. Through extensive evaluations on Breaking Bad dataset and PartNet, Jigsaw++ has demonstrated its effectiveness, reducing reconstruction errors and enhancing the precision of shape reconstruction, which sets a new direction for future reassembly model developments.
ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving
Yuhang Lu
ShanghaiTech University
Jiadong Tu
ShanghaiTech University
Yuexin Ma
ShanghaiTech University
Xinge Zhu
The Chinese University of Hong Kong
Abstract
End-to-end autonomous driving has emerged as a promising approach to unify perception, prediction, and planning within a single framework, reducing information loss and improving adaptability. However, existing methods often rely on fixed and sparse trajectory supervision, limiting their ability to capture the hierarchical reasoning process that human drivers naturally employ. To bridge this gap, we propose ReAL-AD, a Reasoning-Augmented Learning framework that structures decision-making in autonomous driving based on the three-tier human cognitive model: Driving Strategy, Driving Decision, and Driving Operation, where Vision-Language Models (VLMs) are incorporated to enhance situational awareness and structured reasoning across these levels. Specifically, we introduce: (1) the Strategic Reasoning Injector, which formulates highlevel driving strategies by interpreting complex traffic contexts from VLM-generated insights; (2) the Tactical Reasoning Integrator, which refines strategic intent into interpretable tactical choices such as lane changes, overtaking, and speed adjustments; and (3) the Hierarchical Trajectory Decoder, which progressively translates tactical decisions into precise control actions for smooth and humanlike trajectory execution. Extensive evaluations show that integrating our framework improves planning accuracy and safety by over 30%, making end-to-end autonomous driving more interpretable and aligned with human-like hierarchical reasoning. The project page can be found at: 4dvlab.github.io/project page/realad
Serialization based Point Cloud Oversegmentation
Chenghui Lu
Huaqiao University
Jianlong Kwan
Huaqiao University
Dilong Li
Huaqiao University
Ziyi Chen
Huaqiao University
Haiyan Guan
Nanjing University of Information Science and Technology
Abstract
Point cloud oversegmentation, as a fundamental preprocessing step for 3D understanding, is a challenging task due to its spatial proximity and semantic similarity requirements. Most existing works struggle to efficiently group semantically consistent points into superpoints while maintaining spatial proximity. In this paper, we propose a novel serialization based point cloud oversegmentation method, which leverages serialization to avoid complex spatial queries, directly accessing neighboring points through sequence locality for similarity matching and superpoint clustering. Specifically, we first serialize point clouds into a Hilbert curve and spatially-continuously partition them into initial segments. Then, to guarantee the internal semantic consistency of superpoints, we design an adaptive update algorithm that clusters superpoints by matching feature similarities between neighboring segments and refines segment features via Cross-Attention. Experiments on largescale indoor and outdoor datasets demonstrate state-of-theart performance in point cloud oversegmentation. Moreover, it is also adaptable to semantic segmentation and achieves promising performance. The code is available at https://github.com/CHL-glitch/SPCNet.
VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions
Haoang Lu
Xi'an Jiaotong University
Yuanqi Su
Xi'an Jiaotong University
Xiaoning Zhang
unknown
Longjun Gao
unknown
Yu Xue
unknown
Le Wang
unknown
Abstract
This paper introduces VisHall3D, a novel two-stage framework for monocular semantic scene completion that aims to address the issues of feature entanglement and geometric inconsistency prevalent in existing methods. VisHall3D decomposes the scene completion task into two stages: reconstructing the visible regions (vision) and inferring the invisible regions (hallucination). In the first stage, VisFrontierNet, a visibility-aware projection module, is introduced to accurately trace the visual frontier while preserving finegrained details. In the second stage, OcclusionMAE, a hallucination network, is employed to generate plausible geometries for the invisible regions using a noise injection mechanism. By decoupling scene completion into these two distinct stages, VisHall3D effectively mitigates feature entanglement and geometric inconsistency, leading to significantly improved reconstruction quality. The effectiveness of VisHall3D is validated through extensive experiments on two challenging benchmarks: SemanticKITTI and SSCBench-KITTI-360. VisHall3D achieves state-of-the-art performance, outperforming previous methods by a significant margin and paves the way for more accurate and reliable scene understanding in autonomous driving and other applications.
monoVLN: Bridging the Observation Gap between Monocular and Panoramic Vision and Language Navigation
Renjie Lu
Sun Yat-sen University
Yu Zhou
Sun Yat-sen University
Hao Cheng
Hunan University
Jingke Meng
Sun Yat-sen University
Wei-Shi Zheng
Sun Yat-sen University
Abstract
Vision and Language Navigation(VLN) requires agents to navigate 3D environments by following natural language instructions. While existing methods predominantly assume access to panoramic observations, many practical robotics are equipped with monocular RGBD cameras, creating a significant configuration disparity. In this work, we address this critical gap by developing a novel 3DGS-based framework for monocular VLN agents, focusing on the intrinsic information incompleteness challenge. Our approach incorporates two key innovations: (1) implicit partial completion module for inferring representations of missing regions in incompletely rendered panoramic feature maps, and (2) an uncertainty-aware active perception strategy that enables the agent to actively acquire visual observation when uncertain about its decision. Extensive experiments on R2R-CE and RxR-CE datasets demonstrate that our monoVLN outperforms all existing monocular methods, significantly improve 8% success rate on R2R-CE compared to previous monocular methods. We also validate our monoVLN in real-world environments, providing a practical solution for real-world VLN.
Beyond the Frame: Generating 360deg Panoramic Videos from Perspective Videos
Rundong Luo
Cornell University
Matthew Wallingford
University of Washington
Ali Fahardi
University of Washington
Noah Snavely
Cornell University
Wei-Chiu Ma
Cornell University
Abstract
360◦videos have emerged as a promising medium to represent our dynamic visual world. Compared to the 'tunnel vision' of standard cameras, their borderless field of view offers a more complete perspective of our surroundings. While existing video models excel at producing standard videos, their ability to generate full panoramic videos remains elusive. In this paper, we investigate the task of video-to-360◦ generation: given a perspective video as input, our goal is to generate a full panoramic video that is consistent with the original video. Unlike conventional video generation tasks, the output's field of view is significantly larger, and the model is required to have a deep understanding of both the spatial layout of the scene and the dynamics of objects to maintain spatio-temporal consistency. To address these challenges, we first leverage the abundant 360◦videos available online and develop a high-quality data filtering pipeline to curate pairwise training data. We then carefully design a series of geometry- and motion-aware operations to facilitate the learning process and improve the quality of 360◦ video generation. Experimental results demonstrate that our model can generate realistic and coherent 360◦videos from in-the-wild perspective video. In addition, we showcase its potential applications, including video stabilization, camera viewpoint control, and interactive visual question answering.
Gradient Decomposition and Alignment for Incremental Object Detection
Wenlong Luo
Northwestern Polytechnical University
Shizhou Zhang
Northwestern Polytechnical University
De Cheng
Xidian University
Yinghui Xing
Northwestern Polytechnical University
Guoqiang Liang
Northwestern Polytechnical University
Peng Wang
Northwestern Polytechnical University
Yanning Zhang
Northwestern Polytechnical University
Abstract
Incremental object detection (IOD) is crucial for enabling AI systems to continuously learn new object classes over time while retaining knowledge of previously learned categories, allowing model to adapt to dynamic environments without forgetting prior information. Existing IOD methods primarily employ knowledge distillation to mitigate catastrophic forgetting, yet these approaches overlook class overlap issues, often resulting in suboptimal performance. In this paper, we propose a novel framework for IOD that leverages a decoupled gradient alignment technique on top of the specially proposed pseudo-labeling strategy. Our method employs a Gaussian Mixture Model to accurately estimate pseudo-labels of previously learned objects in current training images, effectively functioning as a knowledge-replay mechanism. This strategy reinforces prior knowledge retention and prevents the misclassification of unannotated foreground objects from earlier classes as background. Furthermore, we introduce an adaptive gradient decomposition and alignment method to maintain model stability while facilitating positive knowledge transfer. By aligning gradients from both old and new classes, our approach preserves previously learned knowledge while enhancing plasticity for new tasks. Extensive experiments on two IOD benchmarks demonstrate the effectiveness of the proposed method, achieving superior performances to state-of-the-art methods. The code and datasets are available at https://github.com/FHR-L/GDA-IOD.
MS3D: High-Quality 3D Generation via Multi-Scale Representation Modeling
Guan Luo
Tsinghua University
Jianfeng Zhang
ByteDance Seed
Abstract
High-quality textured mesh reconstruction from sparseview images remains a fundamental challenge in computer graphics and computer vision. Traditional large reconstruction models operate in a single-scale manner, forcing the models to simultaneously capture global structure and local details, often resulting in compromised reconstructed shapes. In this work, we propose MS3D, a novel multi-scale 3D reconstruction framework. At its core, our method introduces a hierarchical structured latent representation for multi-scale modeling, coupled with a multiscale feature extraction and integration mechanism. This enables progressive reconstruction, effectively decomposing the complex task of detailed geometry reconstruction into a sequence of easier steps. This coarse-to-fine approach effectively captures multi-frequency details, learns complex geometric patterns, and generalizes well across diverse objects while preserving fine-grained details. Extensive experiments demonstrate MS3D outperforms state-ofthe-art methods and is broadly applicable to both imageand text-to-3D generation. The entire pipeline reconstructs high-quality textured meshes in under five seconds.
Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration
Katie Z Luo
Cornell University
Minh-Quan Dao
Inria
Zhenzhen Liu
Cornell University
Mark Campbell
Cornell University
Wei-Lun Chao
The Ohio State University
Kilian Q Weinberger
Cornell University
Ezio Malis
Inria
Vincent Frémont
École Centrale de Nantes
Bharath Hariharan
Cornell University
Mao Shan
University of Sydney
Stewart Worrall
University of Sydney
Julie Stephany Berrio Perez
University of Sydney
Abstract
Vehicle-to-everything (V2X) collaborative perception has emerged as a promising solution to address the limitations of single-vehicle perception systems. However, existing V2X datasets are limited in scope, diversity, and quality. To address these gaps, we present Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes collected from three connected autonomous vehicles (CAVs) equipped with two different configurations of LiDAR sensors, plus a roadside unit with dual LiDARs. Our dataset provides point clouds and bounding box annotations across 10 classes, ensuring reliable data for perception training. We provide detailed statistical analysis on the quality of our dataset and extensively benchmark existing V2X methods on it. Mixed Signals is ready-to-use, with precise alignment and consistent annotations across time and viewpoints. We hope our work advances research in the emerging, impactful field of V2X perception. Dataset details at https://mixedsignalsdataset.cs.cornell.edu/.
DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation
Jiangran Lyu
Peking University
Ziming Li
Peking University
Xuesong Shi
unknown
Chaoyi Xu
unknown
Yizhou Wang
Peking University
He Wang
Peking University
Abstract
Nonprehensile manipulation is crucial for handling objects that are too thin, large, or otherwise ungraspable in unstructured environments. While conventional planningbased approaches struggle with complex contact modeling, learning-based methods have recently emerged as a promising alternative. However, existing learning-based approaches face two major limitations: they heavily rely on multi-view cameras and precise pose tracking, and they fail to generalize across varying physical conditions, such as changes in object mass and table friction. To address these challenges, we propose the Dynamics-Adaptive World Action Model (DyWA), a novel framework that enhances action learning by jointly predicting future states while adapting to dynamics variations based on historical trajectories. By unifying the modeling of geometry, state, physics, and robot actions, DyWA enables more robust policy learning under partial observability. Compared to baselines, our method improves the success rate by 31.5% using only single-view point cloud observations in the simulation. Furthermore, DyWA achieves an average success rate of 68% in real-world experiments, demonstrating its ability to generalize across diverse object geometries, adapt to varying table friction, and robustness in challenging scenarios such as half-filled water bottles and slippery surfaces.
ResGS: Residual Densification of 3D Gaussian for Efficient Detail Recovery
Yanzhe Lyu
University of Science and Technology of China
Kai Cheng
unknown
Xin Kang
unknown
Xuejin Chen
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
Abstract
Recently, 3D Gaussian Splatting (3D-GS) has prevailed in novel view synthesis, achieving high fidelity and efficiency. However, it often struggles to capture rich details and complete geometry. Our analysis reveals that the 3D-GS densification operation lacks adaptiveness and faces a dilemma between geometry coverage and detail recovery. To address this, we introduce a novel densification operation, residual split, which adds a downscaled Gaussian as a residual. Our approach is capable of adaptively retrieving details and complementing missing geometry. To further support this method, we propose a pipeline named ResGS. Specifically, we integrate a Gaussian image pyramid for progressive supervision and implement a selection scheme that prioritizes the densification of coarse Gaussians over time. Extensive experiments demonstrate that our method achieves SOTA rendering quality. Consistent performance improvements can be achieved by applying our residual split on various 3DGS variants, underscoring its versatility and potential for broader application in 3D-GS-based applications. Project page: https://yanzhelyu.github.io/resgs.github.io/.
BezierGS: Dynamic Urban Scene Reconstruction with Bezier Curve Gaussian Splatting
Zipei Ma
Fudan University
Junzhe Jiang
Fudan University
Yurui Chen
Fudan University
Li Zhang
Fudan University
Abstract
The realistic reconstruction of street scenes is critical for developing real-world simulators in autonomous driving. Most existing methods rely on object pose annotations, using these poses to reconstruct dynamic objects and move them during the rendering process. This dependence on high-precision object annotations limits large-scale and extensive scene reconstruction. To address this challenge, we propose B´ezier curve Gaussian splatting (B´ezierGS), which represents the motion trajectories of dynamic objects using learnable B´ezier curves. This approach fully leverages the temporal information of dynamic objects and through learnable curve modeling, automatically corrects pose errors. By introducing additional supervision on dynamic object rendering and inter-curve consistency constraints, we achieve reasonable and accurate separation and reconstruction of scene elements. Extensive experiments on the Waymo Open Dataset and the nuPlan benchmark demonstrate that B´ezierGS outperforms state-of-theart alternatives in both dynamic and static scene components reconstruction and novel view synthesis.
DCHM: Depth-Consistent Human Modeling for Multiview Detection
Jiahao Ma
Australian National University
Tianyu Wang
unknown
Miaomiao Liu
unknown
David Ahmedt-Aristizabal
unknown
Chuong Nguyen
unknown
Abstract
Multiview pedestrian detection typically involves two stages: human modeling and pedestrian localization. Human modeling represents pedestrians in 3D space by fusing multiview information, making its quality crucial for detection accuracy. However, existing methods often introduce noise and have low precision. While some approaches reduce noise by fitting on costly multiview 3D annotations, they often struggle to generalize across diverse scenes. To eliminate reliance on human-labeled annotations and accurately model humans, we propose Depth-Consistent Human Modeling (DCHM), a framework designed for consistent depth estimation and multiview fusion in global coordinates. Specifically, our proposed pipeline with superpixelwise Gaussian Splatting achieves multiview depth consistency in sparse-view, large-scaled, and crowded scenarios, producing precise point clouds for pedestrian localization. Extensive validations demonstrate that our method significantly reduces noise during human modeling, outperforming previous state-of-the-art baselines. Additionally, to our knowledge, DCHM is the first to reconstruct pedestrians and perform multiview segmentation in such a challenging setting. Code is available on the project page.
Find Any Part in 3D
Ziqi Ma
California Institute of Technology
Yisong Yue
California Institute of Technology
Georgia Gkioxari
California Institute of Technology
Abstract
Why don't we have foundation models in 3D yet? A key limitation is data scarcity. For 3D object part segmentation, existing datasets are small in size and lack diversity. We show that it is possible to break this data barrier by building a data engine powered by 2D foundation models. Our data engine automatically annotates any number of object parts: 1,755x more unique part types than existing datasets combined. By training on our annotated data with a simple contrastive objective, we obtain an open-world model that generalizes to any part in any object based on any text query. Even when evaluated zero-shot, we outperform existing methods on the datasets they train on. We achieve 260% improvement in mIoU and boost speed by 6x to 300x. Our scaling analysis confirms that this generalization stems from the data scale, which underscores the impact of our data engine. Finally, to advance general-category openworld 3D part segmentation, we release a benchmark covering a wide range of objects and parts. Project website: https://ziqi-ma.github.io/find3dsite/
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
Shijie Ma
ARC Lab, Tencent PCG
Yuying Ge
ARC Lab, Tencent PCG
Teng Wang
ARC Lab, Tencent PCG
Yuxin Guo
ARC Lab, Tencent PCG
Yixiao Ge
ARC Lab, Tencent PCG
Ying Shan
ARC Lab, Tencent PCG
Abstract
The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving finegrained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth explorations, we have finally arrived at an effective method, namely GenHancer, which consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be further plugged into multimodal large language models for better vision-centric performance. All the models and codes are made publicly available.
InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild
Yiyi Ma
Shenzhen International Graduate School, Tsinghua University
Yuanzhi Liang
Institute of Artificial Intelligence, China Telecom
Xiu Li
Shenzhen International Graduate School, Tsinghua University
Chi Zhang
Institute of Artificial Intelligence, China Telecom
Xuelong Li
Institute of Artificial Intelligence, China Telecom
Abstract
We present Interleaved Learning for Motion Synthesis (InterSyn), a novel framework that targets the generation of realistic interaction motions by learning from integrated motions that consider both solo and multi-person dynamics. Unlike previous methods that treat these components separately, InterSyn employs an interleaved learning strategy to capture the natural, dynamic interactions and nuanced coordination inherent in real-world scenarios. Our framework comprises two key modules: the Interleaved Interaction Synthesis (INS) module, which jointly models solo and interactive behaviors in a unified paradigm from a first-person perspective to support multiple character interactions, and the Relative Coordination Refinement (REC) module, which refines mutual dynamics and ensures synchronized motions among characters. Experimental results show that the motion sequences generated by InterSyn exhibit higher text-to-motion alignment and improved diversity compared with recent methods, setting a new benchmark for robust and natural motion synthesis. Additionally, our code will be open-sourced in the future to promote further research and development in this area. Project website: https://myy888.github.io/InterSyn/
MaGS: Reconstructing and Simulating Dynamic 3D Objects with Mesh-adsorbed Gaussian Splatting
Shaojie Ma
Zhejiang University
Yawei Luo
Zhejiang University
Wei Yang
Huazhong University of Science and Technology
Yi Yang
Zhejiang University
Abstract
3D reconstruction and simulation, although interrelated, have distinct objectives: reconstruction requires a flexible 3D representation that can adapt to diverse scenes, while simulation needs a structured representation to model motion principles effectively. This paper introduces the Mesh-adsorbed Gaussian Splatting (MaGS) method to address this challenge. MaGS constrains 3D Gaussians to roam near the mesh, creating a mutually adsorbed meshGaussian 3D representation. Such representation harnesses both the rendering flexibility of 3D Gaussians and the structured property of meshes. To achieve this, we introduce RMD-Net, a network that learns motion priors from video data to refine mesh deformations, alongside RGDNet, which models the relative displacement between the mesh and Gaussians to enhance rendering fidelity under mesh constraints. To generalize to novel, user-defined deformations beyond input video without reliance on temporal data, we propose MPE-Net, which leverages inherent mesh information to bootstrap RMD-Net and RGD-Net. Due to the universality of meshes, MaGS is compatible with various deformation priors such as ARAP, SMPL, and soft physics simulation. Extensive experiments on the D-NeRF, DG-Mesh, and PeopleSnapshot datasets demonstrate that MaGS achieves state-of-the-art performance in both reconstruction and simulation. †Corresponding Author Project page: https://wcwac.github.io/MaGS-page/
MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion
Yikun Ma
Sun Yat-sen University
Yiqing Li
Sun Yat-sen University
Jiawei Wu
Sun Yat-sen University
Xing Luo
Peng Cheng Laboratory
Zhi Jin
Sun Yat-sen University
Abstract
1 Generative models have made remarkable advancements and are capable of producing high-quality content. However, performing controllable editing with generative models remains challenging, due to their inherent uncertainty in outputs. This challenge is particularly pronounced in motion editing, which involves the processing of spatial information. While some physics-based generative methods have attempted to implement motion editing, they typically operate on single-view images with simple motions, such as translation and dragging. These methods struggle to handle complex motions, such as, rotation and stretching, and ensure multi-view consistency, often necessitating resourceintensive retraining. To address these challenges, we propose MotionDiff, a training-free zero-shot diffusion method that leverages optical flow for complex motion editing among multi-view images. Specifically, given a static scene, users can interactively select objects of interest to add motion priors. The proposed Point Kinematic Model (PKM) then estimates corresponding multi-view optical flows during the Multi-view Flow Estimation Stage (MFES). Subsequently, these optical flows are utilized to generate multiview motion results through decoupled motion representation in the Multi-view Motion Diffusion Stage (MMDS). Extensive experiments demonstrate that MotionDiff outperforms other physics-based generative motion editing methods in achieving high-quality multi-view consistent motion results. Notably, MotionDiff does not require retraining, enabling users to conveniently adapt it for various downstream tasks. Code is available at https://github.com/MrMa-yikun/MotionDiff.
ReMP-AD: Retrieval-enhanced Multi-modal Prompt Fusion for Few-Shot Industrial Visual Anomaly Detection
Hongchi Ma
Harbin Institute of Technology
Guanglei Yang
Harbin Institute of Technology
Debin Zhao
Harbin Institute of Technology
Yanli Ji
Sun Yat-Sen University
Wangmeng Zuo
Harbin Institute of Technology
Abstract
Industrial visual inspection is crucial for detecting defects in manufactured products, but it traditionally relies on human operators, leading to inefficiencies. Industrial Visual Anomaly Detection (IVAD) has emerged as a promising solution, with methods such as zero-shot, few-shot, and reconstruction-based techniques. However, zero-shot methods struggle with subtle anomalies, and reconstructionbased methods fail to capture fine-grained details. Few-shot methods, which use limited samples and prompts, offer a more efficient approach. Despite their promise, challenges remain in managing intra-class variation among references and in effectively extracting more representative anomaly features. This paper presents Retrieval-enhanced Multimodal Prompt Fusion Anomaly Detection (ReMP-AD), a framework that introduces Intra-Class Token Retrieval (ICTR) to reduce noise in the memory bank and VisionLanguage Prior Fusion (VLPF) to guide the encoder in capturing more distinctive and relevant features of anomalies. Experiments on the VisA and MVTec-AD datasets demonstrate that ReMP-AD outperforms existing methods, achieving 97.8%/94.1% performance in 4-shot anomaly segmentation and classification. Our approach also shows strong results on the PCB-Bank dataset, highlighting its effectiveness in few-shot industrial anomaly detection. Code is available at https://github.com/cshcma/ReMP-AD.git
On the Recovery of Cameras from Fundamental Matrices
Rakshith Madhavan
Politecnico di Milano
Federica Arrigoni
Politecnico di Milano
Abstract
The viewing graph is a compact tool to encode the geometry of multiple views: nodes represent uncalibrated cameras and edges represent fundamental matrices (when available). Most research focuses on theoretical analyses, exploring for which viewing graphs it is possible (in principle) to retrieve cameras from fundamental matrices, in the sense that the problem admits a unique solution for noiseless data. However, the practical task of recovering cameras from noisy fundamental matrices is still open, as available methods are limited to special graphs (such as those covered by triplets). In this paper, we develop the first method that can deal with the recovery of cameras from noisy fundamental matrices in a general viewing graph. Experimental results demonstrate the promise of the proposed approach on a variety of synthetic and real scenarios.
Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection
Subhajit Maity
University of Central Florida
Ayan Kumar Bhunia
University of Surrey
Subhadeep Koley
University of Surrey
Pinaki Nath Chowdhury
University of Surrey
Aneeshan Sain
University of Surrey
Yi-Zhe Song
University of Surrey
Abstract
Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering crossmodal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.
A Hyperdimensional One Place Signature to Represent Them All: Stackable Descriptors For Visual Place Recognition
Connor Malone
Queensland University of Technology
Somayeh Hussaini
Queensland University of Technology
Tobias Fischer
Queensland University of Technology
Michael Milford
Queensland University of Technology
Abstract
Visual Place Recognition (VPR) enables coarse localization by comparing query images to a reference database of geo-tagged images. Recent breakthroughs in deep learning architectures and training regimes have led to methods with improved robustness to factors like environment appearance change, but with the downside that the required training and/or matching compute scales with the number of distinct environmental conditions encountered. Here, we propose Hyperdimensional One Place Signatures (HOPS) to simultaneously improve the performance, compute and scalability of these state-of-the-art approaches by fusing the descriptors from multiple reference sets captured under different conditions. HOPS scales to any number of environmental conditions by leveraging the Hyperdimensional Computing framework. Extensive evaluations demonstrate that our approach is highly generalizable and consistently improves recall performance across all evaluated VPR methods and datasets by large margins. Arbitrarily fusing reference images without compute penalty enables numerous other useful possibilities, three of which we demonstrate here: improved performance with reduced dimensionality descriptors, stacking synthetic images, and coarse localization to an entire traverse or environmental section.
AccidentalGS: 3D Gaussian Splatting from Accidental Camera Motion
Mao Mao
Zhejiang University
Xujie Shen
Zhejiang University
Guyuan Chen
Zhejiang University
Boming Zhao
Zhejiang University
Jiarui Hu
Zhejiang University
Hujun Bao
Zhejiang University
Zhaopeng Cui
Zhejiang University
Abstract
Neural 3D modeling and novel view synthesis with Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS) typically requires the multi-view images with wide baselines and accurate camera poses as input. However, scenarios with accidental camera motions are rarely studied. In this paper, we propose AccidentalGS, the first method for neural 3D modeling and novel view synthesis from accidental camera motions. To achieve this, we present a novel joint optimization framework that considers geometric and photometric errors, using a simplified camera model for stability. We also introduce a novel online adaptive depth-consistency loss to prevent the overfitting of the Gaussian model to input images. Extensive experiments on both synthetic and real-world datasets show that AccidentalGS achieves more accurate camera poses and realistic novel views compared to existing methods, and supports 3D modeling and neural rendering even for the Moon with telescope-like images.
Tree Skeletonization from 3D Point Clouds by Denoising Diffusion
Elias Ariel Marks
University of Bonn
Lucas Nunes
University of Bonn
Federico Magistri
University of Bonn
Matteo Sodano
University of Bonn
Rodrigo Marcuzzi
University of Bonn
Lars Zimmermann
University of Bonn
Jens Behley
University of Bonn
Cyrill Stachniss
University of Bonn
Abstract
The natural world presents complex organic structures, such as tree canopies, that humans can interpret even when only partially visible. Understanding tree structures is key for forest monitoring, orchard management, and automated harvesting applications. However, reconstructing tree topologies from sensor data, called tree skeletonization, remains a challenge for computer vision approaches. Traditional methods for tree skeletonization rely on handcrafted features, regression, or generative models, whereas recent advances focus on deep learning approaches. Existing methods often struggle with occlusions caused by dense foliage, limiting their applicability over the annual vegetation cycle. Furthermore, the lack of real-world data with reference information limits the evaluation of these methods to synthetic datasets, which does not validate generalization to real environments. In this paper, we present a novel approach for tree skeletonization that combines a generative denoising diffusion probabilistic model for predicting node positions and branch directions with a classical minimum spanning tree algorithm to infer tree skeletons from 3D point clouds, even with strong occlusions. Additionally, we provide a dataset of an apple orchard with 280 trees scanned 10 times during the growing season with corresponding reference skeletons, enabling quantitative evaluation. Experiments show the superior performance of our approach on real-world data and competitive results compared to state-of-art approaches on synthetic benchmarks.
LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes
Juliette Marrie
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK
Romain Menegaux
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK
Michael Arbel
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK
Diane Larlus
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK
Julien Mairal
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK
Abstract
We address the problem of extending the capabilities of vision foundation models such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a novel method to uplift 2D image features into Gaussian Splatting representations of 3D scenes. Unlike traditional approaches that rely on minimizing a reconstruction loss, our method employs a simpler and more efficient feature aggregation technique, augmented by a graph diffusion mechanism. Graph diffusion refines 3D features, such as coarse segmentation masks, by leveraging 3D geometry and pairwise similarities induced by DINOv2. Our approach achieves performance comparable to the state of the art on multiple downstream tasks while delivering significant speed-ups. Notably, we obtain competitive segmentation results using only generic DINOv2 features, despite DINOv2 not being trained on millions of annotated segmentation masks like SAM. When applied to CLIP features, our method demonstrates strong performance in open-vocabulary object segmentation tasks, highlighting the versatility of our approach.1
Visual Modality Prompt for Adapting Vision-Language Object Detectors
Heitor R. Medeiros
ETS Montreal
Atif Belal
ETS Montreal
Srikanth Muralidharan
ETS Montreal
Eric Granger
ETS Montreal
Marco Pedersoli
ETS Montreal
Abstract
The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and traditional detectors. Recently, vision-language detectors (VLDs), such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities; however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose Mod-Prompt, a visual prompt strategy to adapt VLDs to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inferencefriendly modality prompt decoupled residual, facilitating a more robust adaptation. Empirical benchmarking results show our method for modality adaptation on YOLO-World and Grounding DINO for challenging infrared (LLVIP, FLIR) and depth (NYUv2) datasets, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability. Our code is available at https: //github.com/heitorrapela/ModPrompt.
Diffusion-Based Extreme High-speed Scenes Reconstruction with the Complementary Vision Sensor
Yapeng Meng
Tsinghua University
Yihan Lin
Tsinghua University
Taoyi Wang
Tsinghua University
Yuguo Chen
Tsinghua University
Lijian Wang
Tsinghua University
Rong Zhao
Tsinghua University
Abstract
Recording and reconstructing high-speed scenes poses a significant challenge. While high-speed cameras can capture fine temporal details, their extremely high bandwidth demands make continuous recording unsustainable. Conversely, traditional RGB cameras, typically operating at 30 FPS, rely on frame interpolation to synthesize high-speed motion, often introducing artifacts and motion blur. Human visual system inspired sensors, like event cameras, offer high-speed sparse temporal or spatial variation data, partially alleviating these issues. However, existing methods still suffer from RGB blur, temporal aliasing, and loss of event information. To overcome these challenges, we leverage a novel complementary vision sensor, Tianmouc, which outputs high-speed, multi-bit, sparse spatio-temporal difference information with RGB frames. Building on this unique sensing modality, we introduce a Cascaded Bi-directional Recurrent Diffusion Model (CBRDM) that achieves accurate, sharp, color-rich video frames reconstruction. Our method outperforms state-of-the-art RGB interpolation algorithms in quantitative evaluations and surpasses eventbased methods in real-world comparisons. Code and dataset are at https://github.com/Tianmouc/GenRec. †These authors contributed equally to this work ‡This work was performed while the author was at Tsinghua University
Temporal Rate Reduction Clustering for Human Motion Segmentation
Xianghan Meng
Beijing University of Posts and Telecommunications
Zhengyu Tong
Beijing University of Posts and Telecommunications
Zhiyuan Huang
Beijing University of Posts and Telecommunications
Chun-Guang Li
Beijing University of Posts and Telecommunications
Abstract
Human Motion Segmentation (HMS), which aims to partition videos into non-overlapping human motions, has attracted increasing research attention recently. Existing approaches for HMS are mainly dominated by subspace clustering methods, which are grounded on the assumption that high-dimensional temporal data align with a Union-ofSubspaces (UoS) distribution. However, the frames in video capturing complex human motions with cluttered backgrounds may not align well with the UoS distribution. In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering (TR2C), which jointly learns structured representations and affinity to segment the sequences of frames in video. Specifically, the structured representations learned by TR2C enjoy temporally consistency and are aligned well with a UoS structure, which is favorable for addressing the HMS task. We conduct extensive experiments on five benchmark HMS datasets and achieve state-of-the-art performances with different feature extractors. The code is available at: https://github. com/mengxianghan123/TR2C.
GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration
Li Mi
EPFL
Manon Béchaz
EPFL
Zeming Chen
EPFL
Antoine Bosselut
EPFL
Devis Tuia
EPFL
Abstract
Active Geo-localization (AGL) is the task of localizing a goal, represented in various modalities (e.g., aerial images, ground-level images, or text), within a predefined search area. Current methods approach AGL as a goal-reaching reinforcement learning (RL) problem with a distance-based reward. They localize the goal by implicitly learning to minimize the relative distance from it. However, when distance estimation becomes challenging or when encountering unseen targets and environments, the agent exhibits reduced robustness and generalization ability due to the less reliable exploration strategy learned during training. In this paper, we propose GeoExplorer, an AGL agent that incorporates curiosity-driven exploration through intrinsic rewards. Unlike distance-based rewards, our curiosity-driven reward is goal-agnostic, enabling robust, diverse, and contextually relevant exploration based on effective environment modeling. These capabilities have been proven through extensive experiments across four AGL benchmarks, demonstrating the effectiveness and generalization ability of GeoExplorer in diverse settings, particularly in localizing unfamiliar targets and environments.
FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation
Cui Miao
National University of Defense Technology
Tao Chang
National University of Defense Technology
Meihan Wu
National University of Defense Technology
Hongbin Xu
Bytedance Seed
Chun Li
Shenzhen MSU-BIT University
Ming Li
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
Xiaodong Wang
National University of Defense Technology
Abstract
Vision-language-action (VLA) models have significantly advanced robotic manipulation by enabling robots to interpret language instructions for task execution. However, training these models often relies on large-scale user-specific data, raising concerns about privacy and security, which in turn limits their broader adoption. To address this, we propose FedVLA, the first federated VLA learning framework, enabling distributed model training that preserves data privacy without compromising performance. Our framework integrates task-aware representation learning, adaptive expert selection, and expert-driven federated aggregation, enabling efficient and privacy-preserving training of VLA models. Specifically, we introduce an InstructionOriented Scene-Parsing mechanism, which decomposes and enhances object-level features based on task instructions, improving contextual understanding. To effectively learn diverse task patterns, we design a Dual Gating Mixtureof-Experts (DGMoE) mechanism, where not only input tokens but also self-aware experts adaptively decide their activation. Finally, we propose an Expert-Driven Aggregation strategy at the federated server, where model aggregation is guided by activated experts, ensuring effective cross-client knowledge transfer. Extensive simulations and real-world robotic experiments demonstrate the effectiveness of our proposals. Notably, DGMoE significantly improves computational efficiency compared to its vanilla counterpart, while FedVLA achieves task success rates comparable to centralized training, effectively preserving data privacy.
Multi-view Gaze Target Estimation
Qiaomu Miao
Stony Brook University
Vivek Raju Golani
Stony Brook University
Jingyi Xu
Stony Brook University
Progga Paromita Dutta
Stony Brook University
Minh Hoai
The University of Adelaide
Dimitris Samaras
Stony Brook University
Abstract
This paper presents a method that utilizes multiple camera views for the gaze target estimation (GTE) task. The approach integrates information from different camera views to improve accuracy and expand applicability, addressing limitations in existing single-view methods that face challenges such as face occlusion, target ambiguity, and out-ofview targets. Our method processes a pair of camera views as input, incorporating a Head Information Aggregation (HIA) module for leveraging head information from both views for more accurate gaze estimation, an Uncertaintybased Gaze Selection (UGS) for identifying the most reliable gaze output, and an Epipolar-based Scene Attention (ESA) module for cross-view background information sharing. This approach significantly outperforms single-view baselines, especially when the second camera provides a clear view of the person's face. Additionally, our method can estimate the gaze target in the first view using the image of the person in the second view only, a capability not possessed by single-view GTE methods. Furthermore, the paper introduces a multi-view dataset for developing and evaluating multi-view GTE methods. Data and code are available at https://www3.cs.stonybrook.edu/˜cvl/multiview_gte.html.
Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation
Ziliang Miao
The University of Hong Kong
Runjian Chen
The University of Hong Kong
Yixi Cai
KTH Royal Institute of Technology
Buwei He
KTH Royal Institute of Technology
Wenquan Zhao
Southern University of Science and Technology
Wenqi Shao
Shanghai AI Laboratory
Bo Zhang
Shanghai AI Laboratory
Fu Zhang
The University of Hong Kong
Abstract
Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems such as self-driving vehicles. While previous supervised approaches rely on costly manual annotations, LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose Temporal Overlapping Prediction (TOP), a self-supervised pre-training method designed to alleviate this annotation burden. TOP learns powerful spatiotemporal representations by predicting the occupancy states of temporal overlapping points that are commonly observed in current and adjacent scans. To further ground these representations in the current scene's geometry, we introduce an auxiliary pretraining objective of reconstructing the occupancy of the current scan. Extensive experiments on the nuScenes and SemanticKITTI datasets validate our method's effectiveness. TOP consistently outperforms existing supervised and self-supervised pre-training baselines across both pointlevel Intersection-over-Union (IoU) and object-level Recall metrics. Notably, it achieves a relative improvement of up to 28.77% over a training-from-scratch baseline and demonstrates strong transferability across LiDAR setups. Our code is publicly available at https://github.com/ZiliangMiao/TOP.
Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting
Xingyu Miao
Durham University
Haoran Duan
Tsinghua University
Quanhao Qian
DAMO Academy, Alibaba Group
Jiuniu Wang
DAMO Academy, Alibaba Group
Yang Long
Durham University
Ling Shao
UCAS-Terminus AI Lab, UCAS
Deli Zhao
DAMO Academy, Alibaba Group
Ran Xu
DAMO Academy, Alibaba Group
Gongjie Zhang
DAMO Academy, Alibaba Group
Abstract
Spatial intelligence is emerging as a transformative frontier in AI, yet it remains constrained by the scarcity of largescale 3D datasets. Unlike the abundant 2D imagery, acquiring 3D data typically requires specialized sensors and laborious annotation. In this work, we present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations - including point clouds, camera poses, depth maps, and pseudo-RGBD - via integrated depth estimation, camera calibration, and scale calibration. Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding. By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence. We release two generated spatial datasets, i.e., COCO-3D and Objects365-v2-3D, and demonstrate through extensive experiments that our generated data can benefit various 3D tasks, ranging from fundamental perception to MLLMbased reasoning. These results validate our pipeline as an effective solution for developing AI systems capable of perceiving, understanding, and interacting with physical environments.
Not all Views are Created Equal: Analyzing Viewpoint Instabilities in Vision Foundation Models
Mateusz Michalkiewicz
Rice University
Sheena Bai
Rice University
Mahsa Baktashmotlagh
The University of Queensland
Varun Jampani
Stability AI
Guha Balakrishnan
Rice University
Abstract
In this paper, we analyze the viewpoint stability of foundational models - specifically, their sensitivity to changes in viewpoint- and define instability as significant feature variations resulting from minor changes in viewing angle, leading to generalization gaps in 3D reasoning tasks. We investigate nine foundational models, focusing on their responses to viewpoint changes, including the often-overlooked accidental viewpoints where specific camera orientations obscure an object's true 3D structure. Our methodology enables recognizing and classifying accidental, stable and other viewpoints using feature representations alone, without accessing the actual images at inference time. Our findings indicate that while foundation models consistently encode accidental viewpoints, they vary in their interpretation of other viewpoints due to inherent biases, at times leading to object misclassifications based on geometric resemblance. Through quantitative and qualitative evaluations on three downstream tasks - classification, VQA, and 3D reconstruction - we illustrate the impact of viewpoint instability and underscore the importance of feature robustness across diverse viewing conditions.
VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions
Marko Mihajlovic
ETH Zürich
Siwei Zhang
ETH Zürich
Gen Li
ETH Zürich
Kaifeng Zhao
ETH Zürich
Lea Müller
UC Berkeley
Siyu Tang
ETH Zürich
Abstract
Parametric human body models play a crucial role in computer graphics and vision, enabling applications ranging from human motion analysis to understanding humanenvironment interactions. Traditionally, these models use surface meshes, which pose challenges in efficiently handling interactions with other geometric entities, such as objects and scenes, typically represented as meshes or point clouds. To address this limitation, recent research has explored volumetric neural implicit body models. However, existing works are either insufficiently robust for complex human articulations or impose high computational and memory costs, limiting their widespread use. To this end, we introduce VolumetricSMPL, a neural volumetric body model that leverages Neural Blend Weights (NBW) to generate compact, yet efficient MLP decoders. Unlike prior approaches that rely on large MLPs, NBW dynamically blends a small set of learned weight matrices using predicted shape- and pose-dependent coefficients, significantly improving computational efficiency while preserving expressiveness. VolumetricSMPL outperforms prior volumetric occupancy model COAP with 10! faster inference, 6! lower GPU memory usage, enhanced accuracy, and a Signed Distance Function (SDF) for efficient and differentiable contact modeling. We demonstrate VolumetricSMPL's strengths across four challenging tasks: (1) reconstructing human-object interactions from in-the-wild images, (2) recovering human meshes in 3D scenes from egocentric views, (3) scene-constrained motion synthesis, and (4) resolving self-intersections. Our results highlight its broad applicability and significant performance and efficiency gains.
Discontinuity-aware Normal Integration for Generic Central Camera Models
Francesco Milano
ETH Zurich
Manuel López-Antequera
Meta
Naina Dhingra
Meta
Roland Siegwart
ETH Zurich
Robert Thiel
Meta
Abstract
Recovering a 3D surface from its surface normal map, a problem known as normal integration, is a key component for photometric shape reconstruction techniques such as shape-from-shading and photometric stereo. The vast majority of existing approaches for normal integration handle only implicitly the presence of depth discontinuities and are limited to orthographic or ideal pinhole cameras. In this paper, we propose a novel formulation that allows modeling discontinuities explicitly and handling generic central cameras. Our key idea is based on a local planarity assumption, that we model through constraints between surface normals and ray directions. Compared to existing methods, our approach more accurately approximates the relation between depth and surface normals, achieves state-of-the-art results on the standard normal integration benchmark, and is the first to directly handle generic central camera models.
S2M2: Scalable Stereo Matching Model for Reliable Depth Estimation
Junhong Min
Samsung Electronics
Youngpil Jeon
Samsung Electronics
Jimin Kim
Samsung Electronics
Minyong Choi
Samsung Electronics
Abstract
The pursuit of a generalizable stereo matching model, capable of performing well across varying resolutions and disparity ranges without dataset-specific fine-tuning, has revealed a fundamental trade-off. Iterative local search methods achieve high scores on constrained benchmarks, but their core mechanism inherently limits the global consistency required for true generalization. However, global matching architectures, while theoretically more robust, have historically been rendered infeasible by prohibitive computational and memory costs. We resolve this dilemma with S²M²: a global matching architecture that achieves state-ofthe-art accuracy and high efficiency without relying on cost volume filtering or deep refinement stacks. Our design integrates a multi-resolution transformer for robust long-range correspondence, trained with a novel loss function that concentrates probability on feasible matches. This approach enables a more robust joint estimation of disparity, occlusion, and confidence. S²M² establishes a new state of the art on Middlebury v3 and ETH3D benchmarks, significantly outperforming prior methods in most metrics while reconstructing high-quality details with competitive efficiency.
R-LiViT: A LiDAR-Visual-Thermal Dataset Enabling Vulnerable Road User Focused Roadside Perception
Jonas Mirlach
XITASO GmbH
Lei Wan
Karlsruhe Institute of Technology
Andreas Wiedholz
XITASO GmbH
Hannan Ejaz Keen
XITASO GmbH
Andreas Eich
LiangDao GmbH
Abstract
In autonomous driving, the integration of roadside perception systems is essential for overcoming occlusion challenges and enhancing the safety of Vulnerable Road Users (VRUs). While LiDAR and visual (RGB) sensors are commonly used, thermal imaging remains underrepresented in datasets, despite its acknowledged advantages for VRU detection in extreme lighting conditions. In this paper, we present R-LiViT, the first dataset to combine LiDAR, RGB, and thermal imaging from a roadside perspective, with a strong focus on VRUs. R-LiViT captures three intersections during both day and night, ensuring a diverse dataset. It includes 10,000 LiDAR frames and 2,400 temporally and spatially aligned RGB and thermal images across 150 traffic scenarios, with 7 and 8 annotated classes respectively, providing a comprehensive resource for tasks such as object detection and tracking. The dataset1 and the code for reproducing our evaluation results2 are made publicly available.https://github.com/XITASO/r-livit
PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks
Clinton Ansun Mo
The University of Sydney
Kun Hu
Edith Cowan University
Chengjiang Long
Meta Reality Labs
Dong Yuan
The University of Sydney
Wan-Chi Siu
Hong Kong Polytechnic University
Zhiyong Wang
The University of Sydney
Abstract
Motion skeletons drive 3D character animation by transforming bone hierarchies, but differences in proportions or structure make motion data hard to transfer across skeletons, posing challenges for data-driven motion synthesis. Temporal Point Clouds (TPCs) offer an unstructured, crosscompatible motion representation. Though reversible with skeletons, TPCs mainly serve for compatibility, not for direct motion task learning. Doing so would require data synthesis capabilities for the TPC format, which presents unexplored challenges regarding its unique temporal consistency and point identifiability. Therefore, we propose PUMPS, the primordial autoencoder architecture for TPC data. PUMPS independently reduces frame-wise point clouds into sampleable feature vectors, from which a decoder extracts distinct temporal points using latent Gaussian noise vectors as sampling identifiers. We introduce linear assignment-based point pairing to optimise the TPC reconstruction process, and negate the use of expensive pointwise attention mechanisms in the architecture. Using these latent features, we pre-train a motion synthesis model capable of performing motion prediction, transition generation, and keyframe interpolation. For these pre-training tasks, PUMPS performs remarkably well even without native dataset supervision, matching state-of-the-art performance. When fine-tuned for motion denoising or estimation, PUMPS outperforms many respective methods without deviating from its generalist architecture. The code is available at: https://github.com/MiniEval/PUMPS.. Figure 1. Overview of PUMPS pre-training, zero-shot evaluation, and fine-tuning pipelines. PUMPS consists of an auto-encoder (encoder-decoder modules) and latent synthesis component, which are pre-trained successively.
TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras
Mohammad Mohammadi
University of Toronto
Ziyi Wu
University of Toronto
Igor Gilitschenski
University of Toronto
Abstract
Long-term temporal information is crucial for event-based perception tasks, as raw events only encode pixel brightness changes. Recent works show that when trained from scratch, recurrent models achieve better results than feedforward models in these tasks. However, when leveraging self-supervised pre-trained weights, feedforward models can outperform their recurrent counterparts. Current self-supervised learning (SSL) methods for event-based pretraining largely mimic RGB image-based approaches. They pre-train feedforward models on raw events within a short time interval, ignoring the temporal information of events. In this work, we introduce TESPEC, a self-supervised pretraining framework tailored for learning spatio-temporal information. TESPEC is well-suited for recurrent models, as it is the first framework to leverage long event sequences during pre-training. TESPEC employs the masked image modeling paradigm with a new reconstruction target. We design a novel method to accumulate events into pseudo grayscale videos containing high-level semantic information about the underlying scene, which is robust to sensor noise and reduces motion blur. Reconstructing this target thus requires the model to reason about long-term history of events. Extensive experiments demonstrate our state-ofthe-art results in downstream tasks, including object detection, semantic segmentation, and monocular depth estimation. Project webpage: https://mhdmohammadi. github.io/TESPEC_webpage.
DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic
Munish Monga
Sony Research India
Vishal Chudasama
Sony Research India
Pankaj Wasnik
Sony Research India
Biplab Banerjee
Indian Institute of Technology, Bombay
Abstract
Real-world object detection systems, such as those in autonomous driving and surveillance, must continuously learn new object categories and simultaneously adapt to changing environmental conditions. Existing approaches, Class Incremental Object Detection (CIOD) and Domain Incremental Object Detection (DIOD)-only address one aspect of this challenge. CIOD struggles in unseen domains, while DIOD suffers from catastrophic forgetting when learning new classes, limiting their real-world applicability. To overcome these limitations, we introduce Dual Incremental Object Detection (DuIOD), a more practical setting that simultaneously handles class and domain shifts in an exemplarfree manner. We propose DuET, a Task Arithmetic-based model merging framework that enables stable incremental learning while mitigating sign conflicts through a novel Directional Consistency Loss. Unlike prior methods, DuET is detector-agnostic, allowing models like YOLO11 and RTDETR to function as real-time incremental object detectors. To comprehensively evaluate both retention and adaptation, we introduce the Retention-Adaptability Index (RAI), which combines the Average Retention Index (Avg RI) for catastrophic forgetting and the Average Generalization Index for domain adaptability into a common ground. Extensive experiments on the Pascal Series and Diverse Weather Series demonstrate DuET's effectiveness, achieving a +13.12% RAI improvement while preserving 89.3% Avg RI on the Pascal Series (4 tasks), as well as a +11.39% RAI improvement with 88.57% Avg RI on the Diverse Weather Series (3 tasks), outperforming existing methods.
Selective Contrastive Learning for Weakly Supervised Affordance Grounding
WonJun Moon
Sungkyunkwan University
Hyun Seok Seong
Sungkyunkwan University
Jae-Pil Heo
Sungkyunkwan University
Abstract
Facilitating an entity's interaction with objects requires accurately identifying parts that afford specific actions. Weakly supervised affordance grounding (WSAG) seeks to imitate human learning from third-person demonstrations, where humans intuitively grasp functional parts without needing pixel-level annotations. To achieve this, grounding is typically learned using a shared classifier across images from different perspectives, along with distillation strategies incorporating part discovery process. However, since affordancerelevant parts are not always easily distinguishable, models primarily rely on classification, often focusing on common class-specific patterns that are unrelated to affordance. To address this limitation, we move beyond isolated part-level learning by introducing selective prototypical and pixel contrastive objectives that adaptively learn affordance-relevant cues at both the part and object levels, depending on the granularity of the available information. Initially, we find the action-associated objects in both egocentric (object-focused) and exocentric (third-person example) images by leveraging CLIP. Then, by cross-referencing the discovered objects of complementary views, we excavate the precise part-level affordance clues in each perspective. By consistently learning to distinguish affordance-relevant regions from affordanceirrelevant background context, our approach effectively shifts activation from irrelevant areas toward meaningful affordance cues. Experimental results demonstrate the effectiveness of our method.
DIMO: Diverse 3D Motion Generation for Arbitrary Objects
Linzhan Mou
University of Pennsylvania
Jiahui Lei
University of Pennsylvania
Chen Wang
University of Pennsylvania
Lingjie Liu
University of Pennsylvania
Kostas Daniilidis
University of Pennsylvania
Abstract
We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. The core idea of our work is to leverage the rich priors in well-trained video models to extract the common motion patterns and then embed them into a shared low-dimensional latent space. Specifically, we first generate multiple videos of the same object with diverse motions. We then embed each motion into a latent vector and train a shared motion decoder to learn the distribution of motions represented by a structured and compact motion representation, i.e., neural key point trajectories. The canonical 3D Gaussians are then driven by these key points and fused to model the geometry and appearance. During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass and support several interesting applications including 3D motion interpolation and language-guided motion generation. Our project page is available at https://linzhanm.github.io/dimo.
Diff2I2P: Differentiable Image-to-Point Cloud Registration with Diffusion Prior
Juncheng Mu
Tsinghua University
Chengwei Ren
Tsinghua University
Weixiang Zhang
Tsinghua University
Liang Pan
Shanghai AI Laboratory
Xiao-Ping Zhang
Shenzhen Ubiquitous Data Enabling Key Lab
Yue Gao
Tsinghua University
Abstract
Learning cross-modal correspondences is essential for image-to-point cloud (I2P) registration. Existing methods achieve this mostly by utilizing metric learning to enforce feature alignment across modalities, disregarding the inherent modality gap between image and point data. Consequently, this paradigm struggles to ensure accurate crossmodal correspondences. To this end, inspired by the crossmodal generation success of recent large diffusion models, we propose Diff2I2P, a fully Differentiable I2P registration framework, leveraging a novel and effective Diffusion prior for bridging the modality gap. Specifically, we propose a Control-Side Score Distillation (CSD) technique to distill knowledge from a depth-conditioned diffusion model to directly optimize the predicted transformation. However, the gradients on the transformation fail to backpropagate onto the cross-modal features due to the non-differentiability of correspondence retrieval and PnP solver. To this end, we further propose a Deformable Correspondence Tuning (DCT) module to estimate the correspondences in a differentiable way, followed by the transformation estimation using a differentiable PnP solver. With these two designs, the Diffusion model serves as a strong prior to guide the crossmodal feature learning of image and point cloud for forming robust correspondences, which significantly improves the registration. Extensive experimental results demonstrate that Diff2I2P consistently outperforms SoTA I2P registration methods, achieving over 7% improvement in registration recall on the 7-Scenes benchmark. Code will be available at https://github.com/mujc2021/Diff2I2P.
O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views
Lorenzo Mur-Labadia
University of Zaragoza
Maria Santos-Villafranca
University of Zaragoza
Jesus Bermudez-Cameo
University of Zaragoza
Alejandro Perez-Yus
University of Zaragoza
Ruben Martinez-Cantin
University of Zaragoza
Jose J. Guerrero
University of Zaragoza
Abstract
Understanding the world from multiple perspectives is essential for intelligent systems operating together, where segmenting common objects across different views remains an open problem. We introduce a new approach that redefines cross-image segmentation by treating it as a mask matching task. Our method consists of: (1) A MaskContext Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) an Ego↔Exo CrossAttention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space, and (4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects. O-MaMa achieves the state of the art in the Ego-Exo4D Correspondences benchmark, obtaining relative gains of +22 % and +76 % in the Ego2Exo and Exo2Ego IoU against the official challenge baselines, and a +13 % and +6 % compared with the SOTA with 1 % of the training parameters.
Scaling Transformer-Based Novel View Synthesis with Models Token Disentanglement and Synthetic Data
Nithin Gopalakrishnan Nair
Johns Hopkins University
Srinivas Kaza
Google
Xuan Luo
Google
Vishal M. Patel
Johns Hopkins University
Stephen Lombardi
Google
Jungyeon Park
Google
Abstract
Large transformer-based models have made significant progress in generalizable novel view synthesis (NVS) from sparse input views, generating novel viewpoints without the need for test-time optimization. However, these models are constrained by the limited diversity of publicly available scene datasets, making most real-world (in-the-wild) scenes out-of-distribution. To overcome this, we incorporate synthetic training data generated from diffusion models, which improves generalization across unseen domains. While synthetic data offers scalability, we identify artifacts introduced during data generation as a key bottleneck affecting reconstruction quality. To address this, we propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning. This refinement not only improves reconstruction quality over standard transformers but also enables scalable training with synthetic data. As a result, our method outperforms existing models on both in-dataset and cross-dataset evaluations, achieving state-of-the-art results across multiple benchmarks while significantly reducing computational costs.
PARTE: Part-Guided Texturing for 3D Human Reconstruction from a Single Image
Hyeongjin Nam
Seoul National University
Donghwan Kim
Seoul National University
Gyeongsik Moon
Korea University
Kyoung Mu Lee
Seoul National University
Abstract
The misaligned human texture across different human parts is one of the main limitations of existing 3D human reconstruction methods. Each human part, such as a jacket or pants, should maintain a distinct texture without blending into others. The structural coherence of human parts serves as a crucial cue to infer human textures in the invisible regions of a single image. However, most existing 3D human reconstruction methods do not explicitly exploit such part segmentation priors, leading to misaligned textures in their reconstructions. In this regard, we present PARTE, which utilizes 3D human part information as a key guide to reconstruct 3D human textures. Our framework comprises two core components. First, to infer 3D human part information from a single image, we propose a 3D part segmentation module (PartSegmenter) that initially reconstructs a textureless human surface and predicts human part labels based on the textureless surface. Second, to incorporate part information into texture reconstruction, we introduce a part-guided texturing module (PartTexturer), which acquires prior knowledge from a pre-trained image generation network on texture alignment of human parts. Extensive experiments demonstrate that our framework achieves state-of-the-art quality in 3D human reconstruction.
Hierarchical 3D Scene Graphs Construction Outdoors
Jon Nyffeler
ETH Zürich
Federico Tombari
Google
Daniel Barath
ETH Zürich
Abstract
Understanding and structuring outdoor environments in 3D is critical for numerous applications, including robotics, urban planning, and autonomous navigation. In this work, we propose a pipeline to construct hierarchical 3D scene graphs from outdoor data, consisting of posed images and 3D reconstructions. Our approach systematically extracts and organizes objects and their subcomponents, enabling representations that span from entire buildings to their facades and individual windows. By leveraging geometric and semantic relationships, our method efficiently groups objects into meaningful hierarchies while ensuring robust spatial consistency. We integrate efficient feature extraction, hierarchical object merging, and relationship inference to generate structured scene graphs that capture both global and local dependencies. Our approach scales to large outdoor environments while maintaining efficiency, and we demonstrate its effectiveness on real-world datasets. We also demonstrate that these constructed outdoor scene graphs are beneficial for downstream applications, such as 3D scene alignment. The code is available on GitHub.
PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups
Sakuya Ota
Institute of Science Tokyo
Qing Yu
LY Corporation
Kent Fujiwara
LY Corporation
Satoshi Ikehata
National Institute of Informatics (NII)
Ikuro Sato
Institute of Science Tokyo
Abstract
Generating realistic group interactions involving multiple characters remains challenging due to increasing complexity as group size expands. While existing conditional diffusion models incrementally generate motions by conditioning on previously generated characters, they rely on single shared prompts, limiting nuanced control and leading to overly simplified interactions. In this paper, we introduce Person-Interaction Noise Optimization (PINO), a novel, training-free framework designed for generating realistic and customizable interactions among groups of arbitrary size. PINO decomposes complex group interactions into semantically relevant pairwise interactions, and leverages pretrained two-person interaction diffusion models to incrementally compose group interactions. To ensure physical plausibility and avoid common artifacts such as overlapping or penetration between characters, PINO employs physics-based penalties during noise optimization. This approach allows precise user control over character orientation, speed, and spatial relationships without additional training. Comprehensive evaluations demonstrate that PINO generates visually realistic, physically coherent, and adaptable multi-person interactions suitable for diverse animation, gaming, and robotics applications.
Region-aware Anchoring Mechanism for Efficient Referring Visual Grounding
Shuyi Ouyang
Zhejiang University
Ziwei Niu
Zhejiang University
Hongyi Wang
Zhejiang University
Yen-Wei Chen
Ritsumeikan University
Lanfen Lin
Zhejiang University
Abstract
Referring Visual Grounding (RVG) tasks revolve around utilizing vision-language interactions to incorporate object information from language expressions, thereby enabling targeted object detection or segmentation within images. Transformer-based methods have enabled effective interaction through attention mechanisms, achieving notable performance in RVG tasks. However, existing strategies for RVG, which involve direct interaction between visual and linguistic features, face three key challenges: (i) tendency to focus on a single target, (ii) insufficient control over linguistic noise, and (iii) high computational cost. To address these challenges, we propose a Region-aware Anchoring Mechanism (RaAM) that mediates vision-language interactions. In RaAM, region-aware anchors engage in alternating interactions with vision and language modalities, acting as indicators for object presence across different regions within the image. RaAM (i) directs attention to multiple target regions for better localization, (ii) reduces crossmodal redundancy by using anchors as buffers, and (iii) lowers time complexity. In addition, we design region and pixel level loss functions to enhance object presence assessment and edge precision. We evaluate our RaAM-RVG on four benchmark datasets and integrate RaAM into various models by replacing their interaction design. Results show that RaAM outperforms state-of-the-art methods with lower computational cost.
Self-Supervised Sparse Sensor Fusion for Long Range Perception
Edoardo Palladin
Torc Robotics
Samuel Brucker
Torc Robotics
Filippo Ghilotti
Torc Robotics
Praveen Narayanan
Torc Robotics
Mario Bijelic
Torc Robotics
Felix Heide
Princeton University
Abstract
Outside of urban hubs, autonomous cars and trucks have to master driving on intercity highways. Safe, long-distance highway travel at speeds exceeding 100 km/h demands perception distances of at least 250 m, which is about five times the 50-100m typically addressed in city driving, to allow sufficient planning and braking margins. Increasing the perception ranges also allows to extend autonomy from light two-ton passenger vehicles to large-scale forty-ton trucks, which need a longer planning horizon due to their high inertia. However, most existing perception approaches focus on shorter ranges and rely on Bird's Eye View (BEV) representations, which incur quadratic increases in memory and compute costs as distance grows. To overcome this limitation, we built on top of a sparse representation and introduced an efficient 3D encoding of multi-modal and temporal features, along with a novel self-supervised pretraining scheme that enables large-scale learning from unlabeled camera-LiDAR data. Our approach extends perception distances to 250 meters and achieves an 26.6% improvement in mAP in object detection and a decrease of 30.5% in Chamfer Distance in LiDAR forecasting compared to existing methods, reaching distances up to 250 meters.
Exploring Weather-aware Aggregation and Adaptation for Semantic Segmentation under Adverse Conditions
Yuwen Pan
University of Science and Technology of China
Rui Sun
University of Science and Technology of China
Wangkai Li
University of Science and Technology of China
Tianzhu Zhang
National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory
Abstract
Semantic segmentation under adverse conditions is critical for reliable visual perception in challenging weather environments. These extreme scenarios introduce distortions, such as low contrast and reduced visibility, making traditional segmentation models struggle. The scarcity of labeled data in such conditions makes it difficult to train models directly for these environments. Unsupervised domain adaptation (UDA) has been proposed as a solution to transfer knowledge from labeled source domains (normal weather) to unlabeled target domains (adverse weather). However, existing methods face significant challenges, particularly due to weather unawareness and feature heterogeneity. Many models fail to account for the unique characteristics of different weather conditions, and the significant feature discrepancies between normal and adverse weather images hinder effective adaptation. In this paper, we propose a novel weather-aware aggregation and adaptation network that leverages characteristic knowledge to achieve weather homogenization and enhance scene perception. Specifically, we introduce amplitude prompt aggregation to capture essential characteristics from the Fourier frequency domain that are indicative of different weather conditions. Additionally, we employ weather heterogeneity adaptation to mitigate the inter-domain heterogeneity, thereby achieving feature homogenization across diverse environments. Extensive experimental results on multiple challenging benchmarks demonstrate that our method achieves consistent improvements for semantic segmentation under adverse conditions.
Liberated-GS: 3D Gaussian Splatting Independent from SfM Point Clouds
Weihong Pan
Zhejiang University
Xiaoyu Zhang
SenseTime Research
Hongjia Zhai
Zhejiang University
Xiaojun Xiang
SenseTime Research
Hanqing Jiang
SenseTime Research
Guofeng Zhang
Zhejiang University
Abstract
3D Gaussian Splatting (3DGS) has demonstrated impressive performance in novel view synthesis and real-time rendering. However, it heavily relies on high-quality initial sparse points from Structure-from-Motion (SfM), which often struggles in textureless regions, degrading the geometry and visual quality of 3DGS. To address this limitation, we propose a novel initialization pipeline, achieving highfidelity reconstruction from dense image sequences without relying on SfM-derived point clouds. Specifically, we first propose an effective depth alignment method to align the estimated monocular depth with depth rendered from an under-optimized coarse Gaussian model using an unbiased depth rasterization approach and ensemble them afterward. After that, to efficiently process dense image sequences, we incorporate a progressive segmented initialization process to generate the initial points. Extensive experiments demonstrate the superiority of our method over previous approaches and its compatibility with other advanced 3D Gaussian models. Notably, our method outperforms the SfM-based method by a 14.4% improvement in LPIPS on the Mip-NeRF360 datasets and a 30.7% improvement on the Tanks and Temples datasets.
LookOut: Real-World Humanoid Egocentric Navigation
Boxiao Pan
Stanford University
Adam W. Harley
Stanford University
Francis Engelmann
Stanford University
C. Karen Liu
Stanford University
Leonidas J. Guibas
Stanford University
Abstract
The ability to predict collision-free future trajectories from egocentric observations is crucial in applications such as humanoid robotics, VR /AR, and assistive navigation. In this work, we introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video. In particular, we predict both head translations and rotations to learn the active information-gathering behavior expressed through head-turning events. To solve this task, we propose a framework that reasons over temporally aggregated 3D latent features, which models the geometric and semantic constraints for both the static and dynamic parts of the environment. Motivated by the lack of training data in this space, we further contribute a data collection pipeline using the Project Aria glasses, and present a dataset collected through this approach. Our dataset, dubbed Aria Navigation Dataset (AND), consists of 4 hours of recording of users navigating in real-world scenarios. It includes diverse situations and navigation behaviors, providing a valuable resource for learning real-world egocentric navigation policies. Extensive experiments show that our model learns human-like navigation behaviors such as waiting /slowing down, rerouting, and looking around for traffic while generalizing to unseen environments. Check out our project webpage at https://sites.google. com/stanford.edu/lookout.
Augmented and Softened Matching for Unsupervised Visible-Infrared Person Re-Identification
Zhiqi Pang
Harbin Institute of Technology
Chunyu Wang
Harbin Institute of Technology
Lingling Zhao
Harbin Institute of Technology
Junjie Wang
Nanjing Medical University
Abstract
Color variations, a key challenge in the unsupervised visible-infrared person re-identification (UVI-ReID) task, have garnered significant attention. While existing UVIReID methods have made substantial efforts during the optimization phase to enhance the model's robustness to color variations, they often overlook the impact of color variations on the acquisition of pseudo-labels. To address this, in this paper, we focus on improving the robustness of pseudo-labels to color variations through data augmentation and propose an augmented and softened matching (ASM) method. Specifically, we first develop the crossmodality augmented matching (CAM) module, which performs channel augmentation on visible images to generate augmented images. Then, based on the fusion of the visibleinfrared and augmented-infrared centroid similarity matrices, CAM establishes cross-modality correspondences that are robust to color variations. To increase training stability, we design a soft-labels momentum update (SMU) strategy, which converts traditional one-hot labels into soft-labels through momentum updates, thus adapting to CAM. During the optimization phase, we introduce the cross-modality soft contrastive loss and cross-modality hard contrastive loss to promote modality-invariant learning from the perspectives of shared and diversified features, respectively. Extensive experimental results validate the effectiveness of the proposed method, showing that ASM not only outperforms state-of-the-art unsupervised methods but also competes with some supervised methods.
ForeSight: Multi-View Streaming Joint Object Detection and Trajectory Forecasting
Sandro Papais
University of Toronto
Letian Wang
University of Toronto
Brian Cheong
University of Toronto
Steven L. Waslander
University of Toronto
Abstract
We introduce ForeSight, a novel joint detection and forecasting framework for vision-based 3D perception in autonomous vehicles. Traditional approaches treat detection and forecasting as separate sequential tasks, limiting their ability to leverage temporal cues. ForeSight addresses this limitation with a multi-task streaming and bidirectional learning approach, allowing detection and forecasting to share query memory and propagate information seamlessly. The forecast-aware detection transformer enhances spatial reasoning by integrating trajectory predictions from a multiple hypothesis forecast memory queue, while the streaming forecast transformer improves temporal consistency using past forecasts and refined detections. Unlike tracking-based methods, ForeSight eliminates the need for explicit object association, reducing error propagation with a tracking-free model that efficiently scales across multiframe sequences. Experiments on the nuScenes dataset show that ForeSight achieves state-of-the-art performance, achieving an EPA of 54.9%, surpassing previous methods by 9.3%, while also attaining the best mAP and minADE among multi-view detection and forecasting models.1
A Unified Framework for Motion Reasoning and Generation in Human Interaction
Jeongeun Park
Korea University
Sungjoon Choi
Korea University
Sangdoo Yun
Naver AI Lab
Abstract
Recent advancements in large language models (LLMs) have greatly enhanced their ability to generate natural and contextually relevant text, enabling more human-like AI interactions. However, generating and understanding interactive human-like motion, where multiple individuals engage in coordinated movements, remains challenging due to the complexity of modeling these coordinated interactions. Furthermore, a unified and versatile model is required to handle diverse interactive scenarios, such as chat systems that dynamically adapt to user instructions and assigned roles. To tackle these problems, we introduce MoLaM, the Interactive Motion-LAnguage Model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversational contexts. Unlike previous studies primarily focusing on uni-directional tasks (e.g. Works done during Jeongeun Park was an intern at Naver AI Lab. text-to-motion or motion-to-text), MoLaM employs a unified architecture capable of simultaneously understanding and generating both motion and text modalities. Given the lack of an appropriate dataset to address this challenge, we introduce Inter-MT2, a large-scale instructiontuning dataset containing 82.7K multi-turn interactive motion instructions, spanning 153K interactive motion samples. Inter-MT2 covers diverse instructional scenarios including editing, question answering, and story generation, with interactive motions leveraging off-the-shelf large language models and motion diffusion models. We extensively evaluate the versatility of MoLaM across multiple interactive motion-related tasks: motion-to-text, text-to-motion, reaction generation, motion editing, and reasoning about motion sequences. Remarkably, MoLaM is the first model capable of effectively addressing all these tasks with a single unified framework, achieving competitive performance compared to task-specific methods.
Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model
Daehee Park
DGIST
Monu Surana
Qualcomm Research
Pranav Desai
Qualcomm Research
Ashish Mehta
Qualcomm Research
Reuben MV John
Qualcomm Research
Kuk-Jin Yoon
KAIST
Abstract
While data-driven trajectory prediction has enhanced the reliability of autonomous driving systems, it still struggles with rarely observed long-tail scenarios. Prior works addressed this by modifying model architectures, such as using hypernetworks. In contrast, we propose refining the training process to unlock each model's potential without altering its structure. We introduce Generative Active Learning for Trajectory prediction (GALTraj), the first method to successfully deploy generative active learning into trajectory prediction. It actively identifies rare tail samples where the model fails and augments these samples with a controllable diffusion model during training. In our framework, generating scenarios that are diverse, realistic, and preserve tail-case characteristics is paramount. Accordingly, we design a tail-aware generation method that applies tailored diffusion guidance to generate trajectories that both capture rare behaviors and respect traffic rules. Unlike prior simulation methods focused solely on scenario diversity, GALTraj is the first to show how simulator-driven augmentation benefits long-tail learning in trajectory prediction. Experiments on multiple trajectory datasets (WOMD, Argoverse2) with popular backbones (QCNet, MTR) confirm that our method significantly boosts performance on tail samples and also enhances accuracy on head samples.
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models
Sung-Yeon Park
Purdue University
Can Cui
Purdue University
Yunsheng Ma
Purdue University
Ahmadreza Moradipari
Toyota InfoTech Labs
Rohit Gupta
Toyota InfoTech Labs
Kyungtae Han
Toyota InfoTech Labs
Ziran Wang
Purdue University
Abstract
Recent advances in multi-modal large language models (MLLMs) have demonstrated strong performance across various domains; however, their ability to comprehend driving scenes remains less proven. The complexity of driving scenarios, which includes multi-view information, poses significant challenges for existing MLLMs. In this paper, we introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. To further support generalization to multi-view driving scenarios, we also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs. For context-aware analysis of traffic scenes, we categorize our dataset into nine subtasks across three core skills: Road Environment Perception, Spatial Relations Recognition, and Ego-Centric Reasoning. Furthermore, we present BEV-LLM, integrating Bird's-Eye-View (BEV) features from multi-view images into MLLMs. Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives. In contrast, BEV-LLM demonstrates remarkable adaptability to this domain, outperforming other models in six of the nine subtasks. These findings highlight how BEV integration enhances multiview MLLMs while also identifying key areas that require further refinement for effective adaptation to driving scenes. NuPlanQA is available at our GitHub repository
SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection
Chaesong Park
Seoul National University
Eunbin Seo
Hyundai Motor Group
Jihyeon Hwang
Seoul National University
Jongwoo Lim
Seoul National University
Abstract
In this paper, we introduce SC-Lane, a novel slope-aware and temporally consistent heightmap estimation framework for 3D lane detection. Unlike previous approaches that rely on fixed slope anchors, SC-Lane adaptively determines the fusion of slope-specific height features, improving robustness to diverse road geometries. To achieve this, we propose a Slope-Aware Adaptive Feature module that dynamically predicts the appropriate weights from image cues for integrating multi-slope representations into a unified heightmap. Additionally, a Height Consistency Module enforces temporal coherence, ensuring stable and accurate height estimation across consecutive frames, which is crucial for real-world driving scenarios. To evaluate the effectiveness of SC-Lane, we employ three standardized metrics-Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and thresholdbased accuracy-which, although common in surface and depth estimation, have been underutilized for road height assessment. Using the LiDAR-derived heightmap dataset introduced in prior work [20], we benchmark our method under these metrics, thereby establishing a rigorous standard for future comparisons. Extensive experiments on the OpenLane benchmark demonstrate that SCLane significantly improves both height estimation and 3D lane detection, achieving state-of-the-art performance with an F-score of 64.3%, outperforming existing methods by a notable margin. For detailed results and a demonstration video, please refer to our project page: https://parkchaesong.github.io/sclane/
SFUOD: Source-Free Unknown Object Detection
Keon-Hee Park
Kyung Hee University
Seun-An Choe
Kyung Hee University
Gyeong-Moon Park
Korea University
Abstract
Source-free object detection adapts a detector pre-trained on a source domain to an unlabeled target domain without requiring access to labeled source data. While this setting is practical as it eliminates the need for the source dataset during domain adaptation, it operates under the restrictive assumption that only pre-defined objects from the source domain exist in the target domain. This closed-set setting prevents the detector from detecting undefined objects. To ease this assumption, we propose Source-Free Unknown Object Detection (SFUOD), a novel scenario which enables the detector to not only recognize known objects but also detect undefined objects as unknown objects. To this end, we propose CollaPAUL (Collaborative tuning and Principal Axis-based Unknown Labeling), a novel framework for SFUOD. Collaborative tuning enhances knowledge adaptation by integrating target-dependent knowledge from the auxiliary encoder with source-dependent knowledge from the pre-trained detector through a cross-domain attention mechanism. Additionally, principal axes-based unknown labeling assigns pseudo-labels to unknown objects by estimating objectness via principal axes projection and confidence scores from model predictions. The proposed CollaPAUL achieves state-of-the-art performances on SFUOD benchmarks, and extensive experiments validate its effectiveness. Our code is available at SFUOD.
Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control
Seongmin Park
Hanyang University
Hyungmin Kim
Hanyang University
Sangwoo Kim
Hanyang University
Wonseok Jeon
Hyundai Motor Company
Juyoung Yang
Hyundai Motor Company
Byeongwook Jeon
Hyundai Motor Company
Yoonseon Oh
Hanyang University
Jungwook Choi
Hanyang University
Abstract
Deep neural network (DNN)-based policy models, such as vision-language-action (VLA) models, excel at automating complex decision-making from multi-modal inputs. However, scaling these models greatly increases computational overhead, complicating deployment in resourceconstrained settings like robot manipulation and autonomous driving. To address this, we propose SaliencyAware Quantized Imitation Learning (SQIL), which combines quantization-aware training with a selective lossweighting strategy for mission-critical states. By identifying these states via saliency scores and emphasizing them in the training loss, SQIL preserves decision fidelity under low-bit precision. We validate SQIL's generalization capability across extensive simulation benchmarks with environment variations, real-world tasks, and cross-domain tasks (self-driving, physics simulation), consistently recovering full-precision performance. Notably, a 4-bit weightquantized VLA model for robotic manipulation achieves up to 2.5x speedup and 2.5x energy savings on an edge GPU with minimal accuracy loss. These results underline SQIL 's potential for efficiently deploying large IL-based policy models on resource-limited devices.
SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering
Byeongjun Park
KAIST
Hyojun Go
EverEx
Hyelin Nam
EverEx
Byung-Hoon Kim
Yonsei University
Hyungjin Chung
EverEx
Changick Kim
KAIST
Abstract
Recent progress in 3D/4D scene generation emphasizes the importance of physical alignment throughout video generation and scene reconstruction. However, existing methods improve the alignment separately at each stage, making it difficult to manage subtle misalignments arising from another stage. Here, we present SteerX, a zero-shot inferencetime steering method that unifies scene reconstruction into the generation process, tilting data distributions toward better geometric alignment. To this end, we introduce two geometric reward functions for 3D/4D scene generation by using pose-free feed-forward scene reconstruction models. Through extensive experiments, we demonstrate the effectiveness of SteerX in improving 3D/4D scene generation.
UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation
Chaitanya Patel
Stanford University
Hiroki Nakamura
Panasonic Holdings Corporation
Yuta Kyuragi
Panasonic R&D Company of America
Kazuki Kozuka
Panasonic Holdings Corporation
Juan Carlos Niebles
Stanford University
Ehsan Adeli
Stanford University
Abstract
Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in realworld egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion's simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.
Revisiting Point Cloud Completion: Are We Ready For The Real-World?
Stuti Pathak
UAntwerp
Prashant Kumar
IIT Delhi
Dheeraj Baiju
BITS Pilani
Nicholus Mboga
GIM
Gunther Steenackers
UAntwerp
Rudi Penne
UAntwerp
Abstract
Point clouds acquired in constrained, challenging, uncontrolled, and multi-sensor real-world settings are noisy, incomplete, and non-uniformly sparse. This presents acute challenges for the vital task of point cloud completion. Using tools from Algebraic Topology and Persistent Homology (PH), we demonstrate that current benchmark object point clouds lack rich topological features that are integral part of point clouds captured in realistic environments. To facilitate research in this direction, we contribute the first real-world industrial dataset for point cloud completion, RealPC - a diverse, rich and varied set of point clouds. It consists of ∼40,000 pairs across 21 categories of industrial structures in railway establishments. Benchmark results on several strong baselines reveal that existing methods fail in realworld scenarios. We discover a striking observation - unlike current datasets, RealPC consists of multiple 0- and 1-dimensional PH-based topological features. We prove that integrating these topological priors into existing works helps improve completion. We present how 0-dimensional PH priors extract the global topology of a complete shape in the form of a 3D skeleton and assist a model in generating topologically consistent complete shapes. Since computing Homology is expensive, we present a simple, yet effective Homology Sampler guided network, BOSHNet that bypasses the Homology computation by sampling proxy backbones akin to 0-dim PH. These backbones provide similar benefits of 0-dim PH right from the start of the training, unlike similar methods where accurate backbones are obtained only during later phases of the training. The code is available at https://github.com/stutipathak5/Point-CloudCompletion.
MistSense: Versatile Online Detection of Procedural and Execution Mistakes
Constantin Patsch
Technical University of Munich
Yuankai Wu
Technical University of Munich
Marsil Zakour
Technical University of Munich
Driton Salihu
Technical University of Munich
Eckehard Steinbach
Technical University of Munich
Abstract
Online mistake detection is crucial across various domains, ranging from industrial automation to educational applications, as mistakes can be corrected by the human operator after their detection due to the continuous inference on a video stream. While prior research mainly addresses procedural errors that often relate to temporal and ordering information, identifying a broader range of error types is essential for real-world implementation. In this work, we present MistSense, an approach for online mistake identification that includes versatility by considering both procedural errors, which involve incorrect action sequences, and execution errors, such as motor inaccuracies or improper equipment use. Our method integrates RGB and hand pose features to capture fine-grained contextual cues in order to detect a mistake. By jointly modeling spatial and sequential aspects of human actions, our framework enables robust and adaptive error detection in dynamic environments. Once a mistake has been detected, we leverage a large language model (LLM) which provides an error explanation that gives the user further insights into why an action has been identified as a mistake. The evaluation on common mistake detection benchmarks shows the effectiveness of our approach.
D2ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition
Wenjie Pei
Harbin Institute of Technology, Shenzhen
Qizhong Tan
Harbin Institute of Technology, Shenzhen
Guangming Lu
Harbin Institute of Technology, Shenzhen
Jiandong Tian
Shenyang Institute of Automation, Chinese Academy of Sciences
Jun Yu
Harbin Institute of Technology, Shenzhen
Abstract
Adapting pre-trained image models to video modality has proven to be an effective strategy for robust fewshot action recognition. In this work, we explore the potential of adapter tuning in image-to-video model adaptation and propose a novel video adapter tuning framework, called Disentangled-and-Deformable SpatioTemporal Adapter (D2ST-Adapter). It features a lightweight design, low adaptation overhead and powerful spatiotemporal feature adaptation capabilities. D2ST-Adapter is structured with an internal dual-pathway architecture that enables built-in disentangled encoding of spatial and temporal features within the adapter, seamlessly integrating into the single-stream feature learning framework of pretrained image models. In particular, we develop an efficient yet effective implementation of the D2ST-Adapter, incorporating the specially devised anisotropic Deformable SpatioTemporal Attention as its pivotal operation. This mechanism can be individually tailored for two pathways with anisotropic sampling densities along the spatial and temporal domains in 3D spatio-temporal space, enabling disentangled encoding of spatial and temporal features while maintaining a lightweight design. Extensive experiments by instantiating our method on both pre-trained ResNet and ViT demonstrate the superiority of our method over stateof-the-art methods. Our method is particularly well-suited to challenging scenarios where temporal dynamics are critical for action recognition. Code is available at https: //github.com/qizhongtan/D2ST-Adapter.
Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics
Muleilan Pei
HKUST
Shaoshuai Shi
Voyager Research, Didi Chuxing
Xuesong Chen
Voyager Research, Didi Chuxing
Xu Liu
Zhuoyu Technology
Shaojie Shen
HKUST
Abstract
Motion forecasting for on-road traffic agents presents both a significant challenge and a critical necessity for ensuring safety in autonomous driving systems. In contrast to most existing data-driven approaches that directly predict future trajectories, we rethink this task from a planning perspective, advocating a 'First Reasoning, Then Forecasting' strategy that explicitly incorporates behavior intentions as spatial guidance for trajectory prediction. To achieve this, we introduce an interpretable, reward-driven intention reasoner grounded in a novel query-centric Inverse Reinforcement Learning (IRL) scheme. Our method first encodes traffic agents and scene elements into a unified vectorized representation, then aggregates contextual features through a query-centric paradigm. This enables the derivation of a reward distribution, a compact yet informative representation of the target agent's behavior within the given scene context via IRL. Guided by this reward heuristic, we perform policy rollouts to reason about multiple plausible intentions, providing valuable priors for subsequent trajectory generation. Finally, we develop a hierarchical DETR-like decoder integrated with bidirectional selective state space models to produce accurate future trajectories along with their associated probabilities. Extensive experiments on the largescale Argoverse and nuScenes motion forecasting datasets demonstrate that our approach significantly enhances trajectory prediction confidence, achieving highly competitive performance relative to state-of-the-art methods.
HiERO: Understanding the Hierarchy of Human Behavior Enhances Reasoning on Egocentric Videos
Simone Alberto Peirone
Politecnico di Torino
Francesca Pistilli
Politecnico di Torino
Giuseppe Averta
Politecnico di Torino
Abstract
Human activities are particularly complex and variable, and this makes challenging for deep learning models to reason about them. However, we note that such variability does have an underlying structure, composed of a hierarchy of patterns of related actions. We argue that such structure can emerge naturally from unscripted videos of human activities, and can be leveraged to better reason about their content. We present HiERO, a weakly-supervised method to enrich video segments features with the corresponding hierarchical activity threads. By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture. We prove the potential of our enriched features with multiple video-text alignment benchmarks (EgoMCQ, EgoNLQ) with minimal additional training, and in zeroshot for procedure learning tasks (EgoProceL and Ego4D Goal-Step). Notably, HiERO achieves state-of-the-art performance in all the benchmarks, and for procedure learning tasks it outperforms fully-supervised methods by a large margin (+12.5% F1 on EgoProceL) in zero shot. Our results prove the relevance of using knowledge of the hierarchy of human activities for multiple reasoning tasks in egocentric vision. Project page: sapeirone.github.io/HiERO.
A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds
Jizong Peng
dConstruct Robotics
Tze Ho Elden Tse
National University of Singapore
Kai Xu
National University of Singapore
Wenchao Gao
dConstruct Robotics
Angela Yao
National University of Singapore
Abstract
3D Gaussian Splatting (3DGS) is a powerful reconstruction technique; however, it requires initialization from accurate camera poses and high-fidelity point clouds. Typically, the initialization is taken from Structure-from-Motion (SfM) algorithms; however, SfM is time-consuming and restricts the application of 3DGS in real-world scenarios and largescale scene reconstruction. We introduce a constrained optimization method for simultaneous camera pose estimation and 3D reconstruction that does not require SfM support. Core to our approach is decomposing a camera pose into a sequence of camera-to-(device-)center and (device-)centerto-world optimizations. To facilitate, we propose two optimization constraints conditioned on the sensitivity of each parameter group and restricts the search space of each parameter. In addition, as we learn the scene geometry directly from the noisy point clouds, we propose geometric constraints to improve the reconstruction quality. Experiments demonstrate that the proposed method significantly outperforms the existing (multi-modal) 3DGS baseline and methods supplemented by COLMAP on both our collected dataset and two public benchmarks. Project webpage: https://eldentse.github.io/contrainedoptimization-3dgs.
On the Provable Importance of Gradients for Autonomous Language-Assisted Image Clustering
Bo Peng
University of Technology Sydney
Jie Lu
University of Technology Sydney
Guangquan Zhang
University of Technology Sydney
Zhen Fang
University of Technology Sydney
Abstract
This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data. Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-theart clustering performance on various benchmarks.
DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching
Emery Pierson
LIX, Ecole Polytechnique
Lei Li
Technical University of Munich
Angela Dai
Technical University of Munich
Maks Ovsjanikov
LIX, Ecole Polytechnique
Abstract
Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation strategy from diffusion models in the spectral domain. Experiments demonstrate that our learned regularization leads to better results than axiomatic approaches for zero-shot non-rigid shape matching. Our code is available at: https://github.com/daidedou/diffumatch/
MaskControl: Spatio-Temporal Control for Masked Motion Synthesis
Ekkasit Pinyoanuntapong
University of North Carolina at Charlotte
Muhammad Usama Saleem
University of North Carolina at Charlotte
Korrawe Karunratanakul
ETH Zürich
Pu Wang
University of North Carolina at Charlotte
Hongfei Xue
University of North Carolina at Charlotte
Chen Chen
University of Central Florida
Chuan Guo
Snap Inc.
Junli Cao
Snap Inc.
Jian Ren
Snap Inc.
Sergey Tulyakov
Snap Inc.
Abstract
Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, Logits Regularizer implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure highfidelity generation. Second, Logit Optimization explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce Differentiable Expectation Sampling (DES) to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by 77%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at https://www.ekkasit.com/ControlMM-page/
SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection
Maximilian Pittner
Bosch Mobility Solutions, Robert Bosch GmbH
Joel Janai
Bosch Mobility Solutions, Robert Bosch GmbH
Mario Faigle
Bosch Mobility Solutions, Robert Bosch GmbH
Alexandru Paul Condurache
Institute of Neuro- and Bioinformatics, University of Lübeck
Abstract
3D lane detection has emerged as a critical challenge in autonomous driving, encompassing identification and localization of lane markings and the 3D road surface. Conventional 3D methods detect lanes from dense birds-eyeviewed (BEV) features, though erroneous transformations often result in a poor feature representation misaligned with the true 3D road surface. While recent sparse lane detectors have surpassed dense BEV approaches, they completely disregard valuable lane-specific priors. Furthermore, existing methods fail to utilize historic lane observations, which yield the potential to resolve ambiguities in situations of poor visibility. To address these challenges, we present SparseLaneSTP, a novel method that integrates both geometric properties of the lane structure and temporal information into a sparse lane transformer. It introduces a new lane-specific spatio-temporal attention mechanism, a continuous lane representation tailored for sparse architectures as well as temporal regularization. Identifying weaknesses of existing 3D lane datasets, we also introduce a precise and consistent 3D lane dataset using a simple yet effective auto-labeling strategy. Our experimental section proves the benefits of our contributions and demonstrates state-of-the-art performance across all detection and error metrics on existing 3D lane detection benchmarks as well as on our novel dataset.
Long-Context State-Space Video World Models
Ryan Po
Stanford University
Yotam Nitzan
Adobe Research
Richard Zhang
Adobe Research
Berlin Chen
Princeton University
Tri Dao
Princeton University
Eli Shechtman
Adobe Research
Gordon Wetzstein
Stanford University
Xun Huang
Adobe Research
Abstract
Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. Unlike previous approaches that retrofit SSMs for non-causal vision tasks, our method fully exploits the inherent advantages of SSMs in causal sequence modeling. Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory, combined with dense local attention to ensure coherence between consecutive frames. We evaluate the long-term memory capabilities of our model through spatial retrieval and reasoning tasks over extended horizons. Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory, while maintaining practical inference speeds suitable for interactive applications.
FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases
Matteo Poggi
University of Bologna
Fabio Tosi
University of Bologna
Abstract
We present FlowSeek, a novel framework for optical flow requiring minimal hardware resources for training. FlowSeek marries the latest advances on the design space of optical flow networks with cutting-edge single-image depth foundation models and classical low-dimensional motion parametrization, implementing a compact, yet accurate architecture. FlowSeek is trained on a single consumer-grade GPU, a hardware budget about 8x lower compared to most recent methods, and still achieves superior cross-dataset generalization on Sintel Final and KITTI, with a relative improvement of 10 and 15% over the previous state-of-the-art SEA-RAFT, as well as on Spring and LayeredFlow datasets.
Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle
Miroslav Purkrabek
Czech Technical University in Prague
Jiri Matas
Czech Technical University in Prague
Abstract
Human pose estimation methods work well on isolated people but struggle with multiple-bodies-in-proximity scenarios. Previous work has addressed this problem by conditioning pose estimation by detected bounding boxes or keypoints, but overlooked instance masks. We propose to iteratively enforce mutual consistency of bounding boxes, instance masks, and poses. The introduced BBox-MaskPose (BMP) method uses three specialized models that improve each other's output in a closed loop. All models are adapted for mutual conditioning, which improves robustness in multi-body scenes. MaskPose, a new maskconditioned pose estimation model, is the best among topdown approaches on OCHuman. BBox-Mask-Pose pushes SOTA on OCHuman dataset in all three tasks - detection, instance segmentation, and pose estimation. It also achieves SOTA performance on COCO pose estimation. The method is especially good in scenes with large instances overlap, where it improves detection by 39% over the baseline detector. With small specialized models and faster runtime, BMP is an effective alternative to large human-centered foundational models. Code and models are available on the project website 1. 1MiraPurkrabek.github.io/BBox-Mask-Pose/
COVTrack: Continuous Open-Vocabulary Tracking via Adaptive Multi-Cue Fusion
Zekun Qian
College of Intelligence and Computing, Tianjin University
Ruize Han
Shenzhen University of Advanced Technology
Zhixiang Wang
College of Intelligence and Computing, Tianjin University
Junhui Hou
City University of Hong Kong
Wei Feng
College of Intelligence and Computing, Tianjin University
Abstract
Open-Vocabulary Multi-Object Tracking (OVMOT) aims to detect and track diverse object categories in videos, including both seen (base) and unseen (novel) categories. Current methods rely on appearance features from generated image pairs or utilize the discontinuous annotations of the video dataset (TAO) for training, primarily due to the lack of available continuous annotated video datasets for OVMOT. This limitation affects their effectiveness, since continuous target trajectories are necessary for robust tracker learning. In this work, we propose the CTAO dataset, which provides a continuous version of TAO, thereby constructing the first continuous annotated training dataset for OVMOT. This addresses the previous limitations in training data availability. Additionally, we introduce COVTrack, a unified framework that effectively integrates motion and semantic features with appearance features, in which the multi-cue feature aggregation strategy dynamically aggregates and balances these features, based on the confidence estimation from both intra-frame and interframe contexts. Our proposed framework significantly improves OVMOT performance, establishing COVTrack as a state-of-the-art solution on OVMOT benchmarks.
PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors
Kangan Qian
School of Vehicle and Mobility, Tsinghua University
Jinyu Miao
School of Vehicle and Mobility, Tsinghua University
Xinyu Jiao
School of Vehicle and Mobility, Tsinghua University
Ziang Luo
School of Vehicle and Mobility, Tsinghua University
Zheng Fu
School of Vehicle and Mobility, Tsinghua University
Yining Shi
School of Vehicle and Mobility, Tsinghua University
Yunlong Wang
School of Vehicle and Mobility, Tsinghua University
Kun Jiang
School of Vehicle and Mobility, Tsinghua University
Diange Yang
School of Vehicle and Mobility, Tsinghua University
Abstract
Reliable spatial and motion perception is essential for safe autonomous navigation. Recently, class-agnostic motion prediction on bird's-eye view (BEV) cell grids derived from LiDAR point clouds has gained significant attention. However, existing frameworks typically perform cell classification and motion prediction on a per-pixel basis, neglecting important motion field priors such as rigidity constraints, temporal consistency, and future interactions between agents. These limitations lead to degraded performance, particularly in sparse and distant regions. To address these challenges, we introduce PriorMotion, an innovative generative framework designed for class-agnostic motion prediction that integrates essential motion priors by modeling them as distributions within a structured latent space. Specifically, our method captures structured motion priors using raster-vector representations and employs a variational autoencoder with distinct dynamic and static components to learn future motion distributions in the latent space. Experiments on the nuScenes dataset demonstrate that PriorMotion outperforms state-of-the-art methods across both traditional metrics and our newly proposed evaluation criteria. Notably, we achieve improvements of approximately 15.24% in accuracy for fast-moving objects, an 3.59% increase in generalization, a reduction of 0.0163 in motion stability, and a 31.52% reduction in prediction errors in distant regions. Further validation on FMCW LiDAR sensors confirms the robustness of our approach.
VOVTrack: Exploring the Potentiality in Raw Videos for Open-Vocabulary Multi-Object Tracking
Zekun Qian
College of Intelligence and Computing, Tianjin University
Ruize Han
Shenzhen University of Advanced Technology
Junhui Hou
City University of Hong Kong
Linqi Song
City University of Hong Kong
Wei Feng
College of Intelligence and Computing, Tianjin University
Abstract
Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, not fully leveraging the video information. In this work, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video analysis standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new promptguided attention mechanism for more accurate detection (localization and classification) of time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object tracking (association). Experimental results underscore that VOVTrack establishes itself as a state-of-the-art solution for the open-vocabulary tracking task.
Active Perception Meets Rule-Guided RL: A Two-Phase Approach for Precise Object Navigation in Complex Environments
Liang Qin
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, China
Min Wang
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Peiwei Li
University of Science and Technology of China
Wengang Zhou
University of Science and Technology of China
Houqiang Li
University of Science and Technology of China
Abstract
Object Goal Navigation (ObjectNav) in unknown environments presents significant challenges, particularly in OpenVocabulary Mobile Manipulation (OVMM), where robots must efficiently explore large spaces, locate small objects, and accurately position themselves for subsequent manipulation. Existing approaches struggle to meet these demands: rule-based methods offer structured exploration but lack adaptability, while reinforcement learning (RL)-based methods enhance adaptability but fail to ensure effective long-term navigation. Moreover, both approaches often overlook precise stopping positions, which are critical for successful manipulation. To address these challenges, we propose APRR (Active Perception meets Rule-guided RL), a two-phase framework, which designs a new rule-guided RL policy for the exploration phase and a novel active target perception policy for the last-mile navigation phase. Inspired by human search behavior, our rule-guided RL policy enables efficient and adaptive exploration by combining structured heuristics with learning-based decisionmaking. In the last-mile navigation phase, we introduce an RL-based policy enhanced with active target perception, allowing the robot to refine its position dynamically based on real-time detection feedback. Experimental results demonstrate that APRR improves the success rate by 13%, significantly outperforming existing methods. Furthermore, real-world experiments validate the practicality and effectiveness of APRR in real-world mobile manipulation scenarios, offering a robust and adaptable solution for precise object navigation. The code is available at https://github.com/qinliangql/APRR.
Learning on the Go: A Meta-learning Object Navigation Model
Xiaorong Qin
Key Lab of Intelligent Information Processing Laboratory of the Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing
Xinhang Song
Key Lab of Intelligent Information Processing Laboratory of the Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing
Sixian Zhang
Key Lab of Intelligent Information Processing Laboratory of the Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing
Xinyao Yu
Key Lab of Intelligent Information Processing Laboratory of the Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing
Xinmiao Zhang
University of Chinese Academy of Sciences, Beijing
Shuqiang Jiang
Key Lab of Intelligent Information Processing Laboratory of the Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing
Abstract
Object navigation tasks require an agent to locate a target object using visual observations in unseen environments, where unfamiliar layouts and novel object appearances can hinder navigation. Most existing methods lack the adaptability needed to handle these uncertainties, as their navigation models remain fixed during testing. In this paper, we address this challenge by examining object-conditioned trajectory distribution shifts in navigation caused by changes in environmental dynamics. We propose learning a central conditional distribution as a prior that approximates the specific distributions of diverse environments. To retain environment-specific information during navigation, we allow each environment-specific distribution to approximate this central distribution rather than relying on it directly. To implement this, we introduce a meta-learning mechanism that integrates with traditional navigation methods, offering tailored solutions for various types of navigation approaches. Our approach, Learning on the Go (LOG), enables agents to learn on the go, allowing for flexible, adaptive, real-time learning during navigation. Our theoretical analysis highlights the benefits of learning a central distribution for effective generalization across environments, and empirical results confirm the proposed method's effectiveness, demonstrating superior performance compared to existing approaches.
RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints
Yiran Qin
Sun Yat-sen University
Li Kang
Shanghai Jiao Tong University
Xiufeng Song
Shanghai Jiao Tong University
Zhenfei Yin
Oxford
Xiaohong Liu
Shanghai Jiao Tong University
Xihui Liu
HKU
Ruimao Zhang
Sun Yat-sen University
Lei Bai
Shanghai Artificial Intelligence Laboratory
Abstract
Designing effective embodied multi-agent systems is critical for solving complex real-world tasks across domains. Due to the complexity of multi-agent embodied systems, existing methods fail to automatically generate safe and efficient training data for such systems. To this end, we propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents. We design various interfaces tailored to different types of constraints, enabling seamless interaction with the physical world. Leveraging compositional constraints and specifically designed interfaces, we develop an automated data collection framework for embodied multi-agent systems and introduce the first benchmark for embodied multi-agent manipulation, RoboFactory. Based on RoboFactory benchmark, we adapt and evaluate the method of imitation learning and analyzed its performance in different difficulty agent tasks. Furthermore, we explore the architectures and training strategies for multi-agent imitation learning, aiming to build safe and efficient embodied multi-agent systems.
Bias-Resilient Weakly Supervised Semantic Segmentation Using Normalizing Flows
Xianglin Qiu
XJTLU
Xiaoyang Wang
Shanghai AI Laboratory
Zhen Zhang
iHorry
Jimin Xiao
XJTLU
Abstract
Weakly supervised semantic segmentation (WSSS) aims to generate dense labels using sparse annotations, such as image-level labels. Existing class activation map (CAM) generation methods have been able to locate rough objects. However, due to the limited information provided by image level labels, the bias activation problem, including overactivation, becomes another key obstacle in WSSS. To rectify such bias activation, we attempt to mine pixel level class feature distribution information from the entire dataset. Specifically, we propose to use normalizing flow to model the class feature distribution of all pixels across the entire dataset and design a Bias-Resilient WSSS framework based on Normalizing Flow (BRNF). Normalizing flow has the ability to map complex distributions to normal distributions. Building upon it, we designed an additional Gaussian mixture classifier which classifies pixels from the perspective of feature distributions, providing supplementary information to the conventional MLP based classifier. In addition, we use this distribution to sample low bias features as positive anchors for contrastive learning, thereby encouraging feature optimization toward the correct low-bias direction. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving state-ofthe-art performance on WSSS benchmarks. Code will be available at https://github.com/DpDark/BRNF.
Feature Extraction and Representation of Pre-training Point Cloud Based on Diffusion Models
Chang Qiu
Southeast University
Feipeng Da
Southeast University
Zilei Zhang
Southeast University
Abstract
The pretrain-finetune paradigm of pre-training a model on large amounts of image and text data and then fine-tuning the model for a specific task has led to significant progress in many 2D image and natural language processing tasks. Similarly, the use of pre-training methods in point cloud data can also enhance the working performance and generalization ability of the model. Therefore, in this paper, we propose a pre-training framework based on a diffusion model called PreDifPoint. It is able to accomplish the pretraining of the model's backbone network through a diffusion process of gradual denoising. We aggregate the potential features extracted from the backbone network, input them as conditions into the subsequent diffusion model, and direct the point-to-point mapping relationship of the noisy point clouds at neighboring time steps, so as to generate high-quality point clouds and at the same time better perform various downstream tasks of the point clouds. We also introduce a bi-directional covariate attention (DXCAAttention) mechanism for capturing complex feature interactions, fusing local and global features, and improving the detail recovery of point clouds. In addition, we propose a density-adaptive sampling strategy, which can help the model dynamically adjust the sampling strategy between different time steps, and guide the model to pay more attention to the denser regions in the point cloud, thus improving the effectiveness of the model in point cloud recovery. Our PreDifPoint framework achieves more competitive results on various real-world datasets. Specifically, PreDifPoint achieves an overall accuracy of 87.96%, which is 0.35% higher than PointDif, on the classification task on PB-T50395RS, a variant of ScanObjectNN dataset.
LHM: Large Animatable Human Reconstruction Model for Single Image to 3D in Seconds
Lingteng Qiu
Tongyi Lab, Alibaba Group
Xiaodong Gu
Tongyi Lab, Alibaba Group
Peihao Li
Tongyi Lab, Alibaba Group
Qi Zuo
Tongyi Lab, Alibaba Group
Weichao Shen
Tongyi Lab, Alibaba Group
Junfei Zhang
Tongyi Lab, Alibaba Group
Kejie Qiu
Tongyi Lab, Alibaba Group
Weihao Yuan
Tongyi Lab, Alibaba Group
Guanying Chen
Tongyi Lab, Alibaba Group
Zilong Dong
Tongyi Lab, Alibaba Group
Liefeng Bo
Tongyi Lab, Alibaba Group
Abstract
Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feedforward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability. Our code is available on https://github.com/aigc3d/LHM
Multi-View 3D Point Tracking
Frano Rajiˇc
ETH Z¨urich
Haofei Xu
ETH Z¨urich
Marko Mihajlovic
ETH Z¨urich
Siyuan Li
ETH Z¨urich
Irem Demir
ETH Z¨urich
Emircan G¨undo˘gdu
ETH Z¨urich
Lei Ke
Carnegie Mellon University
Sergey Prokudin
ETH Z¨urich
Marc Pollefeys
ETH Z¨urich
Siyu Tang
ETH Z¨urich
Abstract
We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or prior multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feedforward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks-Panoptic Studio and DexYCB- achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Our method generalizes well to diverse camera setups of 1-8 views with varying vantage points and video lengths of 24-150 frames. By releasing our tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for real-world applications. Project page: https://ethz-vlg.github.io/mvtracker.
AMD: Adaptive Momentum and Decoupled Contrastive Learning Framework for Robust Long-Tail Trajectory Prediction
Bin Rao
State Key Laboratory of Internet of Things for Smart City, University of Macau
Haicheng Liao
State Key Laboratory of Internet of Things for Smart City, University of Macau
Yanchen Guan
State Key Laboratory of Internet of Things for Smart City, University of Macau
Chengyue Wang
State Key Laboratory of Internet of Things for Smart City, University of Macau
Bonan Wang
State Key Laboratory of Internet of Things for Smart City, University of Macau
Jiaxun Zhang
State Key Laboratory of Internet of Things for Smart City, University of Macau
Zhenning Li
State Key Laboratory of Internet of Things for Smart City, University of Macau
Abstract
Accurately predicting the future trajectories of traffic agents is essential in autonomous driving. However, due to the inherent imbalance in trajectory distributions, tail data in natural datasets often represents more complex and hazardous scenarios. Existing studies typically rely solely on a base model's prediction error, without considering the diversity and uncertainty of long-tail trajectory patterns. We propose an adaptive momentum and decoupled contrastive learning framework (AMD), which integrates unsupervised and supervised contrastive learning strategies. By leveraging an improved momentum contrast learning (MoCo-DT) and decoupled contrastive learning (DCL) module, our framework enhances the model's ability to recognize rare and complex trajectories. Additionally, we design four types of trajectory random augmentation methods and introduce an online iterative clustering strategy, allowing the model to dynamically update pseudo-labels and better adapt to the distributional shifts in long-tail data. We propose three different criteria to define long-tail trajectories and conduct extensive comparative experiments on the nuScenes and ETH/UCY datasets. The results show that AMD not only achieves optimal performance in long-tail trajectory prediction but also demonstrates outstanding overall prediction accuracy.
Beyond Perspective: Neural 360-Degree Video Compression
Andy Regensky
Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg
Marc Windsheimer
Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg
Fabian Brand
Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg
Andr´e Kaup
Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg
Abstract
Neural video codecs (NVCs) have seen fast-paced advancement in recent years and already perform close to stateof-the-art traditional video codecs like H.266/VVC. However, NVC investigations have so far focused on improving performance for classical perspective video leaving the increasingly important 360-degree video format unexplored. In this paper, we address this issue and present how existing NVCs can be optimized for 360-degree video while also improving performance on perspective video. As no suitable datasets for neural 360-degree video compression exist, we publish a large-scale 360-degree video dataset consisting of more than 6000 user generated 9-frame sequences with resolutions ranging from 0.5K to 8K. We propose a novel method for training data augmentation exploiting the spherical characteristics of 360-degree video that shows to be crucial for achieving maximum compression performance. An additional positional feature encoding further supports the NVC in dynamic bitrate allocation notably improving the performance for both 360-degree and perspective video. Overall, we achieve rate savings of almost 8% for 360degree video and more than 3% for perspective video with minimal complexity overhead. The dataset is available at: https://huggingface.co/datasets/FAULMS/UGC360. Source code and pre-trained model weights are available at: https://github.com/FAU-LMS/NVC360.
GauUpdate: New Object Insertion in 3D Gaussian Fields with Consistent Global Illumination
Chengwei Ren
Tsinghua University
Fan Zhang
Shanghai AI Laboratory
Liangchao Xu
Nanjing University
Liang Pan
Shanghai AI Laboratory
Ziwei Liu
Nanyang Technological University
Wenping Wang
Texas A&M University
Xiao-Ping Zhang
Tsinghua University
Yuan Liu
The Hong Kong University of Science and Technology
Abstract
3D Gaussian Splatting (3DGS) is a prevailing technique to reconstruct large-scale 3D scenes from multiview images for novel view synthesis, like a room, a block, and even a city. Such large-scale scenes are not static with changes constantly happening in these scenes, like a new building being built or a new decoration being set up. To keep the reconstructed 3D Gaussian fields up-to-date, a naive way is to reconstruct the whole scene after changing, which is extremely costly and inefficient. In this paper, we propose a new method called GauUpdate that allows partially updating an old 3D Gaussian field with new objects from a new 3D Gaussian field. However, simply inserting the new objects leads to inconsistent appearances because the old and new Gaussian fields may have different lighting environments from each other. GauUpdate addresses this problem by applying inverse rendering techniques in the 3DGS to recover both the materials and environmental lights. Based on the materials and lighting, we relight the new objects in the old 3D Gaussian field for consistent global illumination. For an accurate estimation of the materials and lighting, we put additional constraints on the materials and lighting conditions, that these two fields share the same materials but different environment lights, to improve their qualities. We conduct experiments on both synthetic scenes and realworld scenes to evaluate GauUpdate, which demonstrate that GauUpdate achieves realistic object insertion in 3D Gaussian fields with consistent appearances.
Multi-modal Segment Anything Model for Camouflaged Scene Segmentation
Guangyu Ren
Xi'an Jiaotong-Liverpool University
Hengyan Liu
Xi'an Jiaotong-Liverpool University
Michalis Lazarou
Imperial College London
Tania Stathaki
Imperial College London
Abstract
Camouflaged scenes, where objects blend seamlessly into their environments, pose significant challenges to both human observers and computer vision systems. To address this, we propose a novel framework that leverages off-the-shelf foundation models to generate multi-modal prompts for the Segment Anything Model (SAM), thus eliminating the need for manual prompts and significantly improving overall performance on this downstream task. At first, we generate an image caption using the BLIP model and obtain its text embedding through the use of a text encoder. We then generate a visual embedding through the vision encoder of the BLIP model and use both as inputs to SAM to provide additional semantic information about the image. Finally, we propose a couple of architectural novelties, a) we effectively integrate the multi-modal information in SAM through a multi-level adapter and b) we replace the dense embedding of SAM with the image embedding of its image encoder. Our method achieves new state-of-the-art performance in 11 out of 12 metrics in three benchmark datasets for camouflaged detection. Additionally, our method can be successfully adapted to other tasks such as medical image segmentation performing on par or even outperforming the state-of-the-art methods. Our code is available in https://github.com/icqialanqian/Vision-Language-SAM.
Neural Compression for 3D Geometry Sets
Siyu Ren
City University of Hong Kong
Junhui Hou
City University of Hong Kong
Weiyao Lin
Shanghai Jiao Tong University
Wenping Wang
Texas A&M University
Abstract
We present NeCGS, the first neural compression paradigm, which can compress a geometry set encompassing thousands of detailed and diverse 3D mesh models by up to 900 times with high accuracy and preservation of detailed geometric structures. Specifically, we first propose TSDF-Def, a new implicit representation that is capable of accurately representing irregular 3D mesh models with various structures into regular 4D tensors of uniform and compact size, where 3D surfaces can be extracted through the deformable marching cubes. Then we construct a quantization-aware auto-decoder network architecture to regress these 4D tensors to explore the local geometric similarity within each shape and across different shapes for redundancy removal, resulting in more compact representations, including an embedded feature of a smaller size associated with each 3D model and a network parameter shared by all models. We finally encode the resulting features and network parameters into bitstreams through entropy coding. Besides, our NeCGS can handle the dynamic scenario well, where new 3D models are constantly added to a compressed set. Extensive experiments and ablation studies demonstrate the significant advantages of our NeCGS over state-of-the-art methods both quantitatively and qualitatively. The source code is publicly available at https://github.com/rsy6318/NeCGS.
Seeing the Unseen: A Semantic Alignment and Context-Aware Prompt Framework for Open-Vocabulary Camouflaged Object Segmentation
Peng Ren
College of Computer Science and Technology, Jilin University
Tian Bai
College of Computer Science and Technology, Jilin University
Jing Sun
School of Information and Communication Engineering, Dalian Minzu University
Fuming Sun
School of Information and Communication Engineering, Dalian Minzu University
Abstract
Open-Vocabulary Camouflaged Object Segmentation (OVCOS) aims to segment camouflaged objects of any category based on text descriptions. Despite existing openvocabulary methods exhibit strong segmentation capabilities, they still have a major limitation in camouflaged scenarios: semantic confusion, which leads to incomplete segmentation and class shift in the model. To mitigate the above limitation, we propose a framework for OVCOS, named SuCLIP. Specifically, we design a context-aware prompt scheme that leverages the internal knowledge of the CLIP visual encoder to enrich the text prompt and align it with local visual features, thereby enhancing the text prompt. To better align the visual semantic space and the text semantic space, we design a class-aware feature selection module to dynamically adjust text and visual embeddings, making them more matched with camouflaged object. Meanwhile, we introduce a semantic consistency loss to mitigate the semantic deviation between the text prompt and visual features, ensuring semantic consistency between the segmentation results and the text prompt. Finally, we design a text query decoder that precisely maps textual semantics to pixel-level segmentation results, thereby achieving semantic-spatial consistent decoding. Experimental results show that SuCLIP significantly outperforms the advanced method OVCoser on the OVCamo dataset.
TOTP: Transferable Online Pedestrian Trajectory Prediction with Temporal-Adaptive Mamba Latent Diffusion
Ziyang Ren
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Ping Wei
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Shangqi Deng
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Haowen Tang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Jiapeng Li
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Huan Li
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Abstract
Pedestrian trajectory prediction is crucial for many intelligent tasks. While existing methods predict future trajectories from fixed-frame historical observations, they are limited by the observational perspective and the need for extensive historical information, resulting in prediction delays and inflexible generalization in real-time systems. In this paper, we propose a novel task called Transferable Online Pedestrian Trajectory Prediction (TOTP), which synchronously predicts future trajectories with variable observations and enables effective task transfer under different observation constraints. To advance TOTP modeling, we propose a Temporal-Adaptive Mamba Latent Diffusion (TAMLD) model. It utilizes the Social-Implicit Mamba Synthesizer to extract motion states with social interaction and refine temporal representations through TemporalAware Distillation. A Trend-Conditional Mamba Decomposer generates the motion latent distribution of the future motion trends and predicts future motion trajectories through sampling decomposition. We utilize Motion-Latent Mamba Diffusion to reconstruct the latent space disturbed by imbalanced temporal noise. Our method achieves stateof-the-art results on multiple datasets and tasks, showcasing temporal adaptability and generalization ability.
Fast Globally Optimal and Geometrically Consistent 3D Shape Matching
Paul Roetzer
University of Bonn
Florian Bernard
University of Bonn
Abstract
Geometric consistency, i.e. the preservation of neighbourhoods, is a natural and strong prior in 3D shape matching. Geometrically consistent matchings are crucial for many downstream applications, such as texture transfer or statistical shape modelling. Yet, in practice, geometric consistency is often overlooked, or only achieved under severely limiting assumptions (e.g. a good initialisation). In this work, we propose a novel formalism for computing globally optimal and geometrically consistent matchings between 3D shapes which is scalable in practice. Our key idea is to represent the surface of the source shape as a collection of cyclic graphs, which are then consistently matched to the target shape. Mathematically, we construct a hyper product graph (between source and target shape), and then cast 3D shape matching as a minimum-cost circulation flow problem in this hyper graph, which yields global geometrically consistent matchings between both shapes. We empirically show that our formalism is efficiently solvable and that it leads to high-quality results. Our code is publicly available.1 1https://github.com/paul0noah/geco
CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image
Wonseok Roh
Korea University
Hwanhee Jung
Korea University
Jong Wook Kim
Korea University
Seunggwan Lee
Korea University
Innfarn Yoo
CNAPS.AI Inc.
Andreas Lugmayr
Google
Seunggeun Chi
Purdue University
Karthik Ramani
Purdue University
Sangpil Kim
Korea University
Abstract
Recently, generalizable feed-forward methods based on 3D Gaussian Splatting have gained significant attention for their potential to reconstruct 3D scenes using finite resources. These approaches create a 3D radiance field, parameterized by per-pixel 3D Gaussian primitives, from just a few images in a single forward pass. Unlike multi-view methods that benefit from cross-view correspondences, 3D scene reconstruction with a single-view image remains an underexplored area. In this work, we introduce CATSplat, a novel generalizable transformer-based framework designed to break through the inherent constraints in monocular settings. First, we propose leveraging textual guidance from a visual-language model to complement insufficient information from single-view image features. By incorporating scene-specific contextual details from text embeddings through cross-attention, we pave the way for context-aware 3D scene reconstruction beyond relying solely on visual cues. Moreover, we advocate utilizing spatial guidance from 3D point features toward comprehensive geometric understanding under monocular settings. With 3D priors, image features can capture rich structural insights for predicting 3D Gaussians without multi-view techniques. Extensive experiments on large-scale datasets demonstrate the state-of-the-art performance of CATSplat in single-view 3D scene reconstruction with high-quality novel view synthesis.
HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction
Sara Rojas
KAUST
Matthieu Armando
NAVER LABS Europe
Bernard Ghanem
KAUST
Philippe Weinzaepfel
NAVER LABS Europe
Vincent Leroy
NAVER LABS Europe
Gr´egory Rogez
NAVER LABS Europe
Abstract
Recovering the 3D geometry of a scene from a sparse set of uncalibrated images is a long-standing problem in computer vision. While recent learning-based approaches such as DUSt3R and MASt3R have demonstrated impressive results by directly predicting dense scene geometry, they are primarily trained on outdoor scenes with static environments and struggle to handle human-centric scenarios. In this work, we introduce HAMSt3R, an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated multi-view images. First, we exploit DUNE, a strong image encoder obtained by distilling, among others, the encoders from MASt3R and from a state-of-the-art Human Mesh Recovery (HMR) model, multi-HMR, for a better understanding of scene geometry and human bodies. Our method then incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments, enabling a more comprehensive 3D reconstruction. By leveraging the outputs of our different heads, HAMSt3R produces a dense point map enriched with human semantic information in 3D. Unlike existing methods that rely on complex optimization pipelines, our approach is fully feedforward and efficient, making it suitable for real-world applications. We evaluate our model on EgoHumans and EgoExo4D, two challenging benchmarks containing diverse human-centric scenarios. Additionally, we validate its generalization to traditional multi-view stereo and multi-view pose regression tasks. Our results demonstrate that our method can reconstruct humans effectively while preserving strong performance in general 3D reconstruction tasks, bridging the gap between human and scene understanding in 3D vision.
MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
Fu Rong
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University
Meng Lan
Hong Kong University of Science and Technology
Qian Zhang
Horizon Robotics
Lefei Zhang
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University
Abstract
Referring video object segmentation (RVOS) aims to segment objects in a video according to textual descriptions, which requires the integration of multimodal information and temporal dynamics perception. The Segment Anything Model 2 (SAM 2) has shown great effectiveness across various video segmentation tasks. However, its application to offline RVOS is challenged by the translation of the text into effective prompts and a lack of global context awareness. In this paper, we propose a novel RVOS framework, termed MPG-SAM 2, to address these challenges. Specifically, MPG-SAM 2 employs a multimodal encoder to jointly encode video and textual features, generating semantically aligned video and text embeddings along with multimodal class tokens. A mask prior generator is devised to utilize the video embeddings and class tokens to create pseudo masks of target objects and global context. These masks are fed into the prompt encoder as dense prompts, along with multimodal class tokens as sparse prompts to generate accurate prompts for SAM 2. To provide the online SAM 2 with a global view, we propose a hierarchical global-historical aggregator, which allows SAM 2 to aggregate global and historical information of target objects at both pixel and object levels, enhancing the target representation and temporal consistency. Extensive experiments on several RVOS benchmarks demonstrate the superiority of MPG-SAM 2 and the effectiveness of the proposed modules. The code is available at https://github.com/rongfu-dsb/MPG-SAM2.
PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining
Ciyu Ruan
Shenzhen International Graduate School, Tsinghua University
Ruishan Guo
Shenzhen International Graduate School, Tsinghua University
Zihang Gong
Harbin Institute of Technology
Jingao Xu
Carnegie Mellon University
Wenhan Yang
Pengcheng Laboratory
Xinlei Chen
Shenzhen International Graduate School, Tsinghua University
Abstract
Event cameras excel in high temporal resolution and dynamic range but suffer from dense noise in rainy conditions. Existing event deraining methods face trade-offs between temporal precision, deraining effectiveness, and computational efficiency. In this paper, we propose PREMamba, a novel point-based event camera deraining framework that fully exploits the spatiotemporal characteristics of raw event and rain. Our framework introduces a 4D event cloud representation that integrates dual temporal scales to preserve high temporal precision, a Spatio-Temporal Decoupling and Fusion module (STDF) that enhances deraining capability by enabling shallow decoupling and interaction of temporal and spatial information, and a Multi-Scale State Space Model (MS3M) that captures deeper rain dynamics across dual-temporal and multi-spatial scales with linear computational complexity. Enhanced by frequencydomain regularization, PRE-Mamba achieves superior performance (0.95 SR, 0.91 NR, and 0.4s/M events) with only 0.26M parameters on EventRain-27K, a comprehensive dataset with labeled synthetic and real-world sequences. Moreover, our method generalizes well across varying rain intensities, viewpoints, and even snowy conditions. Code and dataset: https://github.com/softword-tt/PRE-Mamba.
CAD-Recode: Reverse Engineering CAD Code from Point Clouds
Danila Rukhovich
SnT, University of Luxembourg
Elona Dupont
SnT, University of Luxembourg
Dimitrios Mallis
SnT, University of Luxembourg
Kseniya Cherenkova
Artec3D, Luxembourg
Anis Kacem
SnT, University of Luxembourg
Djamila Aouada
SnT, University of Luxembourg
Abstract
Computer-Aided Design (CAD) models are typically constructed by sequentially drawing parametric sketches and applying CAD operations to obtain a 3D model. The problem of 3D CAD reverse engineering consists of reconstructing the sketch and CAD operation sequences from 3D representations such as point clouds. In this paper, we address this challenge through novel contributions across three levels: CAD sequence representation, network design, and training dataset. In particular, we represent CAD sketch-extrude sequences as Python code. The proposed CAD-Recode translates a point cloud into Python code that, when executed, reconstructs the CAD model. Taking advantage of the exposure of pre-trained Large Language Models (LLMs) to Python code, we leverage a relatively small LLM as a decoder for CAD-Recode and combine it with a lightweight point cloud projector. CAD-Recode is trained on a procedurally generated dataset of one million CAD sequences. CAD-Recode significantly outperforms existing methods across the DeepCAD, Fusion360 and realworld CC3D datasets. Furthermore, we show that our CAD Python code output is interpretable by off-the-shelf LLMs, enabling CAD editing and CAD-specific question answering from point clouds.
DAViD: Data-efficient and Accurate Vision Models from Synthetic Data
Fatemeh Saleh
Microsoft, Cambridge
Sadegh Aliakbarian
Microsoft, Cambridge
Charlie Hewitt
Microsoft, Cambridge
Lohit Petikam
Microsoft, Cambridge
Xiao-Xian
Microsoft, Cambridge
Antonio Criminisi
Microsoft, Cambridge
Thomas J. Cashman
Microsoft, Cambridge
Tadas Baltruˇsaitis
Microsoft, Cambridge
Abstract
The state of the art in human-centric computer vision achieves high accuracy and robustness across a diverse range of tasks. The most effective models in this domain have billions of parameters, thus requiring extremely large datasets, expensive training regimes, and compute-intensive inference. In this paper, we demonstrate that it is possible to train models on much smaller but high-fidelity synthetic datasets, with no loss in accuracy and higher efficiency. Using synthetic training data provides us with excellent levels of detail and perfect labels, while providing strong guarantees for data provenance, usage rights, and user consent. Procedural data synthesis also provides us with explicit control on data diversity, that we can use to address unfairness in the models we train. Extensive quantitative assessment on real input images demonstrates accuracy of our models on three dense prediction tasks: depth estimation, surface normal estimation, and soft foreground segmentation. Our models require only a fraction of the cost of training and inference when compared with foundational models of similar accuracy. Our human-centric synthetic dataset and trained models are available at https://aka.ms/DAVi
MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning
Mohammadreza Salehi
VIS Lab, UvA
Shashanka Venkataramanan
Valeo.ai
Ioana Simion
VIS Lab, UvA
Efstratios Gavves
VIS Lab, UvA
Cees G. M. Snoek
VIS Lab, UvA
Yuki M Asano
Fundamental AI Lab, UTN
Abstract
Dense self-supervised learning has shown great promise for learning pixel- and patch-level representations, but extending it to videos remains challenging due to the complexity of motion dynamics. Existing approaches struggle as they rely on static augmentations that fail under object deformations, occlusions, and camera movement, leading to inconsistent feature learning over time. We propose a motion-guided self-supervised learning framework that clusters dense point tracks to learn spatiotemporally consistent representations. By leveraging an off-the-shelf point tracker, we extract long-range motion trajectories and optimize feature clustering through a momentum-encoder-based optimal transport mechanism. To ensure temporal coherence, we propagate cluster assignments along tracked points, enforcing feature consistency across views despite viewpoint changes. Integrating motion as an implicit supervisory signal, our method learns representations that generalize across frames, improving robustness in dynamic scenes and challenging occlusion scenarios. By initializing from strong image-pretrained models and leveraging video data for training, we improve state-of-the-art by 1% to 6% on six image and video datasets and four evaluation benchmarks. The implementation is publicly available at our GitHub repository: github.com/SMSD75/MoSiC
Correspondence-Free Fast and Robust Spherical Point Pattern Registration
Anik Sarker
Dept. of Mechanical Engineering, Virginia Tech
Alan T. Asbeck
Dept. of Mechanical Engineering, Virginia Tech
Abstract
Current methods to estimate the rotation between two spherical (\protect \mathbb {S}^2) patterns typically rely on maximizing their spherical cross-correlation. However, these approaches exhibit computational complexities greater than cubic O(n^3) with respect to rotation space discretization. We propose a rotation estimation algorithm between two spherical patterns with linear time complexity O(n). Unlike existing methods, we explicitly represent spherical patterns as discrete 3D point sets on the unit sphere, reformulating rotation estimation as a spherical point-set alignment (i.e., the Wahba problem for 3D unit vectors). We introduce three novel algorithms: (1) SPMC (Spherical Pattern Matching by Correlation), (2) FRS (Fast Rotation Search), and (3) a hybrid approach (SPMC+FRS) that combines the advantages of the previous two methods. Our experiments demonstrate that in the \protect \mathbb {S}^2 domain and in correspondence-free settings, our algorithms are over 10x faster and over 10x more accurate than current state-of-the-art methods for the Wahba problem with outliers. We validate our approach through extensive simulations on a new dataset of spherical patterns, the 'Robust Vector Alignment Dataset.' Furthermore, we adapt our methods to two real-world tasks: (i) Point Cloud Registration (PCR) and (ii) rotation estimation for spherical images. In the PCR task, our approach successfully registers point clouds exhibiting overlap ratios as low as 65%. In spherical image alignment, we show that our method robustly estimates rotations even under challenging conditions involving substantial clutter (over 19%) and large rotational offsets. Our results highlight the effectiveness and robustness of our algorithms in realistic, complex scenarios. Our dataset and code are available at: https://github.com/ARLab-VT/Robust-VectorSet-Alignment
Lidar Waveforms are Worth 40x128x33 Words
Dominik Scheuble
Mercedes-Benz AG
Hanno Holzhüter
MicroVision
Steven Peters
Torc Robotics
Mario Bijelic
Torc Robotics
Felix Heide
Torc Robotics
Abstract
Lidar has become crucial for autonomous driving, providing high-resolution 3D scans that are key for accurate scene understanding. To this end, lidar sensors measure the timeresolved full waveforms from the returning laser light, which a subsequent digital signal processor (DSP) converts to point clouds by identifying peaks in the waveform. Conventional automotive lidar DSPs process each waveform individually, ignoring potentially valuable context from neighboring waveforms. As a result, lidar point clouds are prone to artifacts from low signal-to-noise ratio (SNR) regions, highly reflective objects, and environmental conditions like fog. While leveraging neighboring waveforms is investigated extensively in transient imaging, applications remain limited to scientific or experimental hardware. In this work, we propose a learned DSP that directly processes full waveforms using a transformer architecture, leveraging features from adjacent waveforms to generate high-fidelity multiecho point clouds. To assess our method, we capture data in real-world driving scenarios and a weather chamber with a conventional automotive lidar. Trained on synthetic and real data, the method improves Chamfer distance by 32cm and 20cm compared to conventional peak finding and existing transient imaging approaches, respectively. This translates to maximum range improvements of up to 17m in fog and 14m in nominal real-world conditions.
Prior2Former - Evidential Modeling of Mask Transformers for Assumption-Free Open-World Panoptic Segmentation
Sebastian Schmidt
Technical University of Munich
Julius Körner
Technical University of Munich
Dominik Fuchsgruber
Technical University of Munich
Stefano Gasperini
Technical University of Munich
Federico Tombari
Technical University of Munich
Stephan Günnemann
Technical University of Munich
Abstract
In panoptic segmentation, individual instances must be separated within semantic classes. As state-of-the-art methods rely on a pre-defined set of classes, they struggle with novel categories and out-of-distribution (OOD) data. This is particularly problematic in safety-critical applications, such as autonomous driving, where reliability in unseen scenarios is essential. We address the gap between outstanding benchmark performance and reliability by proposing Prior2Former (P2F), the first approach for segmentation vision transformers rooted in evidential learning. P2F extends the mask vision transformer architecture by incorporating a Beta prior for computing model uncertainty in pixel-wise binary mask assignments. This design enables high-quality uncertainty estimation that effectively detects novel and OOD objects, enabling state-of-the-art anomaly instance segmentation and open-world panoptic segmentation. Unlike most segmentation models addressing unknown classes, P2F operates without access to OOD data samples or contrastive training on void (i.e., unlabeled) classes, making it highly applicable in real-world scenarios where such prior information is unavailable. Additionally, P2F can be flexibly applied to anomaly instance and panoptic segmentation. Through comprehensive experiments on the Cityscapes, COCO, SegmentMeIfYouCan, and OoDIS datasets, P2F demonstrates state-of-the-art performance across the board.
SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians
Liam Schoneveld
Woven by Toyota
Zhe Chen
Woven by Toyota
Davide Davoli
Toyota Motor Europe NV/SA
Jiapeng Tang
Technical University of Munich
Saimon Terazawa
Woven by Toyota
Ko Nishino
Kyoto University
Matthias Nießner
Technical University of Munich
Abstract
Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians). Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach. Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-ofthe-art in emotion classification.
MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning
Mattia Segu
Google
Marta Tintore Gazulla
Google
Yongqin Xian
Google
Luc Van Gool
INSAIT, Sofia University, St. Kliment Ohridski
Federico Tombari
Google
Abstract
Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resourceconstrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both highperformance computing platforms and mobile devices.
Blended Point Cloud Diffusion for Localized Text-guided Shape Editing
Etai Sella
Tel Aviv University
Noam Atia
Tel Aviv University
Ron Mokady
BRIA AI
Hadar Averbuch-Elor
Cornell University
Abstract
Natural language offers a highly intuitive interface for enabling localized fine-grained edits of 3D shapes. However, prior works face challenges in preserving global coherence while locally modifying the input 3D shape. In this work, we introduce an inpainting-based framework for editing shapes represented as point clouds. Our approach leverages foundation 3D diffusion models for achieving localized shape edits, adding structural guidance in the form of a partial conditional shape, ensuring that other regions correctly preserve the shape's identity. Furthermore, to encourage identity preservation also within the local edited region, we propose an inference-time coordinate blending algorithm which balances reconstruction of the full shape with inpainting at a progression of noise levels during the inference process. Our coordinate blending algorithm seamlessly blends the original shape with its edited version, enabling a fine-grained editing of 3D shapes, all while circumventing the need for computationally expensive and often inaccurate inversion. Extensive experiments show that our method outperforms alternative techniques across a wide range of metrics that evaluate both fidelity to the original shape and also adherence to the textual description.
BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes
Minkyun Seo
Computer Science Engineering and Interdisciplinary Program of AI, Seoul National University
Hyungtae Lim
Laboratory for Information & Decision Systems, Massachusetts Institute of Technology
Kanghee Lee
Computer Science Engineering and Interdisciplinary Program of AI, Seoul National University
Luca Carlone
Laboratory for Information & Decision Systems, Massachusetts Institute of Technology
Jaesik Park
Computer Science Engineering and Interdisciplinary Program of AI, Seoul National University
Abstract
Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors, and (c) raw coordinate usage, which exacerbates scale discrepancies. To address these issues, we present a zero-shot registration pipeline called BUFFERX by (a) adaptively determining voxel size/search radii, (b) using farthest point sampling to bypass learned detectors, and (c) leveraging patch-wise scale normalization for consistent coordinate bounds. In particular, we present a multi-scale patch-based descriptor generation and a hierarchical inlier search across scales to improve robustness in diverse scenes. We also propose a novel generalizability benchmark using 11 datasets that cover various indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X achieves substantial generalization without prior information or manual parameter tuning for the test datasets. Our code is available at https://github.com/MIT-SPARK/BUFFER-X.
MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances
Yunzhe Shao
School of Software and BNRist, Tsinghua University
Xinyu Yi
School of Software and BNRist, Tsinghua University
Lu Yin
School of Informatics, Xiamen University
Shihui Guo
School of Informatics, Xiamen University
Junhai Yong
School of Software and BNRist, Tsinghua University
Feng Xu
School of Software and BNRist, Tsinghua University
Abstract
This paper proposes a novel method, named MagShield, designed to address the issue of magnetic disturbances in sparse inertial motion capture (MoCap) systems. Existing Inertial Measurement Units (IMUs) are prone to orientation estimation errors in magnetically disturbed environments, limiting the practical application of inertial Mocap systems in real-world scenarios. To address this problem, MagShield employs a 'detect-then-correct' strategy, first detecting magnetic disturbances through multi-IMU joint analysis, and then correcting orientation errors using human motion priors. MagShield can be integrated with most existing sparse inertial MoCap systems, improving their performance in magnetically disturbed environments. Experimental results demonstrate that MagShield significantly enhances the accuracy of motion capture under magnetic interference and exhibits good compatibility across different sparse inertial MoCap systems. Code and dataset are available at https://github.com/YZ-Shiao/MagShield.
DM-EFS: Dynamically Multiplexed Expanded Features Set Form for Robust and Efficient Small Object Detection
Aashish Sharma
KLASS Engineering and Solutions
Abstract
In this paper, we address the problem of small object detection (SOD) by introducing our novel approach - Dynamically Multiplexed Expanded Features Set (DM-EFS) form. Detecting small objects is challenging as they usually suffer from inadequate feature representation. Hence, to address this, we propose the Expanded Features Set (EFS) form - a simple yet effective idea to improve the feature representation of small objects by utilizing the untapped higher resolution features from the shallower layers of the backbone module. We observe that the EFS form improves the SOD performance. However, due to processing of additional features, it has a higher computational cost which reduces inference efficiency. Hence, to address this, we propose Dynamic Feature Multiplexing (DFM) - a novel design that optimizes the usage of the EFS form during inference by dynamically multiplexing it to create our aforementioned DMEFS form. Since our DM-EFS form is a multiplexed (or subsampled) optimal version of the EFS form, it improves the SOD performance like the EFS form but with a lower computational cost. Extensive experiments confirm the efficacy of our DM-EFS approach. Integrated with YOLOv7 base model, our DM-EFS achieves state-of-the art results on diverse SOD datasets outperforming the base model and SOD baselines, with on-par or even better inference efficiency.
GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space
David G. Shatwell
Center for Research in Computer Vision, University of Central Florida
Ishan Rajendrakumar Dave
Adobe
Sirnam Swetha
Center for Research in Computer Vision, University of Central Florida
Mubarak Shah
Center for Research in Computer Vision, University of Central Florida
Abstract
Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely on cues like brightness, hue, and shadow positioning, while seasonal changes and weather inform date estimation. However, these visual cues significantly depend on geographic context, closely linking timestamp prediction to geo-localization. To address this interdependence, we introduce GT-Loc, a novel retrieval-based method that jointly predicts the capture time (hour and month) and geo-location (GPS coordinates) of an image. Our approach employs separate encoders for images, time, and location, aligning their embeddings within a shared high-dimensional feature space. Recognizing the cyclical nature of time, instead of conventional contrastive learning with hard positives and negatives, we propose a temporal metric-learning objective providing soft targets by modeling pairwise time differences over a cyclical toroidal surface. We present new benchmarks demonstrating that our joint optimization surpasses previous time prediction methods, even those using the ground-truth geo-location as an input during inference. Additionally, our approach achieves competitive results on standard geo-localization tasks, and the unified embedding space facilitates compositional and text-based image retrieval.
STEP-DETR: Advancing DETR-based Semi-Supervised Object Detection with Super Teacher and Pseudo-Label Guided Text Queries
Tahira Shehzadi
DFKI
Khurram Azeem Hashmi
DFKI
Shalini Sarode
DFKI
Didier Stricker
DFKI
Muhammad Zeshan Afzal
DFKI
Abstract
This paper addresses key limitations in current SemiSupervised Object Detection (SSOD) frameworks, focusing on issues related to pseudo-label quality, confidence bias, and inefficient query generation. Traditional methods, including CNN-based and DETR-based architectures, often face challenges such as noisy pseudo-labels, overfitting to common object categories, and consequently face difficulty detecting rare objects. Specifically, recent DETR-based SSOD approaches struggle with the one-to-many assignment strategy, which produces noisy pseudo-labels and overlapping predictions, resulting in suboptimal performance. To address these challenges, we propose STEP-DETR, a transformer-based SSOD framework. STEP-DETR introduces Super Teacher to generate higher-quality pseudolabels and improve the student's learning process. Furthermore, STEP-DETR proposes Pseudo-Label Text Queries, which incorporate text embeddings from Super Teacher, balancing the student's confidence across common and rare categories, thereby mitigating confidence bias and enhancing generalization. Moreover, Denoising Text Guided Object Queries synthesizes query-label pairs for foreground and background using contrastive learning, enabling the model to better distinguish objects from background noise. To further boost performance and training efficiency, a Query Refinement Module is incorporated to filter out redundant denoising queries. On MS-COCO and Pascal VOC benchmarks, STEP-DETR outperforms state-of-the-art methods, demonstrating its effectiveness in improving semi-supervised object detection. Notably, with just 10% labeled data, it achieves 45.4 mAP, surpassing the baseline Semi-DETR by 1.9 mAP.
AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs
Yi-Ting Shen
University of Maryland, College Park
Sungmin Eum
DEVCOM Army Research Laboratory
Doheon Lee
University of Maryland, College Park
Rohit Shete
University of Maryland, College Park
Chiao-Yi Wang
University of Maryland, College Park
Heesung Kwon
DEVCOM Army Research Laboratory
Shuvra S. Bhattacharyya
University of Maryland, College Park
Abstract
Composed pose retrieval (CPR) enables users to search for human poses by specifying a reference pose and a transition description, but progress in this field is hindered by the scarcity and inconsistency of annotated pose transitions. Existing CPR datasets rely on costly human annotations or heuristic-based rule generation, both of which limit scalability and diversity. In this work, we introduce AutoComPose, the first framework that leverages multimodal large language models (MLLMs) to automatically generate rich and structured pose transition descriptions. Our method enhances annotation quality by structuring transitions into fine-grained body part movements and introducing mirrored/swapped variations, while a cyclic consistency constraint ensures logical coherence between forward and reverse transitions. To advance CPR research, we construct and release two dedicated benchmarks, AIST-CPR and PoseFixCPR, supplementing prior datasets with enhanced attributes. Extensive experiments demonstrate that training retrieval models with AutoComPose yields superior performance over human-annotated and heuristic-based methods, significantly reducing annotation costs while improving retrieval quality. Our work pioneers the automatic annotation of pose transitions, establishing a scalable foundation for future CPR research.
BlinkTrack: Feature Tracking over 80 FPS via Events and Images
Yichen Shen
State Key Lab of CAD&CG, Zhejiang University
Yijin Li
State Key Lab of CAD&CG, Zhejiang University
Shuo Chen
State Key Lab of CAD&CG, Zhejiang University
Guanglin Li
State Key Lab of CAD&CG, Zhejiang University
Zhaoyang Huang
Avolution AI
Hujun Bao
State Key Lab of CAD&CG, Zhejiang University
Zhaopeng Cui
State Key Lab of CAD&CG, Zhejiang University
Guofeng Zhang
State Key Lab of CAD&CG, Zhejiang University
Abstract
Event cameras, known for their high temporal resolution and ability to capture asynchronous changes, have gained significant attention for their potential in feature tracking, especially in challenging conditions. However, event cameras lack the fine-grained texture information that conventional cameras provide, leading to error accumulation in tracking. To address this, we propose a novel framework, BlinkTrack, which integrates event data with grayscale images for high-frequency feature tracking. Our method extends the traditional Kalman filter into a learning-based framework, utilizing differentiable Kalman filters in both event and image branches. This approach improves singlemodality tracking and effectively solves the data association and fusion from asynchronous event and image data. We also introduce new synthetic and augmented datasets to better evaluate our model. Experimental results indicate that BlinkTrack significantly outperforms existing methods, exceeding 80 FPS with multi-modality data and 100 FPS with preprocessed event data. Codes and dataset are available at https://github.com/ColieShen/BlinkTrack.
Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision
Tianma Shen
Santa Clara University
Aditya Puranik
Santa Clara University
James Vong
Santa Clara University
Vrushabh Deogirikar
Santa Clara University
Ryan Fell
Santa Clara University
Julianna Dietrich
Santa Clara University
Maria Kyrarini
Santa Clara University
Christopher Kitts
Santa Clara University
David C. Jeong
Santa Clara University
Abstract
Egocentric human body estimation allows for the inference of user body pose and shape from a wearable camera's firstperson perspective. Although pose estimation techniques have been used to overcome self-occlusions and image distortions caused by head-mounted fisheye images, similar advances in 3D human mesh recovery (HMR) techniques have been limited. We address this gap with Fish2Mesh, a fisheye-aware transformer-based model designed for 3D egocentric human mesh recovery. We propose an egocentric position embedding block to generate an ego-specific position table for the Swin Transformer to reduce fisheye image distortion. Our model utilizes multi-task heads for SMPL parametric regression and camera translations, estimating 3D and 2D joints as auxiliary loss to support model training. Further, we augment egocentric camera data with a training dataset by employing the pre-trained 4D-Human model and third-person cameras for weak supervision. Our experiments demonstrate that Fish2Mesh outperforms state-of-the-art 3D HMR models. Code and data are available on our website.
Online Reasoning Video Segmentation with Just-in-Time Digital Twins
Yiqing Shen
Johns Hopkins University, Baltimore, MD, USA
Bohan Liu
Johns Hopkins University, Baltimore, MD, USA
Chenjia Li
Johns Hopkins University, Baltimore, MD, USA
Lalithkumar Seenivasan
Johns Hopkins University, Baltimore, MD, USA
Mathias Unberath
Johns Hopkins University, Baltimore, MD, USA
Abstract
Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. Our innovation is the introduction of a just-in-time digital twin concept, where - given an implicit query - a LLM plans the construction of a low-level scene representation from highlevel video using specialist vision models. We refer to this approach to creating a digital twin as 'just-in-time' because the LLM planner will anticipate the need for specific information and only request this limited subset instead of always evaluating every specialist model. The LLM then performs reasoning on this digital twin representation to identify target objects. To evaluate our approach, we introduce a new comprehensive video reasoning segmentation benchmark comprising 200 videos with 895 implicit text queries. The benchmark spans three reasoning categories (semantic, spatial, and temporal) with three different reasoning chain complexity. Experimental results demonstrate that our method performs best across all reasoning categories, suggesting that our just-in-time digital twin can bridge the gap between high-level reasoning and lowlevel perception in embodied AI. Benchmark is available at https://github.com/yiqings/jitbench/.
Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing
Hongyu Shen
Beijing Institute of Technology
Junfeng Ni
Tsinghua University
Yixin Chen
State Key Laboratory of General Artificial Intelligence, BIGAI
Weishuo Li
State Key Laboratory of General Artificial Intelligence, BIGAI
Mingtao Pei
Beijing Institute of Technology
Siyuan Huang
State Key Laboratory of General Artificial Intelligence, BIGAI
Abstract
We address the challenge of lifting 2D visual segmentation to 3D in Gaussian Splatting. Existing methods often suffer from inconsistent 2D masks across viewpoints and produce noisy segmentation boundaries as they neglect these semantic cues to refine the learned Gaussians. To overcome this, we introduce Gaussian Instance Tracing (GIT), which augments the standard Gaussian representation with an instance weight matrix across input views. Leveraging the inherent consistency of Gaussians in 3D, we use this matrix to identify and correct 2D segmentation inconsistencies. Furthermore, since each Gaussian ideally corresponds to a single object, we propose a GIT-guided adaptive density control mechanism to split and prune ambiguous Gaussians during training, resulting in sharper and more coherent 2D and 3D segmentation boundaries. Experimental results show that our method extracts clean 3D assets and consistently improves 3D segmentation in both online (e.g., selfprompting) and offline (e.g., contrastive lifting) settings, enabling applications such as hierarchical segmentation, object extraction, and scene editing.
SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images
Yu Sheng
University of Science and Technology of China
Jiajun Deng
The University of Adelaide
Xinran Zhang
University of Science and Technology of China
Yu Zhang
University of Science and Technology of China
Bei Hua
University of Science and Technology of China
Yanyong Zhang
University of Science and Technology of China
Jianmin Ji
University of Science and Technology of China
Abstract
A major breakthrough in 3D reconstruction is the feedforward paradigm to generate pixel-wise 3D points or Gaussian primitives from sparse, unposed images. To further incorporate semantics while avoiding the significant memory and storage costs of high-dimensional semantic features, existing methods extend this paradigm by associating each primitive with a compressed semantic feature vector. However, these methods have two major limitations: (a) the naively compressed feature compromises expressiveness, affecting the model's ability to capture finegrained semantics, and (b) the pixel-wise primitive prediction introduces redundancy in overlapping areas, causing unnecessary memory overhead. To this end, we introduce SpatialSplat, a feedforward framework that produces redundancy-aware Gaussians and capitalizes on a dual-field semantic representation. Particularly, with the insight that primitives within the same instance exhibit high semantic consistency, we decompose the semantic representation into a coarse feature field that encodes uncompressed semantics with minimal primitives, and a fine-grained yet low-dimensional feature field that captures detailed interinstance relationships. Moreover, we propose a selective Gaussian mechanism, which retains only essential Gaussians in the scene, effectively eliminating redundant primitives. Our proposed Spatialsplat learns accurate semantic information and detailed instances prior with more compact 3D Gaussians, making semantic 3D reconstruction more applicable. We conduct extensive experiments to evaluate our method, demonstrating a remarkable 60% reduction in scene representation parameters while achieving superior performance over state-of-the-art methods. The code will be made available for future investigation.
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers For Motion Transfer
Qingyu Shi
PKU
Jianzong Wu
PKU
Jinbin Bai
NUS
Jiangning Zhang
ZJU
Lu Qi
UC Merced
Yunhai Tong
PKU-Wuhan Institute for Artificial Intelligence
Xiangtai Li
NTU
Abstract
The motion transfer task aims to transfer motion from a source video to newly generated videos, requiring the model to decouple motion from appearance. Previous diffusionbased methods primarily rely on separate spatial and temporal attention mechanisms within the 3D U-Net. In contrast, state-of-the-art video Diffusion Transformers (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information. Thus, the interaction between spatial and temporal dimensions makes decoupling motion and appearance more challenging for DiT models. In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension, facilitating the decoupling of foreground motion from background appearance. Meanwhile, the temporal kernel effectively captures temporal variations in DiT features, which are closely related to motion. Moreover, we introduce explicit supervision along dense trajectories in the latent feature space to further enhance motion consistency. Additionally, we present MTBench, a general and challenging benchmark for motion transfer. We also introduce a hybrid motion fidelity metric that considers both the global and local motion similarity. Therefore, our work provides a more comprehensive Figure 2. Method comparision. Our method exhibits the best balance between motion fidelity and edit fidelity. evaluation than previous works. Extensive experiments on MTBench demonstrate that DeT achieves the best trade-off between motion fidelity and edit fidelity.
DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving
Chen Shi
The Chinese University of Hong Kong, Shenzhen
Shaoshuai Shi
Didi Chuxing, China
Kehua Sheng
Didi Chuxing, China
Bo Zhang
Didi Chuxing, China
Li Jiang
The Chinese University of Hong Kong, Shenzhen
Abstract
Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX introduces Omni Scene Modeling (OSM), a module that unifies multimodal supervision-3D point cloud forecasting, 2D semantic representation, and image generation-to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamicaware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX's predictions to enhance task-specific inference. Extensive experiments demonstrate DriveX's effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state-of-the-art results on diverse tasks including occupancy prediction, flow estimation, and end-to-end driving. These results validate DriveX's capability as a generalpurpose world model, paving the way for robust and unified autonomous driving frameworks.
GenM3: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation
Junyu Shi
The Hong Kong University of Science and Technology (Guangzhou)
Lijiang Liu
The Hong Kong University of Science and Technology (Guangzhou)
Yong Sun
The Hong Kong University of Science and Technology (Guangzhou)
Zhiyuan Zhang
The Hong Kong University of Science and Technology (Guangzhou)
Jinni Zhou
The Hong Kong University of Science and Technology (Guangzhou)
Qiang Nie
The Hong Kong University of Science and Technology (Guangzhou)
Abstract
Scaling up motion datasets is crucial to enhance motion generation capabilities. However, training on large-scale multi-source datasets introduces data heterogeneity challenges due to variations in motion content. To address this, we propose Generative Pretrained Multi-path Motion Model (GenM3), a comprehensive framework designed to learn unified motion representations. GenM3 comprises two components: 1) a Multi-Expert VQ-VAE (MEVQ-VAE) that adapts to different dataset distributions to learn a unified discrete motion representation, and 2) a Multi-path Motion Transformer (MMT) that improves intra-modal representations by using separate modality-specific pathways, each with densely activated experts to accommodate variations within that modality, and improves inter-modal alignment by the text-motion shared pathway. To enable largescale training, we integrate and unify 11 high-quality motion datasets (approximately 220 hours of motion data) and augment it with textual annotations (nearly 10,000 motion sequences labeled by a large language model and 300+ by human experts). After training on our integrated dataset, GenM3 achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin. It also demonstrates strong zero-shot generalization on IDEA400 dataset, highlighting its effectiveness and adaptability across diverse motion scenarios.
Ultra-Precision 6DoF Pose Estimation Using 2-D Interpolated Discrete Fourier Transform
Guowei Shi
UM-SJTU Joint Institute, Shanghai Jiao Tong University
Zian Mao
UM-SJTU Joint Institute, Shanghai Jiao Tong University
Peisen Huang
UM-SJTU Joint Institute, Shanghai Jiao Tong University
Abstract
Ultra-precision estimation of 6DoF pose is essential in applications such as semiconductor manufacturing and nanoscale manipulation. Conventional vision-based techniques are often hampered by sensitivity to defocus and limited estimation accuracy. In this paper, we propose a novel two-dimensional interpolated Discrete Fourier Transform (2D-IpDFT) method for robust 6DoF pose estimation using periodic patterns. We further develop a mathematical framework that links image parameters-phase and frequency-to 6DoF pose, which is applicable to both orthographic and quasi-orthographic imaging systems. Extensive experiments on a low-cost setup, featuring an industrial camera and an etched checkerboard pattern, demonstrate translation estimation accuracy at the nanometer level and rotation estimation accuracy at the microradian level.
VoxelKP: A Voxel-based Network Architecture for Human Keypoint Estimation in LiDAR Data
Jian Shi
KAUST
Peter Wonka
KAUST
Abstract
We present VoxelKP, a novel fully sparse network architecture tailored for human keypoint estimation in LiDAR data. The key challenge is that objects are distributed sparsely in 3D space, while human keypoint detection requires detailed local information wherever humans are present. First, we introduce a dual-branch fully sparse spatial-context block where the spatial branch focuses on learning the local spatial correlations between keypoints within each human instance, while the context branch aims to retain the global spatial information. Second, we use a spatially aware multi-scale BEV fusion technique to leverage absolute 3D coordinates when projecting 3D voxels to a 2D grid encoding a bird's eye view for better preservation of the global context of each human instance. We evaluate our method on the Waymo dataset and achieve an improvement of 27% on the MPJPE metric compared to the state-of-the-art, HUM3DIL, trained on the same data, and 12% against the state-of-the-art, GC-KPL, pretrained on a 25x larger dataset. To the best of our knowledge, VoxelKP is the first single-staged, fully sparse network that is specifically designed for addressing the challenging task of 3D keypoint estimation from LiDAR data, achieving stateof-the-art performance. Our code is available at https: //github.com/shijianjian/VoxelKP.
Simultaneous Motion And Noise Estimation with Event Cameras
Shintaro Shiba
Keio University
Yoshimitsu Aoki
Keio University
Guillermo Gallego
Technische Universität Berlin
Abstract
Event cameras are emerging vision sensors whose noise is challenging to characterize. Existing denoising methods for event cameras are often designed in isolation and thus consider other tasks, such as motion estimation, separately (i.e., sequentially after denoising). However, motion is an intrinsic part of event data, since scene edges cannot be sensed without motion. We propose, to the best of our knowledge, the first method that simultaneously estimates motion in its various forms (e.g., ego-motion, optical flow) and noise. The method is flexible, as it allows replacing the one-step motion estimation of the widely-used Contrast Maximization framework with any other motion estimator, such as deep neural networks. The experiments show that the proposed method achieves state-of-the-art results on the E-MLB denoising benchmark and competitive results on the DND21 benchmark, while demonstrating effectiveness across motion estimation and intensity reconstruction tasks. Our approach advances event-data denoising theory and expands practical denoising use-cases via open-source code. Project page: https://github.com/tub-rip/ESMD
OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving
Kota Shimomura
Chubu University
Masaki Nambata
Elith Inc.
Atsuya Ishikawa
Honda R&D Co., Ltd.
Ryota Mimura
Honda R&D Co., Ltd.
Koki Inoue
Elith Inc.
Takayoshi Yamashita
Honda R&D Co., Ltd.
Takayuki Kawabuchi
Honda R&D Co., Ltd.
Abstract
Although autonomous driving systems demonstrate high perception performance, they still face limitations when handling rare situations or complex road structures. Such road infrastructures are designed for human drivers, safety improvements are typically introduced only after accidents occur. This reactive approach poses a significant challenge for autonomous systems, which require proactive risk mitigation. To address this issue, we propose OD-RASE, a framework for enhancing the safety of autonomous driving systems by detecting road structures that cause traffic accidents and connecting these findings to infrastructure development. First, we formalize an ontology based on specialized domain knowledge of road traffic systems. In parallel, we generate infrastructure improvement proposals using a large-scale visual language model (LVLM) and use ontology-driven data filtering to enhance their reliability. This process automatically annotates improvement proposals on pre-accident road images, leading to the construction of a new dataset. Furthermore, we introduce the Baseline approach (OD-RASE model), which leverages LVLM and a diffusion model to produce both infrastructure improvement proposals and generated images of the improved road environment. Our experiments demonstrate that ontologydriven data filtering enables highly accurate prediction of accident-causing road structures and the corresponding improvement plans. We believe that this work contributes to the overall safety of traffic environments and marks an important step toward the broader adoption of autonomous driving systems.
DiffRefine: Diffusion-based Proposal Specific Point Cloud Densification for Cross-Domain Object Detection
Sangyun Shin
Department of Computer Science, University of Oxford
Yuhang He
Microsoft Research
Xinyu Hou
Department of Computer Science, University of Oxford
Samuel Hodgson
Department of Computer Science, University of Oxford
Andrew Markham
Department of Computer Science, University of Oxford
Niki Trigoni
Department of Computer Science, University of Oxford
Abstract
The robustness of 3D object detection in large-scale outdoor point clouds degrades significantly when deployed in an unseen environment due to domain shifts. To minimize the domain gap, existing works on domain adaptive detection focuses on several factors, including point density, object shape and sizes, to reduce the false negative detections. However, the adaptation results indicate that there are still remaining challenges. We argue that this is due to the challenge in recognizing comparably less distinctive region on object surface due to sparsity, occlusion, etc. In this work, we aim to reinforce those features by generating points on object surface to make them straightforwardly recognizable. We draw our motivation from a common observation that detection proposals already contain the accurate bounding boxes, but with relatively low objectness score predictions, which lead to false negatives. Given these box proposals, we densify sparse object points with a diffusion approach. As a result, our model DiffRefine can act as a simple additional module before second-stage refinement, where most existing detection models for two-stage detection can use. Experimental results on domain adaptive detection show competitive performance, especially on vanishing points due to distance on various detection architectures.
Seam360GS: Seamless 360deg Gaussian Splatting from Real-World Omnidirectional Images
Changha Shin
Yonsei University
Woong Oh Cho
Yonsei University
Seon Joo Kim
Yonsei University
Abstract
360° visual content is widely shared on platforms such as YouTube and plays a central role in virtual reality, robotics, and autonomous navigation. However, consumer-grade dual-fisheye systems consistently yield imperfect panoramas due to inherent lens separation and angular distortions. In this work, we introduce a novel calibration framework that incorporates a dual-fisheye camera model into the 3D Gaussian splatting pipeline. Our approach not only simulates the realistic visual artifacts produced by dualfisheye cameras but also enables the synthesis of seamlessly rendered 360° images. By jointly optimizing 3D Gaussian parameters alongside calibration variables that emulate lens gaps and angular distortions, our framework transforms imperfect omnidirectional inputs into flawless novel view synthesis. Extensive evaluations on real-world datasets confirm that our method produces seamless renderings-even from imperfect images-and outperforms existing 360° rendering models.
AnimalClue: Recognizing Animals by their Traces
Risa Shinoda
The University of Osaka
Nakamasa Inoue
Institute of Science Tokyo
Iro Laina
Visual Geometry Group, University of Oxford
Christian Rupprecht
Visual Geometry Group, University of Oxford
Hirokatsu Kataoka
National Institute of Advanced Industrial Science and Technology (AIST)
Abstract
Wildlife observation plays an important role in biodiversity conservation, necessitating robust methodologies for monitoring wildlife populations and interspecies interactions. Recent advances in computer vision have significantly contributed to automating fundamental wildlife observation tasks, such as animal detection and species identification. However, accurately identifying species from indirect evidence like footprints and feces remains relatively underexplored, despite its importance in contributing to wildlife monitoring. To bridge this gap, we introduce AnimalClue, the first large-scale dataset for species identification from images of indirect evidence. Our dataset consists of 159,605 bounding boxes encompassing five categories of indirect clues: footprints, feces, eggs, bones, and feathers. It covers 968 species, 200 families, and 65 orders. Each image is annotated with species-level labels, bounding boxes or segmentation masks, and fine-grained trait information, including activity patterns and habitat preferences. Unlike existing datasets primarily focused on direct visual features (e.g., animal appearances), AnimalClue presents unique challenges for classification, detection, and instance segmentation tasks due to the need for recognizing more detailed and subtle visual features. In our experiments, we extensively evaluate representative vision models and identify key challenges in animal identification from their traces. Our dataset and code are available at https: //dahlian00.github.io/AnimalCluePage/
Unsupervised RGB-D Point Cloud Registration for Scenes with Low Overlap and Photometric Inconsistency
Yejun Shou
The State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University
Haocheng Wang
The State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University
Lingfeng Shen
The State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University
Qian Zheng
College of Computer Science and Technology, Zhejiang University
Gang Pan
College of Computer Science and Technology, Zhejiang University
Yanlong Cao
The State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University
Abstract
Point cloud registration is a fundamental task in 3D vision, playing a crucial role in various fields. With the rapid advancement of RGB-D sensors, unsupervised point cloud registration methods based on RGB-D sequences have demonstrated excellent performance. However, existing methods struggle in scenes with low overlap and photometric inconsistency. Low overlap results in numerous correspondence outliers, while photometric inconsistency hinders the model's ability to extract discriminative features. To address these challenges, we first propose the Overlapping Constraint for Inliers Detection (OCID) module, which filters and optimizes the initial correspondence set using an overlapping constraint. This module robustly selects reliable correspondences within the overlapping region while maintaining a balance between accuracy and efficiency. Additionally, we introduce a novel scene representation, 3DGS, which integrates both geometric and texture information, making it particularly well-suited for RGB-D registration tasks. Building on this, we propose the Gaussian Rendering for Photometric Adaptation (GRPA) module, which refines the geometric transformation and enhances the model's adaptability to scenes with inconsistent photometric information. Extensive experiments on ScanNet and ScanNet1500 demonstrate that our method achieves state-of-the-art performance. The code will be released at OG-UPCR.
Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation
Xincheng Shuai
Fudan University
Henghui Ding
Fudan University
Zhenyuan Qin
Fudan University
Hao Luo
DAMO Academy, Alibaba group
Xingjun Ma
Fudan University
Dacheng Tao
Nanyang Technological University
Abstract
Controlling the movements of dynamic objects and the camera within generated videos is a meaningful yet challenging task. Due to the lack of datasets with comprehensive 6D pose annotations, existing text-to-video methods can not simultaneously control the motions of both camera and objects in 3D-aware manner, resulting in limited controllability over generated contents. To address this issue and facilitate the research in this field, we introduce a Synthetic Dataset for Free-Form Motion Control (SynFMC). The proposed SynFMC dataset includes diverse object and environment categories and covers various motion patterns according to specific rules, simulating common and complex real-world scenarios. The complete 6D pose information facilitates models learning to disentangle the motion effects from objects and the camera in a video. To provide precise 3D-aware motion control, we further propose a method trained on SynFMC, Free-Form Motion Control (FMC). FMC can control the 6D poses of objects and camera independently or simultaneously, producing high-fidelity videos. Moreover, it is compatible with various personalized text-to-image (T2I) models for different content styles. Extensive experiments demonstrate that the proposed FMC outperforms previous methods across multiple scenarios.
You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception
Hao Si
The University of Tokyo
Ehsan Javanmardi
The University of Tokyo
Manabu Tsukada
The University of Tokyo
Abstract
Collaborative perception enables vehicles to overcome individual perception limitations by sharing information, allowing them to see further and through occlusions. In real-world scenarios, models on different vehicles are often heterogeneous due to manufacturer variations. Existing methods for heterogeneous collaborative perception address this challenge by fine-tuning adapters or the entire network to bridge the domain gap. However, these methods are impractical in real-world applications, as each new collaborator must undergo joint training with the ego vehicle on a dataset before inference, or the ego vehicle stores models for all potential collaborators in advance. Therefore, we pose a new question: Can we tackle this challenge directly during inference, eliminating the need for joint training? To answer this, we introduce Progressive Heterogeneous Collaborative Perception (PHCP), a novel framework that formulates the problem as few-shot unsupervised domain adaptation. Unlike previous work, PHCP dynamically aligns features by self-training an adapter during inference, eliminating the need for labeled data and joint training. Extensive experiments on the OPV2V dataset demonstrate that PHCP achieves strong performance across diverse heterogeneous scenarios. Notably, PHCP achieves performance comparable to SOTA methods trained on the entire dataset while using only a small amount of unlabeled data.
Recovering Parametric Scenes from Very Few Time-of-Flight Pixels
Carter Sifferman
University of Wisconsin-Madison
Yiquan Li
University of Wisconsin-Madison
Yiming Li
University of Wisconsin-Madison
Fangzhou Mu
University of Wisconsin-Madison
Michael Gleicher
University of Wisconsin-Madison
Mohit Gupta
University of Wisconsin-Madison
Yin Li
University of Wisconsin-Madison
Abstract
We aim to recover the geometry of 3D parametric scenes using very few depth measurements from low-cost, commercially available time-of-flight sensors. These sensors offer very low spatial resolution (i.e., a single pixel), but image a wide field-of-view per pixel and capture detailed timeof-flight data in the form of time-resolved photon counts. This time-of-flight data encodes rich scene information and thus enables recovery of simple scenes from sparse measurements. We investigate the feasibility of using a distributed set of few measurements (e.g., as few as 15 pixels) to recover the geometry of simple parametric scenes with a strong prior, such as estimating the 6D pose of a known object. To achieve this, we design a method that utilizes both feed-forward prediction to infer scene parameters, and differentiable rendering within an analysis-bysynthesis framework to refine the scene parameter estimate. We develop hardware prototypes and demonstrate that our method effectively recovers object pose given an untextured 3D model in both simulations and controlled real-world captures, and show promising initial results for other parametric scenes. We additionally conduct experiments to explore the limits and capabilities of our imaging solution. Our project webpage is available at cpsiff.github.io/recovering parametric scenes
Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation
Andrea Simonelli
Meta Reality Labs Zürich
Norman Müller
Meta Reality Labs Zürich
Peter Kontschieder
Meta Reality Labs Zürich
Abstract
The increasing availability of digital 3D environments, whether through image-based 3D reconstruction, generation, or scans obtained by robots, is driving innovation across various applications. These come with a significant demand for 3D interaction, such as 3D Interactive Segmentation, which is useful for tasks like object selection and manipulation. Additionally, there is a persistent need for solutions that are efficient, precise, and performing well across diverse settings, particularly in unseen environments and with unfamiliar objects. In this work, we introduce a 3D interactive segmentation method that consistently surpasses previous state-of-the-art techniques on both in-domain and out-of-domain datasets. Our simple approach integrates a voxel-based sparse encoder with a lightweight transformerbased decoder that implements implicit click fusion, achieving superior performance and maximizing efficiency. Our method demonstrates substantial improvements on benchmark datasets, including ScanNet [3], ScanNet++ [35], S3DIS [1], and KITTI-360 [17], and also on unseen geometric distributions such as the ones obtained by Gaussian Splatting [12]. The project page is available here: https: //simonelli-andrea.github.io/easy3d.
MonoSOWA: Scalable Monocular 3D Object Detector Without Human Annotations
Jan Skvrna
Czech Technical University in Prague
Lukas Neumann
Czech Technical University in Prague
Abstract
Inferring object 3D position and orientation from a single RGB camera is a foundational task in computer vision with many important applications. Traditionally, 3D object detection methods are trained in a fully-supervised setup, requiring LiDAR and vast amounts of human annotations, which are laborious, costly, and do not scale well with the ever-increasing amounts of data being captured. We present a novel method to train a 3D object detector from a single RGB camera without domain-specific human annotations, making orders of magnitude more data available for training. The method uses newly proposed Local Object Motion Model to disentangle object movement source between subsequent frames, is approximately 700 times faster than previous work and compensates camera focal length differences to aggregate multiple datasets. The method is evaluated on three public datasets, where despite using no human labels, it outperforms prior work by a significant margin. It also shows its versatility as a pre-training tool for fully-supervised training and shows that combining pseudo-labels from multiple datasets can achieve comparable accuracy to using human labels from a single dataset. The source code and model are available at https://github.com/jskvrna/MonoSOWA.
DMesh++: An Efficient Differentiable Mesh for Complex Shapes
Sanghyun Son
University of Maryland
Matheus Gadelha
Adobe Research
Yang Zhou
Adobe Research
Matthew Fisher
Adobe Research
Zexiang Xu
Adobe Research
Yi-Ling Qiao
University of Maryland
Ming C. Lin
University of Maryland
Yi Zhou
Adobe Research
Abstract
Recent probabilistic methods for 3D triangular meshes capture diverse shapes by differentiable mesh connectivity, but face high computational costs with increased shape details. We introduce a new differentiable mesh processing method that addresses this challenge and efficiently handles meshes with intricate structures. Our method reduces time complexity from O(N) to O(log N) and requires significantly less memory than previous approaches. Building on this innovation, we present a reconstruction algorithm capable of generating complex 2D and 3D shapes from point clouds or multi-view images.
MDP-Omni: Parameter-free Multimodal Depth Prior-based Sampling for Omnidirectional Stereo Matching
Eunjin Son
Jeonbuk National University
HyungGi Jo
Jeonbuk National University
Wookyong Kwon
Electronics and Telecommunications Research Institute (ETRI)
Sang Jun Lee
Jeonbuk National University
Abstract
Omnidirectional stereo matching (OSM) estimates 360◦ depth by performing stereo matching on multi-view fisheye images. Existing methods assume a unimodal depth distribution, matching each pixel to a single object. However, this assumption constrains the sampling range, causing oversmoothed depth artifacts, especially at object boundaries. To address these limitations, we propose MDP-Omni, a novel OSM network that leverages parameter-free multimodal depth priors. Specifically, we design a sampling strategy that adaptively adjusts the sampling range based on a multimodal probability distribution, without introducing any additional parameters. Furthermore, we present the azimuth-based multi-view volume fusion module to build a single cost volume. It mitigates false matches caused by occlusions in warped multi-view volumes. Experimental results demonstrate that MDP-Omni significantly improves existing methods, particularly in capturing fine details.
CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving
Rui Song
Fraunhofer IVI
Chenwei Liang
Fraunhofer IVI
Yan Xia
USTC
Walter Zimmer
TU Munich
Hu Cao
TU Munich
Holger Caesar
TU Delft
Andreas Festag
TH Ingolstadt
Alois Knoll
TU Munich
Abstract
Dynamic scene rendering opens new avenues in autonomous driving by enabling closed-loop simulations with photorealistic data, which is crucial for validating endto-end algorithms. However, the complex and highly dynamic nature of traffic environments presents significant challenges in accurately rendering these scenes. In this paper, we introduce a novel 4D Gaussian Splatting (4DGS) approach, which incorporates context and temporal deformation awareness to improve dynamic scene rendering. Specifically, we employ a 2D semantic segmentation foundation model to self-supervise the 4D semantic features of Gaussians, ensuring meaningful contextual embedding. Simultaneously, we track the temporal deformation of each Gaussian across adjacent frames. By aggregating and encoding both semantic and temporal deformation features, each Gaussian is equipped with cues for potential deformation compensation within 3D space, facilitating a more precise representation of dynamic scenes. Experimental results show that our method improves 4DGS's ability to capture fine details in dynamic scene rendering for autonomous driving and outperforms other self-supervised methods in 4D reconstruction and novel view synthesis. Furthermore, CoDa-4DGS deforms semantic features with each Gaussian, enabling broader applications.
OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics
Yeon-Ji Song
Seoul National University
Jaein Kim
Seoul National University
Suhyung Choi
Seoul National University
Jin-Hwa Kim
NAVER AI Lab
Byoung-Tak Zhang
Seoul National University
Abstract
Human perception involves decomposing complex multiobject scenes into time-static object appearance (i.e., size, shape, color) and time-varying object motion (i.e., position, velocity, acceleration). For machines to achieve human-like intelligence in real-world interactions, understanding these physical properties of objects is essential, forming the foundation for dynamic video prediction. While recent advancements in object-centric transformers have demonstrated potential in video prediction, they primarily focus on object appearance, often overlooking motion dynamics, which is crucial for modeling dynamic interactions and maintaining temporal consistency in complex environments. To address these limitations, we propose OCK, a dynamic video prediction model leveraging object-centric kinematics and object slots. We introduce a novel component named Object Kinematics that comprises explicit object motions, serving as an additional attribute beyond conventional appearance features to model dynamic scenes. The Object Kinematics are integrated into various OCK mechanisms, enabling spatiotemporal prediction of complex object interactions over long video sequences. Our model demonstrates superior performance in handling complex scenes with intricate object attributes and motions, highlighting its potential for applicability in vision-related dynamics learning tasks.
A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks
Hang Su
ShanghaiTech University
Yunlong Feng
ShanghaiTech University
Daniel Gehrig
University of Pennsylvania
Panfeng Jiang
ShanghaiTech University
Ling Gao
Amap, Alibaba Group
Xavier Lagorce
ShanghaiTech University
Laurent Kneip
Shanghai Engineering Research Center of Intelligent Vision and Imaging
Abstract
Structure and continuous motion estimation from point correspondences is a fundamental problem in computer vision that has been powered by well-known algorithms such as the familiar 5-point or 8-point algorithm. However, despite their acclaim, these algorithms are limited to processing point correspondences originating from a pair of views each one representing an instantaneous capture of the scene. Yet, in the case of rolling shutter cameras, or more recently, event cameras, this synchronization breaks down. In this work, we present a unified approach for structure and linear motion estimation from 2D point correspondences with arbitrary timestamps, from an arbitrary set of views. By formulating the problem in terms of first-order dynamics and leveraging a constant velocity motion model, we derive a novel, linear point incidence relation allowing for the efficient recovery of both linear velocity and 3D points with predictable degeneracies and solution multiplicities. Owing to its general formulation, it can handle correspondences from a wide range of sensing modalities such as global shutter, rolling shutter, and event cameras, and can even combine correspondences from different collocated sensors. We validate the effectiveness of our solver on both simulated and real-world data, where we show consistent improvement across all modalities when compared to recent approaches. We believe our work opens the door to efficient structure and motion estimation from asynchronous data. Code can be found at https://github.com/suhang99/AsyncTrack-Motion-Solver.
Dense Policy: Bidirectional Autoregressive Learning of Actions
Yue Su
Shanghai Jiao Tong University
Xinyu Zhan
Shanghai Jiao Tong University
Hongjie Fang
Shanghai Jiao Tong University
Han Xue
Shanghai Jiao Tong University
Hao-Shu Fang
Shanghai Jiao Tong University
Yong-Lu Li
Shanghai Jiao Tong University
Cewu Lu
Shanghai Jiao Tong University
Lixin Yang
Shanghai Jiao Tong University
Abstract
Mainstream visuomotor policies predominantly rely on generative models for holistic action prediction, while current autoregressive policies, predicting the next token or chunk, have shown suboptimal results. This motivates a search for more effective learning methods to unleash the potential of autoregressive policies for robotic manipulation. This paper introduces a bidirectionally expanded learning approach, termed Dense Policy, to establish a new paradigm for autoregressive policies in action prediction. It employs a lightweight encoder-only architecture to iteratively unfold the action sequence from an initial single frame into the target sequence in a coarse-to-fine manner with logarithmic-time inference. Extensive experiments validate that our dense policy has superior autoregressive learning capabilities and can surpass existing holistic generative policies. Our model, data, and code are available at: https://selen-suyue.github.io/DspNet/.
FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers
Haisheng Su
Shanghai Jiao Tong University
Junjie Zhang
Xi'an Jiaotong University
Feixiang Song
SenseAuto Research
Sanping Zhou
Xi'an Jiaotong University
Wei Wu
SenseAuto Research
Junchi Yan
Shanghai Jiao Tong University
Nanning Zheng
Xi'an Jiaotong University
Abstract
Detecting 3D objects accurately from multi-view 2D images is a challenging yet essential task in the field of autonomous driving. Current methods resort to integrating depth prediction to recover the spatial information for object query decoding, which necessitates explicit supervision from LiDAR points during the training phase. However, the predicted depth quality is still unsatisfactory such as depth discontinuity of object boundaries and indistinction of small objects, which are mainly caused by the sparse supervision of projected points and the use of high-level image features for depth prediction. Besides, cross-view consistency and scale invariance are also overlooked in previous methods. In this paper, we introduce Frequencyaware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder, which can be obtained through three main modules. Specifically, the Frequency-aware Spatial Pyramid Encoder (FSPE) constructs a feature pyramid by combining high-frequency edge clues and low-frequency semantics from different levels respectively. Then the Crossview Scale-invariant Depth Predictor (CSDP) estimates the pixel-level depth distribution with cross-view and efficient channel attention mechanism. Finally, the Positional Depth Encoder (PDE) combines the 2D image features and 3D position embeddings to generate the 3D depth-aware features for query decoding. Additionally, hybrid depth supervision is adopted for complementary depth learning from both metric and distribution aspects. Extensive experiments conducted on the nuScenes dataset demonstrate the effectiveness and superiority of our proposed method.
HUG: Hierarchical Urban Gaussian Splatting with Block-Based Reconstruction for Large-Scale Aerial Scenes
Mai Su
Peking University
Zhongtao Wang
Peking University
Huishan Au
Peking University
Yilong Li
Peking University
Xizhe Cao
Peking University
Chengwei Pan
Institute of Artificial Intelligence, BUAA
Yisong Chen
Peking University
Guoping Wang
Peking University
Abstract
3DGS is an emerging and increasingly popular technology in the field of novel view synthesis. Its highly realistic rendering quality and real-time rendering capabilities make it promising for various applications. However, when applied to large-scale aerial urban scenes, 3DGS methods suffer from issues such as excessive memory consumption, slow training times, prolonged partitioning processes, and significant degradation in rendering quality due to the increased data volume. To tackle these challenges, we introduce HUG, a novel approach that enhances data partitioning and reconstruction quality by leveraging a hierarchical neural Gaussian representation. We first propose a visibility-based data partitioning method that is simple yet highly efficient, significantly outperforming existing methods in speed. Then, we introduce a novel hierarchical weighted training approach, combined with other optimization strategies, to substantially improve reconstruction quality. Our method achieves state-of-the-art results on one synthetic dataset and four real-world datasets.
OVA-Fields: Weakly Supervised Open-Vocabulary Affordance Fields for Robot Operational Part Detection
Heng Su
Chongqing University
Mengying Xie
Chongqing University
Nieqing Cao
Xi'an Jiaotong-Liverpool University
Yan Ding
Shanghai AI Lab
Beichen Shao
Chongqing University
Xianlei Long
Chongqing University
Fuqiang Gu
Chongqing University
Chao Chen
Chongqing University
Abstract
In recent years, affordance detection has become essential for robotic manipulation in real-world scenes, where robots must autonomously interpret commands and perform actions. Current methods often focus on individual point cloud objects or simple semantic queries, limiting their effectiveness in diverse scenes and complex instructions. To address this, we introduce OVA-Fields, a framework for affordance detection in 3D scenes with complex semantics. By integrating multilevel geometric encoding and enhanced semantic affordance embeddings, OVA-Fields maps user commands directly to operational parts, embedding enriched affordance information into the 3D scene. Experimental results demonstrate that OVA-Fields achieves 52.4% mIoU on complex semantic real-world scenes and 90% success rate in real-world robot manipulation tasks (e.g., 'take out some food from the refirgerator') using RGB-D sensing. Our approach enables the precise identification of operational parts, transforming natural language queries into targeted manipulations in real-world environments. Our codes are available at: https://github. com/vlasu19/OVA-Fields
Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction
Edgar Sucar
Visual Geometry Group (VGG), University of Oxford
Zihang Lai
Visual Geometry Group (VGG), University of Oxford
Eldar Insafutdinov
Visual Geometry Group (VGG), University of Oxford
Andrea Vedaldi
Visual Geometry Group (VGG), University of Oxford
Abstract
DUSt3R has recently demonstrated that many tasks in multiview geometry, including estimating camera intrinsics and extrinsics, reconstructing 3D scenes, and establishing image correspondences, can be reduced to predicting a pair of viewpoint-invariant point maps, i.e., pixel-aligned point clouds defined in a common reference frame. While this formulation is elegant and powerful, it is limited to static scenes. To overcome this limitation, we introduce the concept of Dynamic Point Maps (DPM), which extends standard point maps to support 4D tasks such as motion segmentation, scene flow estimation, 3D object tracking, and 2D correspondence. Our key insight is that, when time is introduced, several possible spatial and temporal references can be used to define the point maps. We identify a minimal subset of these combinations that can be regressed by a network to solve the aforementioned tasks. We train a DPM predictor on a mixture of synthetic and real data and evaluate it across diverse benchmarks, including video depth prediction, dynamic point cloud reconstruction, 3D scene flow, and object pose tracking, achieving stateof-the-art performance. Additional results are available at https://www.robots.ox.ac.uk/˜vgg/research/dynamic-pointmaps/.
SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images
Gencer Sumbul
Ecole Polytechnique Fédérale de Lausanne (EPFL)
Chang Xu
Ecole Polytechnique Fédérale de Lausanne (EPFL)
Emanuele Dalsasso
Ecole Polytechnique Fédérale de Lausanne (EPFL)
Devis Tuia
Ecole Polytechnique Fédérale de Lausanne (EPFL)
Abstract
From optical sensors to microwave radars, leveraging the complementary strengths of remote sensing (RS) sensors is crucial for achieving dense spatio-temporal monitoring of our planet. In contrast, recent deep learning models, whether task-specific or foundational, are often specific to single sensors or to fixed combinations: adapting such models to different sensory inputs requires both architectural changes and re-training, limiting scalability and generalization across multiple RS sensors. On the contrary, a single model able to modulate its feature representations to accept diverse sensors as input would pave the way to agile and flexible multi-sensor RS data processing. To address this, we introduce SMARTIES, a generic and versatile foundation model lifting sensor-specific/dependent efforts and enabling scalability and generalization to diverse RS sensors: SMARTIES projects data from heterogeneous sensors into a shared spectrum-aware space, enabling the use of arbitrary combinations of bands both for training and inference. To obtain sensor-agnostic representations, we train a single, unified transformer model reconstructing masked multisensor data with cross-sensor token mixup. On both singleand multi-modal tasks across diverse sensors, SMARTIES outperforms previous models that rely on sensor-specific pretraining. Our code and pretrained models are available at https://gsumbul.github.io/SMARTIES.
ARMO: Autoregressive Rigging for Multi-Category Objects
Mingze Sun
Tsinghua Shenzhen International Graduate School
Shiwei Mao
Tsinghua Shenzhen International Graduate School
Keyi Chen
Tsinghua Shenzhen International Graduate School
Yurun Chen
Tsinghua Shenzhen International Graduate School
Shunlin Lu
The Chinese University of Hong Kong, Shenzhen
Jingbo Wang
Shanghai AI Laboratory
Junting Dong
Shanghai AI Laboratory
Ruqi Huang
Tsinghua Shenzhen International Graduate School
Abstract
Recent advancements in large-scale generative models have significantly improved the quality and diversity of 3D shape generation. However, most existing methods focus primarily on generating static 3D models, overlooking the potential dynamic nature of certain shapes, such as humanoids, animals, and insects. To address this gap, we focus on rigging, a fundamental task in animation that establishes skeletal structures and skinning for 3D models. In this paper, we introduce OmniRig, the first large-scale rigging dataset, comprising 79,499 meshes with detailed skeleton and skinning information. Unlike traditional benchmarks that rely on predefined standard poses (e.g., A-pose, T-pose), our dataset embraces diverse shape categories, styles, and poses. Leveraging this rich dataset, we propose ARMO, a novel rigging framework that utilizes an autoregressive model to predict both joint positions and connectivity relationships in a unified manner. By treating the skeletal structure as a complete graph and discretizing it into tokens, we encode the joints using an auto-encoder to obtain a latent embedding and an autoregressive model to predict the tokens. A mesh-conditioned latent diffusion model is used to predict the latent embedding for conditional skeleton generation. Our method addresses the limitations of regressionbased approaches, which often suffer from error accumulation and suboptimal connectivity estimation. Through extensive experiments on the OmniRig dataset, our approach achieves state-of-the-art performance in skeleton prediction, demonstrating improved generalization across diverse object categories. The code and dataset will be made available at https://armo-omnirig.github.io/.
AnnofreeOD: Detecting All Classes at Low Frame Rates Without Human Annotations
Boyi Sun
Institute of Automation, Chinese Academy of Sciences
Yuhang Liu
Institute of Automation, Chinese Academy of Sciences
Houxin He
Institute of Automation, Chinese Academy of Sciences
Yonglin Tian
Institute of Automation, Chinese Academy of Sciences
Fei-Yue Wang
Institute of Automation, Chinese Academy of Sciences
Abstract
Manual annotation of 3D bounding boxes in large-scale 3D scenes is expensive and time-consuming. This motivates the exploration of annotation-free 3D object detection using unlabeled point cloud data. Existing unsupervised 3D detection frameworks predominantly identify moving objects via scene flow, which has significant limitations: (1) limited detection classes (≤3), (2) difficulty in detecting stationary objects, and (3) reliance on high frame rates. To address these limitations, we propose AnnofreeOD, a novel Annotation-free Object Detection framework based on 2D-to-3D knowledge distillation. First, we explore an effective strategy to generate high-quality pseudo boxes using single-frame 2D knowledge. Second, we observe the noise from the previous step and introduce NoiseResistant Regression (NRR) based on Box Augmentation (BA). AnnofreeOD achieves state-of-the-art performance across multiple experiments. On the nuScenes dataset, we established the first annotation-free 10-class object detection baseline, achieving 40% of fully supervised performance. Furthermore, in 3-class and class-agnostic object detection tasks, our approach surpasses prior stateof-the-art methods by +9.3% mAP (+12.2% NDS) and +6.0% AP (+4.1% NDS), significantly improving precision. Our codes will be released at https://github.com/sbysbysbys/AnnofreeAD.
Arti-PG: A Toolbox for Procedurally Synthesizing Large-Scale and Diverse Articulated Objects with Rich Annotations
Jianhua Sun
School of Artificial Intelligence, Shanghai Jiao Tong University
Yuxuan Li
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University
Jiude Wei
School of Artificial Intelligence, Shanghai Jiao Tong University
Longfei Xu
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University
Nange Wang
School of Artificial Intelligence, Shanghai Jiao Tong University
Yining Zhang
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University
Cewu Lu
School of Artificial Intelligence, Shanghai Jiao Tong University
Abstract
The acquisition of substantial volumes of 3D articulated object data is expensive and time-consuming, and consequently the scarcity of 3D articulated object data becomes an obstacle for deep learning methods to achieve remarkable performance in various articulated object understanding tasks. Meanwhile, pairing these object data with detailed annotations to enable training for various tasks is also difficult and labor-intensive to achieve. In order to expeditiously gather a significant number of 3D articulated objects with comprehensive and detailed annotations for training, we propose Articulated Object Procedural Generation toolbox, a.k.a. Arti-PG toolbox. Arti-PG toolbox consists of i) descriptions of articulated objects by means of a generalized structure program along with their analytic correspondence to the objects' point cloud, ii) procedural rules about manipulations on the structure program to synthesize large-scale and diverse new articulated objects, and iii) mathematical descriptions of knowledge (e.g. affordance, semantics, etc.) to provide annotations to the synthesized object. Arti-PG has two appealing properties for providing training data for articulated object understanding tasks: i) objects are created with unlimited variations in shape through program-oriented structure manipulation, ii) Arti-PG is widely applicable to diverse tasks by easily providing comprehensive and detailed annotations. Arti-PG now supports the procedural generation of 26 categories of articulate objects and provides annotations across a wide range of both vision and manipulation tasks, and we provide exhaustive experiments which fully demonstrate its advantages. Our code is released at https://github.com/Analytic-Concept-Group/ArtiPG. †denotes equal contribution, § denotes corresponding author
Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data
Zeyi Sun
Shanghai Jiaotong University
Tong Wu
Stanford University
Pan Zhang
Shanghai Artificial Intelligence Laboratory
Yuhang Zang
Shanghai Artificial Intelligence Laboratory
Xiaoyi Dong
Shanghai Artificial Intelligence Laboratory
Yuanjun Xiong
Adobe
Dahua Lin
Shanghai Artificial Intelligence Laboratory
Jiaqi Wang
Shanghai Artificial Intelligence Laboratory
Abstract
Recent years have witnessed remarkable progress in multiview diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D data with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically generates filtered multi-view images to assist in training multi-view diffusion models. Specifically, we introduce a data generation pipeline that employs (1) 2D and video diffusion models to generate multi-view images based on constructed text prompts, and (2) our fine-tuned 3D-aware MVLLaVA for filtering data and rewriting inaccurate captions. Leveraging this pipeline, we have generated large scale synthetic multi-view images with dense descriptive captions. Furthermore, we present a Training Timestep Reschedule (TTR) strategy that leverages the denoising process to learn multi-view consistency while maintaining the original 2D diffusion prior. Extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment and view consistency.
CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation
Lin Sun
Tianjin University
Jiale Cao
Tianjin University
Jin Xie
Chongqing University
Xiaoheng Jiang
Zhengzhou University
Yanwei Pang
Tianjin University
Abstract
Contrastive Language-Image Pre-training (CLIP) exhibits strong zero-shot classification ability on image-level tasks, leading to the research to adapt CLIP for open-vocabulary semantic segmentation without training. The key is to improve spatial representation of image-level CLIP, such as replacing self-attention map at last layer with self-self attention map or vision foundation model based attention map. In this paper, we present a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP. The proposed CLIPer includes an early-layer fusion and a fine-grained compensation. We observe that, the embeddings and attention maps at early layers can preserve spatial structural information. Inspired by this, we design the early-layer fusion module to generate segmentation map with better spatial coherence. Afterwards, we employ a fine-grained compensation module to compensate local details using the self-attention maps of diffusion model. We conduct the experiments on eight segmentation datasets. Our CLIPer achieves the state-ofthe-art performance on these datasets. With ViT-L and sliding-window inference, CLIPer has the mIoU of 72.2% and 44.7% on VOC and Object, outperforming ProxyCLIP by 11.6% and 5.5%. Our code is available at https: //github.com/linsun449/cliper.code.
Controllable-LPMoE: Adapting to Challenging Object Segmentation via Dynamic Local Priors from Mixture-of-Experts
Yanguang Sun
Nanjing University of Science and Technology
Jiawei Lian
Nanjing University of Science and Technology
Jian Yang
Nankai University
Lei Luo
Nanjing University of Science and Technology
Abstract
Large-scale foundation models provide powerful feature representations for downstream object segmentation tasks. However, when adapted to specific tasks through the fullparameter fine-tuning, the enormous parameters being updated often results in significant computational overhead, creating a bottleneck in training efficiency. Although existing methods attempt to fine-tune frozen models by directly embedding trainable prompts, these prompts lack inherent semantic priors, limiting the adaptability of largescale models. In this paper, we propose a novel dynamic priors-based fine-tuning paradigm with fewer trainable parameters, dubbed Controllable-LPMoE, which adaptively modulates frozen foundation models by dynamically controlling local priors to enhance fine-grained perception for specific segmentation tasks. More specifically, we construct a lightweight dynamic mixed local priors extractor that captures diverse local priors from input images through heterogeneous convolutions while employing a gating network to dynamically output expert priors required for the subsequent fine-tuning. Furthermore, we design a bi-directional interaction adapter that employs cosine-aligned deformable attention and channel-oriented adaptive scale enhancement to interact and restructure between frozen and trainable features, achieving efficient fine-tuning. Extensive experiments validate the superiority of our Controllable-LPMoE approach, demonstrating excellent segmentation performance compared to 31 state-of-the-art (SOTA) methods and adaptability to multiple binary object segmentation tasks.
Dual Domain Control via Active Learning for Remote Sensing Domain Incremental Object Detection
Jiachen Sun
Xidian University
De Cheng
Xidian University
Xi Yang
Xidian University
Nannan Wang
Xidian University
Abstract
Domain incremental object detection in remote sensing addresses the challenge of adapting to continuously emerging domains with distinct characteristics. Unlike natural images, remote sensing data vary significantly due to differences in sensors, altitudes, and geographic locations, leading to data distribution shifts and feature misalignments. These challenges make it difficult for models to generalize across domains while retaining knowledge from previous tasks, requiring effective adaptation strategies to mitigate catastrophic forgetting. To address these challenges, we propose the Dual Domain Control via Active Learning (Active-DDC) method, which integrates active learning strategies to handle data distribution and model feature shifts. The first component, the Data-based Active Learning Example Replay (ALER) module, combines a highinformation sample selection strategy from active learning with the characteristic extreme foreground-background ratio in remote sensing images, enabling the selection of highly representative samples for storage in a memory bank. The second component, the Query-based Active Domain Shift Control (ADSC) module, leverages the query vector, a key element for DETR-based detectors, to implement query active preselection and optimal transport matching, thus facilitating effective cross-domain knowledge transfer. Our method achieves optimal performance in domain incremental tasks across four remote sensing datasets, and ablation studies further validate the effectiveness of both components.
EVDM: Event-based Real-world Video Deblurring with Mamba
Zhijing Sun
University of Science and Technology of China
Senyan Xu
University of Science and Technology of China
Kean Liu
University of Science and Technology of China
Runze Tian
University of Science and Technology of China
Xueyang Fu
University of Science and Technology of China
Zheng-Jun Zha
University of Science and Technology of China
Abstract
Existing event-based video deblurring methods face limitations in extracting and fusing long-range spatiotemporal motion information from events, primarily due to restricted receptive fields or low computational efficiency, resulting in suboptimal deblurring performance. To address these issues, we introduce the state space model, which leverages linear complexity and global receptive fields for long-range modeling, and propose EVDM, a novel Eventbased Video Deblurring framework with Mamba. The framework consists of: (1) Motion Clue Extraction Mamba (MCEM), which employs an event self-reconstruction loss to ensure the completeness of details when extracting longrange motion information. (2) Motion-aware Intra-frame Fusion Mamba (MIFM) and Inter-frame Temporal Propagation Mamba (ITPM), which utilize the motion-aware state space to perform cross-modal fusion and inter-frame information exchange guided by motion clues. Consequently, EVDM achieves superior detail restoration in blurred regions while ensuring temporal motion consistency across frames. Additionally, to overcome the limitation of fixed exposure ratios in existing event-frame paired datasets, we introduce T-RED, a high-quality, high-resolution dataset with varying exposure time ratios. T-RED provides more realistic and complex data for event-based video deblurring research. Experiments on multiple datasets demonstrate that EVDM outperforms previous SOTA methods.
Hierarchy UGP: Hierarchy Unified Gaussian Primitive for Large-Scale Dynamic Scene Reconstruction
Hongyang Sun
Zhejiang University
Qinglin Yang
Zhejiang University
Jiawei Wang
UESTC
Zhen Xu
Zhejiang University
Chen Liu
Li Auto Inc.
Yida Wang
Li Auto Inc.
Kun Zhan
Li Auto Inc.
Hujun Bao
Zhejiang University
Xiaowei Zhou
Zhejiang University
Sida Peng
Zhejiang University
Abstract
Recent advances in differentiable rendering have significantly improved dynamic street scene reconstruction. However, the complexity of large-scale scenarios and dynamic elements, such as vehicles and pedestrians, remains a substantial challenge. Existing methods often struggle to scale to large scenes or accurately model arbitrary dynamics. To address these limitations, we propose Hierarchy UGP, which constructs a hierarchical structure consisting of a root level, sub-scenes level, and primitive level, using Unified Gaussian Primitive (UGP) defined in 4D space as the representation. The root level serves as the entry point to the hierarchy. At the sub-scenes level, the scene is spatially divided into multiple sub-scenes, with various elements extracted. At the primitive level, each element is modeled with UGPs, and its global pose is controlled by a motion prior related to time. This hierarchical design greatly enhances the model's capacity, enabling it to model large-scale scenes. Additionally, our UGP allows for the reconstruction of both rigid and non-rigid dynamics. We conducted experiments on Dynamic City, our proprietary large-scale dynamic street scene dataset, as well as the public Waymo dataset. Experimental results demonstrate that our method achieves state-of-the-art performance. We plan to release the accompanying code and the Dynamic City dataset as open resources to further research within the community.
Low-Light Image Enhancement Using Event-Based Illumination Estimation
Lei Sun
INSAIT, Sofia University 'St. Kliment Ohridski'
Yuhan Bao
Zhejiang University
Jiajun Zhai
Zhejiang University
Jingyun Liang
Alibaba Group
Yulun Zhang
Shanghai Jiao Tong University
Kaiwei Wang
Zhejiang University
Danda Pani Paudel
INSAIT, Sofia University 'St. Kliment Ohridski'
Luc Van Gool
INSAIT, Sofia University 'St. Kliment Ohridski'
Abstract
Low-light image enhancement (LLIE) aims to improve the visibility of images captured in poorly lit environments. Prevalent event-based solutions primarily utilize events triggered by motion, i.e., 'motion events' to strengthen only the edge texture, while leaving the high dynamic range and excellent low-light responsiveness of event cameras largely unexplored. This paper instead opens a new avenue from the perspective of estimating the illumination using 'temporal-mapping' events, i.e., by converting the timestamps of events triggered by a transmittance modulation into brightness values. The resulting fine-grained illumination cues facilitate a more effective decomposition and enhancement of the reflectance component in low-light images through the proposed Illumination-aided Reflectance Enhancement module. Furthermore, the degradation model of temporal-mapping events under low-light conditions is investigated for realistic training data synthesis. To address the lack of datasets under this regime, we construct a beamsplitter setup and collect EvLowLight dataset that includes images, temporal-mapping events, and motion events. Experiments across 5 synthetic datasets and our real-world EvLowLight dataset substantiate that the devised pipeline, dubbed RETINEV, excels in producing well-illuminated, high dynamic range images, outperforming previous stateof-the-art event-based methods by up to 6.62 dB, while maintaining an efficient inference speed of 35.6 framesper-second on a 640 x 480 image. Codes and datasets: https://github.com/AHupuJR/RetinEV.
Mitigating Geometric Degradation in Fast DownSampling via FastAdapter for Point Cloud Segmentation
Shuofeng Sun
Beijing University of Posts and Telecommunications
Haibin Yan
Beijing University of Posts and Telecommunications
Abstract
Farthest Point Sampling (FPS) is widely used in existing point-based models because it effectively preserves structural integrity during downsampling. However, it incurs significant computational overhead, severely impacting the model's inference efficiency. Random sampling or grid sampling is considered faster downsampling methods; however, these fast downsampling methods may lead to the loss of geometric information during the downsampling process due to their overly simplistic and fixed rules, which can negatively affect model performance. To address this issue, we propose FastAdapter, which aggregates local contextual information through a small number of anchor points and facilitates interactions across spatial and layer dimensions, ultimately feeding this information back into the downsampled point cloud to mitigate the information degradation caused by fast downsampling methods. In addition to using FastAdapter to enhance model performance in methods that already employ fast downsampling, we aim to explore a more challenging yet valuable application scenario. Specifically, we focus on pre-trained models that utilize FPS, embedding FastAdapter and replacing FPS with random sampling for lightweight fine-tuning. This approach aims to significantly improve inference speed while maintaining relatively unchanged performance. Experimental results on ScanNet, S3DIS, and SemanticKITTI demonstrate that our method effectively mitigates the geometric information degradation issues caused by fast downSampling.
Moment Quantization for Video Temporal Grounding
Xiaolong Sun
Xi'an Jiaotong University
Le Wang
Xi'an Jiaotong University
Sanping Zhou
Xi'an Jiaotong University
Liushuai Shi
Xi'an Jiaotong University
Kun Xia
Xi'an Jiaotong University
Mengnan Liu
Xi'an Jiaotong University
Yabing Wang
Xi'an Jiaotong University
Gang Hua
Amazon Alexa AI
Abstract
Video temporal grounding is a critical video understanding task, which aims to localize moments relevant to a language description. The challenge of this task lies in distinguishing relevant and irrelevant moments. Previous methods focused on learning continuous features exhibit weak differentiation between foreground and background features. In this paper, we propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG), which quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments. Specifically, MQVTG maintains a learnable moment codebook, where each video moment matches a codeword. Considering the visual diversity, i.e., various visual expressions for the same moment, MQVTG treats moment-codeword matching as a clustering process without using discrete vectors, avoiding the loss of useful information from direct hard quantization. Additionally, we employ effective priorinitialization and joint-projection strategies to enhance the maintained moment codebook. With its simple implementation, the proposed method can be integrated into existing temporal grounding models as a plug-and-play component. Extensive experiments on six popular benchmarks demonstrate the effectiveness and generalizability of MQVTG, significantly outperforming state-of-the-art methods. Further qualitative analysis shows that our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination. Code is available at https://github.com/TensorsSun/MQVTG.
RobAVA: A Large-scale Dataset and Baseline Towards Video based Robotic Arm Action Understanding
Baoli Sun
Dalian University of Technology
Ning Wang
Dalian University of Technology
Xinzhu Ma
The Chinese University of Hong Kong
Anqi Zou
Dalian University of Technology
Yihang Lu
Dalian University of Technology
Chuixuan Fan
Dalian University of Technology
Zhihui Wang
Dalian University of Technology
Kun Lu
Dalian University of Technology
Zhiyong Wang
The University of Sydney
Abstract
Understanding the behaviors of robotic arms is essential for various robotic applications such as logistics management and automated manufacturing. However, the lack of large-scale and diverse datasets significantly hinders progress in video-based robotic arm action understanding.In particular, our RobAVA contains 40k video sequences with video-level fine-grained annotations, covering basic actions such as picking, pushing, and placing, as well as their combinations in different orders and interactions with various objects. Distinguished to existing action recognition benchmarks, RobAVA includes instances of both normal and anomalous executions for each action category. The main challenge in robotic arm action recognition is that a complete action is composed of fundamental, atomic behaviors, requiring models to learn their inter-relationships. To this end, we propose a novel baseline approach, AGPTNet, which re-defines the problem of understanding robotic arm actions as a task of aligning video sequences with atomic attributes. To enhance AGPT-Net's ability to distinguish normal and anomalous action instances, we introduce a joint semantic space constraint between category and attribute semantics, thereby amplifying the separation between normal and anomalous attribute representations for each action. We conduct extensive experiments to demonstrate AGPT-Net's superiority over other mainstream recognition models. Please see the project page at https://github.com/Sunbaoli/RobAVA.
Towards Efficient General Feature Prediction in Masked Skeleton Modeling
Shengkai Sun
Hefei University of Technology
Zefan Zhang
Jilin University
Jianfeng Dong
Zhejiang Gongshang University
Zhiyong Cheng
Hefei University of Technology
Xiaojun Chang
University of Science and Technology of China
Meng Wang
Hefei University of Technology
Abstract
Recent advances in the masked autoencoder (MAE) paradigm have significantly propelled self-supervised skeleton-based action recognition. However, most existing approaches limit reconstruction targets to raw joint coordinates or their simple variants, resulting in computational redundancy and limited semantic representation. To address this, we propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling. Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations. Specifically, we introduce a collaborative learning framework where a lightweight target generation network dynamically produces diversified supervision signals across spatial-temporal hierarchies, avoiding reliance on pre-computed offline features. The framework incorporates constrained optimization to ensure feature diversity while preventing model collapse. Experiments on NTU RGB+D 60, NTU RGB+D 120 and PKU-MMD demonstrate the benefits of our approach: Computational efficiency (with 6.2x faster training than standard masked skeleton modeling methods) and superior representation quality, achieving state-of-the-art performance in various downstream tasks.
Two Losses, One Goal: Balancing Conflict Gradients for Semi-supervised Semantic Segmentation
Rui Sun
Shenzhen International Graduate School, Tsinghua University
Huayu Mai
National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory
Wangkai Li
National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory
Yujia Chen
National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory
Yuan Wang
University of Science and Technology of China
Abstract
Semi-supervised semantic segmentation has attracted considerable attention as it alleviates the need for extensive pixel-level annotations. However, existing methods often overlook the potential optimization conflict between supervised and unsupervised learning objectives, leading to suboptimal performance. In this paper, we identify this underexplored issue and propose a novel Pareto Optimization Strategy (POS) to tackle it. POS aims to find a descent gradient direction that benefits both learning objectives, thereby facilitating model training. By dynamically assigning weights to the gradients at each iteration based on the model's learning status, POS effectively reconciles the intrinsic tension between the two objectives. Furthermore, we analyze POS from the perspective of gradient descent in random batch sampling and propose the Magnitude Enhancement Operation (MEO) to further unleash its potential by considering both direction and magnitude during gradient integration. Extensive experiments on challenging benchmarks demonstrate that integrating POS into existing semi-supervised segmentation methods yields consistent improvements across different data splits and architectures (CNN, Transformer), showcasing its effectiveness.
Uncertainty-Aware Gradient Stabilization for Small Object Detection
Huixin Sun
School of Electronic Information Engineering, Beihang University
Yanjing Li
School of Electronic Information Engineering, Beihang University
Linlin Yang
State Key Laboratory of Media Convergence and Communication, CUC
Xianbin Cao
School of Electronic Information Engineering, Beihang University
Baochang Zhang
School of Artificial Intelligence, Beihang University
Abstract
Despite advances in generic object detection, there remains a performance gap in detecting small objects compared to normal-scale objects. We reveal that conventional object localization methods suffer from gradient instability in small objects due to sharper loss curvature, leading to a convergence challenge. To address the issue, we propose Uncertainty-Aware Gradient Stabilization (UGS), a framework that reformulates object localization as a classification task to stabilize gradients. UGS quantizes continuous labels into interval non-uniform discrete representations. Under a classification-based objective, the localization branch generates bounded and confidence-driven gradients, mitigating instability. Furthermore, UGS integrates an uncertainty minimization (UM) loss that reduces prediction variance and an uncertainty-guided refinement (UR) module that identifies and refines high-uncertainty regions via perturbations. Evaluated on four benchmarks, UGS consistently improves anchor-based, anchor-free, and leading small object detectors. Notably, UGS enhances DINO5scale by 2.6 AP on VisDrone, surpassing prior state-ofthe-art performance.
Visual Intention Grounding for Egocentric Assistants
Pengzhan Sun
National University of Singapore
Junbin Xiao
National University of Singapore
Tze Ho Elden Tse
National University of Singapore
Yicong Li
National University of Singapore
Arjun Akula
Google DeepMind
Angela Yao
National University of Singapore
Abstract
Visual grounding associates textual descriptions with objects in an image. Conventional methods target third-person image inputs and named object queries. In applications such as AI assistants, the perspective shifts - inputs are egocentric, and objects may be referred to implicitly through needs and intentions. To bridge this gap, we introduce EgoIntention, the first dataset for egocentric visual intention grounding. EgoIntention challenges multimodal LLMs to 1) understand and ignore unintended contextual objects and 2) reason about uncommon object functionalities. Benchmark results show that current models misidentify context objects and lack affordance understanding in egocentric views. We also propose Reason-to-Ground (RoG) instruction tuning; it enables hybrid training with normal descriptions and egocentric intentions with a chained intention reasoning and object grounding mechanism. RoG significantly outperforms naive finetuning and hybrid training on EgoIntention, while maintaining or slightly improving naive description grounding. This advancement enables unified visual grounding for egocentric and exocentric visual inputs while handling explicit object queries and implicit human intentions. Our code and model are available at https://github.com/pengzhansun/EgoIntention.
Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models
Wei Suo
Northwestern Polytechnical University
Ji Ma
Northwestern Polytechnical University
Mengyang Sun
Northwestern Polytechnical University
Lin Yuanbo Wu
Swansea University
Peng Wang
Northwestern Polytechnical University
Yanning Zhang
Northwestern Polytechnical University
Abstract
Although Large Vision-Language Models (LVLMs) have achieved impressive results, their high computational costs pose a significant barrier to wide application. To enhance inference efficiency, most existing approaches can be categorized as parameter-dependent or token-dependent strategies to reduce computational demands. However, parameter-dependent methods require retraining LVLMs to recover performance while token-dependent strategies struggle to consistently select the most relevant tokens. In this paper, we systematically analyze the above challenges and provide a series of valuable insights for inference acceleration. Based on these findings, we propose a novel framework, the Pruning All-Rounder (PAR). Different from previous works, PAR develops a meta-router to adaptively organize pruning flows across both tokens and layers. With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of acceleration scenarios. The code for this work is publicly available at https://github.com/ASGO-MM/Pruning-All-Rounder.
Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues
Francesco Taioli
Polytechnic of Turin
Edoardo Zorzi
University of Verona
Gianni Franchi
U2IS, ENSTA Paris
Alberto Castellini
University of Verona
Alessandro Farinelli
University of Verona
Marco Cristani
University of Verona
Yiming Wang
Fondazione Bruno Kessler
Abstract
Language-driven instance object navigation assumes that a human initiates the task by providing a detailed description of the target to the embodied agent. While this description is crucial for distinguishing the target from other visually similar instances, providing it prior to navigation can be demanding for humans. We thus introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolves uncertainties about the target instance during navigation in natural, template-free and open-ended dialogues with the human, minimizing user input. We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the humanagent interaction reasoning using Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates internal selfdialogues within the agent to obtain a complete and accurate observation with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue, or halt navigation. For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA serves as a competitive baseline, whereas existing language-driven instance navigation methods struggle in multi-instance scenes.
ReTracker: Exploring Image Matching for Robust Online Any Point Tracking
Dongli Tan
Zhejiang University
Xingyi He
Zhejiang University
Sida Peng
Zhejiang University
Yiqing Gong
Zhejiang University
Xing Zhu
Ant Research
Jiaming Sun
Zhejiang University
Ruizhen Hu
Shenzhen University
Yujun Shen
Zhejiang University
Hujun Bao
Zhejiang University
Xiaowei Zhou
Zhejiang University
Abstract
This paper aims to establish correspondences for a set of 2D query points across a video sequence in an online manner. Recent methods leverage future frames to achieve smooth point tracking at the current frame, but they still struggle to find points with significant viewpoint changes after longterm occlusions and inherently cannot achieve online tracking. To overcome these challenges, we develop a novel online tracking framework, named ReTracker, that integrates two advances in image matching with tracking-specific designs. First, a decoder network with a global receptive field is incorporated with a temporal attention module to robustly track points undergoing large location changes. Second, the decoder network is adapted to pretrain on large-scale twoview matching data, which offers significantly greater diversity and volume than tracking data, to learn general matching priors. This pretraining strategy effectively enhances our tracker's ability to handle viewpoint and appearance variations after long-term occlusions. Experiments demonstrate that our method outperforms recent online trackers across multiple benchmarks and achieves competitive or superior performance compared to offline methods. Furthermore, we collect an ego-centric, occlusion-heavy dataset to illustrate the retracking capabilities of our approach. Project page: re-tracker.github.io.
Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning
Jieyi Tan
Wuhan University
Chengwei Zhang
University of Cambridge
Bo Dang
Wuhan University
Yansheng Li
Wuhan University
Abstract
Traditional Remote Sensing Foundation Models (RSFMs) are pre-trained with a data-centralized paradigm, through self-supervision on large-scale curated remote sensing data. For each institution, however, pre-training RSFMs with limited data in a standalone manner may lead to suboptimal performance, while aggregating remote sensing data from multiple institutions for centralized pre-training raises privacy concerns. Seeking for collaboration is a promising solution to resolve this dilemma, where multiple institutions can collaboratively train RSFMs without sharing private data. In this paper, we propose a novel privacypreserved pre-training framework (FedSense), which enables multiple institutions to collaboratively train RSFMs without sharing private data. However, it is a non-trivial task hindered by a vicious cycle, which results from model drift by remote sensing data heterogeneity and high communication overhead. To break this vicious cycle, we introduce federated mutual-guidance learning. Specifically, we propose a Server-to-Clients Guidance (SCG) mechanism to guide clients' updates towards global-flatness optimal solutions. Additionally, we propose a Clients-to-Server Guidance (CSG) mechanism to inject local knowledge into the server by low-bit communication. Extensive experiments on four downstream tasks demonstrate the effectiveness of our FedSense in both full-precision and communicationreduced scenarios, showcasing remarkable communication efficiency and performance gains.
What You Have is What You Track: Adaptive and Robust Multimodal Tracking
Yuedong Tan
TeleAI, China Telecom
Jiawei Shao
TeleAI, China Telecom
Eduard Zamfir
Computer Vision Lab, CAIDAS & IFI, University of Wurzburg
Ruanjun Li
ShanghaiTech University
Zhaochong An
University of Copenhagen
Chao Ma
AI Institute, Shanghai Jiao Tong University
Danda Paudel
INSAIT, Sofia University
Luc Van Gool
INSAIT, Sofia University
Radu Timofte
Computer Vision Lab, CAIDAS & IFI, University of Wurzburg
Zongwei Wu
Computer Vision Lab, CAIDAS & IFI, University of Wurzburg
Abstract
Multimodal data is known to be helpful for visual tracking by improving robustness to appearance variations. However, sensor synchronization challenges often compromise data availability, particularly in video settings where shortages can be temporal. Despite its importance, this area remains underexplored. In this paper, we present the first comprehensive study on tracker performance with temporally incomplete multimodal data. Unsurprisingly, under such a circumstance, existing trackers exhibit significant performance degradation, as their rigid architectures lack the adaptability needed to effectively handle missing modalities. To address these limitations, we propose a flexible framework for robust multimodal tracking. We venture that a tracker should dynamically activate computational units based on missing data rates. This is achieved through a novel Heterogeneous Mixture-of-Experts fusion mechanism with adaptive complexity, coupled with a video-level masking strategy that ensures both temporal consistency and spatial completeness - critical for effective video tracking. Surprisingly, our model not only adapts to varying missing rates but also adjusts to scene complexity. Extensive experiments show that our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings. The code and benchmark will be made publicly available at https: //github.com/supertyd/FlexTrack.
RnGCam: High-speed video from rolling & global shutter measurements
Kevin Tandi
University of California, San Diego
Xiang Dai
University of California, San Diego
Chinmay Talegaonkar
University of California, San Diego
Gal Mishne
University of California, San Diego
Nick Antipa
University of California, San Diego
Abstract
Compressive video capture encodes a short high-speed video into a single measurement using a low-speed sensor, then computationally reconstructs the original video. Prior implementations rely on expensive hardware and are restricted to imaging sparse scenes with empty backgrounds. We propose RnGCam, a system that fuses measurements from low-speed consumer-grade rolling-shutter (RS) and global-shutter (GS) sensors into video at kHz frame rates. The RS sensor is combined with a pseudorandom optic, called a diffuser, which spatially multiplexes scene information. The GS sensor is coupled with a conventional lens. The RS-diffuser provides low spatial detail and high temporal detail, complementing the GS-lens system's high spatial detail and low temporal detail. We propose a reconstruction method using implicit neural representations (INR) to fuse the measurements into a high-speed video. Our INR method separately models the static and dynamic scene components, while explicitly regularizing dynamics. In simulation, we show that our approach significantly outperforms previous RS compressive video methods, as well as state-of-the-art frame interpolators. We validate our approach in a dual-camera hardware setup, which generates 230 frames of video at 4,800 frames per second for dense scenes, using hardware that costs 10⇥less than previous compressive video systems.
Closed-Loop Transfer for Weakly-supervised Affordance Grounding
Jiajin Tang
Zhengxuan Wei
Ge Zheng
Sibei Yang
Abstract
Humans can perform previously unexperienced interactions with novel objects simply by observing others engage with them. Weakly-supervised affordance grounding mimics this process by learning to locate object regions that enable actions on egocentric images, using exocentric interaction images with image-level annotations. However, extracting affordance knowledge solely from exocentric images and transferring it one-way to egocentric images limits the applicability of previous works in complex interaction scenarios. Instead, this study introduces LoopTrans, a novel closed-loop framework that not only transfers knowledge from exocentric to egocentric but also transfers back to enhance exocentric knowledge extraction. Within LoopTrans, several innovative mechanisms are introduced, including unified cross-modal localization and denoising knowledge distillation, to bridge domain gaps between object-centered egocentric and interaction-centered exocentric images while enhancing knowledge transfer. Experiments show that LoopTrans achieves consistent improvements across all metrics on image and video benchmarks, even handling challenging scenarios where object interaction regions are fully occluded by the human body. All models and codes will be made publicly available.
CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective
Zongheng Tang
Hangzhou International Innovation Institute, Beihang University
Yi Liu
School of Artificial Intelligence, Beihang University
Yifan Sun
School of Artificial Intelligence, Beihang University
Yulu Gao
Hangzhou International Innovation Institute, Beihang University
Jinyu Chen
School of Artificial Intelligence, Beihang University
Runsheng Xu
University of California, Los Angeles
Si Liu
School of Artificial Intelligence, Beihang University
Abstract
Collaborative perception shares information among different agents and helps solving problems that individual agents may face, e.g., occlusions and small sensing range. Prior methods usually separate the multi-agent fusion and multi-time fusion into two consecutive steps. In contrast, this paper proposes an efficient collaborative perception that aggregates the observations from different agents (space) and different times into a unified spatio-temporal space simultaneously. The unified spatio-temporal space brings two benefits, i.e., efficient feature transmission and superior feature fusion. 1) Efficient feature transmission: each static object yields a single observation in the spatial temporal space, and thus only requires transmission only once (whereas prior methods re-transmit all the object features multiple times). 2) superior feature fusion: merging the multi-agent and multi-time fusion into a unified spatialtemporal aggregation enables a more holistic perspective, thereby enhancing perception performance in challenging scenarios. Consequently, our Collaborative perception with Spatio-temporal Transformer (CoST) gains improvement in both efficiency and accuracy. Notably, CoST is not tied to any specific method and is compatible with a majority of previous methods, enhancing their accuracy while reducing the transmission bandwidth. Code will be available at https://github.com/tzhhhh123/CoST.
HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder
Yingqi Tang
Nullmax
Zhuoran Xu
Nullmax
Zhaotie Meng
Nullmax
Erkang Cheng
Nullmax
Abstract
Although end-to-end autonomous driving (E2E-AD) technologies have made significant progress in recent years, there remains an unsatisfactory performance on closed-loop evaluation. The potential of leveraging planning in query design and interaction has not yet been fully explored. In this paper, we introduce a multi-granularity planning query representation that integrates heterogeneous waypoints, including spatial, temporal, and driving-style waypoints across various sampling patterns. It provides additional supervision for trajectory prediction, enhancing precise closed-loop control for the ego vehicle. Additionally, we explicitly utilize the geometric properties of planning trajectories to effectively retrieve relevant image features based on physical locations using deformable attention. By combining these strategies, we propose a novel end-to-end autonomous driving framework, termed HiP-AD, which simultaneously performs perception, prediction, and planning within a unified decoder. HiP-AD enables comprehensive interaction by allowing planning queries to iteratively interact with perception queries in the BEV space while dynamically extracting image features from perspective views. Experiments demonstrate that HiPAD outperforms all existing end-to-end autonomous driving methods on the closed-loop benchmark Bench2Drive and achieves competitive performance on the real-world dataset nuScenes.
G2SF: Geometry-Guided Score Fusion for Multimodal Industrial Anomaly Detection
Chengyu Tao
The Hong Kong University of Science and Technology
Xuanming Cao
The Hong Kong University of Science and Technology (Guangzhou)
Juan Du
The Hong Kong University of Science and Technology
Abstract
Industrial quality inspection plays a critical role in modern manufacturing by identifying defective products during production. While single-modality approaches using either 3D point clouds or 2D RGB images suffer from information incompleteness, multimodal anomaly detection offers promise through the complementary fusion of crossmodal data. However, existing methods face challenges in effectively integrating unimodal results and improving discriminative power. To address these limitations, we first reinterpret memory bank-based anomaly scores in single modalities as isotropic Euclidean distances in local feature spaces. Dynamically evolving from Euclidean metrics, we propose a novel Geometry-Guided Score Fusion (G2SF) framework that progressively learns an anisotropic local distance metric as a unified score for the fusion task. Through a geometric encoding operator, a novel Local Scale Prediction Network (LSPN) is proposed to predict direction-aware scaling factors that characterize first-order local feature distributions, thereby enhancing discrimination between normal and anomalous patterns. Additionally, we develop specialized loss functions and score aggregation strategy from geometric priors to ensure both metric generalization and efficacy. Comprehensive evaluations on the MVTec-3D AD and Eyecandies datasets demonstrate the state-of-the-art detection performance of our method, and detailed ablation analysis validates each component's contribution. Our code is available at https://github.com/ctaoaa/G2SF.
GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation
Ye Tao
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
Jiawei Zhang
SenseTime Research
Yahao Shi
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
Dongqing Zou
SenseTime Research
Bin Zhou
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
Abstract
Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pretrained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splattingbased geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-viewconsistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. Our code is available at https://github.com/MOMOYATW/GSV3D.
MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion
Peilin Tao
Institute of Automation, Chinese Academy of Sciences
Hainan Cui
Institute of Automation, Chinese Academy of Sciences
Diantao Tu
Institute of Automation, Chinese Academy of Sciences
Shuhan Shen
Institute of Automation, Chinese Academy of Sciences
Abstract
Multi-camera systems are increasingly vital in the environmental perception of autonomous vehicles and robotics. Their physical configuration offers inherent fixed relative pose constraints that benefit Structure-from-Motion (SfM). However, traditional global SfM systems struggle with robustness due to their optimization framework. We propose a novel global motion averaging framework for multi-camera systems, featuring two core components: a decoupled rotation averaging module and a hybrid translation averaging module. Our rotation averaging employs a hierarchical strategy by first estimating relative rotations within rigid camera units and then computing global rigid unit rotations. To enhance the robustness of translation averaging, we incorporate both camera-to-camera and camera-to-point constraints to initialize camera positions and 3D points with a convex distancebased objective function and refine them with an unbiased non-bilinear angle-based objective function. Experiments on large-scale datasets show that our system matches or exceeds incremental SfM accuracy while significantly improving efficiency. Our framework outperforms existing global SfM methods, establishing itself as a robust solution for realworld multi-camera SfM applications. The code is available at https://github.com/3dv-casia/MGSfM/.
RoboPearls: Editable Video Simulation for Robot Manipulation
Tang Tao
Shenzhen Campus of Sun Yat-sen University
Likui Zhang
Sun Yat-sen University
Youpeng Wen
Shenzhen Campus of Sun Yat-sen University
Kaidong Zhang
Sun Yat-sen University
Jia-Wang Bian
Bytedance Seed
Xia Zhou
Li Auto Inc.
Tianyi Yan
Li Auto Inc.
Kun Zhan
Li Auto Inc.
Peng Jia
Li Auto Inc.
Hefeng Wu
Sun Yat-sen University
Liang Lin
Sun Yat-sen University
Xiaodan Liang
Shenzhen Campus of Sun Yat-sen University
Abstract
The development of generalist robot manipulation policies has seen significant progress, driven by large-scale demonstration data across diverse environments. However, the high cost and inefficiency of collecting real-world demonstrations hinder the scalability of data acquisition. While existing simulation platforms enable controlled environments for robotic learning, the challenge of bridging the sim-to-real gap remains. To address these challenges, we propose RoboPearls, an editable video simulation framework for robotic manipulation. Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations from demonstration videos, and supports a wide range of simulation operators, including various object manipulations, powered by proposed modules like Incremental Semantic Distillation (ISD) and 3D regularized NNFM Loss (3D-NNFM). Moreover, by incorporating large language models (LLMs), RoboPearls automates the simulation production process in a user-friendly manner through flexible command interpretation and execution. Furthermore, RoboPearls employs a vision-language model (VLM) to analyze robotic learning issues to close the simulation loop for performance enhancement. To demonstrate the effectiveness of RoboPearls, we conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot, which demonstrate our satisfactory simulation performance. More information can be found on our Project Page.
Parameter-Efficient Adaptation of Geospatial Foundation Models through Embedding Deflection
Romain Thoreau
CNES
Valerio Marsocci
European Space Agency Φ-Lab
Dawa Derksen
CNES
Abstract
As large-scale heterogeneous data sets become increasingly available, adapting foundation models at low cost has become a key issue. Seminal works in natural language processing, e.g. Low-Rank Adaptation (LoRA), leverage the low 'intrinsic rank' of parameter updates during adaptation. In this paper, we argue that incorporating stronger inductive biases on both the data and the models can enhance the adaptation of Geospatial Foundation Models (GFMs), pretrained on RGB satellite images, to other types of optical satellite data. Specifically, the pretrained parameters of GFMs serve as a strong prior for the spatial structure of multispectral images. For this reason, we introduce DEFLECT (Deflecting Embeddings for Finetuning Latent representations for Earth and Climate Tasks), a novel strategy for adapting GFMs to multispectral satellite imagery with very few additional parameters. DEFLECT improves the representation capabilities of the extracted features, particularly enhancing spectral information, which is essential for geoscience and environmental-related tasks. We demonstrate the effectiveness of our method across three different GFMs and five diverse datasets, ranging from forest monitoring to marine environment segmentation. Compared to competing methods, DEFLECT achieves on-par or higher accuracy with 5-10x fewer parameters for classification and segmentation tasks. The code is available at https://github.com/VMarsocci/DEFLECT.
DATA: Domain-And-Time Alignment for High-Quality Feature Fusion in Collaborative Perception
Chengchang Tian
Southeast University
Jianwei Ma
Southeast University
Yan Huang
Southeast University
Zhanye Chen
Southeast University
Honghao Wei
Washington State University
Hui Zhang
Southeast University
Wei Hong
Southeast University
Abstract
Feature-level fusion shows promise in collaborative perception (CP) through balanced performance and communication bandwidth trade-off. However, its effectiveness critically relies on input feature quality. The acquisition of high-quality features faces domain gaps from hardware diversity and deployment conditions, alongside temporal misalignment from transmission delays. These challenges degrade feature quality with cumulative effects throughout the collaborative network. In this paper, we present the Domain-And-Time Alignment (DATA) network, designed to systematically align features while maximizing their semantic representations for fusion. Specifically, we propose a Consistency-preserving Domain Alignment Module (CDAM) that reduces domain gaps through proximal-region hierarchical downsampling and observability-constrained discriminator. We further propose a Progressive Temporal Alignment Module (PTAM) to handle transmission delays via multi-scale motion modeling and two-stage compensation. Building upon the aligned features, an Instancefocused Feature Aggregation Module (IFAM) is developed to enhance semantic representations. Extensive experiments demonstrate that DATA achieves state-of-the-art performance on three typical datasets, maintaining robustness with severe communication delays and pose errors. The code will be released at https://github.com/ChengchangTian/DATA.
DuoCLR: Dual-Surrogate Contrastive Learning for Skeleton-based Human Action Segmentation
Haitao Tian
University of Ottawa
Abstract
In this paper, a new contrastive representation learning framework is proposed to enhance action segmentation via pretraining using trimmed (single action) skeleton sequences. Unlike previous representation learning works that are tailored for action recognition and that develop isolated sequence-wise representations, the proposed framework focuses on exploiting multi-scale representations in conjunction with cross-sequence variations. More specifically, it proposes a novel data augmentation strategy, 'Shuffle and Warp', which exploits diverse multi-action permutations. The latter effectively assist two surrogate tasks that are introduced in contrastive learning: Cross Permutation Contrasting (CPC) and Relative Order Reasoning (ROR). In optimization, CPC learns intra-class similarities by contrasting representations of the same action class across different permutations, while ROR reasons about inter-class contexts by predicting relative mapping between two permutations. Together, these tasks enable a Dual-Surrogate Contrastive Learning (DuoCLR) network to learn multi-scale feature representations optimized for action segmentation. In experiments, DuoCLR is pretrained on a trimmed skeleton dataset and evaluated on an untrimmed dataset where it demonstrates a significant boost over state-the-art comparatives in both multi-class and multi-label action segmentation tasks. Lastly, ablation studies are conducted to evaluate the effectiveness of each component of the proposed approach.
AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration
Javier Tirado-Garín
University of Zaragoza
Javier Civera
University of Zaragoza
Abstract
We present AnyCalib, a method for calibrating the intrinsic parameters of a camera from a single in-the-wild image, that is agnostic to the camera model. Current methods are predominantly tailored to specific camera models and/or require extrinsic cues, such as the direction of gravity, to be visible in the image. In contrast, we argue that the perspective and distortion cues inherent in images are sufficient for model-agnostic camera calibration. To demonstrate this, we frame the calibration process as the regression of the rays corresponding to each pixel. We show, for the first time, that this intermediate representation allows for a closed-form recovery of the intrinsics for a wide range of camera models, including but not limited to: pinhole, Brown-Conrady and KannalaBrandt. Our approach also applies to edited-cropped and stretched-images. Experimentally, we demonstrate that AnyCalib consistently outperforms alternative methods, including 3D foundation models, despite being trained on orders of magnitude less data. Code is available at https://github.com/javrtg/AnyCalib.
GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization
Shaowen Tong
ShanghaiTech University
Zimin Xia
École Polytechnique Fédérale de Lausanne (EPFL)
Alexandre Alahi
École Polytechnique Fédérale de Lausanne (EPFL)
Xuming He
ShanghaiTech University
Yujiao Shi
ShanghaiTech University
Abstract
Cross-view localization, the task of estimating a camera's 3-degrees-of-freedom (3-DoF) pose by aligning groundlevel images with aerial images, is crucial for large-scale outdoor applications like autonomous navigation and augmented reality. Existing methods often rely on fully supervised learning, which requires costly ground-truth pose annotations. In this work, we propose GeoDistill, a Geometry guided weakly supervised self Distillation framework that uses teacher-student learning with Field-of-View (FoV)- based masking to enhance local feature learning for robust cross-view localization. In GeoDistill, the teacher model localizes a full view image, while the student model predicts locations from a limited FoV counterpart created by FoVbased masking. By aligning the student's predictions with those of the teacher, the student focuses on key features like lane lines and ignores textureless regions, such as roads. This results in more accurate predictions and reduced uncertainty. Our experiments show that GeoDistill significantly improves localization performance across different frameworks. Additionally, we introduce a novel orientation estimation network that predicts relative orientation without requiring precise planar position ground truth. GeoDistill provides a scalable and efficient solution for real-world cross-view localization challenges. Code and model can be found at https://github.com/tongshw/GeoDistill.
EvRT-DETR: Latent Space Adaptation of Image Detectors for Event-based Vision
Dmitrii Torbunov
Brookhaven National Laboratory
Yihui Ren
Brookhaven National Laboratory
Animesh Ghose
Brookhaven National Laboratory
Odera Dim
Brookhaven National Laboratory
Yonggang Cui
Brookhaven National Laboratory
Abstract
Event-based cameras (EBCs) have emerged as a bioinspired alternative to traditional cameras, offering advantages in power efficiency, temporal resolution, and high dynamic range. However, development of image analysis methods for EBCs is challenging due to the sparse and asynchronous nature of the data. This work addresses the problem of object detection for EBC cameras. The current approaches to EBC object detection focus on constructing complex data representations and rely on specialized architectures. We introduce I2EvDet (Image-to-Event Detection), a novel adaptation framework that bridges mainstream object detection with temporal event data processing. First, we demonstrate that a Real-Time DEtection TRansformer, or RT-DETR, a state-of-the-art natural image detector, trained on a simple image-like representation of the EBC data achieves performance comparable to specialized EBC methods. Next, as part of our framework, we develop an efficient adaptation technique that transforms image-based detectors into event-based detection models by modifying their frozen latent representation space via minimal architectural additions. The resulting EvRT-DETR model reaches state-of-the-art performance on the standard benchmark datasets Gen1 (mAP +2.3) and 1Mpx/Gen4 (mAP +1.4). These results demonstrate a fundamentally new approach to EBC object detection through principled adaptation of mainstream architectures, offering an efficient alternative with potential applications to other temporal visual domains. The code is available at: https://github.com/realtimeintelligence/evrt-detr.
Leveraging 2D Priors and SDF Guidance for Urban Scene Rendering
Siddharth Tourani
IIIT Hyderabad
Jayaram Reddy
IIIT Hyderabad
Akash Kumbar
IIIT Hyderabad
Satyajit Tourani
IIIT Hyderabad
Nishant Goyal
IIT Kharagpur
Madhava Krishna
IIIT Hyderabad
N Dinesh Reddy
VLM Run
Muhammad Haris Khan
MBZUAI
Abstract
Dynamic scene rendering and reconstruction play a crucial role in computer vision and augmented reality. Recent methods based on 3D Gaussian Splatting (3DGS), have enabled accurate modeling of dynamic urban scenes, but for urban scenes they require both camera and LiDAR data, ground-truth 3D segmentations and motion data in the form of tracklets or pre-defined object templates such as SMPL. In this work, we explore whether a combination of 2D object agnostic priors in the form of depth and point tracking coupled with a signed distance function (SDF) representation for dynamic objects can be used to relax some of these requirements. We present a novel approach that integrates Signed Distance Functions (SDFs) with 3D Gaussian Splatting (3DGS) to create a more robust object representation by harnessing the strengths of both methods. Our unified optimization framework enhances the geometric accuracy of 3D Gaussian splatting and improves deformation modeling within the SDF, resulting in a more adaptable and precise representation. We demonstrate that our method achieves state-of-the-art performance in rendering metrics even without LiDAR data on urban scenes. When incorporating LiDAR, our approach improved further in reconstructing and generating novel views across diverse object categories, without ground-truth 3D motion annotation. Additionally, our method enables various scene editing tasks, including scene decomposition, and scene composition.
Head2Body: Body Pose Generation from Multi-sensory Head-mounted Inputs
Minh Tran
University of Southern California
Hongda Mao
Amazon
Qingshuang Chen
Amazon
Yelin Kim
Amazon
Abstract
Generating body pose from head-mounted, egocentric inputs is essential for immersive VR/AR and assistive technologies, as it supports more natural interactions. However, the task is challenging due to limited visibility of body parts in first-person views and the sparseness of sensory data, with only a single device placed on the head. To address these challenges, we introduce Head2Body, a novel framework for body pose estimation that effectively combines headIMU and egocentric visual data. First, we introduce a pretrained IMU encoder, trained on over 1,700 hours of Ego4D IMU data from head-mounted devices, to better capture detailed temporal motion cues given limited labeled egocentric pose data. For visual processing, we leverage large vision-language models (LVLMs) to segment body parts that appear sporadically in video frames to improve visual feature extraction. To better guide pose generation from sparse head-mounted signals, we incorporate a residual Vector Quantized Variational Autoencoder (VQ-VAE) to represent poses with discrete tokens, capturing high-frequency motion patterns and improving over direct continuous regression, which often lacks structure and temporal consistency. Our experiments demonstrate the effectiveness of the proposed approach, yielding 6-13% gains over state-of-the-art baselines on three datasets: AMASS, KinPoly, and EgoExo4D. By capturing subtle temporal dynamics and leveraging complementary sensory data, our approach advances accurate egocentric body pose estimation and sets a new benchmark for multi-modal, first-person motion tracking.
More Reliable Pseudo-labels, Better Performance: A Generalized Approach to Single Positive Multi-label Learning
Luong Tran
FPT Software AI Center
Thieu Vo
National University of Singapore
Anh Nguyen
University of Liverpool
Sang Dinh
Hanoi University of Science and Technology
Van Nguyen
FPT Software AI Center
Abstract
Multi-label learning is a challenging computer vision task that requires assigning multiple categories to each image. However, fully annotating large-scale datasets is often impractical due to high costs and effort, motivating the study of learning from partially annotated data. In the extreme case of Single Positive Multi-Label Learning (SPML), each image is provided with only one positive label, while all other labels remain unannotated. Traditional SPML methods that treat missing labels as unknown or negative tend to yield inaccuracies and false negatives, and integrating various pseudo-labeling strategies can introduce additional noise. To address these challenges, we propose the Generalized Pseudo-Label Robust Loss (GPR Loss), a novel loss function that effectively learns from diverse pseudo-labels while mitigating noise. Complementing this, we introduce a simple yet effective Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) technique. Together, these contributions form the Adaptive and Efficient Vision-Language Pseudo-Labeling (AEVLP) framework. Extensive experiments on four benchmark datasets demonstrate that our framework significantly advances multi-label classification, achieving state-of-the-art results.
PHATNet: A Physics-guided Haze Transfer Network for Domain-adaptive Real-world Image Dehazing
Fu-Jen Tsai
National Tsing Hua University
Yan-Tsung Peng
National Chengchi University
Yen-Yu Lin
National Yang Ming Chiao Tung University
Chia-Wen Lin
National Tsing Hua University
Abstract
Image dehazing aims to remove unwanted hazy artifacts in images. Although previous research has collected paired real-world hazy and haze-free images to improve dehazing models' performance in real-world scenarios, these models often experience significant performance drops when handling unseen real-world hazy images due to limited training data. This issue motivates us to develop a flexible domain adaptation method to enhance dehazing performance during testing. Observing that predicting haze patterns is generally easier than recovering clean content, we propose the Physics-guided Haze Transfer Network (PHATNet) which transfers haze patterns from unseen target domains to source-domain haze-free images, creating domainspecific fine-tuning sets to update dehazing models for effective domain adaptation. Additionally, we introduce a HazeTransfer-Consistency loss and a Content-Leakage Loss to enhance PHATNet's disentanglement ability. Experimental results demonstrate that PHATNet significantly boosts state-of-the-art dehazing models on benchmark real-world image dehazing datasets. The source code is available at https://github.com/pp00704831/PHATNet.
Auto-Vocabulary Semantic Segmentation
Osman Ülger
University of Amsterdam
Maksymilian Kulicki
Institute of Fundamental Technological Research, Polish Academy of Science
Yuki Asano
University of Technology Nuremberg
Martin R. Oswald
University of Amsterdam
Abstract
Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, without training or fine-tuning. However, OVS methods typically require a human in the loop to specify the vocabulary based on the task or dataset at hand. In this paper, we introduce Auto-Vocabulary Semantic Segmentation (AVS), advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. Our approach, AutoSeg, presents a framework that autonomously identifies relevant class names using semantically enhanced BLIP embeddings and segments them afterwards. Given that open-ended object category predictions cannot be directly compared with a fixed ground truth, we develop a Large Language Model-based Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically generated classes and their corresponding segments. With AVS, our method sets new benchmarks on datasets PASCAL VOC, Context, ADE20K, and Cityscapes, while showing competitive performance to OVS methods that require specified class names. All code is released here.
Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation
Maximilian Ulmer
German Aerospace Center (DLR)
Wout Boerdijk
German Aerospace Center (DLR)
Rudolph Triebel
German Aerospace Center (DLR)
Maximilian Durner
Technical University of Munich
Abstract
This paper presents Object-Conditioned Diffusion Transformer (OC-DiT), a novel class of diffusion models designed for object-centric prediction, and applies it to zeroshot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model's latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large-scale synthetic dataset comprising thousands of high-quality object meshes. Remarkably, our model achieves state-of-the-art performance on multiple challenging real-world benchmarks, without requiring any retraining on target data. Through comprehensive ablation studies, we demonstrate the potential of diffusion models for instance segmentation tasks. Code is available at https://github.com/DLR-RM/oc-dit.
Neural Inverse Rendering for High-Accuracy 3D Measurement of Moving Objects with Fewer Phase-Shifting Patterns
Yuki Urakawa
Institute of Science Tokyo
Yoshihiro Watanabe
Institute of Science Tokyo
Abstract
Among structured-light methods, the phase-shifting approach enables high-resolution and high-accuracy measurements using a minimum of three patterns. However, its performance is significantly affected when dynamic and complex-shaped objects are measured, as motion artifacts and phase inconsistencies can degrade accuracy. In this study, we propose an enhanced phaseshifting method that incorporates neural inverse rendering to enable the 3D measurement of moving objects. To effectively capture object motion, we introduce a displacement field into the rendering model, which accurately represents positional changes and mitigates motion-induced distortions. Additionally, to achieve high-precision reconstruction with fewer phase-shifting patterns, we design a multiview-rendering framework that utilizes multiple cameras in conjunction with a single projector. Comparisons with state-of-the-art methods and various ablation studies demonstrated that our method accurately reconstructs the shapes of moving objects, even with a small number of patterns, using only simple, well-known phase-shifting patterns.
Uncalibrated Structure from Motion on a Sphere
Jonathan Ventura
California Polytechnic State University
Viktor Larsson
Lund University
Fredrik Kahl
Chalmers University of Technology
Abstract
Spherical motion is a special case of camera motion where the camera moves on the imaginary surface of a sphere with the optical axis normal to the surface. Common sources of spherical motion are a person capturing a stereo panorama with a phone held in an outstretched hand, or a hemispherical camera rig used for multi-view scene capture. However, traditional structure-from-motion pipelines tend to fail on spherical camera motion sequences, especially when the camera is facing outward. Building upon prior work addressing the calibrated case, we explore uncalibrated reconstruction from spherical motion, assuming a fixed but unknown focal length parameter. We show that, although two-view spherical motion is always a critical case, self-calibration is possible from three or more views. Through analysis of the relationship between focal length and spherical relative pose, we devise a global structurefrom-motion approach for uncalibrated reconstruction. We demonstrate the effectiveness of our approach on real-world captures in various settings, even when the camera motion deviates from perfect spherical motion. Code and data for our method are available at https://github.com/jonathanventura/spherical-sfm.
EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation
Zengyu Wan
University of Science and Technology of China
Wei Zhai
University of Science and Technology of China
Yang Cao
University of Science and Technology of China
Zhengjun Zha
University of Science and Technology of China
Abstract
Visual 3D motion estimation aims to infer the motion of 2D pixels in 3D space based on visual cues. The key challenge arises from depth variation induced spatio-temporal motion inconsistencies, disrupting the assumptions of local spatial or temporal motion smoothness in previous motion estimation frameworks. In contrast, event cameras offer new possibilities for 3D motion estimation through continuous adaptive pixel-level responses to scene changes. This paper presents EMoTive, a novel event-based framework that models spatio-temporal trajectories via event-guided non-uniform parametric curves, effectively characterizing locally heterogeneous spatio-temporal motion. Specifically, we first introduce Event Kymograph - an event projection method that leverages a continuous temporal projection kernel and decouples spatial observations to encode finegrained temporal evolution explicitly. For motion representation, we introduce a density-aware adaptation mechanism to fuse spatial and temporal features under event guidance, coupled with a non-uniform rational curve parameterization framework to adaptively model heterogeneous trajectories. The final 3D motion estimation is achieved through multi-temporal sampling of parametric trajectories, yielding optical flow and depth motion fields. To facilitate evaluation, we introduce CarlaEvent3D, a multi-dynamic synthetic dataset for comprehensive validation. Extensive experiments on both this dataset and a real-world benchmark demonstrate the effectiveness of the proposed method.
Event-aided Dense and Continuous Point Tracking: Everywhere and Anytime
Zhexiong Wan
Northwestern Polytechnical University
Jianqin Luo
Northwestern Polytechnical University
Yuchao Dai
Northwestern Polytechnical University
Gim Hee Lee
National University of Singapore
Abstract
Recent point tracking methods have made great strides in recovering the trajectories of any point (especially key points) in long video sequences associated with large motions. However, the spatial and temporal granularities of point trajectories remain constrained by limited motion estimation accuracy and video frame rate. Leveraging the high temporal resolution and motion sensitivity of event cameras, we introduce event data for the first time to recover spatially dense and temporally continuous trajectories of every point at any time. Specifically, we define the dense and continuous point trajectory representation as estimating multiple control points of curves for each pixel and model the movement of sparse events triggered along continuous point trajectories. Building on this, we propose a novel multi-frame iterative streaming framework that first estimates local inter-frame motion representations from two consecutive frames with inter-frame events, then aggregates them into a global long-term motion representation to utilize input full video and event data with an arbitrary number of frames. Extensive experiments on simulated and real data demonstrate the significant improvement of our framework over state-of-the-art methods and the crucial role of introducing events to model continuous point trajectories.
AG2aussian: Anchor-Graph Structured Gaussian Splatting for Instance-Level 3D Scene Understanding and Editing
Zhaonan Wang
Shandong University
Manyi Li
Shandong University
Changhe Tu
Shandong University
Abstract
3D Gaussian Splatting (3DGS) has witnessed exponential adoption across diverse applications, driving a critical need for semantic-aware 3D Gaussian representations to enable scene understanding and editing tasks. Existing approaches typically attach semantic features to a collection of free Gaussians and distill the features via differentiable rendering, leading to noisy segmentation and a messy selection of Gaussians. In this paper, we introduce AG2aussian, a novel framework that leverages an anchor-graph structure to organize semantic features and regulate Gaussian primitives. Our anchor-graph structure not only promotes compact and instance-aware Gaussian distributions, but also facilitates graph-based propagation, achieving a clean and accurate instance-level Gaussian selection. Extensive validation across four applications, i.e. interactive click-based query, open-vocabulary text-driven query, object removal editing, and physics simulation, demonstrates the advantages of our approach and its benefits to various applications. The experiments and ablation studies further evaluate the effectiveness of the key designs of our approach.
Authentic 4D Driving Simulation with a Video Generation Model
Lening Wang
Beihang University
Wenzhao Zheng
Tsinghua University
Dalong Du
PhiGent Robotics
Yunpeng Zhang
unknown
Yilong Ren
Beihang University
Han Jiang
Beihang University
Zhiyong Cui
Beihang University
Haiyang Yu
Beihang University
Jie Zhou
Tsinghua University
Shanghang Zhang
Peking University
Abstract
Simulating driving environments in 4D is crucial for developing accurate and immersive autonomous driving systems. Despite progress in generating driving scenes, challenges in transforming views and modeling the dynamics of space and time remain. To tackle these issues, we propose a fresh methodology that reconstructs real-world driving environments and utilizes a generative network to enable 4D simulation. This approach builds continuous 4D point cloud scenes by leveraging surround-view data from autonomous vehicles. By separating the spatial and temporal elements, it creates smooth keyframe sequences. Furthermore, video generation techniques are employed to produce lifelike 4D simulation videos from any given perspective. To extend the range of possible viewpoints, we incorporate training using decomposed camera poses, which allows for enhanced modeling of distant scenes. Additionally, we merge camera trajectory data to synchronize 3D points across consecutive frames, fostering a richer understanding of the evolving scene. With training across multiple scene levels, our method is capable of simulating scenes from any viewpoint and offers deep insight into the evolution of scenes over time in a consistent spatial-temporal framework. In comparison with current methods, this approach excels in maintaining consistency across views, background coherence, and overall accuracy, significantly contributing to the development of more realistic autonomous driving simulations.
C4D: 4D Made from 3D through Dual Correspondences
Shizun Wang
National University of Singapore
Zhenxiang Jiang
National University of Singapore
Xingyi Yang
The Hong Kong Polytechnic University
Xinchao Wang
National University of Singapore
Abstract
Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multiview geometric constraints, disrupting the reconstruction. To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. Specifically, apart from predicting pointmaps, C4D captures two types of correspondences: short-term optical flow and long-term point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes. Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction. Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking. Project Page: https://littlepure2333.github.io/C4D
Completing 3D Partial Assemblies with View-Consistent 2D-3D Correspondence
Weihao Wang
Tongji University
Yu Lan
Tongji University
Mingyu You
Tongji University
Bin He
Tongji University
Abstract
3D assembly completion represents a fundamental task in 3D computer vision and robotics. This task aims to retrieve the missing parts from a set of candidates and predict their 6DoF poses to make the partial assembly complete. However, due to the inherent uncertainty in completion and the similarity among candidates, even humans struggle to achieve precise completion without external guidance. To address this challenge, we introduce an auxiliary image depicting the complete assembly from a specific view. The primary challenge lies in the lack of correspondence or grounding between the partial assembly and the image, leading to ambiguities in identifying missing parts and ineffective guidance for completion. Moreover, this correspondence heavily depends on the view of image, which, unfortunately, is often unknown in real-world scenarios. To this end, we propose a novel cross-modal 3D assembly completion framework. At its core is missing-oriented feature fusion augmented by self-supervised view alignment to establish view-consistent 2D-3D correspondence between the image and the partial assembly, which effectively captures clues of missing parts from the image and provides targeted guidance for completion. Extensive experiments demonstrate our state-of-the-art performance on the PartNet dataset and show its generalization capabilities in two downstream applications: component suggestion and furniture restoration.
Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention
Weida Wang
Tongji University
Changyong He
Tongji University
Jin Zeng
Tongji University
Di Qiu
Google
Abstract
Depth images captured by Time-of-Flight (ToF) sensors are prone to noise, requiring denoising for reliable downstream applications. Previous works either focus on single-frame processing, or perform multi-frame processing without considering depth variations at corresponding pixels across frames, leading to undesirable temporal inconsistency and spatial ambiguity. In this paper, we propose a novel ToF depth denoising network leveraging motion-invariant graph fusion to simultaneously enhance temporal stability and spatial sharpness. Specifically, despite depth shifts across frames, graph structures exhibit temporal self-similarity, enabling cross-frame geometric attention for graph fusion. Then, by incorporating an image smoothness prior on the fused graph and data fidelity term derived from ToF noise distribution, we formulate a maximum a posterior problem for ToF denoising. Finally, the solution is unrolled into iterative filters whose weights are adaptively learned from the graph-informed geometric attention, producing a highperformance yet interpretable network. Experimental results demonstrate that the proposed scheme achieves stateof-the-art performance in terms of accuracy and consistency on synthetic DVToF dataset and exhibits robust generalization on the real Kinectv2 dataset. Source code is available at https://github.com/davidweidawang/GIGA-ToF.
Continuous-Time Human Motion Field from Event Cameras
Ziyun Wang
University of Pennsylvania
Ruijun Zhang
University of Pennsylvania
Zi-Yan Liu
University of Pennsylvania
Yufu Wang
University of Pennsylvania
Kostas Daniilidis
University of Pennsylvania
Abstract
This paper addresses the challenges of estimating a continuous-time human motion field from a stream of events. Existing human motion estimation methods rely predominantly on frame-based approaches, which are prone to aliasing and inaccuracies due to limited temporal resolution and motion blur. In this work, we predict a continuoustime human motion field directly from events, by leveraging a recurrent feed-forward neural network to predict human motion in the latent space of possible human motions. Prior state-of-the-art event-based methods rely on computationally intensive optimization across a fixed number of poses at high frame rates, which becomes prohibitively expensive as we increase the temporal resolution. In comparison, we present the first work that replaces traditional discretetime predictions with a continuous human motion field represented as a time-implicit function, enabling parallel pose queries at arbitrary temporal resolutions. Despite the promises of event cameras, few benchmarks have tested the limit of high speed human motion estimation. We introduce Beam-splitter Event Agile Human Motion Dataset-a hardware-synchronized high-speed human dataset to fill Work completed while Ziyun Wang was at University of Pennsylvania. this gap. On this new data, our method improves joint errors by 23.8 % compared to previous event human methods, while reducing the computational time by 69%. More details of the work can be found on the project page: ziyunclaudewang.github.io/evhuman.html.
Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild
Haoran Wang
Nanjing University
Zekun Li
Nanjing University
Jian Zhang
Nanjing University
Lei Qi
Southeast University
Yinghuan Shi
Nanjing University
Abstract
Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild. Consequently, reference segmentation, which leverages reference images and their corresponding masks to impart novel knowledge to the model, emerges as a promising new direction for adapting large vision models. However, existing reference segmentation approaches predominantly rely on meta-learning, which still necessitates an extensive meta-training process and brings massive data and computational cost. In this study, we propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video. This perspective allows the latest version of SAM, known as SAM2, which is equipped with interactive video object segmentation (iVOS) capabilities, to be adapted to downstream tasks in a lightweight manner. We term this approach Correspondence As Video for SAM (CAV-SAM). CAV-SAM comprises two key modules: the Diffusion-Based Semantic Transition (DBST) module employs a diffusion model to construct a semantic transformation sequence, while the Test-Time Geometric Alignment (TTGA) module aligns the geometric changes within this sequence through test-time fine-tuning. We evaluated CAVSAM on widely-used datasets, achieving segmentation performance improvements exceeding 5% over SOTA methods. Our implementation is available at https://github. com/wanghr64/cav-sam.
DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction
Rui Wang
ETH Zürich
Quentin Lohmeyer
ETH Zürich
Mirko Meboldt
ETH Zürich
Siyu Tang
ETH Zürich
Abstract
Reconstructing clean, distractor-free 3D scenes from realworld captures remains a significant challenge, particularly in highly dynamic and cluttered settings such as egocentric videos. To tackle this problem, we introduce DeGauss, a simple and robust self-supervised framework for dynamic scene reconstruction based on a decoupled dynamic-static Gaussian Splatting design. DeGauss models dynamic elements with foreground Gaussians and static content with background Gaussians, using a probabilistic mask to coordinate their composition and enable independent yet complementary optimization. DeGauss generalizes robustly across a wide range of real-world scenarios, from casual image collections to long, dynamic egocentric videos, without relying on complex heuristics or extensive supervision. Experiments on benchmarks including NeRF-on-the-go, ADT, AEA, Hot3D, and EPIC-Fields demonstrate that DeGauss consistently outperforms existing methods, establishing a strong baseline for generalizable, distractor-free 3D reconstruction in highly dynamic, interaction-rich environments.
Debiasing Trace Guidance: Top-down Trace Distillation and Bottom-up Velocity Alignment for Unsupervised Anomaly Detection
Xingjian Wang
Zhejiang University
Li Chai
Zhejiang University
Jiming Chen
Zhejiang University
Abstract
The leak of anomalous information from input condition poses a great challenge to reconstruction-based anomaly detection. Recent diffusion-based methods respond to this issue by suppressing anomaly information for condition injection or in-sampling inversion. However, since they treat conditions as a time-invariant prior, they fall into a trade-off problem between anomaly suppression and normal pattern consistency. To address this problem, we propose Debiasing Trace Guidance (DTG) framework based on Flow Matching towards debiasing generation for more accurate unsupervised multi-class anomaly detection. Generally, DTG distills a low-dimensional generation sub-trace robust to anomalies by Top-down Trace Distillation, and then utilizes its time-varying velocity features to guide a debiasing generation by Bottom-up Velocity Alignment. The trace distillation filters out high-frequency anomalies via learnable wavelet filters and reserving structural information by keeping global consistency across samples using Skinhorn Distance. Subsequently, the velocity field of original trace is aligned with the one of sub-trace through KVInjection Attention mechanism. The model is forced to generate normal details from corresponding low-dimensional contexts via Alignment Mask. Experimental results on several benchmarks and corresponding ablation studies have demonstrated the effectiveness of the proposed method.
Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval
Zhichuan Wang
Huazhong Agricultural University
Yang Zhou
Shenzhen University
Zhe Liu
The University of Hong Kong
Rui Yu
University of Louisville
Song Bai
ByteDance
Yulong Wang
Huazhong Agricultural University
Xinwei He
Huazhong Agricultural University
Xiang Bai
Huazhong University of Science and Technology
Abstract
Open-set 3D object retrieval (3DOR) is an emerging task aiming to retrieve 3D objects of unseen categories beyond the training set. Existing methods typically utilize all modalities (i.e., voxels, point clouds, multi-view images) and train specific backbones before fusion. However, they still struggle to produce generalized representations due to insufficient 3D training data. Being contrastively pre-trained on web-scale image-text pairs, CLIP inherently produces generalized representations for a wide range of downstream tasks. Building upon it, we present a simple yet effective framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR. DAC innovatively synergizes a CLIP model with a multi-modal large language model (MLLM) to learn generalized 3D representations, where the MLLM is used for dual purposes. First, it describes the seen category information to align with CLIP's training objective for adaptation during training. Second, it provides external hints about unknown objects complementary to visual cues during inference. To improve the synergy, we introduce an Additive-Bias Low-Rank adaptation (AB-LoRA), which alleviates overfitting and further enhances the generalization to unseen categories. With only multi-view images, DAC significantly surpasses prior arts by an average of +10.01% mAP on four open-set 3DOR datasets. Moreover, its generalization is also validated on image-based and cross-dataset setups. Code is available at https://github.com/wangzhichuan123/DAC.
Deterministic Object Pose Confidence Region Estimation
Jinghao Wang
National University of Defense Technology
Zhang Li
National University of Defense Technology
Zi Wang
National University of Defense Technology
Banglei Guan
National University of Defense Technology
Yang Shang
National University of Defense Technology
Qifeng Yu
National University of Defense Technology
Abstract
6D pose confidence region estimation has emerged as a critical direction, aiming to perform uncertainty quantification for assessing the reliability of estimated poses. However, current sampling-based approach suffers from critical limitations that severely impede their practical deployment: 1) the sampling speed significantly decreases as the number of samples increases. 2) the derived confidence regions are often excessively large. To address these challenges, we propose a deterministic and efficient method for estimating pose confidence regions. Our approach uses inductive conformal prediction to calibrate the deterministically regressed Gaussian keypoint distributions into 2D keypoint confidence regions. We then leverage the implicit function theorem to propagate these keypoint confidence regions directly into 6D pose confidence regions. This method avoids the inefficiency and inflated region sizes associated with sampling and ensembling. It provides compact confidence regions that cover the ground-truth poses with a user-defined confidence level. Experimental results on the LineMOD Occlusion and SPEED datasets show that our method achieves higher pose estimation accuracy with reduced computational time. For the same coverage rate, our method yields significantly smaller confidence region volumes, reducing them by up to 99.9% for rotations and 99.8% for translations. The code will be available soon.
DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover
Youzhuo Wang
ShanghaiTech University
Jiayi Ye
ShanghaiTech University
Chuyang Xiao
ShanghaiTech University
Yiming Zhong
ShanghaiTech University
Heng Tao
ShanghaiTech University
Hang Yu
ShanghaiTech University
Yumeng Liu
ShanghaiTech University
Jingyi Yu
ShanghaiTech University
Yuexin Ma
ShanghaiTech University
Abstract
Handover between a human and a dexterous robotic hand is a fundamental yet challenging task in human-robot collaboration. It requires handling dynamic environments and a wide variety of objects and demands robust and adaptive grasping strategies. However, progress in developing effective dynamic dexterous grasping methods is limited by the absence of high-quality, real-world human-to-robot handover datasets. Existing datasets primarily focus on grasping static objects or rely on synthesized handover motions, which differ significantly from real-world robot motion patterns, creating a substantial gap in applicability. In this paper, we introduce DexH2R, a comprehensive real-world dataset for human-to-robot handovers, built on a dexterous robotic hand. Our dataset captures a diverse range of interactive objects, dynamic motion patterns, rich visual sensor data, and detailed annotations. Additionally, to ensure natural and human-like dexterous motions, we utilize teleoperation for data collection, enabling the robot's movements to align with human behaviors and habits, which is a crucial characteristic for intelligent humanoid robots. Furthermore, we propose an effective solution, DynamicGrasp, for human-to-robot handover and evaluate various state-ofthe-art approaches, including auto-regressive models and diffusion policy methods, providing a thorough comparison and analysis. We believe our benchmark will drive advancements in human-to-robot handover research by offering a high-quality dataset, effective solutions, and comprehensive evaluation metrics. Project is at dexh2r.github.io/.
End-to-End Entity-Predicate Association Reasoning for Dynamic Scene Graph Generation
Liwei Wang
Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology
Yanduo Zhang
Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology
Tao Lu
Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology
Fang Liu
Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology
Huiqin Zhang
Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology
Jiayi Ma
Wuhan University
Huabing Zhou
Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology
Abstract
Dynamic Scene Graph Generation (DSGG) aims to comprehensively understand videos by abstracting them into visual triplets <subject, predicate, object>. Most existing methods focus on capturing temporal dependencies, but overlook crucial visual relationship dependencies between entities and predicates, as well as among predicate subclasses. These dependencies are essential for a deeper contextual understanding of scenarios. Additionally, current approaches do not support end-to-end training and instead rely on a two-stage pipeline, which incurs higher computational costs. To address these issues, we propose an end-to-end Association Reasoning Network (ARN) for DSGG. ARN leverages CLIP's semantic priors to model fine-grained triplet cues to generate scene graph. In addition, we design a Predicate Association Parsing (PAP) module that employs a conditional weight mapping mechanism to structure entity and predicate representations. We further introduce a Hierarchical Attention (HA) mechanism to integrate spatio-temporal context with entity and predicate representations, enabling effective associative reasoning. Extensive experiments on the Action Genome dataset demonstrate significant performance improvements over existing methods. The source code is available in https://github.com/wlw951226/ARN.
Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics
Taowen Wang
Rochester Institute of Technology
Cheng Han
University of Missouri - Kansas City
James Liang
U.S. Naval Research Laboratory
Wenhao Yang
Lamar University
Dongfang Liu
Rochester Institute of Technology
Luna Xinyu Zhang
Rochester Institute of Technology
Qifan Wang
Meta AI
Jiebo Luo
University of Rochester
Ruixiang Tang
Rutgers University
Abstract
Recently in robotics, Vision-Language-Action (VLA) models have emerged as a transformative approach, enabling robots to execute complex tasks by integrating visual and linguistic inputs within an end-to-end learning framework. Despite their significant capabilities, VLA models introduce new attack surfaces. This paper systematically evaluates their robustness. Recognizing the unique demands of robotic execution, our attack objectives target the inherent spatial and functional characteristics of robotic systems. In particular, we introduce two untargeted attack objectives that leverage spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory. Additionally, we design an adversarial patch generation approach that places a small, colorful patch within the camera's view, effectively executing the attack in both digital and physical environments. Our evaluation reveals a marked degradation in task success rates, with up to a 100% reduction across a suite of simulated robotic tasks, highlighting critical security gaps in current VLA architectures. By unveiling these vulnerabilities and proposing actionable evaluation metrics, we advance both the understanding and enhancement of safety for VLA-based robotic systems, underscoring the necessity for continuously developing robust defense strategies prior to physical-world deployments1.
Faster and Better 3D Splatting via Group Training
Chengbo Wang
School of Design, Hunan University
Guozheng Ma
Nanyang Technological University
Yifei Xue
School of Design, Hunan University
Yizhen Lao
School of Design, Hunan University
Abstract
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, demonstrating remarkable capability in high-fidelity scene reconstruction through its Gaussian primitive representations. However, the computational overhead induced by the massive number of primitives poses a significant bottleneck to training efficiency. To overcome this challenge, we propose Group Training, a simple yet effective strategy that organizes Gaussian primitives into manageable groups, optimizing training efficiency and improving rendering quality. This approach shows universal compatibility with existing 3DGS frameworks, including vanilla 3DGS and MipSplatting, consistently achieving accelerated training while maintaining superior synthesis quality. Extensive experiments reveal that our straightforward Group Training strategy achieves up to 30% faster convergence and improved rendering quality across diverse scenarios. Project Website: https://chengbo-wang.github.io/3DGSwith-Group-Training/.
From Enhancement to Understanding: Build a Generalized Bridge for Low-light Vision via Semantically Consistent Unsupervised Fine-tuning
Sen Wang
East China Normal University
Shao Zeng
Tencent Youtu Lab
Tianjun Gu
East China Normal University
Zhizhong Zhang
East China Normal University
Ruixin Zhang
Tencent Youtu Lab
Shouhong Ding
Tencent Youtu Lab
Jingyun Zhang
Tencent WeChat Pay Lab
Jun Wang
Tencent WeChat Pay Lab
Xin Tan
East China Normal University
Yuan Xie
East China Normal University
Lizhuang Ma
East China Normal University
Abstract
Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation. The code is available at GEFU.
HccePose(BF): Predicting Front & Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation
Yulin Wang
Southeast University
Mengting Hu
Southeast University
Hongli Li
Purdue University
Chen Luo
Southeast University
Abstract
In pose estimation for seen objects, a prevalent pipeline involves using neural networks to predict dense 3D coordinates of the object surface on 2D images, which are then used to establish dense 2D-3D correspondences. However, current methods primarily focus on more efficient encoding techniques to improve the precision of predicted 3D coordinates on the object's front surface, overlooking the potential benefits of incorporating the back surface and interior of the object. To better utilize the full surface and interior of the object, this study predicts 3D coordinates of both the object's front and back surfaces and densely samples 3D coordinates between them. This process creates ultra-dense 2D-3D correspondences, effectively enhancing pose estimation accuracy based on the Perspective-n-Point (PnP) algorithm. Additionally, we propose Hierarchical Continuous Coordinate Encoding (HCCE) to provide a more accurate and efficient representation of front and back surface coordinates. Experimental results show that, compared to existing state-of-the-art (SOTA) methods on the BOP website, the proposed approach outperforms across seven classic BOP core datasets. Code is available at https: //github.com/WangYuLin-SEU/HCCEPose.
Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection
Hanshi Wang
State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA
Jin Gao
School of Artificial Intelligence, University of Chinese Academy of Sciences
Weiming Hu
School of Artificial Intelligence, University of Chinese Academy of Sciences
Zhipeng Zhang
School of Artificial Intelligence, Shanghai Jiao Tong University
Abstract
We present the first work demonstrating that a pure Mamba block can achieve efficient Dense Global Fusion, meanwhile guaranteeing top performance for cameraLiDAR multi-modal 3D object detection. Our motivation stems from the observation that existing fusion strategies are constrained by their inability to simultaneously achieve efficiency, long-range modeling, and retaining complete scene information. Inspired by recent advances in statespace models (SSMs) [8] and linear attention [35, 43], we leverage their linear complexity and long-range modeling capabilities to address these challenges. However, this is non-trivial since our experiments reveal that simply adopting efficient linear-complexity methods does not necessarily yield improvements and may even degrade performance. We attribute this degradation to the loss of height information during multi-modal alignment, leading to deviations in sequence order. To resolve this, we propose height-fidelity LiDAR encoding that preserves precise height information through voxel compression in continuous space, thereby enhancing camera-LiDAR alignment. Subsequently, we introduce the Hybrid Mamba Block, which leverages the enriched height-informed features to conduct local and global contextual learning. By integrating these components, our method achieves state-of-the-art performance with the top-tire NDS score of 75.0 on the nuScenes [2] validation benchmark, even surpassing methods that utilize highresolution inputs. Meanwhile, our method maintains efficiency, achieving faster inference speed than most recent state-of-the-art methods. Code is available at https://github.com/AutoLab-SAI-SJTU/MambaFusion
HiNeuS: High-fidelity Neural Surface Mitigating Low-texture and Reflective Ambiguity
Yida Wang
Li Auto Inc.
Xueyang Zhang
Li Auto Inc.
Kun Zhan
Li Auto Inc.
Peng Jia
Li Auto Inc.
Xianpeng Lang
Li Auto Inc.
Abstract
Neural surface reconstruction faces persistent challenges in reconciling geometric fidelity with photometric consistency under complex scene conditions. We present HiNeuS, a unified framework that holistically addresses three core limitations in existing approaches: multi-view radiance inconsistency, missing keypoints in textureless regions, and structural degradation from over-enforced Eikonal constraints during joint optimization. To resolve these issues through a unified pipeline, we introduce: 1) Differential visibility verification through SDF-guided ray tracing, resolving reflection ambiguities via continuous occlusion modeling; 2) Planar-conformal regularization via ray-aligned geometry patches that enforce local surface coherence while preserving sharp edges through adaptive appearance weighting; and 3) Physically-grounded Eikonal relaxation that dynamically modulates geometric constraints based on local radiance gradients, enabling detail preservation without sacrificing global regularity. Unlike prior methods that handle these aspects through sequential optimizations or isolated modules, our approach achieves cohesive integration where appearance-geometry constraints evolve synergistically throughout training. Comprehensive evaluations across synthetic and real-world datasets demonstrate SotA performance, including a 21.4% reduction in Chamfer distance over reflection-aware baselines and 2.32 dB PSNR improvement against neural rendering counterparts. Qualitative analyses reveal superior capability in recovering specular instruments, urban layouts with centimeterscale infrastructure, and low-textured surfaces without local patch collapse. The method's generalizability is further validated through successful application to inverse rendering tasks, including material decomposition and viewconsistent relighting. Project hosted here, where the urban and vehicle reconstruction related modules are excluded from open-sourced codes due to legal concerns.
HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery
Yu Wang
School of Remote Sensing and Information Engineering, Wuhan University
Bo Dang
School of Remote Sensing and Information Engineering, Wuhan University
Wanchun Li
School of Remote Sensing and Information Engineering, Wuhan University
Wei Chen
School of Remote Sensing and Information Engineering, Wuhan University
Yansheng Li
School of Remote Sensing and Information Engineering, Wuhan University
Abstract
With the increasing resolution of remote sensing imagery (RSI), large-size RSI has emerged as a vital data source for high-precision vector mapping of geographic objects. Existing methods are typically constrained to processing small image patches, which often leads to the loss of contextual information and produces fragmented vector outputs. To address these, this paper introduces HoliTracer, the first framework designed to holistically extract vectorized geographic objects from large-size RSI. In HoliTracer, we enhance segmentation of large-size RSI using the Context Attention Net (CAN), which employs a local-to-global attention mechanism to capture contextual dependencies. Furthermore, we achieve holistic vectorization through a robust pipeline that leverages the Mask Contour Reformer (MCR) to reconstruct polygons and the Polygon Sequence Tracer (PST) to trace vertices. Extensive experiments on large-size RSI datasets, including buildings, water bodies and roads, demonstrate that HoliTracer outperforms stateof-the-art methods. Our code and data are available in github.com/vvangfaye/HoliTracer
LA-MOTR: End-to-End Multi-Object Tracking by Learnable Association
Peng Wang
School of Information, Renmin University of China
Yongcai Wang
School of Information, Renmin University of China
Hualong Cao
School of Information, Renmin University of China
Wang Chen
School of Information, Renmin University of China
Deying Li
School of Information, Renmin University of China
Abstract
This paper proposes LA-MOTR, a novel Tracking-byLearnable-Association framework that resolves the competing optimization objectives between detection and association in end-to-end Tracking-by-Attention (TbA) MultiObject Tracking. Current TbA methods employ shared decoders for simultaneous object detection and tracklet association, often resulting in task interference and suboptimal accuracy. By contrast, our end-to-end framework decouples these tasks into two specialized modules: Separated ObjectTracklet Detection (SOTD) and Spatial-Guided Learnable Association (SGLA). This decoupled design offers flexibility and explainability. In particular, SOTD independently detects new objects and existing tracklets in each frame, while SGLA associates them via Spatial-Weighted Learnable Attention module guided by relative spatial cues. Temporal coherence is further maintained through Tracklet Updates Module. The learnable association mechanism resolves the inherent suboptimal association issues in decoupled frameworks, avoiding the task interference commonly observed in joint approaches. Evaluations on DanceTrack, MOT17, and SportMOT datasets demonstrate state-of-theart performance. Extensive ablation studies validate the effectiveness of the designed modules. Code is available at https://github.com/PenK1nG/LA-MOTR.
LaneDiffusion: Improving Centerline Graph Learning via Prior Injected BEV Feature Generation
Zijie Wang
Sun Yat-sen University
Weiming Zhang
Baidu Inc.
Wei Zhang
Baidu Inc.
Xiao Tan
Baidu Inc.
Hongxing Liu
Baidu Inc.
Yaowei Wang
Harbin Institute of Technology, Shenzhen
Abstract
Centerline graphs, crucial for path planning in autonomous driving, are traditionally learned using deterministic methods. However, these methods often lack spatial reasoning and struggle with occluded or invisible centerlines. Generative approaches, despite their potential, remain underexplored in this domain. We introduce LaneDiffusion, a novel generative paradigm for centerline graph learning. LaneDiffusion innovatively employs diffusion models to generate lane centerline priors at the Bird's Eye View (BEV) feature level, instead of directly predicting vectorized centerlines. Our method integrates a Lane Prior Injection Module (LPIM) and a Lane Prior Diffusion Module (LPDM) to effectively construct diffusion targets and manage the diffusion process. Furthermore, vectorized centerlines and topologies are then decoded from these prior-injected BEV features. Extensive evaluations on the nuScenes and Argoverse2 datasets demonstrate that LaneDiffusion significantly outperforms existing methods, achieving improvements of 4.2%, 4.6%, 4.7%, 6.4% and 1.8% on finegrained point-level metrics (GEO F1, TOPO F1, JTOPO F1, APLS and SDA) and 2.3%, 6.4%, 6.8% and 2.1% on segment-level metrics (IoU, mAPcf, DETl and TOPll). These results establish state-of-the-art performance in centerline graph learning, offering new insights into generative models for this task. Code will be available at: https://github.com/ZJWang9928/LaneDiffusion.
Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts
Yun Wang
City University of Hong Kong
Longguang Wang
Shenzhen Campus, Sun Yat-sen University
Chenghao Zhang
Chinese Academy of Sciences
Yongjian Zhang
Shenzhen Campus, Sun Yat-sen University
Zhanjie Zhang
Zhejiang University
Ao Ma
JD.com
Chenyou Fan
South China Normal University
Tin Lun Lam
The Chinese University of Hong Kong, Shenzhen
Junjie Hu
The Chinese University of Hong Kong, Shenzhen
Abstract
Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge. To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction. Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-theart cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation. The code is available at https://github.com/cocowy1/SMoE-Stereo.
LightCity: An Urban Dataset for Outdoor Inverse Rendering and Reconstruction under Multi-illumination Conditions
Jingjing Wang
State Key Lab of CAD&CG, Zhejiang University
Qirui Hu
State Key Lab of CAD&CG, Zhejiang University
Chong Bao
State Key Lab of CAD&CG, Zhejiang University
Yuke Zhu
State Key Lab of CAD&CG, Zhejiang University
Hujun Bao
State Key Lab of CAD&CG, Zhejiang University
Zhaopeng Cui
State Key Lab of CAD&CG, Zhejiang University
Guofeng Zhang
State Key Lab of CAD&CG, Zhejiang University
Abstract
Inverse rendering in urban scenes is pivotal for applications like autonomous driving and digital twins. Yet, it faces significant challenges due to complex illumination conditions, including multi-illumination and indirect light and shadow effects. However, the effects of these challenges on intrinsic decomposition and 3D reconstruction have not been explored due to the lack of appropriate datasets. In this paper, we present LightCity, a novel high-quality synthetic urban dataset featuring diverse illumination conditions with realistic indirect light and shadow effects. LightCity encompasses over 300 sky maps with highly controllable illumination, varying scales with street-level and aerial perspectives over 50K images, and rich properties such as depth, normal, material components, light and indirect light, etc. Besides, we leverage LightCity to benchmark three fundamental tasks in the urban environments and conduct a comprehensive analysis of these benchmarks, laying a robust foundation for advancing related research. Project page: https://zju3dv.github.io/lightcity/.
MOERL: When Mixture-of-Experts Meet Reinforcement Learning for Adverse Weather Image Restoration
Tao Wang
Nanjing University
Peiwen Xia
Nanjing University
Bo Li
vivo Mobile Communication Co., Ltd
Peng-Tao Jiang
vivo Mobile Communication Co., Ltd
Zhe Kong
Shenzhen Campus of Sun Yat-sen University
Kaihao Zhang
Harbin Institute of Technology (Shenzhen)
Tong Lu
Nanjing University
Wenhan Luo
The Hong Kong University of Science and Technology
Abstract
Adverse weather conditions, such as rain, snow, and haze, introduce complex degradations that present substantial challenges for effective image restoration. Existing all-inone models often rely on fixed network structures, limiting their ability to adapt to the varying characteristics of different weather conditions. Moreover, these models typically lack the iterative refinement process that human experts use for progressive image restoration. In this work, we propose MOERL, a Mixture-of-Experts (MoE) model optimized with reinforcement learning (RL) to enhance image restoration across diverse weather conditions. Our method incorporates two core types of experts, i.e., channel-wise modulation and spatial modulation experts, to address task-specific degradation characteristics while minimizing task interference. In addition, inspired by human expertise, we frame the optimization process as a sequential, progressive problem, allowing the network to refine its parameters progressively and adapt to specific weather conditions. Extensive experiments demonstrate the efficacy and superiority of our proposed method.
MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips
Shibo Wang
The Hong Kong University of Science and Technology (Guangzhou)
Haonan He
The Hong Kong University of Science and Technology (Guangzhou)
Maria Parelli
ETH Zürich
Christoph Gebhardt
The Hong Kong University of Science and Technology (Guangzhou)
Zicong Fan
ETH Zürich
Jie Song
The Hong Kong University of Science and Technology
Abstract
Most RGB-based hand-object reconstruction methods rely on object templates, while template-free methods typically assume full object visibility. This assumption often breaks in real-world settings, where fixed camera viewpoints and static grips leave parts of the object unobserved, resulting in implausible reconstructions. To overcome this, we present MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even under limited viewpoint variation. Our key insight is that, † Prior to joining University of T¨ubingen and T¨ubingen AI Center despite the scarcity of paired 3D hand-object data, largescale novel view synthesis diffusion models offer rich object supervision. This supervision serves as a prior to regularize unseen object regions during hand interactions. Leveraging this insight, we integrate a novel view synthesis model into our hand-object reconstruction framework. We further align hand to object by incorporating visible contact constraints. Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art hand-object reconstruction methods. We also show that novel view synthesis diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction.
Mamba-3VL: Taming State Space Model for 3D Vision Language Learning
Yuan Wang
Tsinghua University
Yuxin Chen
ARC Lab, Tencent PCG
Zhongang Qi
ARC Lab, Tencent PCG
Lijun Liu
UCAS
Jile Jiao
Deepeleph
Xuetao Feng
Deepeleph
Yujia Liang
HUST
Ying Shan
ARC Lab, Tencent PCG
Zhipeng Zhang
School of Artificial Intelligence, SJTU
Abstract
3D vision-language (3D-VL) reasoning, connecting natural language with 3D physical world, represents a milestone in advancing spatial intelligence. While transformer-based methods dominate 3D-VL research, their quadratic complexity and simplistic positional embedding mechanisms severely limits effective modeling of long-range 3D-VL dependencies and spatial relationships in 3D-VL tasks. State Space Models (SSM) have emerged as promising linear-complexity alternatives for sequential data processing, while inherent selection mechanism offers notable capability for spatial modeling. Despite its potential, straightforward adoption of Mamba to 3D-VL tasks encounters two obstacles: (1) how to perceive the position of 3D objects and understand complex spatial relationships, and (2) how to achieve thorough synergies of This work is done during internship at Tencent. The code is released at https://github.com/wangyuan123ac/Mamba-3VL. multi-modal features. In this paper, we propose Mamba-3VL, a pioneering 3D-VL framework to model complex intra- and inter-modality correlations and enhance spatial relation reasoning, while guaranteeing top-tier performance, high efficiency, and generalization potential for 3D-VL tasks. Specifically, Mamba Mixer explicitly models 3D-VL interaction via channel twisting and relation-prioritized spatial scanning policy. It maximally retain spatial relation of objectcentric features. To further provide precise spatial encoding for mamba, we develop Instance-aware Dynamic Position Adapter (IDPA) to dynamically adjust instance-specific positional embeddings and enhance local spatial relation of 3D objects. Extensive results validate Mamba-3VL trumps other competitors on seven 3D-VL benchmarks and showcases versatile potentials for challenging Embodied AI tasks.
MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion
Zihan Wang
Carnegie Mellon University
Jeff Tan
Carnegie Mellon University
Tarasha Khurana
Carnegie Mellon University
Neehar Peri
Carnegie Mellon University
Deva Ramanan
Carnegie Mellon University
Abstract
We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and viewconsistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data-processing scripts are available on Github.
Monocular Semantic Scene Completion via Masked Recurrent Networks
Xuzhi Wang
Tianjin Normal University
Xinran Wu
Tianjin Normal University
Song Wang
Zhejiang University
Lingdong Kong
National University of Singapore
Ziping Zhao
Tianjin Normal University
Abstract
Monocular Semantic Scene Completion (MSSC) aims to predict the voxel-wise occupancy and semantic category from a single-view RGB image. Existing methods adopt a single-stage framework that aims to simultaneously achieve visible region segmentation and occluded region hallucination, while also being affected by inaccurate depth estimation. Such methods often achieve suboptimal performance, especially in complex scenes. We propose a novel two-stage framework that decomposes MSSC into coarse MSSC followed by the Masked Recurrent Network. Specifically, we propose the Masked Sparse Gated Recurrent Unit (MS-GRU) which concentrates on the occupied regions by the proposed mask updating mechanism, and a sparse GRU design is proposed to reduce the computation cost. Additionally, we propose the distance attention projection to reduce projection errors by assigning different attention scores according to the distance to the observed surface. Experimental results demonstrate that our proposed unified framework, MonoMRN, effectively supports both indoor and outdoor scenes and achieves state-of-the-art performance on the NYUv2 and SemanticKITTI datasets. Furthermore, we conduct robustness analysis under various disturbances, highlighting the role of the Masked Recurrent Network in enhancing the model's resilience to such challenges. The source code is publicly available at: https: //github.com/alanWXZ/MonoMRN.
Open-Vocabulary Octree-Graph for 3D Scene Understanding
Zhigang Wang
Northwestern Polytechnical University
Yifei Su
University of Chinese Academy of Sciences
Chenhui Li
Shanghai AI Laboratory
Dong Wang
Shanghai AI Laboratory
Yan Huang
University of Chinese Academy of Sciences
Xuelong Li
TeleAI
Bin Zhao
Northwestern Polytechnical University
Abstract
Open-vocabulary 3D scene understanding is indispensable for embodied agents. Recent works leverage pretrained vision-language models (VLMs) for object segmentation and project them to point clouds to build 3D maps. Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and does not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and text-based object retrieval. To address these issues, we propose Octree-Graph, a novel scene representation for open-vocabulary 3D scene understanding. Specifically, a Chronological Group-wise Segment Merging (CGSM) strategy and an Instance Feature Aggregation (IFA) algorithm are first designed to get 3D instances and corresponding semantic features. Subsequently, an adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape. Finally, the Octree-Graph is constructed where each adaptive-octree acts as a graph node, and edges describe the spatial relations among nodes. Extensive experiments on various tasks are conducted on several widelyused datasets, demonstrating the versatility and effectiveness of our method. Code is available here.
PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency
Haotian Wang
Xi'an Jiaotong University
Aoran Xiao
Nanyang Technological University
Xiaoqin Zhang
Zhejiang University of Technology
Meng Yang
Xi'an Jiaotong University
Shijian Lu
Nanyang Technological University
Abstract
Generalizable depth completion enables the acquisition of dense metric depth maps for unseen environments, offering robust perception capabilities for various downstream tasks. However, training such models typically requires large-scale datasets with metric depth labels, which are often labor-intensive to collect. This paper presents PacGDC, a label-efficient technique that enhances data diversity with minimal annotation effort for generalizable depth completion. PacGDC builds on novel insights into inherent ambiguities and consistencies in object shapes and positions during 2D-to-3D projection, allowing the synthesis of numerous pseudo geometries for the same visual scene. This process greatly broadens available geometries by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a new data synthesis pipeline that uses multiple depth foundation models as scale manipulators. These models robustly provide pseudo depth labels with varied scene scales, affecting both local objects and global layouts, while ensuring projection consistency that supports generalization. To further diversify geometries, we incorporate interpolation and relocation strategies, as well as unlabeled images, extending the data coverage beyond the individual use of foundation models. Extensive experiments show that PacGDC achieves remarkable generalizability across multiple benchmarks, excelling in diverse scene semantics/scales and depth sparsity/patterns under both zero-shot and few-shot settings. Code: https: //github.com/Wang-xjtu/PacGDC.
Precise Action-to-Video Generation Through Visual Action Prompts
Yuang Wang
Zhejiang University
Chao Wen
Fudan University
Haoyu Guo
Zhejiang University
Sida Peng
Zhejiang University
Minghan Qin
Tsinghua University
Hujun Bao
Zhejiang University
Xiaowei Zhou
Zhejiang University
Ruizhen Hu
Xiangjiang Lab
Abstract
We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality tradeoff: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to 'render' actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of actiondriven generative models. By integrating visual skeletons into pretrained video generation models via lightweight finetuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid [64], RT-1 [11] and DROID [35] demonstrate the effectiveness of our proposed approach.
ProSAM: Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts
Xiaoqi Wang
Bosch Research North America
Clint Sebastian
Bosch Center for Artificial Intelligence (BCAI)
Wenbin He
Bosch Research North America
Liu Ren
Bosch Research North America
Abstract
The recent advancements in large foundation models have driven the success of open-set image segmentation, a task focused on segmenting objects beyond predefined categories. Among various prompt types (such as points, boxes, texts, and visual references), visual reference segmentation stands out for its unique flexibility and strong zeroshot capabilities. Recently, several SAM-based methods have made notable progress in this task by automatically generating prompts to guide SAM. However, these methods often generate prompts at boundaries of target regions due to suboptimal prompt encoder, which results in instability and reduced robustness. In this work, we introduce ProSAM, a simple but effective method to address the stability challenges we identified in existing SAM-based visual reference segmentation approaches. By learning a variational prompt encoder to predict multivariate prompt distributions, ProSAM avoids generating prompts that lie in unstable regions, overcoming the instability caused by less robust prompts. Our approach consistently surpasses state-ofthe-art methods on the Pascal-5i and COCO-20i datasets, providing a more robust solution for visual reference segmentation.
Recognizing Actions from Robotic View for Natural Human-Robot Interaction
Ziyi Wang
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
Peiming Li
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
Hong Liu
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
Zhichao Deng
Sun Yat-sen University
Can Wang
Kiel University
Jun Liu
Lancaster University
Junsong Yuan
State University of New York at Buffalo
Mengyuan Liu
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
Abstract
Natural Human-Robot Interaction (N-HRI) requires robots to recognize human actions at varying distances and states, regardless of whether the robot itself is in motion or stationary. This setup is more flexible and practical than conventional human action recognition tasks. However, existing benchmarks designed for traditional action recognition fail to address the unique complexities in N-HRI due to limited data, modalities, task categories, and diversity of subjects and environments. To address these challenges, we introduce ACTIVE (Action from Robotic View), a large-scale dataset tailored specifically for perception-centric robotic views prevalent in mobile service robots. ACTIVE comprises 30 composite action categories, 80 participants, and 46,868 annotated video instances, covering both RGB and point cloud modalities. Participants performed various human actions in diverse environments at distances ranging from 3m to 50m, while the camera platform was also mobile, simulating real-world scenarios of robot perception with varying camera heights due to uneven ground. This comprehensive and challenging benchmark aims to advance action and attribute recognition research in N-HRI. Furthermore, we propose ACTIVE-PC, a method that accurately perceives human actions at long distances using Multilevel Neighborhood Sampling, Layered Recognizers, Elastic Ellipse Query, and precise decoupling of kinematic interference from human actions. Experimental results demonstrate the effectiveness of ACTIVE-PC. Our code is available at: https://github.com/wangzy01/ACTIVE-Action-from-Robotic-View.
Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities
Liuyi Wang
Tongji University
Xinyuan Xia
Shanghai AI Laboratory
Hui Zhao
Shanghai AI Laboratory
Hanqing Wang
Shanghai AI Laboratory
Tai Wang
Shanghai AI Laboratory
Yilun Chen
Shanghai AI Laboratory
Chengju Liu
Tongji University
Qijun Chen
Tongji University
Jiangmiao Pang
Shanghai AI Laboratory
Abstract
Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLNPE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment's overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models. The code is available at https://crystalsixone.github.io/vln pe.github.io.
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
Haochen Wang
NLPR, MAIS, CASIA
Yucheng Zhao
Dexmal
Tiancai Wang
Dexmal
Haoqiang Fan
Dexmal
Xiangyu Zhang
MEGVII Technology
Zhaoxiang Zhang
NLPR, MAIS, CASIA
Abstract
The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (ROSS3D), which integrates 3D aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird'sEye-View images, contributing to a comprehensive overview of the entire scene. Empirically, ROSS3D achieves state-ofthe-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.
S3E: Self-Supervised State Estimation for Radar-Inertial System
Shengpeng Wang
Huazhong University of Science and Technology
Yulong Xie
Huazhong University of Science and Technology
Qing Liao
Harbin Institute of Technology
Wei Wang
Wuhan University
Abstract
Millimeter-wave radar for state estimation is gaining significant attention for its affordability and reliability in harsh conditions. Existing localization solutions typically rely on post-processed radar point clouds as landmark points. Nonetheless, the inherent sparsity of radar point clouds, ghost points from multi-path effects, and limited angle resolution in single-chirp radar severely degrade state estimation performance. To address these issues, we propose S3E, a Self-Supervised State Estimator that employs more richly informative radar signal spectra to bypass sparse points and fuses complementary inertial information to achieve accurate localization. S3E fully explores the association between exteroceptive radar and proprioceptive inertial sensor to achieve complementary benefits. To deal with limited angle resolution, we introduce a novel cross-fusion technique that enhances spatial structure information by exploiting subtle rotational shift correlations across heterogeneous data. The experimental results demonstrate our method achieves robust and accurate performance without relying on localization ground truth supervision. To the best of our knowledge, this is the first attempt to achieve state estimation by fusing radar spectra and inertial data in a complementary self-supervised manner.
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
Xiyao Wang
University of Maryland, College Park
Zhengyuan Yang
Microsoft
Linjie Li
Microsoft
Hongjin Lu
University of Maryland, College Park
Yuancheng Xu
University of Maryland, College Park
Chung-Ching Lin
Microsoft
Kevin Lin
Microsoft
Furong Huang
University of Maryland, College Park
Lijuan Wang
Microsoft
Abstract
Despite significant advancements in vision-language models (VLMs), there lack effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the selfimproving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improves VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs.
Shape of Motion: 4D Reconstruction from a Single Video
Qianqian Wang
UC Berkeley
Vickie Ye
UC Berkeley
Hang Gao
UC Berkeley
Weijia Zeng
UC San Diego
Jake Austin
UC Berkeley
Zhengqi Li
Adobe Research
Angjoo Kanazawa
UC Berkeley
Abstract
Monocular dynamic reconstruction is a challenging and long-standing vision problem due to the highly ill-posed nature of the task. Existing approaches depend on templates, are effective only in quasi-static scenes, or fail to model 3D motion explicitly. We introduce a method for reconstructing generic dynamic scenes, featuring explicit, persistent 3D motion trajectories in the world coordinate frame, from casually captured monocular videos. We tackle the problem with two key insights: First, we exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE(3) motion bases. Each point's motion is expressed as a linear combination of these bases, facilitating soft decomposition of the scene into multiple rigidly-moving groups. Second, we take advantage of offthe-shelf data-driven priors such as monocular depth maps and long-range 2D tracks, and devise a method to effectively consolidate these noisy supervisory signals, resulting in a globally consistent representation of the dynamic scene. Experiments show that our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes. Project Page: https://shape-of-motion.github.io/
StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning
Chuxin Wang
University of Science and Technology of China
Yixin Zha
University of Science and Technology of China
Wenfei Yang
University of Science and Technology of China
Tianzhu Zhang
University of Science and Technology of China
Abstract
Recently, Mamba-based methods have demonstrated impressive performance in point cloud representation learning by leveraging State Space Model (SSM) with the efficient context modeling ability and linear complexity. However, these methods still face two key issues that limit the potential of SSM: Destroying the adjacency of 3D points during SSM processing and failing to retain long-sequence memory as the input length increases in downstream tasks. To address these issues, we propose StruMamba3D, a novel paradigm for self-supervised point cloud representation learning. It enjoys several merits. First, we design spatial states and use them as proxies to preserve spatial dependencies among points. Second, we enhance the SSM with a state-wise update strategy and incorporate a lightweight convolution to facilitate interactions between spatial states for efficient structure modeling. Third, our method reduces the sensitivity of pre-trained Mamba-based models to varying input lengths by introducing a sequence length-adaptive strategy. Experimental results across four downstream tasks showcase the superior performance of our method. In addition, our method attains the SOTA 95.1% accuracy on ModelNet40 and 92.75% accuracy on the most challenging split of ScanObjectNN without voting strategy.
The Source Image is the Best Attention for Infrared and Visible Image Fusion
Song Wang
School of Computer Science and Technology, North University of China
Xie Han
School of Computer Science and Technology, North University of China
Liqun Kuang
School of Computer Science and Technology, North University of China
Boying Wang
School of Computer Science and Technology, North University of China
Zhongyu Chen
School of Computer Science and Technology, North University of China
Zherui Qiao
School of Computer Science and Technology, North University of China
Fan Yang
School of Computer Science and Technology, North University of China
Xiaoxia Liu
School of Computer Science and Technology, North University of China
Bingyu Zhang
School of Computer Science and Technology, North University of China
Zhixun Wang
School of Computer Science and Technology, North University of China
Abstract
Infrared and visible image fusion (IVF) endeavors to engineer composite outputs by blending optimal virtues of divergent modalities. This paper reveals, unprecedentedly, the intrinsic 'attention properties' of infrared images, which directly arise from their physical characteristics (i.e., heat distribution) and can be linked to attention mechanisms naturally, as observed in the gradient-weighted class activation mapping (Grad-CAM) visualization analysis of image classification models. To incorporate this property into IVF for better fusion, we propose the source infrared cross attention (I-SCA) and further extend it to the visible modality, subsequently introducing the source visible cross attention (V-SCA). The joint use of I-SCA and V-SCA greatly alleviate longstanding issues, such as insufficient and incomplete multimodal feature interaction and fusion, in IVF. Moreover, an auxiliary component for I-SCA and VSCA, termed CBSM, is employed to boost the channel, map space, and suppress redundancy and misleading information of the source images. Specifically, we directly treat the CBSM-processed raw image as the query, while the intermediate features of another modality are treated as keys and values in I-SCA and V-SCA. Unlike attention mechanisms that divide images into patches or limit computations to local windows, our cross attention modules achieve smoother and more robust IVF through true global modeling across the entire image space with linear complexity. Comparison with current SOTA methods on three popular public datasets confirms its superiority.
TopicGeo: An Efficient Unified Framework for Geolocation
Xin Wang
Xidian University
Xinlin Wang
Xidian University
Shuiping Gou
Xidian University
Abstract
Vision-based geolocation techniques that establish spatial correspondences between smaller query images and larger georeferenced images have gained significant attention. Existing approaches typically employ a separate 'retrievethen-match' paradigm, whereas such paradigms suffer from computational inefficiency or precision limitations. To this end, we propose TopicGeo, a unified framework for direct and precise query-to-reference image matching via three key innovations. The textual object semantics, called topics, distilled from CLIP prompt learning are embedded into the geolocation framework to eliminate intra-class and inter-class distribution discrepancies while also enhancing processing efficiency. Center-based adaptive label assignment and outlier rejection mechanisms as a joint retrievalmatching optimization strategy ensure task-coherent feature learning and precise spatial correspondences. A multi-level fine matching pipeline is introduced to refine matching from quality and quantity. Evaluations on large-scale synthetic and real-world datasets illustrate that TopicGeo achieves state-of-the-art performance in retrieval recall and matching accuracy while maintaining a balance in computational efficiency.
TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking
Mengmeng Wang
Zhejiang University of Technology
Haonan Wang
Zhejiang University of Technology
Yulong Li
Zhejiang University of Technology
Xiangjie Kong
Zhejiang University of Technology
Jiaxin Du
Zhejiang University of Technology
Guojiang Shen
Zhejiang University of Technology
Feng Xia
RMIT University
Abstract
3D LiDAR-based single object tracking (SOT) relies on sparse and irregular point clouds, posing challenges from geometric variations in scale, motion patterns, and structural complexity across object categories. Current category-specific approaches achieve good accuracy but are impractical for real-world use, requiring separate models for each category and showing limited generalization. To tackle these issues, we propose TrackAny3D, the first framework to transfer large-scale pretrained 3D models for category-agnostic 3D SOT. We first integrate parameterefficient adapters to bridge the gap between pretraining and tracking tasks while preserving geometric priors. Then, we introduce a Mixture-of-Geometry-Experts (MoGE) architecture that adaptively activates specialized subnetworks based on distinct geometric characteristics. Additionally, we design a temporal context optimization strategy that incorporates learnable temporal tokens and a dynamic mask weighting module to propagate historical information and mitigate temporal drift. Experiments on three commonlyused benchmarks show that TrackAny3D establishes new state-of-the-art performance on category-agnostic 3D SOT, demonstrating strong generalization and competitiveness. We hope this work will enlighten the community on the importance of unified models and further expand the use of large-scale pretrained models in this field.
UAVScenes: A Multi-Modal Dataset for UAVs
Sijie Wang
Nanyang Technological University
Siqi Li
Nanyang Technological University
Yawei Zhang
Nanyang Technological University
Shangshu Yu
School of Computer Science and Engineering, Northeastern University
Shenghai Yuan
Nanyang Technological University
Rui She
Beihang University
Quanjiang Guo
University of Electronic Science and Technology of China
Abstract
Multi-modal perception is essential for unmanned aerial vehicle (UAV) operations, as it enables a comprehensive understanding of the UAVs' surrounding environment. However, most existing multi-modal UAV datasets are primarily biased toward localization and 3D reconstruction tasks, or only support map-level semantic segmentation due to the lack of frame-wise annotations for both camera images and LiDAR point clouds. This limitation prevents them from being used for high-level scene understanding tasks. To address this gap and advance multi-modal UAV perception, we introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well-calibrated multi-modal UAV dataset MARSLVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both frame-wise images and LiDAR point clouds, along with accurate 6-degree-of-freedom (6-DoF) poses. These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis (NVS). Our dataset is available at https://github.com/sijieaaa/UAVScenes
UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving
Yuping Wang
University of California, Riverside
Xiangyu Huang
University of Wisconsin, Madison
Xiaokang Sun
University of Michigan
Mingxuan Yan
University of California, Riverside
Shuo Xing
Texas A&M University
Zhengzhong Tu
Texas A&M University
Jiachen Li
University of California, Riverside
Abstract
We introduce UniOcc, a comprehensive, unified benchmark and toolkit for occupancy forecasting (i.e., predicting future occupancies based on historical information) and occupancy prediction (i.e., predicting current-frame occupancy from camera images. UniOcc unifies the data from multiple real-world datasets (i.e., nuScenes, Waymo) and highfidelity driving simulators (i.e., CARLA, OpenCOOD), providing 2D/3D occupancy labels and annotating innovative per-voxel flows. Unlike existing studies that rely on suboptimal pseudo labels for evaluation, UniOcc incorporates novel evaluation metrics that do not depend on ground-truth labels, enabling robust assessment on additional aspects of occupancy quality. Through extensive experiments on state-of-the-art models, we demonstrate that large-scale, diverse training data and explicit flow information significantly enhance occupancy prediction and forecasting performance. Our data and code are available at https://uniocc.github.io/.
V2XScenes: A Multiple Challenging Traffic Conditions Dataset for Large-Range Vehicle-Infrastructure Collaborative Perception
Bowen Wang
Shanghai Jiao Tong University
Yafei Wang
Shanghai Jiao Tong University
Wei Gong
Shanghai Jiao Tong University
Siheng Chen
Shanghai AI Laboratory
Genjia Liu
Shanghai Jiao Tong University
Minhao Xiong
Shanghai Jiao Tong University
Chin Long Ng
Shanghai Jiao Tong University
Abstract
Whether autonomous driving can effectively handle challenging scenarios such as bad weather and complex traffic environments is still in doubt. One of the critical difficulties is that the single-view perception makes it hard to obtain the complementary perceptual information around the multi-condition scenes, such as meeting occlusion and congestion. To investigate the advantages of collaborative perception in high-risky driving scenarios, we construct a multiple challenging conditions dataset for largerange vehicle-infrastructure cooperative perception, called V2XScenes, which includes seven typical multi-modal layouts at successive road section. Particularly, each selected scene is labeled with a specific condition description, and we provide unique object tracking numbers across the entire road section and sequential frames to ensure consistency. Comprehensive cooperative perception benchmarks of 3D object detection and tracking for large-range roadside scenes are summarized, and the quantitative results based on the state-of-the-art demonstrate the effectiveness of collaborative perception facing challenging scenes. The data and benchmark codes of V2XScenes will be released.
VISO: Accelerating In-orbit Object Detection with Language-Guided Mask Learning and Sparse Inference
Meiqi Wang
Tsinghua University
Han Qiu
Tsinghua University
Abstract
In-orbit object detection is essential for Earth observation missions on satellites equipped with GPUs. A promising approach is to use pre-trained vision-language modeling (VLM) to enhance its open-vocabulary capability. However, adopting it on satellites poses two challenges: (1) satellite imagery differs substantially from natural images, and (2) satellites' embedded GPUs are insufficient for complex models' inference. We reveal their lack of a crucial prior: in-orbit detection involves identifying a set of known objects within a cluttered yet monotonous background. Motivated by this observation, we propose VISO, a Visionlanguage Instructed Satellite Object detection model that focuses on object-specific features while suppressing irrelevant regions through language-guided mask learning. After pre-training on a large-scale satellite dataset with 3.4M region-text pairs, VISO enhances object-text alignment and object-centric features to improve detection accuracy. Also, VISO suppresses irrelevant regions, enabling highly sparse inference to accelerate speed on satellites. Extensive experiments show that VISO without sparsity outperforms stateof-the-art (SOTA) VLMs in zero-shot detection by increasing 34.1% AP and reducing 27x FLOPs, and surpasses specialist models in supervised object detection and object referring by improving 2.3% AP. When sparsifying VISO to a comparable AP, FLOPs can be greatly reduced by up to 8.5x. Real-world tests reveal that VISO achieves a 2.8-4.8x FPS speed-up on satellites' embedded GPUs1.
VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers
Yating Wang
Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University
Haoyi Zhu
Shanghai AI Lab
Mingyu Liu
USTC
Jiange Yang
ZJU
Hao-Shu Fang
NJU
Tong He
SJTU
Abstract
In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent action outputs. Once trained, the tokenizer can be seamlessly adapted to a wide range of downstream tasks in a zero-shot manner, from short-horizon reactive behaviors to long-horizon planning. A key finding of our work is that the domain gap between synthetic and real action trajectories is marginal, allowing us to effectively utilize a vast amount of synthetic data during training without compromising real-world performance. To validate our approach, we conducted extensive experiments in both simulated environments and on real robotic platforms. The results demonstrate that as the volume of synthetic trajectory data increases, the performance of our tokenizer on downstream tasks improves significantly-most notably, achieving up to a 30% higher success rate on two real-world tasks in long-horizon scenarios. These findings highlight the potential of our action tokenizer as a robust and scalable solution for real-time embodied intelligence systems, paving the way for more efficient and reliable robotic control in diverse application domains.
VehicleMAE: View-asymmetry Mutual Learning for Vehicle Re-identification Pre-training via Masked AutoEncoders
Qi Wang
School of Mathematics and Computer Sciences, Nanchang University
Zeyu Zhang
School of Mathematics and Computer Sciences, Nanchang University
Dong Wang
School of Software, Nanchang University
Di Gai
School of Mathematics and Computer Sciences, Nanchang University
Xin Xiong
The First Affiliated Hospital, Jiangxi Medical College, Nanchang University
Jiyang Xu
School of Mathematics and Computer Sciences, Nanchang University
Ruihua Zhou
School of Software, Nanchang University
Abstract
Large-scale pre-training technology has achieved remarkable performance in diversified object re-identification (ReID) downstream tasks. Nevertheless, to our best knowledge, the pre-training model specifically for vehicle Re-ID, which focuses on tackling the challenge of multi-view variations, has not been fully investigated. In this paper, we first leverage a diffusion model to build a large-scale vehicle Re-ID benchmark dataset, dubbed 'DiffVERI', containing over 1700K images from abundant multi-view annotations. In terms of this dataset, we further present VehicleMAE, a novel masked image modeling pre-training paradigm that learns view-invariant representations by performing mutual-distillation in a self-supervised manner. To be specific, the pipeline of VehicleMAE unfolds two core modules, i.e., view-asymmetry masked image modeling (VMIM) and past-to-present mutual-distillation (PPMD). Technically, VMIM consists of two homogeneous masked autoencoders (MAE) that simultaneously reconstruct the RGB pixels and multi-view semantic information of the specific vehicle body region via paired asymmetric mask sampling strategies. To progressively distill the knowledge of the model itself, PPMD considers the two MAEs in the current epoch and the previous one as the student models and the teacher models, respectively, which leverages the knowledge learned by the current student and the historical teacher for mutual feature-level distillation. Extensive experimental results have verified that the proposed pre-training paradigm on DiffVERI gains compelling downstream task performance for vehicle Re-ID.
YOLOE: Real-Time Seeing Anything
Ao Wang
School of Software, Tsinghua University
Lihao Liu
School of Software, Tsinghua University
Hui Chen
BNRist, Tsinghua University
Zijia Lin
School of Software, Tsinghua University
Jungong Han
Department of Automation, Tsinghua University
Guiguang Ding
School of Software, Tsinghua University
Abstract
Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a builtin large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3x less training cost and 1.4x inference speedup, YOLOEv8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 APb and 0.4 APm gains over closed-set YOLOv8-L with nearly 4x less training time. Code and models are available at here.
You Think, You ACT: The New Task of Arbitrary Text to Motion Generation
Runqi Wang
National Engineering Research Center for Multimedia Software, Wuhan University
Caoyuan Ma
School of Computer Science, Wuhan University
Guopeng Li
StepFun
Hanrui Xu
School of Computer Science, Wuhan University
Yuke Li
University of Maryland College Park
Zheng Wang
National Engineering Research Center for Multimedia Software, Wuhan University
Abstract
Text to Motion aims to generate human motions from texts. Existing settings rely on limited Action Texts that include action labels (e.g., 'walk, bend'), which limits flexibility and practicability in scenarios difficult to describe directly. This paper extends limited Action Texts to arbitrary ones. Scene texts without explicit action labels can enhance the practicality of models in complex and diverse industries such as virtual human interaction, robot behavior generation, and film production, while also supporting the exploration of potential implicit behavior patterns. However, newly introduced Scene Texts may yield multiple reasonable output results, causing significant challenges in existing data, framework, and evaluation. To address this practical issue, we first create a new dataset HUMANML3D++ by extending texts of the wellannotated dataset HUMANML3D. Secondly, we propose a simple yet effective framework that extracts action instructions from arbitrary texts and subsequently generates motions. Furthermore, we also benchmark this new setting with multi-solution metrics to address the inadequacies of existing single-solution metrics. Extensive experiments indicate that Text to Motion in this realistic setting is challenging, fostering new research in this practical direction. More details are available in https://github.com/RunqiWang77/TAAT.github.io.
ZeroStereo: Zero-shot Stereo Matching from Single Images
Xianqi Wang
Huazhong University of Science and Technology
Hao Yang
Huazhong University of Science and Technology
Gangwei Xu
Huazhong University of Science and Technology
Junda Cheng
Huazhong University of Science and Technology
Min Lin
Huazhong University of Science and Technology
Yong Deng
Autel Robotics
Jinliang Zang
Autel Robotics
Yurui Chen
Autel Robotics
Xin Yang
Optics Valley Laboratory
Abstract
State-of-the-art supervised stereo matching methods have achieved remarkable performance on various benchmarks. However, their generalization to real-world scenarios remains challenging due to the scarcity of annotated realworld stereo data. In this paper, we propose ZeroStereo, a novel stereo image generation pipeline for zero-shot stereo matching. Our approach synthesizes high-quality right images from arbitrary single images by leveraging pseudo disparities generated by a monocular depth estimation model. Unlike previous methods that address occluded regions by filling missing areas with neighboring pixels or random backgrounds, we fine-tune a diffusion inpainting model to recover missing details while preserving semantic structure. Additionally, we propose Training-Free Confidence Generation, which mitigates the impact of unreliable pseudo labels without additional training, and Adaptive Disparity Selection, which ensures a diverse and realistic disparity distribution while preventing excessive occlusion and foreground distortion. Experiments demonstrate that models trained with our pipeline achieve state-of-theart zero-shot generalization across multiple datasets, with only a dataset volume comparable to Scene Flow. Code: https://github.com/Windsrain/ZeroStereo.
MixANT: Observation-dependent Memory Propagation for Stochastic Dense Action Anticipation
Syed Talal Wasim
University of Bonn
Hamid Suleman
University of Bonn
Olga Zatsarynna
University of Bonn
Muzammal Naseer
Khalifa University
Juergen Gall
University of Bonn
Abstract
We present MixANT, a novel architecture for stochastic long-term dense anticipation of human activities. While recent State Space Models (SSMs) like Mamba have shown promise through input-dependent selectivity on three key parameters, the critical forget-gate (A matrix) controlling temporal memory remains static. We address this limitation by introducing a mixture of experts approach that dynamically selects contextually relevant A matrices based on input features, enhancing representational capacity without sacrificing computational efficiency. Extensive experiments on the 50Salads, Breakfast, and Assembly101 datasets demonstrate that MixANT consistently outperforms stateof-the-art methods across all evaluation settings. Our results highlight the importance of input-dependent forgetgate mechanisms for reliable prediction of human behavior in diverse real-world scenarios. The project page is available at https://talalwasim.github.io/MixANT/.
3D Test-time Adaptation via Graph Spectral Driven Point Shift
Xin Wei
State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University
Qin Yang
State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University
Yijie Fang
State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University
Mingrui Zhu
State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University
Nannan WangB
State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University
Abstract
While test-time adaptation (TTA) methods effectively address domain shifts by dynamically adapting pre-trained models to target domain data during online inference, their application to 3D point clouds is hindered by their irregular and unordered structure. Current 3D TTA methods often rely on computationally expensive spatial-domain optimizations and may require additional training data. In contrast, we propose Graph Spectral Domain Test-Time Adaptation (GSDTTA), a novel approach for 3D point cloud classification that shifts adaptation to the graph spectral domain, enabling more efficient adaptation by capturing global structural properties with fewer parameters. Point clouds in target domain are represented as outlier-aware graphs and transformed into graph spectral domain by Graph Fourier Transform (GFT). For efficiency, adaptation is performed by optimizing only the lowest 10% of frequency components, which capture the majority of the point cloud's energy. An inverse GFT (IGFT) is then applied to reconstruct the adapted point cloud with the graph spectral-driven point shift. This process is enhanced by an eigenmap-guided selftraining strategy that iteratively refines both the spectral adjustments and the model parameters. Experimental results and ablation studies on benchmark datasets demonstrate the effectiveness of GSDTTA, outperforming existing TTA methods for 3D point cloud classification.
AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance
Yi-Lin Wei
School of Computer Science and Engineering, Sun Yat-sen University
Mu Lin
School of Computer Science and Engineering, Sun Yat-sen University
Yuhao Lin
School of Computer Science and Engineering, Sun Yat-sen University
Jian-Jian Jiang
School of Computer Science and Engineering, Sun Yat-sen University
Xiao-Ming Wu
School of Computer Science and Engineering, Sun Yat-sen University
Ling-An Zeng
School of Computer Science and Engineering, Sun Yat-sen University
Wei-Shi Zheng
School of Computer Science and Engineering, Sun Yat-sen University
Abstract
Language-guided robot dexterous generation enables robots to grasp and manipulate objects based on human commands. However, previous data-driven methods are hard to understand intention and execute grasping with unseen categories in the open set. In this work, we explore a new task, Open-set Language-guided Dexterous Grasp, and find that the main challenge is the huge gap between highlevel human language semantics and low-level robot actions. To solve this problem, we propose an Affordance Dexterous Grasp (AffordDexGrasp) framework, with the insight of bridging the gap with a new generalizable-instructive affordance representation. This affordance can generalize to unseen categories by leveraging the object's local structure and category-agnostic semantic attributes, thereby effectively guiding dexterous grasp generation. Built upon the affordance, our framework introduces Affordance Flow Matching (AFM) for affordance generation with language as input, and Grasp Flow Matching (GFM) for generating dexterous grasp with affordance as input. To evaluate our framework, we build an open-set table-top language-guided dexterous grasp dataset. Extensive experiments in the simulation and real worlds show that our framework surpasses all previous methods in open-set generalization.
EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting
Xiaobao Wei
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Qingpo Wuwu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Zhongyu Zhao
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Zhuangzhe Wu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Nan Huang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Ming Lu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Ningning Ma
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Shanghang Zhang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Abstract
Photorealistic reconstruction of street scenes is essential for developing real-world simulators in autonomous driving. While recent methods based on 3D/4D Gaussian Splatting (GS) have demonstrated promising results, they still encounter challenges in complex street scenes due to the unpredictable motion of dynamic objects. Current methods typically decompose street scenes into static and dynamic objects, learning the Gaussians in either a supervised manner (e.g., w/o 3D bounding-box) or a self-supervised manner (e.g., w/o 3D bounding-box). However, these approaches do not effectively model the motions of dynamic objects (e.g., the motion speed of pedestrians is clearly different from that of vehicles), resulting in suboptimal scene decomposition. To address this, we propose Explicit Motion Decomposition (EMD), which models the motions of dynamic objects by introducing learnable motion embeddings to the Gaussians, enhancing the decomposition in street scenes. The proposed plug-and-play EMD module compensates for the lack of motion modeling in self-supervised street Gaussian splatting methods. We also introduce tailored training strategies to extend EMD to supervised approaches. Comprehensive experiments demonstrate the effectiveness of our method, achieving state-of-the-art novel view synthesis performance in self-supervised settings. The code is available at: https://qingpowuwu.github.io/emd.
Noise2Score3D: Tweedie's Approach for Unsupervised Point Cloud Denoising
Xiangbin Wei
Shenzhen University
Yuanfeng Wang
Quantum Science Center of Guangdong-Hong Kong-Macao Greater Bay Area
Ao XU
Research Institute of Tsinghua University in Shenzhen
Lingyu Zhu
YunJi Intelligent Engineering Co., Ltd.
Dongyong Sun
YunJi Intelligent Engineering Co., Ltd.
Keren Li
Shenzhen University
Yang Li
Nanjing University
Qi Qin
City University of Hong Kong
Abstract
Building on recent advances in Bayesian statistics and image denoising, we propose Noise2Score3D, a fully unsupervised framework for point cloud denoising. Noise2Score3D learns the score function of the underlying point cloud distribution directly from noisy data, eliminating the need for clean data during training. Using Tweedie's formula, our method performs denoising in a single step, avoiding the iterative processes used in existing unsupervised methods, thus improving both accuracy and efficiency. Additionally, we introduce Total Variation for Point Clouds as a denoising quality metric, which allows for the estimation of unknown noise parameters. Experimental results demonstrate that Noise2Score3D achieves state-of-the-art performance on standard benchmarks among unsupervised learning methods in Chamfer distance and point-to-mesh metrics. Noise2Score3D also demonstrates strong generalization ability beyond training datasets. Our method, by addressing the generalization issue and challenge of the absence of clean data in learning-based methods, paves the way for learning-based point cloud denoising methods in real-world applications.
PCR-GS: COLMAP-Free 3D Gaussian Splatting via Pose Co-Regularizations
Yu Wei
Nanyang Technological University
Jiahui Zhang
Nanyang Technological University
Xiaoqin Zhang
Zhejiang University of Technology
Ling Shao
UCAS-Terminus AI Lab, University of Chinese Academy of Sciences
Shijian Lu
Nanyang Technological University
Abstract
COLMAP-free 3D Gaussian Splatting (3D-GS) has recently attracted increasing attention due to its remarkable performance in reconstructing high-quality 3D scenes from unposed images or videos. However, it often struggles to handle scenes with complex camera trajectories as featured by drastic rotation and translation across adjacent camera views, leading to degraded estimation of camera poses and further local minima in joint optimization of camera poses and 3D-GS. We propose PCR-GS, an innovative COLMAP-free 3DGS technique that achieves superior 3D scene modeling and camera pose estimation via camera pose co-regularization. PCR-GS achieves regularization from two perspectives. The first is feature reprojection regularization which extracts view-robust DINO features from adjacent camera views and aligns their semantic information for camera pose regularization. The second is waveletbased frequency regularization which exploits discrepancy in high-frequency details to further optimize the rotation matrix in camera poses. Extensive experiments over multiple real-world scenes show that the proposed PCR-GS achieves superior pose-free 3D-GS scene modeling under dramatic changes of camera trajectories.
Passing the Driving Knowledge Test
Maolin Wei
Boston University
Wanzhou Liu
Washington University in St. Louis
Eshed Ohn-Bar
Boston University
Abstract
If a Large Language Model (LLM) were to take a driving knowledge test today, would it pass? Beyond standard spatial and visual question-answering (QA) tasks on current autonomous driving benchmarks, driving knowledge tests require a complete understanding of all traffic rules, signage, and right-of-way principles. To pass this test, human drivers must discern various edge cases that rarely appear in real-world datasets. In this work, we present DriveQA, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios. Through our experiments using DriveQA, we show that (1) state-of-the-art LLMs and Multimodal LLMs (MLLMs) perform well on basic traffic rules but exhibit significant weaknesses in numerical reasoning and complex right-of-way scenarios, traffic sign variations, and spatial layouts, (2) fine-tuning on DriveQA improves accuracy across multiple categories, particularly in regulatory sign recognition and intersection decision-making, (3) controlled variations in DriveQA-V provide insights into model sensitivity to environmental factors such as lighting, perspective, distance, and weather conditions, and (4) pretraining on DriveQA enhances downstream driving task performance, leading to improved results on real-world datasets such as nuScenes and BDD, while also demonstrating that models can internalize text and synthetic traffic knowledge to generalize effectively across downstream QA tasks.
RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians
Shenxing Wei
The Hong Kong Polytechnic University
Jinxi Li
The Hong Kong Polytechnic University
Yafei Yang
The Hong Kong Polytechnic University
Siyuan Zhou
The Hong Kong Polytechnic University
Bo Yang
The Hong Kong Polytechnic University
Abstract
In this paper, we present a generalizable method for 3D surface reconstruction from raw point clouds or pre-estimated 3D Gaussians by 3DGS from RGB images. Unlike existing coordinate-based methods which are often computationally intensive when rendering explicit surfaces, our proposed method, named RayletDF, introduces a new technique called raylet distance field, which aims to directly predict surface points from query rays. Our pipeline consists of three key modules: a raylet feature extractor, a raylet distance field predictor, and a multi-raylet blender. These components work together to extract fine-grained local geometric features, predict raylet distances, and aggregate multiple predictions to reconstruct precise surface points. We extensively evaluate our method on multiple public real-world datasets, demonstrating superior performance in surface reconstruction from point clouds or 3D Gaussians. Most notably, our method achieves exceptional generalization ability, successfully recovering 3D surfaces in a single-forward pass across unseen datasets in testing. Our code and datasets are available at https://github. com/vLAR-group/RayletDF.
Object-level Correlation for Few-Shot Segmentation
Chunlin Wen
School of Computer Science and Engineering, Southeast University
Yu Zhang
School of Computer Science and Engineering, Southeast University
Jie Fan
Samsung Electronics (China) R&D Centre
Hongyuan Zhu
Institute for Infocomm Research (I2R), A*STAR Singapore
Xiu-Shen Wei
School of Computer Science and Engineering, Southeast University
Yijun Wang
School of Computer Science and Engineering, Southeast University
Zhiqiang Kou
School of Computer Science and Engineering, Southeast University
Shuzhou Sun
Shanghai AI Laboratory
Abstract
Few-shot semantic segmentation (FSS) aims to segment objects of novel categories in the query images given only a few annotated support samples. Existing methods primarily build the image-level correlation between the support target object and the entire query image. However, this correlation contains the hard pixel noise, i.e., irrelevant background objects, that is intractable to trace and suppress, leading to the overfitting of the background. To address the limitation of this correlation, we imitate the biological vision process to identify novel objects in the object-level information. Target identification in the general objects is more valid than in the entire image, especially in the low-data regime. Inspired by this, we design an Object-level Correlation Network (OCNet) by establishing the object-level correlation between the support target object and query general objects, which is mainly composed of the General Object Mining Module (GOMM) and Correlation Construction Module (CCM). Specifically, GOMM constructs the query general object feature by learning saliency and high-level similarity cues, where the general objects include the irrelevant background objects and the target foreground object. Then, CCM establishes the objectlevel correlation by allocating the target prototypes to match the general object feature. The generated object-level correlation can mine the query target feature and suppress the hard pixel noise for the final prediction. Extensive experiments on PASCAL-5i and COCO-20i show that our model achieves the state-of-the-art performance.
SEGS-SLAM: Structure-enhanced 3D Gaussian Splatting SLAM with Appearance Embedding
Tianci Wen
IRAIS, tjKLIR, College of Artificial Intelligence, Nankai University
Zhiang Liu
IRAIS, tjKLIR, College of Artificial Intelligence, Nankai University
Yongchun Fang
IRAIS, tjKLIR, College of Artificial Intelligence, Nankai University
Abstract
3D Gaussian splatting (3D-GS) has recently revolutionized novel view synthesis in the simultaneous localization and mapping (SLAM) problem. However, most existing algorithms fail to fully capture the underlying structure, resulting in structural inconsistency. Additionally, they struggle with abrupt appearance variations, leading to inconsistent visual quality. To address these problems, we propose SEGS-SLAM, a structure-enhanced 3D Gaussian Splatting SLAM, which achieves high-quality photorealistic mapping. Our main contributions are two-fold. First, we propose a structure-enhanced photorealistic mapping (SEPM) framework that, for the first time, leverages highly structured point cloud to initialize structured 3D Gaussians, leading to significant improvements in rendering quality. Second, we propose Appearance-from-Motion embedding (AfME), enabling 3D Gaussians to better model image appearance variations across different camera poses. Extensive experiments on monocular, stereo, and RGB-D datasets demonstrate that SEGS-SLAM significantly outperforms state-ofthe-art (SOTA) methods in photorealistic mapping quality, e.g., an improvement of 19.86% in PSNR over MonoGS on the TUM RGB-D dataset for monocular cameras.
Seeing and Seeing Through the Glass: Real and Synthetic Data for Multi-Layer Depth Estimation
Hongyu Wen
Department of Computer Science, Princeton University
Yiming Zuo
Department of Computer Science, Princeton University
Venkat Subramanian
Department of Computer Science, Princeton University
Patrick Chen
Department of Computer Science, Princeton University
Jia Deng
Department of Computer Science, Princeton University
Abstract
Transparent objects are common in daily life, and understanding their multi-layer depth information-perceiving both the transparent surface and the objects behind it-is crucial for real-world applications that interact with transparent materials. In this paper, we introduce LayeredDepth, the first dataset with multi-layer depth annotations, including a real-world benchmark and a synthetic data generator, to support the task of multi-layer depth estimation. Our real-world benchmark consists of 1,500 images from diverse scenes, and evaluating state-of-the-art depth estimation methods on it reveals that they struggle with transparent objects. The synthetic data generator is fully procedural and capable of providing training data for this task with an unlimited variety of objects and scene compositions. Using this generator, we create a synthetic dataset with 15,300 images. Baseline models training solely on this synthetic dataset produce good cross-domain multilayer depth estimation. Fine-tuning state-of-the-art singlelayer depth models on it substantially improves their performance on transparent objects, with quadruplet accuracy on our benchmark increased from 55.14% to 75.20%. All images and validation annotations are available under CC0 at https://layereddepth.cs.princeton.edu.
ArgoTweak: Towards Self-Updating HD Maps through Structured Priors
Lena Wild
KTH Royal Institute of Technology
Rafael Valencia
TRATON
Patric Jensfelt
KTH Royal Institute of Technology
Abstract
Reliable integration of prior information is crucial for selfverifying and self-updating HD maps. However, no public dataset includes the required triplet of prior maps, current maps, and sensor data. As a result, existing methods must rely on synthetic priors, which create inconsistencies and lead to a significant sim2real gap. To address this, we introduce ArgoTweak, the first dataset to complete the triplet with realistic map priors. At its core, ArgoTweak employs a bijective mapping framework, breaking down large-scale modifications into fine-grained atomic changes at the map element level, thus ensuring interpretability. This paradigm shift enables accurate change detection and integration while preserving unchanged elements with high fidelity. Experiments show that training models on ArgoTweak significantly reduces the sim2real gap compared to synthetic priors. Extensive ablations further highlight the impact of structured priors and detailed change annotations. By establishing a benchmark for explainable, prioraided HD mapping, ArgoTweak advances scalable, selfimproving mapping solutions. The dataset, baselines, map modification toolbox, and further resources are available at https://KTH-RPL.github.io/ArgoTweak/.
Resonance: Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations
Conghao Wong
Huazhong University of Science and Technology
Ziqian Zou
Huazhong University of Science and Technology
Beihao Xia
Huazhong University of Science and Technology
Abstract
Learning to forecast trajectories of intelligent agents has caught much more attention recently. However, it remains a challenge to accurately account for agents' intentions and social behaviors when forecasting, and in particular, to simulate the unique randomness within each of those components in an explainable and decoupled way. Inspired by vibration systems and their resonance properties, we propose the Resonance (short for Re) model to encode and forecast pedestrian trajectories in the form of 'co-vibrations'. It decomposes trajectory modifications and randomness into multiple vibration portions to simulate agents' reactions to each single cause, and forecasts trajectories as the superposition of these independent vibrations separately. Also, benefiting from such vibrations and their spectral properties, representations of social interactions can be learned by emulating the resonance phenomena, further enhancing its explainability. Experiments on multiple datasets have verified its usefulness both quantitatively and qualitatively.
Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images
Tianhao Wu
Nanyang Technological University
Chuanxia Zheng
University of Oxford
Frank Guan
Singapore Institute of Technology
Andrea Vedaldi
University of Oxford
Tat-Jen Cham
Nanyang Technological University
Abstract
Most existing image-to-3D models assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional image-to-3D model designed to reconstruct plausible 3D geometry and appearance from partial observations. We extend a 'foundation' 3D generator by introducing a visible mask-weighted attention mechanism and an occlusion-aware attention layer that explicitly leverage visible and occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes. It substantially outperforms state-of-the-art methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction. †Project Lead.
CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation
Jianyu Wu
Shanghai Jiao Tong University
Yizhou Wang
The Chinese University of Hong Kong
Xiangyu Yue
The Chinese University of Hong Kong
Xinzhu Ma
Shanghai Artificial Intelligence Laboratory
Jinyang Guo
Beihang University
Dongzhan Zhou
Shanghai Artificial Intelligence Laboratory
Wanli Ouyang
Shanghai Artificial Intelligence Laboratory
Shixiang Tang
Shanghai Artificial Intelligence Laboratory
Abstract
While accurate and user-friendly Computer-Aided Design (CAD) is crucial for industrial design and manufacturing, existing methods still struggle to achieve this due to their over-simplified representations or architectures incapable of supporting multimodal design requirements. In this paper, we attempt to tackle this problem from both methods and datasets aspects. First, we propose a cascade MAR [25] with topology predictor (CMT), the first multimodal framework for CAD generation based on Boundary Representation (B-Rep). Specifically, the cascade MAR can effectively capture the 'edge-counters-surface' priors that are essential in B-Reps, while the topology predictor directly estimates topology in B-Reps from the compact tokens in MAR. Second, to facilitate large-scale training, we develop a large-scale multimodal CAD dataset, mmABC, which includes over 1.3 million B-Rep models with multimodal annotations, including point clouds, text descriptions, and multi-view images. Extensive experiments show the superior of CMT in both conditional and unconditional CAD generation tasks. For example, we improve Coverage and Valid ratio by +10.68% and +10.3%, respectively, compared to state-of-the-art methods on ABC [21] in unconditional generation. CMT also improves +4.01 Chamfer on image conditioned CAD generation on mmABC.
Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling
Qirui Wu
Simon Fraser University
Denys Iliash
Simon Fraser University
Daniel Ritchie
Brown University
Manolis Savva
Simon Fraser University
Angel X. Chang
Simon Fraser University
Abstract
Reconstructing structured 3D scenes from RGB images using CAD objects unlocks efficient and compact scene representations that maintain compositionality and interactability. Existing works propose training-heavy methods relying on either expensive yet inaccurate real-world annotations or controllable yet monotonous synthetic data that do not generalize well to unseen objects or domains. We present Diorama, the first zero-shot open-world system that holistically models 3D scenes from single-view RGB observations without requiring end-to-end training or human annotations. We show the feasibility of our approach by decomposing the problem into subtasks and introduce better solutions to each: architecture reconstruction, 3D shape retrieval, object pose estimation, and scene layout optimization. We evaluate our system on both synthetic and realworld data to show we significantly outperform baselines from prior work. We also demonstrate generalization to real-world internet images and the text-to-scene task.
Efficient Spiking Point Mamba for Point Cloud Analysis
Peixi Wu
University of Science and Technology of China
Bosong Chai
Zhejiang University
Menghua Zheng
Tsingmao Intelligence
Wei Li
University of Science and Technology of China
Zhangchi Hu
University of Science and Technology of China
Jie Chen
University of Science and Technology of China
Zheyu Zhang
University of Science and Technology of China
Hebei Li
University of Science and Technology of China
Xiaoyan Sun
University of Science and Technology of China
Abstract
Bio-inspired Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing 3D SNNs have struggled with long-range dependencies until the recent emergence of Mamba, which offers superior computational efficiency and sequence modeling capability. In this work, we propose Spiking Point Mamba (SPM), the first Mamba-based SNN in the 3D domain. Naively adapting Mamba to 3D SNNs, though, is hindered by temporal dynamics mismatch and spike-induced information loss. Thus, we first introduce Hierarchical Dynamic Encoding (HDE), an improved direct encoding method that effectively introduces dynamic temporal mechanism. Then, we propose Spiking Mamba Block (SMB), which builds upon Mamba while learning intertime-step features and minimizing information loss caused by spikes. Finally, to further boost performance, we adopt an asymmetric SNN-ANN architecture for spike-based pretraining and finetune. Compared with the previous stateof-the-art SNN models, SPM improves overall accuracy by +6.2%, +6.1%, and +7.4% on three variants of ScanObjectNN, and boosts instance mIOU by +1.9% on ShapeNetPart. Meanwhile, its energy consumption is at most 12.6x lower than that of its ANN counterpart. Code: https://github.com/PeppaWu/SPM.
EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding
Yuqi Wu
Tsinghua University
Wenzhao Zheng
Tsinghua University
Sicheng Zuo
Tsinghua University
Yuanhui Huang
Tsinghua University
Jie Zhou
Tsinghua University
Jiwen Lu
Tsinghua University
Abstract
3D occupancy prediction provides a comprehensive description of the surrounding scenes and has become an essential task for 3D perception. Most existing methods focus on offline perception from one or a few views and cannot be applied to embodied agents that demand to gradually perceive the scene through progressive embodied exploration. In this paper, we formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize the global scene with uniform 3D semantic Gaussians and progressively update local regions observed by the embodied agent. For each update, we extract semantic and structural features from the observed image and efficiently incorporate them via deformable crossattention to refine the regional Gaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global 3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown (i.e., uniformly distributed) environment and maintains an explicit global memory of it with 3D Gaussians. It gradually gains knowledge through the local refinement of regional Gaussians, which is consistent with how humans understand new scenes through embodied exploration. We reorganize an EmbodiedOccScanNet benchmark based on local annotations to facilitate the evaluation of the embodied 3D occupancy prediction task. Our EmbodiedOcc outperforms existing methods by a large margin and accomplishes the embodied occupancy prediction with high accuracy and efficiency. Code: https://github.com/YkiWu/EmbodiedOcc.
FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing
Abstract
Generating realistic human motions from textual descriptions has undergone significant advancements. However, existing methods often overlook specific body part movements and their timing. In this paper, we address this issue by enriching the textual description with more details. Specifically, we propose the FineMotion dataset, which contains over 442,000 human motion snippets - short segments of human motion sequences - and their corresponding detailed descriptions of human body part movements. Additionally, the dataset includes about 95k detailed paragraphs describing the movements of human body parts of entire motion sequences. Experimental results demonstrate the significance of our dataset on the text-driven finegrained human motion generation task, especially with a remarkable +15.3% improvement in Top-3 accuracy for the MDM model. Notably, we further support a zero-shot pipeline of fine-grained motion editing, which focuses on detailed editing in both spatial and temporal dimensions via text.
Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition
Wenhan Wu
University of North Carolina at Charlotte
Zhishuai Guo
Northern Illinois University
Chen Chen
University of Central Florida
Hongfei Xue
University of North Carolina at Charlotte
Aidong Lu
University of North Carolina at Charlotte
Abstract
Zero-shot skeleton-based action recognition aims to develop models capable of identifying actions beyond the categories encountered during training. Previous approaches have primarily focused on aligning visual and semantic representations but often overlooked the importance of fine-grained action patterns in the semantic space (e.g., the hand movements in drinking water and brushing teeth). To address these limitations, we propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition. FS-VAE consists of three key components: 1) a frequency-based enhancement module with high- and low-frequency adjustments to enrich the skeletal semantics learning and improve the robustness of zero-shot action recognition; 2) a semantic-based action description with multilevel alignment to capture both local details and global correspondence, effectively bridging the semantic gap and compensating for the inherent loss of information in skeleton sequences; 3) a calibrated cross-alignment loss that enables valid skeleton-text pairs to counterbalance ambiguous ones, mitigating discrepancies and ambiguities in skeleton and text features, thereby ensuring robust alignment. Evaluations on the benchmarks demonstrate the effectiveness of our approach, validating that frequency-enhanced semantic features enable robust differentiation of visually and semantically similar action clusters, improving zeroshot action recognition. Our project is publicly available at: https://github.com/wenhanwu95/FS-VAE.
Human-Object Interaction from Human-Level Instructions
Zhen Wu
Stanford University
Jiaman Li
Stanford University
Pei Xu
Stanford University
C. Karen Liu
Stanford University
Abstract
Intelligent agents must autonomously interact with the environments to perform daily tasks based on human-level instructions. They need a foundational understanding of the world to accurately interpret these instructions, along with precise low-level movement and interaction skills to execute the derived actions. In this work, we propose the first complete system for synthesizing physically plausible, long-horizon human-object interactions for object manipulation in contextual environments, driven by human-level instructions. We leverage large language models (LLMs) to interpret the input instructions into detailed execution plans. Unlike prior work, our system is capable of generating detailed finger-object interactions, in seamless coordination with full-body movements. We also train a policy to track generated motions in physics simulation via reinforcement learning (RL) to ensure physical plausibility of the motion. Our experiments demonstrate the effectiveness of our system in synthesizing realistic interactions with diverse objects in complex environments, highlighting its potential for realworld applications.
LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling
Jiahao Wu
Peking University
Rui Peng
Peking University
Jianbo Jiao
University of Birmingham
Jiayu Yang
Peking University
Luyang Tang
Peking University
Kaiqiang Xiong
Peking University
Jie Liang
Peking University
Jinbo Yan
Peking University
Runling Liu
Peking University
Ronggang Wang
Peking University
Abstract
Due to the complex and highly dynamic motions in the real world, synthesizing dynamic videos from multi-view inputs for arbitrary viewpoints is challenging. Previous works based on neural radiance field or 3D Gaussian splatting are limited to modeling fine-scale motion, greatly restricting their application. In this paper, we introduce LocalDyGS, which consists of two parts to adapt our method to both large-scale and fine-scale motion scenes: 1) We decompose a complex dynamic scene into streamlined local spaces defined by seeds, enabling global modeling by capturing motion within each local space. 2) We decouple static and dynamic features for local space motion modeling. A static feature shared across time steps captures static information, while a dynamic residual field provides time-specific features. These are combined and decoded to generate Temporal Gaussians, modeling motion within each local space. As a result, we propose a novel dynamic scene reconstruction framework to model highly dynamic real-world scenes more realistically. Our method not only demonstrates competitive performance on various fine-scale datasets compared to state-of-the-art (SOTA) methods, but also represents the first attempt to model larger and more complex highly dynamic scenes. Project page: https://wujh2001.github. io/LocalDyGS/.
Measuring the Impact of Rotation Equivariance on Aerial Object Detection
Xiuyu Wu
Xidian University
Xinhao Wang
Xidian University
Xiubin Zhu
Xidian University
Lan Yang
Xidian University
Jiyuan Liu
National University of Defense Technology
Xingchen Hu
National University of Defense Technology
Abstract
Due to the arbitrary orientation of objects in aerial images, rotation equivariance is a critical property for aerial object detectors. However, recent studies on rotationequivariant aerial object detection remain scarce. Most detectors rely on data augmentation to enable models to learn approximately rotation-equivariant features. A few detectors have constructed rotation-equivariant networks, but due to the breaking of strict rotation equivariance by typical downsampling processes, these networks only achieve approximately rotation-equivariant backbones. Whether strict rotation equivariance is necessary for aerial image object detection remains an open question. In this paper, we implement a strictly rotation-equivariant backbone and neck network with a more advanced network structure and compare it with approximately rotation-equivariant networks to quantitatively measure the impact of rotation equivariance on the performance of aerial image detectors. Additionally, leveraging the inherently grouped nature of rotation-equivariant features, we propose a multi-branch head network that reduces the parameter count while improving detection accuracy. Based on the aforementioned improvements, this study proposes the Multi-branch head rotation-equivariant single-stage Detector (MessDet), which achieves state-of-the-art performance on the challenging aerial image datasets DOTA-v1.0, DOTA-v1.5 and DIOR-R with an exceptionally low parameter count.
Motal: Unsupervised 3D Object Detection by Modality and Task-specific Knowledge Transfer
Hai Wu
Xiamen University
Hongwei Lin
Xiamen University
Xusheng Guo
Xiamen University
Xin Li
Texas A&M University
Mingming Wang
Tsinghua University
Cheng Wang
Xiamen University
Chenglu Wen
Xiamen University
Abstract
The performance of unsupervised 3D object classification and bounding box regression relies heavily on the quality of initial pseudo-labels. Traditionally, the labels of classification and regression are represented by a single set of candidate boxes generated by motion or geometry heuristics. However, due to the similarity of many objects to the background in shape or lack of motion, the labels often fail to achieve high accuracy in two tasks simultaneously. Using these labels to directly train the network results in decreased detection performance. To address this challenge, we introduce Motal that performs unsupervised 3D object detection by Modality and task-specific knowledge transfer. Motal decouples the pseudo-labels into two sets of candidates, from which Motal discovers classification knowledge by motion and image appearance prior, and discovers box regression knowledge by geometry prior, respectively. Motal finally transfers all knowledge to a single student network by a TMT (Task-specific Masked Training) scheme, attaining high performance in both classification and regression. Motal can greatly enhance various unsupervised methods by about 2x mAP. For example, on the WOD test set, Motal improves the state-of-the-art CPD by 21.56% mAP L1 (from 20.54% to 42.10%) and 19.90% mAP L2 (from 18.18% to 38.08%). These achievements highlight the significance of our method.
On-Device Diffusion Transformer Policy for Efficient Robot Manipulation
Yiming Wu
The University of Hong Kong
Huan Wang
Westlake University
Zhenghao Chen
The University of Newcastle
Jianxin Pang
UBTech Robotics Corp.
Dong Xu
The University of Hong Kong
Abstract
Diffusion Policies have significantly advanced robotic manipulation tasks via imitation learning, but their application on resource-constrained mobile platforms remains challenging due to computational inefficiency and extensive memory footprint. In this paper, we propose LightDP, a novel framework specifically designed to accelerate Diffusion Policies for real-time deployment on mobile devices. LightDP addresses the computational bottleneck through two core strategies: network compression of the denoising modules and reduction of the required sampling steps. We first conduct an extensive computational analysis on existing Diffusion Policy architectures, identifying the denoising network as the primary contributor to latency. To overcome performance degradation typically associated with conventional pruning methods, we introduce a unified pruning and retraining pipeline, optimizing the model's postpruning recoverability explicitly. Furthermore, we combine pruning techniques with consistency distillation to effectively reduce sampling steps while maintaining action prediction accuracy. Experimental evaluations on the standard datasets, i.e., PushT, Robomimic, CALVIN, and LIBERO, demonstrate that LightDP achieves real-time action prediction on mobile devices with competitive performance, marking an important step toward practical deployment of diffusion-based policies in resource-limited environments. Extensive real-world experiments also show the proposed LightDP can achieve performance comparable to state-ofthe-art Diffusion Policies.
Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding
Mingxuan Wu
University of California, Berkeley
Huang Huang
University of California, Berkeley
Justin Kerr
University of California, Berkeley
Chung Min Kim
University of California, Berkeley
Anthony Zhang
University of California, Berkeley
Brent Yi
University of California, Berkeley
Angjoo Kanazawa
University of California, Berkeley
Abstract
Humans can resort to long-form inspection to build intuition on predicting the 3D configurations of unseen objects. The more we observe the object motion, the better we get at predicting its 3D state immediately. Existing systems either optimize underlying representations from multi-view observations or train a feed-forward predictor from supervised datasets. We introduce Predict-OptimizeDistill (POD), a self-improving framework that interleaves prediction and optimization in a mutually reinforcing cycle to achieve better 4D object understanding with increasing observation time. Given a multi-view object scan and a long-form monocular video of human-object interaction, POD iteratively trains a neural network to predict local part poses from RGB frames, uses this predictor to initialize a global optimization which refines output poses through inverse rendering, then finally distills the results of optimization back into the model by generating synthetic self-labeled training data from novel viewpoints. Each iteration improves both the predictive model and the optimized motion trajectory, creating a virtuous cycle that bootstraps its own training data to learn about the pose configurations of an object. We also introduce a quasi-multiview mining strategy for reducing depth ambiguity by leveraging long video. We evaluate POD on 14 real-world and 5 synthetic objects with various joint types, including revolute and prismatic joints as well as multi-body configurations where parts detach or reattach independently. POD demonstrates significant improvement over a pure optimization baseline which gets stuck in local minima, particularly for longer videos. We also find that POD's performance improves with both video length and successive iterations of the self-improving cycle, highlighting its ability to scale performance with additional observations and looped refinement.
RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping
Dongming Wu
The Chinese University of Hong Kong
Yanping Fu
Institute of Computing Technology, Chinese Academy of Sciences
Saike Huang
Dexmal
Yingfei Liu
Dexmal
Fan Jia
Dexmal
Nian Liu
Mohamed bin Zayed University of Artificial Intelligence
Feng Dai
Institute of Computing Technology, Chinese Academy of Sciences
Tiancai Wang
Dexmal
Rao Muhammad Anwer
Mohamed bin Zayed University of Artificial Intelligence
Fahad Shahbaz Khan
Mohamed bin Zayed University of Artificial Intelligence
Jianbing Shen
University of Macau
Abstract
General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pretrained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our data and code is available at this link.
TARS: Traffic-Aware Radar Scene Flow Estimation
Jialong Wu
University of Wuppertal
Marco Braun
Aptiv Services Deutschland GmbH
Dominic Spata
Aptiv Services Deutschland GmbH
Matthias Rottmann
Osnabrück University
Abstract
Scene flow provides crucial motion information for autonomous driving. Recent LiDAR scene flow models utilize the rigid-motion assumption at the instance level, assuming objects are rigid bodies. However, these instance-level methods are not suitable for sparse radar point clouds. In this work, we present a novel Traffic-Aware Radar SceneFlow (TARS) estimation method, which utilizes motion rigidity at the traffic level. To address the challenges in radar scene flow, we perform object detection and scene flow jointly and boost the latter. We incorporate the feature map from the object detector, trained with detection losses, to make radar scene flow aware of the environment and road users. From this, we construct a Traffic Vector Field (TVF) in the feature space to achieve holistic traffic-level scene understanding in our scene flow branch. When estimating the scene flow, we consider both point-level motion cues from point neighbors and traffic-level consistency of rigid motion within the space. TARS outperforms the state of the art on a proprietary dataset and the View-of-Delft dataset, improving the benchmarks by 23% and 15%, respectively.
UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control
Yan Wu
ETH Zurich
Korrawe Karunratanakul
ETH Zurich
Zhengyi Luo
Carnegie Mellon University
Siyu Tang
ETH Zurich
Abstract
Generating natural and physically plausible character motion remains challenging, particularly for long-horizon control with diverse guidance signals. While prior work combines high-level diffusion-based motion planners with lowlevel physics controllers, these systems suffer from domain gaps that degrade motion quality and require task-specific fine-tuning. To tackle this problem, we introduce UniPhys, a diffusion-based behavior cloning framework that unifies motion planning and control into a single model. UniPhys enables flexible, expressive character motion conditioned on multi-modal inputs such as text, trajectories, and goals. To address accumulated prediction errors over long sequences, UniPhys is trained with the Diffusion Forcing paradigm, learning to denoise noisy motion histories and handle discrepancies introduced by the physics simulator. This design allows UniPhys to robustly generate physically plausible, long-horizon motions. Through guided sampling, UniPhys generalizes to a wide range of control signals, including unseen ones, without requiring task-specific fine-tuning. Experiments show that UniPhys outperforms prior methods in motion naturalness, generalization, and robustness across diverse control tasks.
Visual Textualization for Image Prompted Object Detection
Yongjian Wu
Beihang University
Yang Zhou
Beihang University
Jiya Saiyin
Beihang University
Bingzheng Wei
ByteDance Inc.
Yan Xu
Beihang University
Abstract
We propose VisTex-OVLM, a novel image prompted object detection method that introduces visual textualization -- a process that projects a few visual exemplars into the text feature space to enhance Object-level Vision-Language Models' (OVLMs) capability in detecting rare categories that are difficult to describe textually and nearly absent from their pretraining data, while preserving their pre-trained object-text alignment. Specifically, VisTex-OVLM leverages multi-scale textualizing blocks and a multi-stage fusion strategy to integrate visual information from visual exemplars, generating textualized visual tokens that effectively guide OVLMs alongside text prompts. Unlike previous methods, our method maintains the original architecture of OVLM, maintaining its generalization capabilities while enhancing performance in few-shot settings. VisTex-OVLM demonstrates superior performance across open-set datasets which have minimal overlap with OVLM's pre-training data and achieves stateof-the-art results on few-shot benchmarks PASCAL VOC and MSCOCO. The code will be released at VisTex-OVLM.
Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images
Philipp Wulff
Technical University of Munich
Felix Wimbauer
Technical University of Munich
Dominik Muhle
Technical University of Munich
Daniel Cremers
Technical University of Munich
Abstract
Volumetric scene reconstruction from a single image is crucial for a broad range of applications like autonomous driving and robotics. Recent volumetric reconstruction methods achieve impressive results, but generally require expensive 3D ground truth or multi-view supervision. We propose to leverage pre-trained 2D diffusion models and depth prediction models to generate synthetic scene geometry from a single image. This can then be used to distill a feed-forward scene reconstruction model. Our experiments on the challenging KITTI-360 and Waymo datasets demonstrate that our method matches or outperforms state-of-the-art baselines that use multi-view supervision, and offers unique advantages, for example regarding dynamic scenes. For more details and code, please check out our project page.
ScenePainter: Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment
Chong Xia
Tsinghua University
Shengjun Zhang
Tsinghua University
Fangfu Liu
Tsinghua University
Chang Liu
Tsinghua University
Khodchaphun Hirunyaratsameewong
SceneScape
Yueqi Duan
WonderJourney
Abstract
Perpetual 3D scene generation aims to produce longrange and coherent 3D view sequences, which is applicable for long-term video synthesis and 3D scene reconstruction. Existing methods follow a 'navigate-and-imagine' fashion and rely on outpainting for successive view expansion. However, the generated view sequences suffer from semantic drift issue derived from the accumulated deviation of the outpainting module. To tackle this challenge, we propose ScenePainter, a new framework for semantically consistent 3D scene generation, which aligns the outpainter's scenespecific prior with the comprehension of the current scene. To be specific, we introduce a hierarchical graph structure dubbed SceneConceptGraph to construct relations among multi-level scene concepts, which directs the outpainter for consistent novel views and can be dynamically refined to enhance diversity. Extensive experiments demonstrate that our framework overcomes the semantic drift issue and generates more consistent and immersive 3D view sequences.
TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes
Yan Xia
University of Science and Technology of China
Yunxiang Lu
Technical University of Munich
Rui Song
Technical University of Munich
Oussema Dhaouadi
Technical University of Munich
João F. Henriques
University of Oxford
Daniel Cremers
Technical University of Munich
Abstract
We tackle the problem of localizing traffic cameras within a 3D reference map and propose a novel image-to-point cloud registration (I2P) method, TrafficLoc, in a coarse-tofine matching fashion. To overcome the lack of large-scale real-world intersection datasets, we first introduce Carla Intersection, a new simulated dataset with 75 urban and rural intersections in Carla. We find that current I2P methods struggle with cross-modal matching under large viewpoint differences, especially at traffic intersections. TrafficLoc thus employs a novel Geometry-guided Attention Loss (GAL) to focus only on the corresponding geometric regions under different viewpoints during 2D-3D feature fusion. To address feature inconsistency in paired image patch-point groups, we further propose Inter-intra Contrastive Learning (ICL) to enhance separating 2D patch /3D group features within each intra-modality and introduce Dense Training Alignment (DTA) with soft-argmax for improving position regression. Extensive experiments show our TrafficLoc greatly improves the performance over the SOTA I2P methods (up to 86%) on Carla Intersection and generalizes well to real-world data. TrafficLoc also achieves new SOTA performance on KITTI and NuScenes datasets, demonstrating the superiority across both in-vehicle and traffic cameras. Our project page is publicly available at https://tumluk.github.io/projects/trafficloc/.
DEPTHOR: Depth Enhancement from a Practical Light-Weight dToF Sensor and RGB Image
Jijun Xiang
Huazhong University of Science and Technology
Xuan Zhu
Huazhong University of Science and Technology
Xianqi Wang
Huazhong University of Science and Technology
Yu Wang
Honor Device Co., Ltd
Hong Zhang
Honor Device Co., Ltd
Fei Guo
Honor Device Co., Ltd
Xin Yang
Huazhong University of Science and Technology
Abstract
Depth enhancement, which uses RGB images as guidance to convert raw signals from dToF into high-precision, dense depth maps, is a critical task in computer vision. Although existing super-resolution-based methods show promising results on public datasets, they often rely on idealized assumptions like accurate region correspondences and reliable dToF inputs, overlooking calibration errors that cause misalignment and anomaly signals inherent to dToF imaging, limiting real-world applicability. To address these challenges, we propose a novel completion-based method, named DEPTHOR, featuring advances in both the training strategy and model architecture. First, we propose a method to simulate real-world dToF data from the accurate ground truth in synthetic datasets to enable noiserobust training. Second, we design a novel network that incorporates monocular depth estimation (MDE), leveraging global depth relationships and contextual information to improve prediction in challenging regions. On the ZJU-L5 dataset, our training strategy significantly enhances depth completion models, achieving results comparable to depth super-resolution methods, while our model achieves stateof-the-art results, improving Rel and RMSE by 27% and 18%, respectively. On a more challenging set of dToF samples we collected, our method outperforms SOTA methods on preliminary stereo-based GT, improving Rel and RMSE by 23% and 22%, respectively. Our Code is available at https://github.com/ShadowBbBb/Depthor
ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds
Binbin Xiang
Norwegian Institute of Bioeconomy Research (NIBIO)
Maciej Wielgosz
Norwegian Institute of Bioeconomy Research (NIBIO)
Stefano Puliti
Norwegian Institute of Bioeconomy Research (NIBIO)
Kamil Král
Silva Tarouca Research Institute for Landscape and Ornamental Gardening
Martin Krůček
Silva Tarouca Research Institute for Landscape and Ornamental Gardening
Azim Missarov
Silva Tarouca Research Institute for Landscape and Ornamental Gardening
Rasmus Astrup
Norwegian Institute of Bioeconomy Research (NIBIO)
Abstract
The segmentation of forest LiDAR 3D point clouds, including both individual tree and semantic segmentation, is fundamental for advancing forest management and ecological research. However, current approaches often struggle with the complexity and variability of natural forest environments. We present ForestFormer3D, a new unified and end-to-end framework designed for precise individual tree and semantic segmentation. ForestFormer3D incorporates ISA-guided query point selection, a score-based block merging strategy during inference, and a one-to-many association mechanism for effective training. By combining these new components, our model achieves state-of-the-art performance for individual tree segmentation on the newly introduced FOR-instanceV2 dataset, which spans diverse forest types and regions. Additionally, ForestFormer3D generalizes well to unseen test sets (Wytham woods and LAUTx), showcasing its robustness across different forest conditions and sensor modalities. The FOR-instanceV2 dataset and the ForestFormer3D code are publicly available at https://bxiang233.github.io/FF3D/.
SG-LDM: Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion
Zhengkang Xiang
The University of Melbourne
Zizhao Li
The University of Melbourne
Amir Khodabandeh
The University of Melbourne
Kourosh Khoshelham
The University of Melbourne
Abstract
Lidar point cloud synthesis based on generative models offers a promising solution to augment deep learning pipelines, particularly when real-world data is scarce or lacks diversity. By enabling flexible object manipulation, this synthesis approach can significantly enrich training datasets and enhance discriminative models. However, existing methods focus on unconditional lidar point cloud generation, overlooking their potential for real-world applications. In this paper, we propose SG-LDM, a SemanticGuided Lidar Diffusion Model that employs latent alignment to enable robust semantic-to-lidar synthesis. By directly operating in the native lidar space and leveraging explicit semantic conditioning, SG-LDM achieves state-ofthe-art performance in generating high-fidelity lidar point clouds guided by semantic labels. Moreover, we propose the first diffusion-based lidar translation framework based on SG-LDM, which enables cross-domain translation as a domain adaptation strategy to enhance downstream perception performance. Systematic experiments demonstrate that SG-LDM significantly outperforms existing lidar diffusion models and the proposed lidar translation framework further improves data augmentation performance in the downstream lidar segmentation task.
MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space
Lixing Xiao
Zhejiang University
Shunlin Lu
The Chinese University of Hong Kong (Shenzhen)
Huaijin Pi
The University of Hong Kong
Ke Fan
Shanghai Jiao Tong University
Liang Pan
The University of Hong Kong
Yueer Zhou
Zhejiang University
Ziyong Feng
DeepGlint
Xiaowei Zhou
Zhejiang University
Sida Peng
Zhejiang University
Jingbo Wang
Shanghai AI Laboratory
Abstract
This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized noncausal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: https://zju3dv. github.io/MotionStreamer/
RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case
Baihui Xiao
Meituan
Chengjian Feng
Meituan
Zhijian Huang
Meituan
Feng Yan
Meituan
Yujie Zhong
Meituan
Lin Ma
Shenzhen Campus of Sun Yat-sen University
Abstract
Collecting real-world data for rare high-risk scenarios, long-tailed driving events, and complex interactions remains challenging, leading to poor performance of existing autonomous driving systems in these critical situations. In this paper, we propose RoboTron-Sim that improves real-world driving in critical situations by utilizing simulated hard cases. First, we develop a simulated dataset called Hard-case Augmented Synthetic Scenarios (HASS), which covers 13 high-risk edge-case categories, as well as balanced environmental conditions such as day/night and sunny/rainy. Second, we introduce Scenario-aware Prompt Engineering (SPE) and an Image-to-Ego Encoder (I2E Encoder) to enable multimodal large language models to effectively learn real-world challenging driving skills from HASS, via adapting to environmental deviations and hardware differences between real-world and simulated scenarios. Extensive experiments on nuScenes show that RoboTron-Sim improves driving performance in challenging scenarios by ∼50%, achieving state-of-the-art results in real-world open-loop planning. Qualitative results further demonstrate the effectiveness of RoboTron-Sim in better managing rare high-risk driving scenarios.
SRefiner: Soft-Braid Attention for Multi-Agent Trajectory Refinement
Liwen Xiao
Huazhong University of Science and Technology
Zhiyu Pan
Huazhong University of Science and Technology
Zhicheng Wang
Huazhong University of Science and Technology
Zhiguo Cao
Huazhong University of Science and Technology
Wei Li
Nanyang Technological University
Abstract
Accurate prediction of multi-agent future trajectories is crucial for autonomous driving systems to make safe and efficient decisions. Trajectory refinement has emerged as a key strategy to enhance prediction accuracy. However, existing refinement methods often overlook the topological relationships between trajectories, which are vital for improving prediction precision. Inspired by braid theory, we propose a novel trajectory refinement approach, Soft-Braid Refiner (SRefiner), guided by the soft-braid topological structure of trajectories using Soft-Braid Attention. Soft-Braid Attention captures spatio-temporal topological relationships between trajectories by considering both spatial proximity and vehicle motion states at 'soft intersection points'. Additionally, we extend this approach to model interactions between trajectories and lanes, further improving the prediction accuracy. SRefiner is a multi-iteration, multi-agent framework that iteratively refines trajectories, incorporating topological information to enhance interactions within traffic scenarios. SRefiner achieves significant performance improvements over four baseline methods across two datasets, establishing a new state-of-the-art in trajectory refinement. Codes are available at https://github.com/LiwenXiao/SRefiner.
SpatialTrackerV2: Advancing 3D Point Tracking with Explicit Camera Motion
Yuxi Xiao
Zhejiang University
Jianyuan Wang
Oxford
Nan Xue
Ant Group
Nikita Karaev
Pixelwise AI
Yuri Makarov
Pixelwise AI
Bingyi Kang
Bytedance Seed
Xing Zhu
Ant Group
Hujun Bao
Zhejiang University
Yujun Shen
Ant Group
Xiaowei Zhou
Zhejiang University
Abstract
We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. Going beyond modular pipelines built on off-the-shelf components for 3D tracking, our approach unifies the intrinsic connections between point tracking, monocular depth, and camera pose estimation into a high-performing and feedforward 3D point tracker. It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion, with a fully differentiable and end-to-end architecture, allowing scalable training across a wide range of datasets, including synthetic sequences, posed RGB-D videos, and unlabeled in-the-wild footage. By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%, and matches the accuracy of leading dynamic 3D reconstruction approaches while running 50x faster.
AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion
Liuyue Xie
Carnegie Mellon University
Jiancong Guo
Google
Ozan Cakmakci
Google
Andre Araujo
Google DeepMind
László A. Jeni
Carnegie Mellon University
Zhiheng Jia
Google
Abstract
Accurate camera calibration is a fundamental task for 3D perception, especially when dealing with real-world, in-thewild environments where complex optical distortions are common. Existing methods often rely on pre-rectified images or calibration patterns, which limits their applicability and flexibility. In this work, we introduce a novel framework that addresses these challenges by jointly modeling camera intrinsic and extrinsic parameters using a generic ray camera model. Unlike previous approaches, AlignDiff shifts focus from semantic to geometric features, enabling more accurate modeling of local distortions. We propose AlignDiff, a diffusion model conditioned on geometric priors, enabling the simultaneous estimation of camera distortions and scene geometry. To enhance distortion prediction, we incorporate edge-aware attention, focusing the model on geometric features around image edges, rather than semantic content. Furthermore, to enhance generalizability to real-world captures, we incorporate a large database of ray-traced lenses containing over three thousand samples. This database characterizes the distortion inherent in a diverse variety of lens forms. Our experiments demonstrate that the proposed method significantly reduces the angular error of estimated ray bundles by ∼8.2◦and overall calibration accuracy, outperforming existing approaches on challenging, real-world datasets.
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives
Shaoyuan Xie
University of California, Irvine
Lingdong Kong
Shanghai AI Laboratory
Yuhao Dong
Shanghai AI Laboratory
Chonghao Sima
Shanghai AI Laboratory
Wenwei Zhang
Shanghai AI Laboratory
Qi Alfred Chen
University of California, Irvine
Ziwei Liu
Shanghai AI Laboratory
Liang Pan
Shanghai AI Laboratory
Abstract
Recent advancements in Vision-Language Models (VLMs) have fueled interest in autonomous driving applications, particularly for interpretable decision-making. However, the assumption that VLMs provide visually grounded and reliable driving explanations remains unexamined. To address this, we introduce DriveBench, a benchmark evaluating 12 VLMs across 17 settings, covering 19,200 images, 20,498 QA pairs, and four key driving tasks. Our findings reveal that existing VLMs often generate plausible responses from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs possess inherent corruption-awareness but only explicitly acknowledge these issues when directly prompted. Given the challenges and inspired by the inherent corruption awareness, we propose Robust Agentic Utilization (RAU), leveraging VLMs' corruption awareness and agentic planning with external tools to enhance perception reliability for a diverse set of downstream tasks. Our study challenges existing evaluation paradigms and provides a road map toward more robust and interpretable autonomous driving systems.
GS-LIVM: Real-Time Photo-Realistic LiDAR-Inertial-Visual Mapping with Gaussian Splatting
Yusen Xie
The Hong Kong University of Science and Technology (Guangzhou)
Zhenmin Huang
The Hong Kong University of Science and Technology
Jin Wu
The Hong Kong University of Science and Technology
Jun Ma
The Hong Kong University of Science and Technology
Abstract
In this paper, we introduce GS-LIVM, a real-time photorealistic LiDAR-Inertial-Visual mapping framework with Gaussian Splatting tailored for outdoor scenes. Compared to existing methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), our approach enables real-time photo-realistic mapping while ensuring high-quality image rendering in large-scale unbounded outdoor environments. In this work, Gaussian Process Regression (GPR) is employed to mitigate the issues resulting from sparse and unevenly distributed LiDAR observations. The voxel-based 3D Gaussians map representation facilitates real-time dense mapping in large outdoor environments with acceleration governed by custom CUDA kernels. Moreover, the overall framework is designed in a covariance-centered manner, where the estimated covariance is used to initialize the scale and rotation of 3D Gaussians, as well as update the parameters of the GPR. We evaluate our algorithm on several outdoor datasets, and the results demonstrate that our method achieves state-of-the-art performance in terms of mapping efficiency and rendering quality. The source code is available on GitHub.
Hi-Gaussian: Hierarchical Gaussians under Normalized Spherical Projection for Single-View 3D Reconstruction
Binjian Xie
Institute of Automation, Chinese Academy of Sciences
Pengju Zhang
Institute of Automation, Chinese Academy of Sciences
Hao Wei
Institute of Automation, Chinese Academy of Sciences
Yihong Wu
Institute of Automation, Chinese Academy of Sciences
Abstract
Single-view 3D reconstruction is a fundamental problem in computer vision, having a significant impact on downstream tasks such as autonomous driving, virtual reality and augmented reality. However, existing single-view reconstruction methods are unable to reconstruct the regions outside the input field-of-view or the areas occluded by visible parts. In this paper, we propose Hi-Gaussian, which employs feed-forward 3D Gaussians for efficient and generalizable single-view 3D reconstruction. A Normalized Spherical Projection module is introduced following an EncoderDecoder network in our model, assigning a larger range to the transformed spherical coordinates, which can enlarge the field of view during scene reconstruction. Besides, to reconstruct occluded regions behind the visible part, we introduce a novel Hierarchical Gaussian Sampling strategy, utilizing two layers of Gaussians to hierarchically represent 3D scenes. We first use a pre-trained monocular depth estimation model to provide depth initialization for leader Gaussians, and then leverage the leader Gaussians to estimate the distribution followed by follower Gaussians, which can flexibly move into occluded areas. Extensive experiments show that our method outperforms other methods for scene reconstruction and novel view synthesis, on both outdoor and indoor datasets.
Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling
Christopher Xie
Meta Reality Labs
Armen Avetisyan
Meta Reality Labs
Henry Howard-Jenkins
Meta Reality Labs
Yawar Siddiqui
Meta Reality Labs
Julian Straub
Meta Reality Labs
Richard Newcombe
Meta Reality Labs
Vasileios Balntas
Meta Reality Labs
Jakob Engel
Meta Reality Labs
Abstract
We present a novel human-in-the-loop approach to estimate 3D scene layout that uses human feedback from an egocentric standpoint. We study this approach through introduction of a novel local correction task, where users identify local errors and prompt a model to automatically correct them. Building on SceneScript [3], a state-of-the-art framework for 3D scene layout estimation that leverages structured language, we propose a solution that structures this problem as 'infilling', a task studied in natural language processing. We train a multi-task version of SceneScript that maintains performance on global predictions while significantly improving its local correction ability. We integrate this into a human-in-the-loop system, enabling a user to iteratively refine scene layout estimates via a lowfriction 'one-click fix' workflow. Our system enables the final refined layout to diverge from the training distribution, allowing for more accurate modelling of complex layouts.
PVMamba: Parallelizing Vision Mamba via Dynamic State Aggregation
Fei Xie
Shanghai Jiao Tong University
Zhongdao Wang
Huawei Noah's Ark Lab
Weijia Zhang
Shanghai Jiao Tong University
Chao Ma
Shanghai Jiao Tong University
Abstract
Mamba, an architecture with RNN-like sequence modeling of State Space Model (SSM), has demonstrated promising capabilities in long-range modeling with high efficiency. However, Mamba models struggle with structured 2D visual data using sequential computing, thereby lagging behind their attention-based counterparts. In this paper, we propose a Parallel Vision Mamba (PVMamba), a novel SSM architecture tailored for visual data. PVMamba encompasses two key designs: 1) Based on the sparsity and adjacency of visual signals, we parallelize the sequential computing through three core steps, termed Dynamic State Aggregation (DSA), i.e., parallelization, alignment, and aggregation. DSA generates the hidden state in SSM by a feasible spatial aggregation, thereby overcoming the inherent sequential constraints. 2) Along with maintaining linear computational complexity, we apply a dynamic operator to learn the spatial samplings for each hidden state. To further boost the local modeling capability, we restrict the dynamic operator to the neighboring pixels in shallow layers. We also devise a layer multiplexing technique to stabilize the training and reduce the learning redundancy. PVMamba is a versatile backbone network with dynamic operators for various vision tasks, such as image classification and dense prediction. Extensive experiments show that PVMamba achieves state-of-the-art performance on a range of benchmarks. The code is available at https: //github.com/VISION-SJTU/PVMamba.
SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions
Mengwei Xie
Alibaba Group
Shuang Zeng
Alibaba Group
Xinyuan Chang
Alibaba Group
Xinran Liu
Alibaba Group
Zheng Pan
Alibaba Group
Mu Xu
Alibaba Group
Xing Wei
Xi'an Jiaotong University
Abstract
Accurate lane topology is essential for autonomous driving, yet traditional methods struggle to model the complex, non-linear structures-such as loops and bidirectional lanes-prevalent in real-world road structure. We present SeqGrowGraph, a novel framework that learns lane topology as a chain of graph expansions, inspired by human mapdrawing processes. Representing the lane graph as a directed graph G = (V, E), with intersections (V ) and centerlines (E), SeqGrowGraph incrementally constructs this graph by introducing one vertex at a time. At each step, an adjacency matrix (A) expands from n ⇥n to (n + 1) ⇥(n + 1) to encode connectivity, while a geometric matrix (M) captures centerline shapes as quadratic B´ezier curves. The graph is serialized into sequences, enabling a transformer model to autoregressively predict the chain of expansions, guided by a depth-first search ordering. Evaluated on nuScenes and Argoverse 2 datasets, SeqGrowGraph achieves state-of-the-art performance.
Efficient Track Anything
Yunyang Xiong
Meta AI Research
Chong Zhou
Meta AI Research
Xiaoyu Xiang
Meta AI Research
Lemeng Wu
Meta AI Research
Chenchen Zhu
Meta AI Research
Zechun Liu
Meta AI Research
Saksham Suri
Meta AI Research
Balakrishnan Varadarajan
Meta AI Research
Ramya Krishna Akula
Meta AI Research
Forrest Iandola
Meta AI Research
Raghuraman Krishnamoorthi
Meta AI Research
Bilge Soran
Meta AI Research
Vikas Chandra
Meta AI Research
Abstract
Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything. Key components of SAM 2 that drive the impressive video object segmentation performance include a large multistage image encoder for frame feature extraction and a memory mechanism that stores memory contexts from past frames to help current frame segmentation. The high computation complexity of image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight end-to-end track anything models that produce high-quality results with low latency and small model size. Our idea is based on adopting lightweight Vision Transformer (ViT) as an image encoder for video object segmentation, and introducing an efficient memory module, which reduces the complexity for both frame feature extraction and memory computation for current frame segmentation. We take vanilla lightweight ViTs and efficient memory module to build EfficientTAMs, and train the models on SA-1B and SA-V datasets for video object segmentation and track anything tasks. We evaluate on multiple video segmentation benchmarks including semisupervised VOS and promptable video segmentation, and find that our proposed EfficientTAM with lightweight ViT performs comparably to SAM 2 model (SAM 2-HieraB+) with ∼1.6x speedup on A100 and ∼2.4x parameter reduction. On segment anything image tasks, our EfficientTAMs also perform favorably over original SAM with ∼20x speedup on A100 and ∼20x parameter reduction. On mobile devices such as iPhone 15 Pro Max, our EfficientTAM can run at ∼28 FPS for near real-time video object segmentation with reasonable quality, highlighting the capability of small models for on-device video object segmentation applications.
Geometric Alignment and Prior Modulation for View-Guided Point Cloud Completion on Unseen Categories
Jingqiao Xiu
National University of Singapore
Yicong Li
National University of Singapore
Na Zhao
Singapore University of Technology and Design
Han Fang
National University of Singapore
Xiang Wang
University of Science and Technology of China
Angela Yao
National University of Singapore
Abstract
View-Guided Point Cloud Completion (VG-PCC) aims to reconstruct complete point clouds from partial inputs by referencing single-view images. While existing VG-PCC models perform well on in-class predictions, they exhibit significant performance drops when generalizing to unseen categories. We identify two key limitations underlying this challenge: (1) Current encoders struggle to bridge the substantial modality gap between images and point clouds. Consequently, their learned representations often lack robust cross-modal alignment and over-rely on superficial classspecific patterns. (2) Current decoders refine global structures holistically, overlooking local geometric patterns that are class-agnostic and transferable across categories. To address these issues, we present a novel generalizable VGPCC framework for unseen categories based on Geometric Alignment and Prior Modulation (GAPM). First, we introduce a Geometry Aligned Encoder that lifts reference images into 3D space via depth maps for natural alignment with partial point clouds. This reduces dependency on class-specific RGB patterns that hinder generalization to unseen classes. Second, we propose a Prior Modulated Decoder that incorporates class-agnostic local priors to reconstruct shapes on a regional basis. This allows the adaptive reuse of learned geometric patterns that promote generalization to unseen classes. Extensive experiments validate that GAPM outperforms existing models on both seen and, notably, unseen categories, establishing a new benchmark for unseen-category generalization in VG-PCC.
A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation
Rongtao Xu
Spatialtemporal AI
Jian Zhang
MBZUAI
Minghao Guo
MBZUAI
Youpeng Wen
Sun Yat-sen University
Haoting Yang
Southern University of Science and Technology
Min Lin
Sun Yat-sen University
Jianzheng Huang
Southern University of Science and Technology
Zhe Li
Southern University of Science and Technology
Kaidong Zhang
Southern University of Science and Technology
Liqiong Wang
Southern University of Science and Technology
Yuxuan Kuang
MBZUAI
Meng Cao
MBZUAI
Feng Zheng
Spatialtemporal AI
Xiaodan Liang
MBZUAI
Abstract
Robotic manipulation faces critical challenges in understanding spatial affordances-the 'where' and 'how' of object interactions-essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A0, a hierarchical affordance-aware diffusion model that decomposes manipulation task into highlevel spatial affordance understanding and low-level action execution. A0 leverages the Embodiment-Agnostic Affordance Representation, which captures object-centric spatial affordances by predicting contact point and postcontact trajectories. A0 is pre-trained on 1 million contact points data and fine-tuned on annotated trajectories, enabling generalization across platforms. Key components include Position Offset Attention for motion-aware feature extraction and a Spatial Information Aggregation Layer for precise coordinate mapping. The output is executed by the action execution module. Experiments on multiple robotic systems (Franka, Kinova, Realman and Dobot) demonstrate A0's superior performance in complex tasks, showcasing its efficiency, flexibility, and real-world applicability.
AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving
Jiawei Xu
Nankai University
Kai Deng
Nankai University
Zexin Fan
Nankai University
Shenlong Wang
University of Illinois Urbana-Champaign
Jin Xie
Nanjing University
Jian Yang
Nankai University
Abstract
Modeling and rendering dynamic urban driving scenes is crucial for self-driving simulation. Current high-quality methods typically rely on costly manual object tracklet annotations, while self-supervised approaches fail to capture dynamic object motions accurately and decompose scenes properly, resulting in rendering artifacts. We introduce ADGS, a novel self-supervised framework for high-quality freeviewpoint rendering of driving scenes from a single log. At its core is a novel learnable motion model that integrates locality-aware B-spline curves with global-aware trigonometric functions, enabling flexible yet precise dynamic object modeling. Rather than requiring comprehensive semantic labeling, AD-GS automatically segments scenes into objects and background with the simplified pseudo 2D segmentation, representing objects using dynamic Gaussians and bidirectional temporal visibility masks. Further, our model incorporates visibility reasoning and physically rigid regularization to enhance robustness. Extensive evaluations demonstrate that our annotation-free model significantly outperforms current state-of-the-art annotationfree methods and is competitive with annotation-dependent approaches. Project Page: https://jiaweixu8. github.io/AD-GS-web/
Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning
Lizhen Xu
Xi'an Jiaotong University
Xiuxiu Bai
Xi'an Jiaotong University
Xiaojun Jia
Nanyang Technological University
Jianwu Fang
Xi'an Jiaotong University
Shanmin Pang
Xi'an Jiaotong University
Abstract
Query-based methods with dense features have demonstrated remarkable success in 3D object detection tasks. However, the computational demands of these models, particularly with large image sizes and multiple transformer layers, pose significant challenges for efficient running on edge devices. Existing pruning and distillation methods either need retraining or are designed for ViT models, which are hard to migrate to 3D detectors. To address this issue, we propose a zero-shot runtime pruning method for transformer decoders in 3D object detection models. The method, termed tgGBC (trim keys gradually Guided By Classification scores), systematically trims keys in transformer modules based on their importance. We expand the classification score to multiply it with the attention map to get the importance score of each key and then prune certain keys after each transformer layer according to their importance scores. Our method achieves a 1.99x speedup in the transformer decoder of the latest ToC3D model, with only a minimal performance loss of less than 1%. Interestingly, for certain models, our method even enhances their performance. Moreover, we deploy 3D detectors with tgGBC on an edge device, further validating the effectiveness of our method. The code can be found at https: //github.com/iseri27/tg_gbc.
BANet: Bilateral Aggregation Network for Mobile Stereo Matching
Gangwei Xu
Huazhong University of Science and Technology
Jiaxin Liu
Huazhong University of Science and Technology
Xianqi Wang
Huazhong University of Science and Technology
Junda Cheng
Huazhong University of Science and Technology
Yong Deng
Autel Robotics
Jinliang Zang
Autel Robotics
Yurui Chen
Autel Robotics
Xin Yang
Optics Valley Laboratory
Abstract
State-of-the-art stereo matching methods typically use costly 3D convolutions to aggregate a full cost volume, but their computational demands make mobile deployment challenging. Directly applying 2D convolutions for cost aggregation often results in edge blurring, detail loss, and mismatches in textureless regions. Some complex operations, like deformable convolutions and iterative warping, can partially alleviate this issue; however, they are not mobile-friendly, limiting their deployment on mobile devices. In this paper, we present a novel bilateral aggregation network (BANet) for mobile stereo matching that produces high-quality results with sharp edges and fine details using only 2D convolutions. Specifically, we first separate the full cost volume into detailed and smooth volumes using a spatial attention map, then perform detailed and smooth aggregations accordingly, ultimately fusing both to obtain the final disparity map. Experimental results demonstrate that our BANet-2D significantly outperforms other mobile-friendly methods, achieving 35.3% higher accuracy on the KITTI 2015 leaderboard than MobileStereoNet-2D, with faster runtime on mobile devices. Code: https://github.com/gangweix/BANet.
Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations
Xiang Xu
Nanjing University of Aeronautics and Astronautics
Lingdong Kong
National University of Singapore
Song Wang
Zhejiang University
Chuanwei Zhou
Nanjing University of Posts and Telecommunications
Qingshan Liu
Nanjing University of Posts and Telecommunications
Abstract
LiDAR representation learning aims to extract rich structural and semantic information from large-scale, readily available datasets, reducing reliance on costly human annotations. However, existing LiDAR representation strategies often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. In this work, we propose LiMA, a novel long-term image-to-LiDAR Memory Aggregation framework that explicitly captures longer range temporal correlations to enhance LiDAR representation learning. LiMA comprises three key components: 1) a Cross-View Aggregation module that aligns and fuses overlapping regions across neighboring camera views, constructing a more unified and redundancy-free memory bank; 2) a Long-Term Feature Propagation mechanism that efficiently aligns and integrates multi-frame image features, reinforcing temporal coherence during LiDAR representation learning; and 3) a Cross-Sequence Memory Alignment strategy that enforces consistency across driving sequences, improving generalization to unseen environments. LiMA maintains high pretraining efficiency and incurs no additional computational overhead during downstream tasks. Extensive experiments on mainstream LiDARbased perception benchmarks demonstrate that LiMA significantly improves both LiDAR semantic segmentation and 3D object detection. We hope this work inspires more effective pretraining paradigms for autonomous driving. The code has been made publicly accessible for future research.
DAA*: Deep Angular A Star for Image-based Path Planning
Zhiwei Xu
The University of Melbourne
Abstract
Path smoothness is often overlooked in path imitation learning from expert demonstrations. In this paper, we introduce a novel learning method, termed deep angular A∗(DAA∗), by incorporating the proposed path angular freedom (PAF) into A∗to improve path similarity through adaptive path smoothness. The PAF aims to explore the effect of move angles on path node expansion by finding the trade-off between their minimum and maximum values, allowing for high adaptiveness for imitation learning. DAA∗improves path optimality by closely aligning with the reference path through joint optimization of path shortening and smoothing, which correspond to heuristic distance and PAF, respectively. Throughout comprehensive evaluations on 7 datasets, including 4 maze datasets, 2 video-game datasets, and a real-world drone-view dataset containing 2 scenarios, we demonstrate remarkable improvements of our DAA∗over neural A∗in path similarity between the predicted and reference paths with a shorter path length when the shortest path is plausible, improving by 9.0% SPR, 6.9% ASIM, and 3.9% PSIM. Furthermore, when jointly learning pathfinding with both path loss and path probability map loss, DAA∗significantly outperforms the state-of-theart TransPath by 6.3% SPR, 6.0% PSIM, and 3.7% ASIM. We also discuss the minor trade-off between path optimality and search efficiency where applicable.
Diffusion-Based Imaginative Coordination for Bimanual Manipulation
Huilin Xu
Fudan University
Jian Ding
King Abdullah University of Science and Technology
Jiakun Xu
ETH Zurich
Ruixiang Wang
The Chinese University of Hong Kong, Shenzhen
Jun Chen
King Abdullah University of Science and Technology
Jinjie Mai
King Abdullah University of Science and Technology
Yanwei Fu
Fudan University
Bernard Ghanem
King Abdullah University of Science and Technology
Feng Xu
Fudan University
Mohamed Elhoseiny
King Abdullah University of Science and Technology
Abstract
Bimanual manipulation is crucial in robotics, enabling complex tasks in industrial automation and household services. However, it poses significant challenges due to the high-dimensional action space and intricate coordination requirements. While video prediction has been recently studied for representation learning and control, leveraging its ability to capture rich dynamic and behavioral information, its potential for enhancing bimanual coordination remains underexplored. To bridge this gap, we propose a unified diffusion-based framework for the joint optimization of video and action prediction. Specifically, we propose a multi-frame latent prediction strategy that encodes future states in a compressed latent space, preserving task-relevant features. Furthermore, we introduce a unidirectional attention mechanism where video prediction is conditioned on the action, while action prediction remains independent of video prediction. This design allows us to omit video prediction during inference, significantly enhancing efficiency. Experiments on two simulated benchmarks and a real-world setting demonstrate a significant improvement in the success rate over the strong baseline ACT using our method, achieving a 24.9% increase on ALOHA, an 11.1% increase on RoboTwin, and a 32.5% increase in real-world experiments. Our models and code are publicly available at Diffusion based imaginative Coordination.
Dual-Temporal Exemplar Representation Network for Video Semantic Segmentation
Xiaolong Xu
Sichuan University
Lei Zhang
Sichuan University
Jiayi Li
Sichuan University
Lituan Wang
Sichuan University
Yifan Guan
Sichuan University
Yu Yan
Sichuan University
Leyi Zhang
Sichuan University
Hao Song
Sichuan University
Abstract
Video semantic segmentation aims to assign a class label for each pixel in every video frame. Existing methods predominantly follow the reference-target interaction paradigm, focusing on extracting local temporal contexts while neglecting the integration of global temporal information. Moreover, complex dynamics and varying lighting conditions introduce inter-frame intra-class discrepancies in feature representations, leading to unstable predictions. In this paper, we propose a novel framework, the Dual-Temporal Exemplar Representation Network (DTERN), which utilizes the strong representational capability of cluster centers, i.e., exemplars, to effectively model both local and global temporal information. DTERN consists of two core modules: 1) the Local Temporal Exemplar Module (LTEM), which constructs local exemplars to capture local temporal contexts, ensuring stable and reliable predictions. 2) the Global Temporal Exemplar Module (GTEM), which introduces learnable global exemplars to dynamically model global temporal information, thereby improving the effective consistency of segmentation. Furthermore, we observe that the existing Video Consistency (VC) metric fails to evaluate segmentation accuracy and lacks sensitivity to small-object segmentation. To this end, we propose Video Effective Consistency (VEC) to comprehensively evaluate temporal consistency and segmentation effectiveness. Experiments on VSPW and Cityscape demonstrate that DTERN outperforms state-of-the-art methods. The code is available at https://github.com/zlxilo/DTERN.
Event-boosted Deformable 3D Gaussians for Dynamic Scene Reconstruction
Wenhao Xu
University of Science and Technology of China
Wenming Weng
University of Science and Technology of China
Yueyi Zhang
MiroMind
Ruikang Xu
University of Science and Technology of China
Zhiwei Xiong
University of Science and Technology of China
Abstract
Deformable 3D Gaussian Splatting (3D-GS) is limited by missing intermediate motion information due to the low temporal resolution of RGB cameras. To address this, we introduce the first approach combining event cameras, which capture high-temporal-resolution, continuous motion data, with deformable 3D-GS for dynamic scene reconstruction. We observe that threshold modeling for events plays a crucial role in achieving high-quality reconstruction. Therefore, we propose a GS-Threshold Joint Modeling strategy, creating a mutually reinforcing process that greatly improves both 3D reconstruction and threshold modeling. Moreover, we introduce a Dynamic-Static Decomposition strategy that first identifies dynamic areas by exploiting the inability of static Gaussians to represent motions, then applies a buffer-based soft decomposition to separate dynamic and static areas. This strategy accelerates rendering by avoiding unnecessary deformation in static areas, and focuses on dynamic areas to enhance fidelity. Additionally, we contribute the first event-inclusive 4D benchmark with synthetic and real-world dynamic scenes, on which our method achieves state-of-the-art performance.
FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction
Jiale Xu
ARC Lab, Tencent PCG
Shenghua Gao
The University of Hong Kong
Abstract
Sparse-view reconstruction models typically require precise camera poses, yet obtaining these parameters from sparse-view images remains challenging. We introduce FreeSplatter, a scalable feed-forward framework that generates high-quality 3D Gaussians from uncalibrated sparse-view images while estimating camera parameters within seconds. Our approach employs a streamlined transformer architecture where self-attention blocks facilitate information exchange among multi-view image tokens, decoding them into pixel-aligned 3D Gaussian primitives within a unified reference frame. This representation enables both high-fidelity 3D modeling and efficient camera parameter estimation using off-the-shelf solvers. We develop two specialized variants-for object-centric and scene-level reconstruction-trained on comprehensive datasets. Remarkably, FreeSplatter outperforms several pose-dependent Large Reconstruction Models (LRMs) by a notable margin while achieving comparable or even better pose estimation accuracy compared to state-of-the-art pose-free reconstruction approach MASt3R in challenging benchmarks. Beyond technical benchmarks, FreeSplatter streamlines text/image-to-3D content creation pipelines, eliminating the complexity of camera pose management while delivering exceptional visual fidelity.
GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors
Tian-Xing Xu
Tsinghua University
Xiangjun Gao
HKUST
Wenbo Hu
ARC Lab, Tencent PCG
Xiaoyu Li
ARC Lab, Tencent PCG
Song-Hai Zhang
Qinghai University
Ying Shan
ARC Lab, Tencent PCG
Abstract
Despite remarkable advancements in video depth estimation, existing methods fall short in geometric fidelity due to their affine-invariant predictions, restricting their applicability in reconstruction and other metrically grounded downstream tasks. We propose a novel point map Variational Autoencoder (VAE) for encoding and decoding unbounded point maps. Notably, its latent space is agnostic to video latent distributions of video diffusion models, allowing us to leverage generation priors to model the distribution of point map sequences conditioned on the input videos. Thus, we can recover high-fidelity point map sequences with temporal coherence from open-world videos, facilitating accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. Extensive evaluations on diverse datasets demonstrate that our method achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.
INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception
Yunjiang Xu
School of Computer Science and Technology, Soochow University
Lingzhi Li
School of Computer Science and Technology, Soochow University
Jin Wang
School of Future Science and Engineering, Soochow University
Yupeng Ouyang
School of Computer Science and Technology, Soochow University
Benyuan Yang
School of Future Science and Engineering, Soochow University
Abstract
Collaborative perception systems overcome single-vehicle limitations in long-range detection and occlusion scenarios by integrating multi-agent sensory data, improving accuracy and safety. However, frequent cooperative interactions and real-time requirements impose stringent bandwidth constraints. Previous works proves that querybased instance-level interaction reduces bandwidth demands and manual priors, however, LiDAR-focused implementations in collaborative perception remain underdeveloped, with performance still trailing state-of-theart approaches. To bridge this gap, we propose INSTINCT (INSTance-level INteraCtion ArchiTecture), a novel collaborative perception framework featuring three core components: 1) a quality-aware filtering mechanism for high-quality instance feature selection; 2) a dualbranch detection routing scheme to decouple collaborationirrelevant and collaboration-relevant instances; and 3) a Cross Agent Local Instance Fusion module to aggregate local hybrid instance features. Additionally, we enhance the ground truth (GT) sampling technique to facilitate training with diverse hybrid instance features. Extensive experiments across multiple datasets demonstrate that INSTINCT achieves superior performance. Specifically, our method achieves an improvement in accuracy 13.23%/33.08% in DAIR-V2X and V2V4Real while reducing the communication bandwidth to 1/281 and 1/264 compared to state-of-the-art methods. The code is available at https://github.com/CrazyShout/INSTINCT.
Learnable Feature Patches and Vectors for Boosting Low-light Image Enhancement without External Knowledge
Xiaogang Xu
The Chinese University of Hong Kong
Jiafei Wu
The University of Hong Kong
Qingsen Yan
Northwestern Polytechnical University
Jiequan Cui
Hefei University of Technology
Richang Hong
Hefei University of Technology
Bei Yu
The Chinese University of Hong Kong
Abstract
A major challenge in Low-Light Image Enhancement (LLIE) is its ill-posed nature: low-light images often lack sufficient information to align with normal-light ones (e.g., not all training data can be fully fitted to the ground truth). Numerous studies have attempted to bridge the gap between low- and normal-light data by introducing effective additional information, which is called 'references' in this paper. However, existing methods overlook the valuable references hidden within the training dataset itself. In this work, we propose a novel LLIE strategy that simultaneously learns image-specific features by neural networks while formulating effective common features from the training data as the reference. These common features are correlated with the samples that are not fully fitted by the LLIE network itself, and they are represented as a set of Learnable Feature Patches and Vectors (LFPVs) in the hidden feature space. LFPVs are updated through two mechanisms: the sampleupdater, which extracts useful features from training samples to refine LFPVs, and the mutual-updater, which propagates information across LFPVs to mutually update them. LFPVs can be adaptively aligned with image-specific features via our designed query-and-fusion procedure, boosting the LLIE performance. Our proposed method can be integrated into any LLIE framework, improving both enhancement quality and downstream task performance. Extensive experiments on various benchmarks demonstrate the effectiveness of our approach.
MergeOcc: Bridge the Domain Gap between Different LiDARs for Robust Occupancy Prediction
Zikun Xu
Tsinghua University
Shaobing Xu
Tsinghua University
Abstract
LiDAR-based 3D occupancy prediction algorithms evolved rapidly with the advent of large-scale datasets. However, the full potential of the existing diverse datasets remains underutilized, as they are typically employed in isolation. Models trained on a single dataset often suffer considerable performance degradation when deployed to real-world scenarios or datasets involving disparate LiDARs. To address this limitation, we introduce MergeOcc, a generalized pipeline designed to handle different LiDARs by leveraging multiple datasets concurrently. The gaps among LiDAR datasets primarily manifest in geometric disparities and semantic inconsistencies, which correspond to the fundamental components of datasets: data and labels. In response, MergeOcc incorporates a novel model architecture that features a geometric realignment and a semantic label mapping to facilitate multiple datasets training (MDT). The effectiveness of MergeOcc is validated through extensive experiments on two prominent datasets for autonomous vehicles: OpenOccupancy-nuScenes and SemanticKITTI. The results demonstrate its enhanced robustness and performance improvements across both types of LiDARs, outperforming several SOTA methods. Additionally, despite using an identical model architecture and hyper-parameter set, MergeOcc can significantly surpass the baselines thanks to its ability to learn from diverse datasets. To the best of our knowledge, this work presents the first cross-dataset 3D occupancy prediction pipeline that effectively bridges the domain gap for seamless deployment across heterogeneous platforms.
NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation
Peiran Xu
Peking University
Xicheng Gong
Peking University
Yadong Mu
Peking University
Abstract
In this work we concentrate on the task of goal-oriented Vision-and-Language Navigation (VLN). Existing methods often make decisions based on historical information, overlooking the future implications and long-term outcomes of the actions. In contrast, we aim to develop a foresighted agent. Specifically, we draw upon Q-learning to train a Qmodel using large-scale unlabeled trajectory data, in order to learn the general knowledge regarding the layout and object relations within indoor scenes. This model can generate a Q-feature, analogous to the Q-value in traditional Q-network, for each candidate action, which describes the potential future information that may be observed after taking the specific action. Subsequently, a cross-modal future encoder integrates the task-agnostic Q-feature with navigation instructions to produce a set of action scores reflecting future prospects. These scores, when combined with the original scores based on history, facilitate an A*-style searching strategy to effectively explore the regions that are more likely to lead to the destination. Extensive experiments conducted on widely used goal-oriented VLN datasets validate the effectiveness of the proposed method.
OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding
Tianrun Xu
Department of Automation, Tsinghua University
Guanyu Chen
Department of Automation, Tsinghua University
Ye Li
School of Software, Xinjiang University
Yuxin Xi
School of Artificial Intelligence, Beijing Normal University
Zeyu Mu
Department of Automation, Tsinghua University
Ruichen Wang
Department of Automation, Tsinghua University
Tianren Zhang
Department of Automation, Tsinghua University
Haichuan Gao
Department of Automation, Tsinghua University
Feng Chen
Department of Automation, Tsinghua University
Abstract
Multimodal large models have made significant progress, yet fine-grained understanding of complex scenes remains a challenge. High-quality, large-scale vision-language datasets are essential for addressing this issue. However, existing methods often rely on labor-intensive manual annotations or closed-source models with optimal performance, making large-scale data collection costly. To overcome these limitations, we propose a self-bootstrapped training pipeline that leverages the model's own multimodal capabilities to recursively refine its understanding. By decomposing existing multimodal data into localized sub-regions and generating hierarchical scene descriptions and multi-faceted question-answer pairs, we construct a dataset based on 1.4M image-task instances. We further utilize this dataset to train the base model, significantly enhancing its ability to interpret complex visual scenes and perform various vision-related tasks. Our OURO model, fine-tuned on Qwen2-VL-7B-Instruct using LoRA, achieves substantial improvements over both the base model and similarly-sized counterparts across multiple multimodal benchmarks. Our self-bootstrapped training pipeline offers a novel paradigm for the continuous improvement of multimodal models. Code and datasets are available at https://github.com/tinnel123666888/OURO.git.
Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions
Liang Xu
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Chengqun Yang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Zili Lin
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Fei Xu
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Yifan Liu
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Congsheng Xu
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Yiyi Zhang
MoE Key Lab of AI, School of Computer Science, Shanghai Jiao Tong University
Jie Qin
Nanjing University of Aeronautics and Astronautics
Xingdong Sheng
Lenovo
Yunhui Liu
Lenovo
Xin Jin
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China
Yichao Yan
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Wenjun Zeng
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China
Xiaokang Yang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Abstract
Learning action models from real-world human-centric interaction datasets is important towards building generalpurpose intelligent assistants with efficiency. However, most existing datasets only offer specialist interaction category and ignore that AI assistants perceive and act based on first-person acquisition. We urge that both the generalist interaction knowledge and egocentric modality are indispensable. In this paper, we embed the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands. With our hybrid RGB-MoCap system, pairs of assistants and instructors engage with multiple objects and the scene following GPT-generated scripts. Under this setting, we accomplish InterVLA, the first largescale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data, spanning 2 egocentric and 5 exocentric videos, accurate human/object motions and verbal commands. Furthermore, we establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis. We believe that our InterVLA testbed and the benchmarks will foster future works on building AI agents in the physical world.
SAM4D: Segment Anything in Camera and LiDAR Streams
Jianyun Xu
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group
Song Wang
Zhejiang University
Ziqian Ni
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group
Chunyong Hu
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group
Sheng Yang
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group
Jianke Zhu
Zhejiang University
Qiang Li
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group
Abstract
We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudolabels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.
Sequential Gaussian Avatars with Hierarchical Motion Context
Wangze Xu
Shanghai Artificial Intelligence Laboratory
Yifan Zhan
Shanghai Artificial Intelligence Laboratory
Zhihang Zhong
Shanghai Artificial Intelligence Laboratory
Xiao Sun
Shanghai Artificial Intelligence Laboratory
Abstract
The emergence of neural rendering has significantly advanced the rendering quality of 3D human avatars, with the recently popular 3DGS technique enabling real-time performance. However, SMPL-driven 3DGS human avatars still struggle to capture fine appearance details due to the complex mapping from pose to appearance during fitting. In this paper, we propose SeqAvatar, which excavates the explicit 3DGS representation to better model human avatars based on a hierarchical motion context. Specifically, we utilize a coarse-to-fine motion conditions that incorporate both the overall human skeleton and fine-grained vertex motions for non-rigid deformation. To enhance the robustness of the proposed motion conditions, we adopt a spatiotemporal multi-scale sampling strategy to hierarchically integrate more motion clues to model human avatars. Extensive experiments demonstrate that our method significantly outperforms 3DGS-based approaches and renders human avatars orders of magnitude faster than the latest NeRFbased models that incorporate temporal context, all while delivering performance that is at least comparable or even superior. Project page: https://zezeaaa.github. io/projects/SeqAvatar/
Stable-Sim2Real: Exploring Simulation of Real-Captured 3D Data with Two-Stage Depth Diffusion
Mutian Xu
SSE, CUHKSZ
Chongjie Ye
FNii-Shenzhen
Haolin Liu
Tencent Hunyuan3D
Yushuang Wu
ByteDance Games
Jiahao Chang
SSE, CUHKSZ
Xiaoguang Han
SSE, CUHKSZ
Abstract
3D data simulation aims to bridge the gap between simulated and real-captured 3D data, which is a fundamental problem for real-world 3D visual tasks. Most 3D data simulation methods inject predefined physical priors but struggle to capture the full complexity of real data. An optimal approach involves learning an implicit mapping from synthetic to realistic data in a data-driven manner, but progress in this solution has met stagnation in recent studies. This work explores a new solution path of data-driven 3D simulation, called Stable-Sim2Real, based on a novel two-stage depth diffusion model. The initial stage finetunes StableDiffusion to generate the residual between the real and synthetic paired depth, producing a stable but coarse depth, where some local regions may deviate from realistic patterns. To enhance this, both the synthetic and initial output depth are fed into a second-stage diffusion, where diffusion loss is adjusted to prioritize these distinct areas identified by a 3D discriminator. We provide a new benchmark scheme to evaluate 3D data simulation methods. Extensive experiments show that training the network with the 3D simulated data derived from our method significantly enhances performance in real-world 3D visual tasks. Moreover, the evaluation demonstrates the high similarity between our 3D simulated data and real-captured patterns. Project page: mutianxu.github.io/stable-sim2real.
TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring
Zhu Xu
Wangxuan Institute of Computer Technology, Peking University
Ting Lei
Wangxuan Institute of Computer Technology, Peking University
Zhimin Li
Tencent Inc.
Guan Wang
Baidu Inc.
Qingchao Chen
National Institute of Health Data Science, Peking University
Yuxin Peng
Wangxuan Institute of Computer Technology, Peking University
Yang Liu
Wangxuan Institute of Computer Technology, Peking University
Abstract
Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-theshelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relationaware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method, which leverages knowledge to enhance detection in relation-aware dynamic scenarios. TRKT is built on two key components:(1)Relationaware knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas. Then we propose an Inter-frame Attention Augmentation strategy that exploits optical flow for neighboring frames to enhance the attention maps, making them motionaware and robust to motion blur. This step yields relationand motion-aware knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset. Our code is avaliable at https://github.com/XZPKU/TRKT.git
Training-Free Industrial Defect Generation with Diffusion Models
Ruyi Xu
National Taiwan University
Yen-Tzu Chiu
National Taiwan University
Tai-I Chen
National Taiwan University
Oscar Chew
ASUS
Yung-Yu Chuang
National Taiwan University
Wen-Huang Cheng
National Taiwan University
Abstract
Anomaly generation has become essential in addressing the scarcity of defective samples in industrial anomaly inspection. However, existing training-based methods fail to handle complex anomalies and multiple defects simultaneously, especially when only a single anomaly sample is available per defect type. To address this issue, we propose TF-IDG, a novel training-free defect generation framework capable of generating diverse anomaly samples in a oneshot setting. We propose a Feature Alignment strategy that provides fine-grained appearance guidance by minimizing the distributional gap between generated and real defects with high complexity. Additionally, we introduce an Adaptive Anomaly Mask mechanism to mitigate the issue of defects with small regions being ignored during the generation process, enhancing consistency between synthetic defects and their corresponding masks. Finally, we incorporate a Texture Preservation module that extracts background information from anomaly-free images, ensuring that the visual properties of synthetic defects are seamlessly integrated into the image. Extensive experiments demonstrate the effectiveness of our method in generating accurate and diverse anomalies, further leading to superior performance in downstream anomaly inspection tasks. Our code is available at https://github.com/rubymiaomiao/TF-IDG.
ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation
Xiwei Xuan
University of California, Davis
Ziquan Deng
University of California, Davis
Kwan-Liu Ma
University of California, Davis
Abstract
Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning. Existing solutions often explore attention mechanisms of pre-trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high-quality reference set can significantly benefit training-free OVS. With this observation, we introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training. Our code is available here.
Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging
Ying Xue
ETH Zürich
Jiaxi Jiang
ETH Zürich
Rayan Armani
ETH Zürich
Dominik Hollidt
ETH Zürich
Yi-Chi Liao
ETH Zürich
Christian Holz
ETH Zürich
Abstract
Tracking human full-body motion using sparse wearable inertial measurement units (IMUs) overcomes the limitations of occlusion and instrumentation of the environment inherent in vision-based approaches. However, purely IMUbased tracking compromises translation estimates and accurate relative positioning between individual people, as inertial cues are inherently self-referential and provide no direct spatial reference about others. In this paper, we present a novel approach for robustly estimating body poses and global translation for multiple individuals by leveraging the distances between sparse wearable sensors-both on each individual and across different people. Our method Group Inertial Poser estimates these absolute distances between pairs of sensors from ultra-wideband ranging (UWB) and fuses them with inertial observations as input into structured state-space models to integrate temporal motion patterns for precise 3D pose estimation. Our novel two-step optimization further leverages the estimated distances for accurately tracking people's global trajectories through the world. We also introduce GIP-DB, the first IMU+UWB dataset for two-person tracking, which comprises 200 minutes of motion recordings from 14 participants. In our evaluation, Group Inertial Poser outperforms previous stateof-the-art methods in accuracy and robustness across synthetic and real-world captures, showing the promise of IMU+UWB-based multi-human motion capture in the wild. [Code & Dataset]
SDFormer: Vision-based 3D Semantic Scene Completion via SAM-assisted Dual-channel Voxel Transformer
Yujie Xue
Hunan University
Huilong Pi
Hunan University
Jiapeng Zhang
Hunan University
Yunchuan Qin
Hunan University
Zhuo Tang
Hunan University
Kenli Li
Hunan University
Ruihui Li
Hunan University
Abstract
Vision-based semantic scene completion (SSC) is able to predict complex scene information from limited 2D images, which has attracted widespread attention. Currently, SSC methods typically construct unified voxel features containing both geometry and semantics, which lead to different depth positions in occluded regions sharing the same 2D semantic information, resulting in ambiguous semantic segmentation. To address this problem, we propose SDFormer, a novel SAM-assisted Dual-channel Voxel Transformer framework for SSC. We uncouple the task based on its multi-objective nature and construct two parallel subnetworks: a semantic constructor (SC) and a geometric refiner (GR). The SC utilizes the Segment Anything Model (SAM) to construct dense semantic voxel features from reliable visible semantic information in the image. The GR accurately predicts depth positions and then further adjusts the semantic output by SAM. Additionally, we design a Semantic Calibration Affinity to enhance semantic-aware transformations in SC. Within the GR, Shape Segments Interactive and Learnable mask generation module to emphasize the spatial location of semantics to obtain finegrained voxel information. Extensive qualitative and quantitative results on the SemanticKITTI and SSCBench-KITTI360 datasets show that our method outperforms state-ofthe-art approaches.
Adversarial Attention Perturbations for Large Object Detection Transformers
Zachary Yahn
Georgia Institute of Technology
Selim Furkan Tekin
Georgia Institute of Technology
Fatih Ilhan
Georgia Institute of Technology
Sihao Hu
Georgia Institute of Technology
Tiansheng Huang
Georgia Institute of Technology
Yichang Xu
Georgia Institute of Technology
Margaret Loper
Georgia Tech Research Institute
Ling Liu
Georgia Institute of Technology
Abstract
Adversarial perturbations are useful tools for exposing vulnerabilities in neural networks. Existing adversarial perturbation methods for object detection are either limited to attacking CNN-based detectors or weak against transformer-based detectors. This paper presents an Attention-Focused Offensive Gradient (AFOG) attack against object detection transformers. By design, AFOG is neural-architecture agnostic and effective for attacking both large transformer-based object detectors and conventional CNN-based detectors with a unified adversarial attention framework. This paper makes three original contributions. First, AFOG utilizes a learnable attention mechanism that focuses perturbations on vulnerable image regions in multi-box detection tasks, increasing performance over non-attention baselines by up to 30.6%. Second, AFOG's attack loss is formulated by integrating two types of feature loss through learnable attention updates with iterative injection of adversarial perturbations. Finally, AFOG is an efficient and stealthy adversarial perturbation method. It probes the weak spots of detection transformers by adding strategically generated and visually imperceptible perturbations which can cause well-trained object detection models to fail. Extensive experiments conducted with twelve large detection transformers on COCO demonstrate the efficacy of AFOG. Our empirical results also show that AFOG outperforms existing attacks on transformer-based and CNN-based object detectors by up to 83% with superior speed and imperceptibility. Code is available at: Link.
MVTrajecter: Multi-View Pedestrian Tracking with Trajectory Motion Cost and Trajectory Appearance Cost
Taiga Yamane
NTT Human Informatics Laboratories, NTT Corporation
Ryo Masumura
NTT Human Informatics Laboratories, NTT Corporation
Satoshi Suzuki
NTT Human Informatics Laboratories, NTT Corporation
Shota Orihashi
NTT Human Informatics Laboratories, NTT Corporation
Abstract
Multi-View Pedestrian Tracking (MVPT) aims to track pedestrians in the form of a bird's eye view occupancy map from multi-view videos. End-to-end methods that detect and associate pedestrians within one model have shown great progress in MVPT. The motion and appearance information of pedestrians is important for the association, but previous end-to-end MVPT methods rely only on the current and its single adjacent past timestamp, discarding the past trajectories before that. This paper proposes a novel endto-end MVPT method called Multi-View Trajectory Tracker (MVTrajecter) that utilizes information from multiple timestamps in past trajectories for robust association. MVTrajecter introduces trajectory motion cost and trajectory appearance cost to effectively incorporate motion and appearance information, respectively. These costs calculate which pedestrians at the current and each past timestamp are likely identical based on the information between those timestamps. Even if a current pedestrian could be associated with a false pedestrian at some past timestamp, these costs enable the model to associate that current pedestrian with the correct past trajectory based on other past timestamps. In addition, MVTrajecter effectively captures the relationships between multiple timestamps leveraging the attention mechanism. Extensive experiments demonstrate the effectiveness of each component in MVTrajecter and show that it outperforms the previous state-of-the-art methods.
RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation
Feng Yan
Meituan
Fanfan Liu
Meituan
Yiyang Huang
Meituan
Zechao Guan
Meituan
Liming Zheng
Meituan
Yufeng Zhong
Meituan
Chengjian Feng
Meituan
Lin Ma
Meituan
Abstract
Recently, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model RoboTron-Mani and the comprehensive dataset RoboData. RoboTron-Mani, on one hand, enhances 3D perception through camera parameters and occupancy supervision. On the other hand, it further incorporates Modality-Isolation-Mask and multimodal decoder blocks based on OpenFlamingo, improving modality fusion and fine-grained perception. RoboData integrats several publicly-available datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, actions, and space alignment, which facilitates comprehensive learning from diverse robotic datasets and offers one complete evaluation system. Trained on RoboData, RoboTronMani is the first generalist policy that surpasses expert models, enabling simultaneous evaluation of all tasks across multiple datasets, rather than being limited to specific data or task selections. Specifically, RoboTron-Mani boosts manipulation performance by increasing the average sequence length on CALVIN from 1.7 to 3.5, enabling crossembodiment generalization, and achieving state-of-the-art results on both simulated and real-world datasets.
TurboReg: TurboClique for Robust and Efficient Point Cloud Registration
Shaocheng Yan
Wuhan University
Pengcheng Shi
Wuhan University
Zhenjun Zhao
University of Zaragoza
Kaixin Wang
Beijing University of Technology
Kuang Cao
Wuhan University
Ji Wu
Wuhan University
Jiayuan Li
Wuhan University
Abstract
Robust estimation is essential in correspondence-based Point Cloud Registration (PCR). Existing methods using maximal clique search in compatibility graphs achieve high recall but suffer from exponential time complexity, limiting their use in time-sensitive applications. To address this challenge, we propose a fast and robust estimator, TurboReg, built upon a novel lightweight clique, TurboClique, and a highly parallelizable Pivot-Guided Search (PGS) algorithm. First, we define the TurboClique as a 3-clique within a highlyconstrained compatibility graph. The lightweight nature of the 3-clique allows for efficient parallel searching, and the highly-constrained compatibility graph ensures robust spatial consistency for stable transformation estimation. Next, PGS selects matching pairs with high SC2 scores as pivots, effectively guiding the search toward TurboCliques with higher inlier ratios. Moreover, the PGS algorithm has linear time complexity and is significantly more efficient than the maximal clique search with exponential time complexity. Extensive experiments show that TurboReg achieves stateof-the-art performance across multiple real-world datasets, with substantial speed improvements. For example, on the 3DMatch+FCGF dataset, TurboReg (1K) operates 208.22x faster than 3DMAC while also achieving higher recall. Our code is accessible at TurboReg.
3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection
Yung-Hsu Yang
ETH Zürich
Luigi Piccinelli
ETH Zürich
Mattia Segu
ETH Zürich
Siyuan Li
ETH Zürich
Rui Huang
ETH Zürich
Yuqian Fu
INSAIT
Marc Pollefeys
ETH Zürich
Hermann Blum
ETH Zürich
Zuria Bauer
ETH Zürich
Abstract
Monocular 3D object detection is valuable for various applications such as robotics and AR/VR. Existing methods are confined to closed-set settings, where the training and testing sets consist of the same scenes and/or object categories. However, real-world applications often introduce new environments and novel object categories, posing a challenge to these methods. In this paper, we address monocular 3D object detection in an open-set setting and introduce the first end-to-end 3D Monocular Openset Object Detector (3D-MOOD). We propose to lift the open-set 2D detection into 3D space through our designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks to yield better overall performance. We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes. To further improve performance, we design the canonical image space for more efficient cross-dataset training. We evaluate 3D-MOOD on both closed-set settings (Omni3D) and open-set settings (Omni3D →Argoverse 2, ScanNet), and achieve new state-of-the-art results. Code and models are available at royyang0714.github.io/3D-MOOD.
AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning
Dejie Yang
Peking University
Zijing Zhao
Peking University
Yang Liu
Peking University
Abstract
Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multimodal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pretraining with large-scale data. However, they either utilize web data that differs from robotic tasks, or train the model in an implicit way (e.g., predicting future frames at the pixel level), thus showing limited generalization ability under insufficient robot data. In this paper, we propose to learn from large-scale human action video datasets in an explicit way (i.e., imitating human actions from hand keypoints), introducing Visual Robot Manipulation with Analogical Reasoning (AR-VRM). To acquire action knowledge explicitly from human action videos, we propose a keypoint Vision-Language Model (VLM) pretraining scheme, enabling the VLM to learn human action knowledge and directly predict human hand keypoints. During fine-tuning on robot data, to facilitate the robotic arm in imitating the action patterns of human motions, we first retrieve human action videos that perform similar manipulation tasks and have similar historical observations , and then learn the Analogical Reasoning (AR) map between human hand keypoints and robot components. Taking advantage of focusing on action keypoints instead of irrelevant visual cues, our method achieves leading performance on the CALVIN benchmark and real-world experiments. In few-shot scenarios, our AR-VRM outperforms previous methods by large margins , underscoring the effectiveness of explicitly imitating human actions under data scarcity. Code available at https://github.com/idejie/ar.
Clink! Chop! Thud! - Learning Object Sounds from Real-World Interactions
Mengyu Yang
Georgia Institute of Technology
Yiming Chen
Georgia Institute of Technology
Haozheng Pei
Georgia Institute of Technology
Siddhant Agarwal
Georgia Institute of Technology
Arun Balajee Vasudevan
Carnegie Mellon University
James Hays
Georgia Institute of Technology
Abstract
Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model's ability to link these sounds to the objects directly involved. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. To encourage an object-centric approach, we first develop an automatic pipeline to compute segmentation masks of the objects involved to guide the model's focus during training towards the most informative regions of the interaction. A slot attention visual encoder is used to further enforce an object prior. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.
CounterPC: Counterfactual Feature Realignment for Unsupervised Domain Adaptation on Point Clouds
Feng Yang
Southeast University
Yichao Cao
Southeast University
Xiu Su
Central South University
Dan Niu
Southeast University
Xuanpeng Li
Southeast University
Abstract
Understanding real-world 3D point clouds is challenging due to domain shifts. The key challenge is disentangling domain-invariant semantics from domain-specific geometric variations, as point clouds exhibit local inconsistency and global redundancy, making direct alignment ineffective. To address this, we propose CounterPC, a counterfactual intervention-based framework, which formulates domain adaptation within a causal latent space, identifying category-discriminative features entangled with intra-class geometric variation confounders. Through counterfactual interventions, we generate counterfactual target samples that retain domain-specific characteristics while improving class separation, mitigating domain bias for optimal feature transfer. To achieve this, we introduce two key modules: i) Joint Distribution Alignment, which leverages 3D foundation models (3D-FMs) and a self-supervised autoregressive generative prediction task to unify feature alignment, and ii) Counterfactual Feature Realignment, which employs Optimal Transport to align category-relevant and category-irrelevant feature distributions, ensuring robust sample-level adaptation while preserving domain properties. CounterPC outperforms current methods on PointDA and GraspNetPC-10 with significant improvements.
DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving
Xuemeng Yang
Shanghai Artificial Intelligence Laboratory
Licheng Wen
Shanghai Artificial Intelligence Laboratory
Tiantian Wei
Technical University of Munich
Yukai Ma
Zhejiang University
Jianbiao Mei
Zhejiang University
Xin Li
Shanghai Artificial Intelligence Laboratory
Wenjie Lei
Zhejiang University
Daocheng Fu
Shanghai Artificial Intelligence Laboratory
Pinlong Cai
Shanghai Artificial Intelligence Laboratory
Min Dou
Shanghai Artificial Intelligence Laboratory
Liang He
East China Normal University
Yong Liu
Zhejiang University
Botian Shi
Shanghai Artificial Intelligence Laboratory
Yu Qiao
Shanghai Artificial Intelligence Laboratory
Abstract
This paper introduces DRIVEARENA, the first high-fidelity closed-loop simulation system designed for driving agents navigating real-world scenarios. DRIVEARENA comprises two core components: Traffic Manager, a traffic simulator capable of generating realistic traffic flow on any global street map, and World Dreamer, a high-fidelity conditional generative model with infinite auto-regression. DRIVEARENA supports closed-loop simulation using road networks from cities worldwide, enabling the generation of diverse traffic scenarios with varying styles. This powerful synergy empowers any driving agent capable of processing real-world images to navigate in DRIVEARENA's simulated environment. Furthermore, DRIVEARENA features a flexible, modular architecture, allowing for multiple implementations of its core components and driving agents. Serving as a highly realistic arena for these players, our work provides a valuable platform for developing and evaluating driving agents across diverse and challenging scenarios. DRIVEARENA takes a significant leap forward in leveraging generative models for driving simulation platforms, opening new avenues for closed-loop evaluation of autonomous driving systems.
Driving View Synthesis on Free-form Trajectories with Generative Prior
Zeyu Yang
Fudan University
Zijie Pan
Fudan University
Yuankun Yang
Fudan University
Xiatian Zhu
University of Surrey
Li Zhang
Fudan University
Abstract
Driving view synthesis along free-form trajectories is essential for realistic driving simulations, enabling closed-loop evaluation of end-to-end driving policies. Existing methods excel at view interpolation along recorded paths but struggle to generalize to novel trajectories due to limited viewpoints in driving videos. To tackle this challenge, we propose DriveX, a novel free-form driving view synthesis framework, that progressively distills generative prior into the 3D Gaussian model during its optimization. Within this framework, we utilize a video diffusion model to refine the degraded novel trajectory renderings from the in-training Gaussian model, while the restored videos in turn serve as additional supervision for optimizing the 3D Gaussian. Concretely, we craft an inpainting-based video restoration task, which can disentangle the identification of degraded regions from the generative capability of the diffusion model and remove the need of simulating specific degraded pattern in the training of the diffusion model. To further enhance the consistency and fidelity of generated contents, the pseudo ground truth is progressively updated with gradually improved novel trajectory rendering, allowing both components to co-adapt and reinforce each other while minimizing the disruption on the optimization. By tightly integrating 3D scene representation with generative prior, DriveX achieves high-quality view synthesis beyond recorded trajectories in real time-unlocking new possibilities for flexible and realistic driving simulations on free-form trajectories.
GSRecon: Efficient Generalizable Gaussian Splatting for Surface Reconstruction from Sparse Views
Hang Yang
Nanjing University of Science and Technology
Le Hui
Northwestern Polytechnical University
Jianjun Qian
Nanjing University of Science and Technology
Jin Xie
Nanjing University
Jian Yang
Nanjing University of Science and Technology
Abstract
Generalizable surface reconstruction aims to recover the surface the scene from a sparse set of images in a feedforward manner. Existing volume rendering-based methods evaluate numerous points along camera rays to infer the geometry, resulting in inefficient reconstruction. Recently, 3D Gaussian Splatting offers an alternative efficient scene representation and has inspired a series of surface reconstruction methods. However, these methods require dense views and cannot be generalized to new scenes. In this paper, we propose a novel surface reconstruction method with Gaussian splatting, named GSRecon, which leverages the advantages of rasterization-based rendering to achieve efficient reconstruction. To obtain accurate geometry representation, we propose a geometry-aware cross-view enhancement module to improve the unreliable geometry estimation in the current view by incorporating accurate geometric information from other views. To generate the fine-grained Gaussian primitives, we propose a hybrid cross-view feature aggregation module that integrates an efficient voxel branch and a fine-grained point branch to jointly capture cross-view geometric information. Subsequently, per-view depth maps are rendered using these Gaussian primitives and fused to obtain the final 3D surface. Extensive experiments on the DTU, BlendedMVS, and Tanks and Temples datasets validate that GSRecon achieves state-of-the-art performance efficiently. Code is available at https://github.com/hyangwinter/GSRecon.
HFD-Teacher: High-Frequency Depth Distillation from Depth Foundation Models for Enhanced Depth Completion
Zhiyuan Yang
Nanyang Technological University
Anqi Cheng
Nanyang Technological University
Haiyue Zhu
SIMTech, A*STAR
Tianjiao Li
Nanyang Technological University
Pey Yuen Tao
SIMTech, A*STAR
Kezhi Mao
Nanyang Technological University
Abstract
Depth completion, the task of reconstructing dense depth maps from sparse depth and RGB images, plays a critical role in 3D scene understanding. However, existing methods often struggle to recover high-frequency details, such as regions with fine structures or weak signals, since depth sensors may fail to capture accurate depth maps in those regions, leading to imperfect supervision ground truth. To overcome this limitation, it is essential to introduce an alternative training source for the models. Emerging depth foundation models excel at producing high-frequency details from RGB images, yet their depth maps suffer from inconsistent scaling. Therefore, we propose a novel teacherstudent framework that enhances depth completion by distilling high-frequency knowledge from depth foundation models across multiple scales. Our approach introduces two key innovations: Adaptive Local Wavelet Decomposition, which dynamically adjusts wavelet decomposition level based on local complexity for efficient feature extraction, and Topological Constraints, which apply persistent homology to enforce structural coherence and suppress spurious depth edges. Experiment results demonstrate that our method outperforms state-of-the-art methods, preserving high-frequency details and overall depth fidelity.
InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation
Zhuoran Yang
University of Science and Technology of China
Xi Guo
SenseAuto
Chenjing Ding
SenseAuto
Chiyu Wang
SenseAuto
Wei Wu
SenseAuto
Yanyong Zhang
University of Science and Technology of China
Abstract
Autonomous driving relies on robust models trained on high-quality, large-scale multi-view driving videos. While world models offer a cost-effective solution for generating realistic driving videos, they struggle to maintain instance-level temporal consistency and spatial geometric fidelity. To address these challenges, we propose InstaDrive, a novel framework that enhances driving video realism through two key advancements: (1) Instance Flow Guider, which extracts and propagates instance features across frames to enforce temporal consistency, preserving instance identity over time. (2) Spatial Geometric Aligner, which improves spatial reasoning, ensures precise instance positioning, and explicitly models occlusion hierarchies. By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality and enhances downstream autonomous driving tasks on the nuScenes dataset. Additionally, we utilize CARLA's autopilot to procedurally and stochastically simulate rare but safety-critical driving scenarios across diverse maps and regions, enabling rigorous safety evaluation for autonomous systems.
InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes
Zesong Yang
Zhejiang University
Bangbang Yang
ByteDance
Wenqi Dong
Zhejiang University
Chenxuan Cao
Zhejiang University
Liyuan Cui
Zhejiang University
Yuewen Ma
Zhejiang University
Zhaopeng Cui
Zhejiang University
Hujun Bao
Zhejiang University
Abstract
Humans can naturally identify and mentally complete occluded objects in cluttered environments. However, imparting similar cognitive ability to robotics remains challenging even with advanced reconstruction techniques, which models scenes as undifferentiated wholes and fails to recognize complete object from partial observations. In this paper, we propose InstaScene, a new paradigm towards holistic 3D perception of complex scenes with a primary goal: decomposing arbitrary instances while ensuring complete reconstruction. To achieve precise decomposition, we develop a novel spatial contrastive learning by tracing rasterization of each instance across views, significantly enhancing semantic supervision in cluttered scenes. To overcome incompleteness from limited observations, we introduce in-situ generation that harnesses valuable observations and geometric cues, effectively guiding 3D generative models to reconstruct complete instances that seamlessly align with the real world. Experiments on scene decomposition and object completion across complex real-world and synthetic scenes demonstrate that our method achieves superior decomposition accuracy while producing geometrically faithful and visually intact objects.
Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation
Xiuyu Yang
UT Austin
Shuhan Tan
UT Austin
Philipp Krähenbühl
UT Austin
Abstract
An ideal traffic simulator replicates the realistic longterm point-to-point trip that a self-driving system experiences during deployment. Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene. This is problematic for long-term simulation. Agents enter and exit the scene as the ego vehicle enters new regions. We propose InfGen, a unified nexttoken prediction model that performs interleaved closedloop motion simulation and scene generation. InfGen automatically switches between closed-loop motion simulation and scene generation mode. It enables stable long-term rollout simulation. InfGen performs at the state-of-theart in short-term (9s) traffic simulation, and significantly outperforms all other methods in long-term (30s) simulation. The code and model of InfGen will be released at https://orangesodahub.github.io/InfGen.
PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data
ChangHe Yang
LG Electronics
Hyeonseop Song
LG Electronics
Seokhun Choi
LG Electronics
Seungwoo Lee
LG Electronics
Jaechul Kim
LG Electronics
Hoseok Do
LG Electronics
Abstract
Despite considerable efforts to enhance the generalization of 3D pose estimators without costly 3D annotations, existing data augmentation methods struggle in real-world scenarios with diverse human appearances and complex poses. We propose PoseSyn, a novel data synthesis framework that transforms abundant in-the-wild 2D pose dataset into diverse 3D pose-image pairs. PoseSyn comprises two key components: Error Extraction Module (EEM), which identifies challenging poses from the 2D pose datasets, and Motion Synthesis Module (MSM), which synthesizes motion sequences around the challenging poses. Then, by generating realistic 3D training data via a human animation model- aligned with challenging poses and appearances-PoseSyn boosts the accuracy of various 3D pose estimators by up to 14% across real-world benchmarks including various backgrounds and occlusions, challenging poses, and multi-view scenarios. Extensive experiments further confirm that PoseSyn is a scalable and effective approach for improving generalization without relying on expensive 3D annotations, regardless of the pose estimator's model size or design.
RALoc: Enhancing Outdoor LiDAR Localization via Rotation Awareness
Yuyang Yang
Xiamen University
Wen Li
Xiamen University
Sheng Ao
Xiamen University
Qingshan Xu
Nanyang Technological University
Shangshu Yu
Northeastern University
Yu Guo
Xiamen University
Yin Zhou
GAC R&D Center
Siqi Shen
Xiamen University
Cheng Wang
Xiamen University
Abstract
LiDAR localization is a fundamental task in autonomous driving and robotics. Scene Coordinate Regression (SCR) exhibits leading pose accuracy, achieving impressive results in learning-based localization. We observe that the realworld LiDAR scans captured from different viewpoints usually result in the catastrophic collapse of SCR. However, existing LiDAR localization methods have largely overlooked the issue of rotation sensitivity in SCR. In this paper, we present RALoc, an outdoor LiDAR localization method with rotation awareness to achieve accurate localization. The key to our approach is to design a Point Cloud Canonicalization module, which leverages a powerful equivariant key feature aggregation to transform the input LiDAR scan towards a consistent orientation, effectively eliminating the adverse effects of rotation. This proposed module has promising scalability and can be seamlessly integrated with the existing LiDAR localization network. Moreover, we propose the Bidirectional LiDAR Localization (BiLiLo) dataset as a benchmark to evaluate the performance of various methods in large outdoor scenes with significant rotation changes. Extensive experiments show that RALoc significantly improves localization performance in scenarios with large rotation changes, and also achieves competitive performance in the Oxford Radar RobotCar dataset. Our project is available at https://etheryangyy. github.io/raloc.github.io.
STaR: Seamless Spatial-Temporal Aware Motion Retargeting with Penetration and Consistency Constraints
Xiaohang Yang
Queen Mary University of London
Qing Wang
Queen Mary University of London
Jiahao Yang
Queen Mary University of London
Gregory Slabaugh
Queen Mary University of London
Shanxin Yuan
Queen Mary University of London
Abstract
Motion retargeting seeks to faithfully replicate the spatio-temporal motion characteristics of a source character onto a target character with a different body shape. Apart from motion semantics preservation, ensuring geometric plausibility and maintaining temporal consistency are also crucial for effective motion retargeting. However, many existing methods prioritize either geometric plausibility or temporal consistency. Neglecting geometric plausibility results in interpenetration, while neglecting temporal consistency leads to motion jitter. In this paper, we propose a novel sequence-to-sequence model for seamless SpatialTemporal aware motion Retargeting (STaR), with penetration and consistency constraints. STaR consists of two modules: (1) a spatial module that incorporates dense shape representation and a novel limb penetration constraint to ensure geometric plausibility while preserving motion semantics, and (2) a temporal module that utilizes a temporal transformer and a novel temporal consistency constraint to predict the entire motion sequence at once while enforcing multi-level trajectory smoothness. The seamless combination of the two modules helps us achieve a good balance between the semantic, geometric, and temporal targets. Extensive experiments on the Mixamo and ScanRet datasets demonstrate that our method produces plausible and coherent motions while significantly reducing interpenetration rates compared with other approaches. Code page: https://github.com/XiaohangYang829/STaR.
SpikeDiff: Zero-shot High-Quality Video Reconstruction from Chromatic Spike Camera and Sub-millisecond Spike Streams
Siqi Yang
Institute for Artificial Intelligence, Peking University
Jinxiu Liang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Zhaojun Huang
National Engineering Research Center of Visual Technology, School of Computer Science, Peking University
Yeliduosi Xiaokaiti
National Engineering Research Center of Visual Technology, School of Computer Science, Peking University
Yakun Chang
Institute of Information Science, Beijing Jiaotong University
Zhaofei Yu
Institute for Artificial Intelligence, Peking University
Boxin Shi
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Abstract
High-speed video reconstruction from neuromorphic spike cameras offers a promising alternative to traditional framebased imaging, providing superior temporal resolution and dynamic range with reduced power consumption. Nevertheless, reconstructing high-quality colored videos from spikes captured in ultra-short time intervals (sub-millisecond) remain challenging due to the inherently noisy nature of spikes. While some existing methods extend the temporal capture window to improve reconstruction quality, they inevitably compromise the temporal resolution advantages of spike cameras. In this paper, we introduce SpikeDiff, the first zeroshot framework that leverages pretrained diffusion models to reconstruct high-quality colored videos from sub-millisecond (0.5ms) chromatic spike streams. By incorporating physicsbased guidance into the diffusion sampling process, SpikeDiff bridges the domain gap between chromatic spikes and conventional images, enabling high-fidelity reconstruction without requiring domain-specific training data. Extensive experiments demonstrate that SpikeDiff achieves impressive reconstruction quality while maintaining ultra-high temporal resolution, outperforming existing methods across diverse challenging scenarios in both perceptual quality and structural preservation.
Unified Multi-Agent Trajectory Modeling with Masked Trajectory Diffusion
Songru Yang
Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University
Zhenwei Shi
Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University
Zhengxia Zou
Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University
Abstract
Understanding movements in multi-agent scenarios is a fundamental problem in intelligent systems. Previous research assumes complete and synchronized observations. However, real-world partial observation caused by occlusions leads to inevitable model failure, which demands a unified framework for coexisting trajectory prediction, imputation, and recovery. Unlike previous attempts that handled observed and unobserved behaviors in a coupled manner, we explore a decoupled denoising diffusion modeling paradigm with a unidirectional information valve to separate the interference from uncertain behaviors. Building on this, we proposed a Unified Masked Trajectory Diffusion model (UniMTD) for arbitrary levels of missing observations. We designed a unidirectional attention as a valve unit to control the direction of information flow between the observed and masked areas, gradually refining the missing observations toward a real-world distribution. We construct it into a unidirectional MoE structure to handle varying proportions of missing observations. A Cached Diffusion model is further designed to improve generation quality while reducing computation and time overhead. Our method has achieved a great leap across human motions and vehicle traffic. UniMTD efficiently achieves 74% improvement in minADE20 and reaches SOTA with advantages of 91%, 66%, 69%, and 58% across 4 fidelity metrics on out-of-boundary, velocity, and trajectory length.
Diving into the Fusion of Monocular Priors for Generalized Stereo Matching
Chengtang Yao
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology
Lidong Yu
NVIDIA
Zhidan Liu
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology
Jiaxi Zeng
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology
Yuwei Wu
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology
Yunde Jia
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology
Abstract
The matching formulation makes it naturally hard for the stereo matching to handle ill-posed regions like occlusions and non-Lambertian surfaces. Fusing monocular priors has been proven helpful for ill-posed matching, but the biased monocular prior learned from small stereo datasets constrains the generalization. Recently, stereo matching has progressed by leveraging the unbiased monocular prior from the vision foundation model (VFM) to improve the generalization in ill-posed regions. We dive into the fusion process and observe three main problems limiting the fusion of the VFM monocular prior. The first problem is the misalignment between affine-invariant relative monocular depth and absolute depth of disparity. Besides, when we use the monocular feature in an iterative update structure, the over-confidence in the disparity update leads to local optima results. A direct fusion of a monocular depth map could alleviate the local optima problem, but noisy disparity results computed at the first several iterations will misguide the fusion. In this paper, we propose a binary local ordering map to guide the fusion, which converts the depth map into a binary relative format, unifying the relative and absolute depth representation. The computed local ordering map is also used to re-weight the initial disparity update, resolving the local optima and noisy problem. In addition, we formulate the final direct fusion of monocular depth to the disparity as a registration problem, where a pixel-wise linear regression module can globally and adaptively align them. Our method fully exploits the monocular prior to support stereo matching results effectively and efficiently. We significantly improve the performance from the experiments when generalizing from SceneFlow to Middlebury and Booster datasets while barely reducing the efficiency.
MagicCity: Geometry-Aware 3D City Generation from Satellite Imagery with Multi-View Consistency
Xingbo Yao
Hong Kong University of Science and Technology (Guangzhou)
Xuanmin Wang
Tianjin University
Hao Wu
Hong Kong University of Science and Technology
Chengliang Ping
Hong Kong University of Science and Technology
Doudou Zhang
Hong Kong University of Science and Technology
Hui Xiong
Hong Kong University of Science and Technology
Abstract
Directly generating 3D cities from satellite imagery opens up new possibilities for gaming and mapping services. However, this task remains challenging due to the limited information in satellite views, making it difficult for existing methods to achieve both photorealistic textures and geometric accuracy. To address these challenges, we propose MagicCity, a novel large-scale generative model for photorealistic 3D city generation with geometric consistency. Given a satellite image, our framework first extracts 3D geometric information and encodes it alongside textural features using a dual encoder. These features then guide a multi-branch diffusion model to generate city-scale, geometrically consistent multi-view images. To further enhance texture consistency across different viewpoints, we propose an Inter-Frame Cross Attention mechanism that enables feature sharing across different frames. Additionally, we incorporate a Hierarchical Geometric-Aware Module and a Consistency Evaluator to improve overall scene consistency. Finally, the generated images are fed into our robust 3D reconstruction pipeline to produce high-visual quality and geometrically consistent 3D cities. Moreover, we contribute CityVista, a high-quality dataset comprising 500 3D city scenes along with corresponding multiview images and satellite imagery to advance research in 3D city generation. Experimental results demonstrate that MagicCity surpasses state-of-the-art methods in both geometric consistency and visual quality. Our project page: https://github.com/YaoXingbo/MagicCity
NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments
Xuan Yao
State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA)
Junyu Gao
School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS)
Changsheng Xu
Peng Cheng Laboratory
Abstract
Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to execute sequential navigation actions in complex environments guided by natural language instructions. Current approaches often struggle with generalizing to novel environments and adapting to ongoing changes during navigation. Inspired by human cognition, we present NavMorph, a self-evolving world model framework that enhances environmental understanding and decision-making in VLN-CE. NavMorph employs compact latent representations to model environmental dynamics, equipping agents with foresight for adaptive planning and policy refinement. By integrating a novel Contextual Evolution Memory, NavMorph leverages scene-contextual information to support effective navigation while maintaining online adaptability. Extensive experiments demonstrate that NavMorph achieves notable performance improvements on popular VLN-CE benchmarks. Our Code is available at https://github.com/Feliciaxyao/NavMorph.
UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions
Siyuan Yao
Sun Yat-sen University
Rui Zhu
Beijing University of Posts and Telecommunications
Ziqi Wang
Beijing University of Posts and Telecommunications
Wenqi Ren
Sun Yat-sen University
Yanyang Yan
University of Chinese Academy of Sciences
Xiaochun Cao
Sun Yat-sen University
Abstract
Visual object tracking has gained promising progress in past decades. Most of the existing approaches focus on learning target representation in well-conditioned daytime data, while for the unconstrained real-world scenarios with adverse weather conditions, e.g. nighttime or foggy environment, the tremendous domain shift leads to significant performance degradation. In this paper, we propose UMDATrack, which is capable of maintaining high-quality target state prediction under various adverse weather conditions within a unified domain adaptation framework. Specifically, we first use a controllable scenario generator to synthesize a small amount of unlabeled videos (less than 2% frames in source daytime datasets) in multiple weather conditions under the guidance of different text prompts. Afterwards, we design a simple yet effective domain-customized adapter (DCA), allowing the target objects' representation to rapidly adapt to various weather conditions without redundant model updating. Furthermore, to enhance the localization consistency between source and target domains, we propose a target-aware confidence alignment module (TCA) following optimal transport theorem. Extensive experiments demonstrate that UMDATrack can surpass existing advanced visual trackers and lead new stateof-the-art performance by a significant margin. Our code is available at https://github.com/Z-Z188/UMDATrack.
Unsupervised Visible-Infrared Person Re-identification under Unpaired Settings
Haoyu Yao
School of Computer Science, Wuhan University
Bin Yang
School of Computer Science, Wuhan University
Wenke Huang
School of Computer Science, Wuhan University
Bo Du
School of Computer Science, Wuhan University
Mang Ye
School of Computer Science, Wuhan University
Abstract
Unsupervised visible-infrared person re-identification (USL-VI-ReID) aims to train a cross-modality retrieval model without labels, reducing the reliance on expensive cross-modality manual annotation. However, existing USL-VI-ReID methods rely on artificially cross-modality paired data as implicit supervision, which is also expensive for human annotation and contrary to the setting of unsupervised tasks. In addition, this full alignment of identity across modalities is inconsistent with real-world scenarios, where unpaired settings are prevalent. To this end, we study the USL-VI-ReID task under unpaired settings, which uses cross-modality unpaired and unlabeled data for training a VI-ReID model. We propose a novel Mapping and Collaborative Learning (MCL) framework. Specifically, we first design a simple yet effective Cross-modality Feature Mapping (CFM) module to map and generate fake crossmodality positive feature pairs, constructing a cross-modal pseudo-identity space for feature alignment. Then, a Static-Dynamic Collaborative (SDC) learning strategy is proposed to align cross-modality correspondences through a collaborative approach, eliminating inter-modality discrepancies across different aspects i.e., cluster-level and instance-level, in scenarios with cross-modal identity mismatches. Extensive experiments on the conducted SYSU-MM01 and RegDB benchmarks under paired and unpaired settings demonstrate that our proposed MCL significantly outperforms existing unsupervised methods, facilitating USL-VI-ReID to real-world deployment.
GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields
Shunsuke Yasuki
Rikkyo University
Taiki Miyanishi
The University of Tokyo
Nakamasa Inoue
Institute of Science Tokyo
Shuhei Kurita
National Institute of Informatics
Koya Sakamoto
The University of Tokyo
Daichi Azuma
The University of Tokyo
Masato Taki
Rikkyo University
Yutaka Matsuo
The University of Tokyo
Abstract
The advancement of 3D language fields has enabled intuitive interactions with 3D scenes via natural language. However, existing approaches are typically limited to smallscale environments, lacking the scalability and compositional reasoning capabilities necessary for large, complex urban settings. To overcome these limitations, we propose GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale highfidelity 3D scenes. GeoProg3D consists of two key components: (i) a Geography-aware City-scale 3D Language Field (GCLF) that leverages a memory-efficient hierarchical 3D model to handle large-scale data, integrated with geographic information for efficiently filtering vast urban spaces using directional cues, distance measurements, elevation data, and landmark references; and (ii) Geographical Vision APIs (GV-APIs), specialized geographic vision tools such as area segmentation and object detection. Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF, effectively supporting diverse geographic vision tasks. To assess performance in city-scale reasoning, we introduce GeoEval3D, a comprehensive benchmark dataset containing 952 query-answer pairs across five challenging tasks: grounding, spatial reasoning, comparison, counting, and measurement. Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks. To our knowledge, GeoProg3D is the first framework enabling compositional geographic reasoning in high-fidelity cityscale 3D environments via natural language.
Purge-Gate: Backpropagation-Free Test-Time Adaptation for Point Clouds Classification via Token purging
Moslem Yazdanpanah
LIVIA, ÉTS Montréal
Ali Bahri
International Laboratory on Learning Systems (ILLS)
Mehrdad Noori
International Laboratory on Learning Systems (ILLS)
Sahar Dastani
International Laboratory on Learning Systems (ILLS)
Gustavo Adolfo Vargas Hakim
International Laboratory on Learning Systems (ILLS)
David Osowiechi
International Laboratory on Learning Systems (ILLS)
Ismail Ben Ayed
International Laboratory on Learning Systems (ILLS)
Christian Desrosiers
International Laboratory on Learning Systems (ILLS)
Abstract
Test-time adaptation (TTA) is crucial for mitigating performance degradation caused by distribution shifts in 3D point cloud classification. In this work, we introduce Token Purging (PG), a novel backpropagation-free approach that removes tokens highly affected by domain shifts before they reach attention layers. Unlike existing TTA methods, PG operates at the token level, ensuring robust adaptation without iterative updates. We propose two variants: PGSP, which leverages source statistics, and PG-SF, a fully source-free version relying on CLS-token-driven adaptation. Extensive evaluations on ModelNet40-C, ShapeNetC, and ScanObjectNN-C demonstrate that PG-SP achieves an average of +10.3% higher accuracy than state-of-theart backpropagation-free methods, while PG-SF sets new benchmarks for source-free adaptation. Moreover, PG is 12.4x faster and 5.5x more memory efficient than our baseline, making it suitable for real-world deployment. Code is available at https://github.com/MosyMosy/Purge-Gate
ESCNet:Edge-Semantic Collaborative Network for Camouflaged Object Detection
Sheng Ye
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Xin Chen
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Yan Zhang
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Xianming Lin
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Liujuan Cao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Abstract
Camouflaged object detection (COD) faces unique challenges where target boundaries are intrinsically ambiguous due to their textural similarity to backgrounds. Existing methods relying on single-modality features often produce fragmented predictions due to insufficient boundary constraints.To address this, we propose ESCNet with dynamically coupled edge-texture perception. Our framework introduces three core innovations that work in concert:1) Adaptive Edge-Texture Perceptor (AETP), which creates an edge prediction behaviour where edge and texture information are mutually reinforcing based on the multi-scale features of the image integrated with the global semantic context of the Transformer;2) Dual-Stream Feature Augmentor (DSFA), which dynamically adjusts the kernel sampling position according to the local texture complexity and edge orientation, thus accurately enhancing the feature information at fractal boundaries and amorphous texture locations;3) Multi-Feature Modulation Module (MFMM), which establishes incremental fine-grained improvements for feature calibration and model prediction through enhanced characterisation of edge perception and hierarchical integration of multiple textures. This interconnected system forms a feedback loop where enhanced representations of edge perception enhance model texture prediction and vice versa. Our ESCNet demonstrates significant performance advantages on all three authoritative datasets.
GS-Occ3D: Scaling Vision-only Occupancy Reconstruction with Gaussian Splatting
Baijun Ye
IIIS, Tsinghua University
Minghui Qin
IIIS, Tsinghua University
Saining Zhang
AIR, Tsinghua University
Moonjun Goon
IIIS, Tsinghua University
Shaoting Zhu
IIIS, Tsinghua University
Hao Zhao
AIR, Tsinghua University
Hang Zhao
IIIS, Tsinghua University
Abstract
Occupancy is crucial for autonomous driving, providing essential geometric priors for perception and planning. However, existing methods predominantly rely on LiDAR-based occupancy annotations, which limits scalability and prevents leveraging vast amounts of potential crowdsourced data for auto-labeling. To address this, we propose GS-Occ3D, a scalable vision-only framework that directly reconstructs occupancy. Vision-only occupancy reconstruction poses significant challenges due to sparse viewpoints, dynamic scene elements, severe occlusions, and long-horizon motion. Existing vision-based methods primarily rely on mesh representation, which suffer from incomplete geometry and additional post-processing, limiting scalability. To overcome these issues, GS-Occ3D optimizes an explicit occupancy representation using an Octreebased Gaussian Surfel formulation, ensuring efficiency and scalability. Additionally, we decompose scenes into static background, ground, and dynamic objects, enabling tailored modeling strategies: (1) Ground is explicitly reconstructed as a dominant structural element, significantly improving large-area consistency; (2) Dynamic vehicles are separately modeled to better capture motion-related occupancy patterns. Extensive experiments on the Waymo dataset demonstrate that GS-Occ3D achieves state-of-theart geometry reconstruction results. By curating visiononly binary occupancy labels from diverse urban scenes, we show their effectiveness for downstream occupancy models on Occ3D-Waymo and superior zero-shot generalization on Occ3D-nuScenes. It highlights the potential of large-scale vision-based occupancy reconstruction as a new paradigm for scalable auto-labeling. Project Page.
Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging
Chongjie Ye
SSE, CUHKSZ
Yushuang Wu
ByteDance Games
Ziteng Lu
SSE, CUHKSZ
Jiahao Chang
SSE, CUHKSZ
Xiaoyang Guo
ByteDance Games
Jiaqing Zhou
ByteDance Games
Hao Zhao
AIR, Tsinghua University
Xiaoguang Han
SSE, CUHKSZ
Abstract
With the growing demand for high-fidelity 3D models from 2D images, existing methods still face significant challenges in accurately reproducing fine-grained geometric details due to limitations in domain gaps and inherent ambiguities in RGB images. To address these issues, we propose Hi3DGen, a novel framework for generating highfidelity 3D geometry from images via normal bridging. Hi3DGen consists of three key components: (1) an imageto-normal estimator that decouples the low-high frequency image pattern with noise injection and dual-stream training to achieve generalizable, stable, and sharp estimation; (2) a normal-to-geometry learning approach that uses normalregularized latent diffusion learning to enhance 3D geometry generation fidelity; and (3) a 3D data synthesis pipeline that constructs a high-quality dataset to support training. Extensive experiments demonstrate the effectiveness and superiority of our framework in generating rich geometric details, outperforming state-of-the-art methods in terms of fidelity. Our work provides a new direction for high-fidelity 3D geometry generation from images by leveraging normal maps as an intermediate representation.
Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis
Junyan Ye
Sun Yat-Sen University
Jun He
Sun Yat-Sen University
Weijia Li
Sun Yat-Sen University
Zhutao Lv
Sun Yat-Sen University
Yi Lin
Sun Yat-Sen University
Jinhua Yu
Sun Yat-Sen University
Haote Yang
Shanghai AI Laboratory
Conghui He
Shanghai AI Laboratory
Abstract
Ground-to-aerial image synthesis focuses on generating realistic aerial images from corresponding ground street view images while maintaining consistent content layout, simulating a top-down view. The significant viewpoint difference leads to domain gaps between views, and dense urban scenes limit the visible range of street views, making this cross-view generation task particularly challenging. In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from street view images, utilizing a diffusion model and the Bird's-Eye View (BEV) paradigm. The CurvedBEV method in SkyDiffusion converts street-view images into a BEV perspective, effectively bridging the domain gap, and employs a "multi-to-one" mapping strategy to address occlusion issues in dense urban scenes. Next, SkyDiffusion designed a BEV-guided diffusion model to generate content-consistent and realistic aerial images. Additionally, we introduce a novel dataset, Ground2Aerial-3, designed for diverse ground-to-aerial image synthesis applications, including disaster scene aerial synthesis, lowaltitude UAV image synthesis, and historical high-resolution satellite image synthesis tasks. Experimental results demonstrate that SkyDiffusion outperforms state-of-the-art methods on cross-view datasets across natural (CVUSA), suburban (CVACT), urban (VIGOR-Chicago), and various application scenarios (G2A-3), achieving realistic and contentconsistent aerial image generation. The code, datasets and more information of this work can be found at https: //opendatalab.github.io/skydiffusion/.
Where am I? Cross-View Geo-localization with Natural Language Descriptions
Junyan Ye
Sun Yat-Sen University
Honglin Lin
Shanghai AI Laboratory
Leyan Ou
Sun Yat-Sen University
Dairong Chen
Wuhan University
Zihao Wang
Sun Yat-Sen University
Qi Zhu
Sun Yat-Sen University
Conghui He
Shanghai AI Laboratory
Weijia Li
Sun Yat-Sen University
Abstract
Cross-view geo-localization identifies the locations of streetview images by matching them with geo-tagged satellite images or OSM. However, most existing studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response. In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text descriptions. To support this task, we construct the CVGText dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization details. Additionally, we propose a novel textbased retrieval localization method, CrossText2Loc, which demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons. More can be found at https://github.com/yejy53/CVG-Text.
Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images
Qi Xun Yeo
Department of Computer Science, National University of Singapore
Yanyan Li
Department of Computer Science, National University of Singapore
Gim Hee Lee
Department of Computer Science, National University of Singapore
Abstract
Modern 3D semantic scene graph estimation methods utilize ground truth 3D annotations to accurately predict target objects, predicates, and relationships. In the absence of given 3D ground truth representations, we explore leveraging only multi-view RGB images to tackle this task. To attain robust features for accurate scene graph estimation, we must overcome the noisy reconstructed pseudo point-based geometry from predicted depth maps and reduce the amount of background noise present in multi-view image features. The key is to enrich node and edge features with accurate semantic and spatial information and through neighboring relations. We obtain semantic masks to guide feature aggregation to filter background features and design a novel method to incorporate neighboring node information to aid robustness of our scene graph estimates. Furthermore, we leverage on explicit statistical priors calculated from the training summary statistics to refine node and edge predictions based on their one-hop neighborhood. Our experiments show that our method outperforms current methods purely using multi-view images as the initial input. Our project page is available at https://qixun1. github.io/projects/SCRSSG.
ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail
Chandan Yeshwanth
Technical University of Munich
Dávid Rozenberszki
Technical University of Munich
Angela Dai
Technical University of Munich
Abstract
Generating text descriptions of objects in 3D indoor scenes is an important building block of embodied understanding. Existing methods describe objects at a single level of detail and do not capture fine-grained details of the parts of objects. In order to produce varying levels of detail capturing both coarse object-level information and detailed part-level descriptions, we propose the task of expressive 3D captioning. Given an input 3D scene, the task is to describe objects at multiple levels of detail: a high-level object description, and a low-level description of the properties of its parts. To produce such captions, we present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan, and for each detected object in the scan, generates a fine-grained collective description of the parts of the object, along with an object-level description conditioned on the part-level description. We design ExCap3D to encourage consistency between the multiple levels of descriptions. To enable this task, we generated the ExCap3D Dataset by leveraging a visual-language model (VLM) for multi-view captioning. The ExCap3D Dataset contains captions on the ScanNet++ dataset with varying levels of detail, comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes. Our experiments show that the object- and part-level details generated by ExCap3D are more expressive than those produced by state-of-the-art methods, with a CIDEr score improvement of 17% and 124% for objectand part-level details respectively. Our code, dataset and models will be made publicly available.
LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion via Distillation to Learnable Look-Up Tables
Xunpeng Yi
Electronic Information School, Wuhan University
Yibing Zhang
Electronic Information School, Wuhan University
Xinyu Xiang
Electronic Information School, Wuhan University
Qinglong Yan
Electronic Information School, Wuhan University
Han Xu
School of Automation, Southeast University
Jiayi Ma
Electronic Information School, Wuhan University
Abstract
Current advanced research on infrared and visible image fusion primarily focuses on improving fusion performance, often neglecting the applicability on real-time fusion devices. In this paper, we propose a novel approach that towards extremely fast fusion via distillation to learnable lookup tables specifically designed for image fusion, termed as LUT-Fuse. Firstly, we develop a look-up table structure that utilizing low-order approximation encoding and highlevel joint contextual scene encoding, which is well-suited for multi-modal fusion. Moreover, given the lack of ground truth in multi-modal image fusion, we naturally proposed the efficient LUT distillation strategy instead of traditional quantization LUT methods. By integrating the performance of the multi-modal fusion network (MM-Net) into the MMLUT model, our method achieves significant breakthroughs in efficiency and performance. It typically requires less than one-tenth of the time compared to the current lightweight SOTA fusion algorithms, ensuring high operational speed across various scenarios, even in low-power mobile devices. Extensive experiments validate the superiority, reliability, and stability of our fusion approach. The code is available at https://github.com/zyb5/LUT-Fuse.
ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users
Xiangyu Yin
University of Pittsburgh
Boyuan Yang
University of Pittsburgh
Weichen Liu
University of Pittsburgh
Qiyao Xue
University of Pittsburgh
Abrar Alamri
University of Pittsburgh
Goeran Fiedler
University of Pittsburgh
Wei Gao
University of Pittsburgh
Abstract
Prosthetic legs play a pivotal role in clinical rehabilitation, allowing individuals with lower-limb amputations the ability to regain mobility and improve their quality of life. Gait analysis is fundamental for optimizing prosthesis design and alignment, directly impacting the mobility and life quality of individuals with lower-limb amputations. Visionbased machine learning (ML) methods offer a scalable and non-invasive solution to gait analysis, but face challenges in correctly detecting and analyzing prosthesis, due to their unique appearances and new movement patterns. In this paper, we aim to bridge this gap by introducing a multipurpose dataset, namely ProGait, to support multiple vision tasks including Video Object Segmentation, 2D Human Pose Estimation, and Gait Analysis (GA). ProGait provides 412 video clips from four above-knee amputees when testing multiple newly-fitted prosthetic legs through walking trials, and depicts the presence, contours, poses, and gait patterns of human subjects with transfemoral prosthetic legs. Alongside the dataset itself, we also present benchmark tasks and fine-tuned baseline models to illustrate the practical application and performance of the ProGait dataset. We compared our baseline models against pre-trained vision models, demonstrating improved generalizability when applying the ProGait dataset for prosthesis-specific tasks. The ProGait dataset is available at https://huggingface. co/datasets/ericyxy98/ProGait, and the source codes of our benchmark tasks are available at https: //github.com/pittisl/ProGait.
MOVE: Motion-Guided Few-Shot Video Object Segmentation
Kaining Ying
Fudan University
Hengrui Hu
Fudan University
Henghui Ding
Fudan University
Abstract
This work addresses motion-guided few-shot video object segmentation (FSVOS), which aims to segment dynamic objects in videos based on a few annotated examples with the same motion patterns. Existing FSVOS datasets and methods typically focus on object categories, which are static attributes that ignore the rich temporal dynamics in videos, limiting their application in scenarios requiring motion understanding. To fill this gap, we introduce MOVE, a large-scale dataset specifically designed for motion-guided FSVOS. Based on MOVE, we comprehensively evaluate 6 state-of-the-art methods from 3 different related tasks across 2 experimental settings. Our results reveal that current methods struggle to address motion-guided FSVOS, prompting us to analyze the associated challenges and propose a baseline method, Decoupled Motion-Appearance Network (DMA). Experiments demonstrate that our approach achieves superior performance in few-shot motion understanding, establishing a solid foundation for future research in this direction.
SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting
Haiyang Ying
University of Maryland, College Park
Matthias Zwicker
University of Maryland, College Park
Abstract
Edges are one of the most basic parametric primitives to describe structural information in 3D. In this paper, we study parametric 3D edge reconstruction from calibrated multi-view images. Previous methods usually reconstruct a 3D edge point set from multi-view 2D edge images, and then fit 3D edges to the point set. However, noise in the point set may cause gaps among fitted edges, and the recovered edges may not align with input multi-view images since the edge fitting depends only on the reconstructed 3D point set. To mitigate these problems, we propose SketchSplat, a method to reconstruct accurate, complete, and compact 3D edges via differentiable multi-view sketch splatting. We represent 3D edges as sketches, which are parametric lines and curves defined by attributes including control points, scales, and opacity. During reconstruction, we iteratively sample Gaussian points from a set of sketches and rasterize the Gaussians onto 2D edge images. Then the gradient of the image loss can be back-propagated to optimize the sketch attributes. Our method bridges 2D edge images and 3D edges in a differentiable manner, which ensures that 3D edges align well with 2D images and leads to accurate and complete results. We also propose a series of adaptive topological operations to reduce redundant edges and apply them along with the sketch optimization, yielding a more compact reconstruction. Finally, we contribute an accurate 2D edge detector that improves the performance of both ours and existing methods. Experiments show that our method achieves state-of-the-art accuracy, completeness, and compactness on a benchmark CAD dataset.
Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability
Seungju Yoo
Yonsei University
Hyuk Kwon
Yonsei University
Joong-Won Hwang
ETRI
Kibok Lee
Yonsei University
Abstract
Recent advances in computer vision have made training object detectors more efficient and effective; however, assessing their performance in real-world applications still relies on costly manual annotation. To address this limitation, we develop an automated model evaluation (AutoEval) framework for object detection. We propose Prediction Consistency and Reliability (PCR), which leverages the multiple candidate bounding boxes that conventional detectors generate before non-maximum suppression (NMS). PCR estimates detection performance without ground-truth labels by jointly measuring 1) the spatial consistency between boxes before and after NMS, and 2) the reliability of the retained boxes via the confidence scores of overlapping boxes. For a more realistic and scalable evaluation, we construct a metadataset by applying image corruptions of varying severity. Experimental results demonstrate that PCR yields more accurate performance estimates than existing AutoEval methods, and the proposed meta-dataset covers a wider range of detection performance. The code is available at https: //github.com/YonseiML/autoeval-det.
S4M: Boosting Semi-Supervised Instance Segmentation with SAM
Heeji Yoon
KAIST AI
Heeseong Shin
KAIST AI
Eunbeen Hong
KAIST AI
Hyunwook Choi
Korea University
Hansang Cho
Samsung Electro-Mechanics
Daun Jeong
Samsung Electro-Mechanics
Seungryong Kim
KAIST AI
Abstract
Semi-supervised instance segmentation poses challenges due to limited labeled data, causing difficulties in accurately localizing distinct object instances. Current teacherstudent frameworks still suffer from performance constraints due to unreliable pseudo-label quality stemming from limited labeled data. While the Segment Anything Model (SAM) offers robust segmentation capabilities at various granularities, directly applying SAM to this task introduces challenges such as class-agnostic predictions and potential over-segmentation. To address these complexities, we carefully integrate SAM into the semi-supervised instance segmentation framework, developing a novel distillation method that effectively captures the precise localization capabilities of SAM without compromising semantic recognition. Furthermore, we incorporate pseudo-label refinement as well as a specialized data augmentation with the refined pseudo-labels, resulting in superior performance. We establish state-of-the-art performance, and provide comprehensive experiments and ablation studies to validate the effectiveness of our proposed approach.
MeshMamba: State Space Models for Articulated 3D Mesh Generation and Reconstruction
Yusuke Yoshiyasu
National Institute of Advanced Industrial Science and Technology (AIST)
Leyuan Sun
National Institute of Advanced Industrial Science and Technology (AIST)
Ryusuke Sagawa
National Institute of Advanced Industrial Science and Technology (AIST)
Abstract
In this paper, we introduce MeshMamba, a neural network model for learning 3D articulated mesh models by employing the recently proposed Mamba State Space Models (Mamba-SSMs). MeshMamba is efficient and scalable in handling a large number of input tokens, enabling the generation and reconstruction of body mesh models with more than 10,000 vertices, capturing clothing and hand geometries. The key to effectively learning MeshMamba is the serialization technique of mesh vertices into orderings that are easily processed by Mamba. This is achieved by sorting the vertices based on body part annotations or the 3D vertex locations of a template mesh, such that the ordering respects the structure of articulated shapes. Based on MeshMamba, we design 1) MambaDiff3D, a denoising diffusion model for generating 3D articulated meshes and 2) Mamba-HMR, a 3D human mesh recovery model that reconstructs a human body shape and pose from a single image. Experimental results showed that MambaDiff3D can generate dense 3D human meshes in clothes, with grasping hands, etc., and outperforms previous approaches in the 3D human shape generation task. Additionally, MambaHMR extends the capabilities of previous non-parametric human mesh recovery approaches, which were limited to handling body-only poses using around 500 vertex tokens, to the whole-body setting with face and hands, while achieving competitive performance in (near) real-time.
BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation
Yuanhong Yu
Zhejiang University
Xingyi He
Zhejiang University
Chen Zhao
EPFL
Junhao Yu
Chongqing University
Jiaqi Yang
Northwestern Polytechnical University
Ruizhen Hu
Shenzhen University
Yujun Shen
Ant Group
Xing Zhu
Ant Group
Xiaowei Zhou
Zhejiang University
Sida Peng
Zhejiang University
Abstract
This paper presents a generalizable RGB-based approach for object pose estimation, specifically designed to address challenges in sparse-view settings. While existing methods can estimate the poses of unseen objects, their generalization ability remains limited in scenarios involving occlusions and sparse reference views, restricting their realworld applicability. To overcome these limitations, we introduce corner points of the object bounding box as an intermediate representation of the object pose. The 3D object corners can be reliably recovered from sparse input views, while the 2D corner points in the target view are estimated through a novel reference-based point synthesizer, which works well even in scenarios involving occlusions. As object semantic points, object corners naturally establish 2D-3D correspondences for object pose estimation with a PnP algorithm. Extensive experiments on the YCB-Video and Occluded-LINEMOD datasets show that our approach outperforms state-of-the-art methods, highlighting the effectiveness of the proposed representation and significantly enhancing the generalization capabilities of object pose estimation, which is crucial for real-world applications.
DADet: Safeguarding Image Conditional Diffusion Models against Adversarial and Backdoor Attacks via Diffusion Anomaly Detection
Hongwei Yu
University of Science and Technology Beijing
Xinlong Ding
University of Science and Technology Beijing
Jiawei Li
University of Science and Technology Beijing
Jinlong Wang
University of Science and Technology Beijing
Yudong Zhang
Tsinghua University
Rongquan Wang
University of Science and Technology Beijing
Huimin Ma
University of Science and Technology Beijing
Jiansheng Chen
University of Science and Technology Beijing
Abstract
While image conditional diffusion models demonstrate impressive generation capabilities, they exhibit high vulnerability when facing backdoor and adversarial attacks. In this paper, we define a scenario named diffusion anomaly where the generated results of a reverse process under attack deviate significantly from the normal ones. By analyzing the underlying formation mechanism of the diffusion anomaly, we reveal how perturbations are amplified during the reverse process and accumulated in the results. Based on the analysis, we reveal the phenomena of divergence and homogeneity, which cause the diffusion process to deviate significantly from the normal process and to decline in diversity. Leveraging these two phenomena, we propose a method named Diffusion Anomaly Detection (DADet) to effectively detect both backdoor and adversarial attacks. Extensive experiments demonstrate that our proposal achieves excellent defense performance against backdoor and adversarial attacks. Specifically, for the backdoor attack detection, our method achieves an F1 score of 99% on different datasets, including MS COCO and CIFAR-10. For the detection of adversarial samples, the F1 score exceeds 84% across three adversarial attacks and two different tasks, evaluated on the MS COCO and Places365 datasets, respectively.
DistillDrive: End-to-End Multi-Mode Autonomous Driving Distillation by Isomorphic Hetero-Source Planning Model
Rui Yu
East China University of Science and Technology
Xianghang Zhang
SenseAuto Research
Runkai Zhao
The University of Sydney
Huaicheng Yan
East China University of Science and Technology
Meng Wang
East China University of Science and Technology
Abstract
End-to-end autonomous driving has been recently seen rapid development, exerting a profound influence on both industry and academia. However, the existing work places excessive focus on ego-vehicle status as their sole learning objectives and lacks of planning-oriented understanding, which limits the robustness of the overall decision-making prcocess. In this work, we introduce DistillDrive, an endto-end knowledge distillation-based autonomous driving model that leverages diversified instance imitation to enhance multi-mode motion feature learning. Specifically, we employ a planning model based on structured scene representations as the teacher model, leveraging its diversified planning instances as multi-objective learning targets for the end-to-end model. Moreover, we incorporate reinforcement learning to enhance the optimization of stateto-decision mappings, while utilizing generative modeling to construct planning-oriented instances, fostering intricate interactions within the latent space. We validate our model on the nuScenes and NAVSIM datasets, achieving a 50% reduction in collision rate and a 3-point improvement in closed-loop performance compared to the baseline model. Code and model are publicly available at https://github.com/YuruiAI/DistillDrive
Dynamic Reconstruction of Hand-Object Interaction with Distributed Force-aware Contact Representation
Abstract
We present ViTaM-D, a novel visual-tactile framework for reconstructing dynamic hand-object interaction with distributed tactile sensing to enhance contact modeling. Existing methods, relying solely on visual inputs, often fail to capture occluded interactions and object deformation. To address this, we introduce DF-Field, a distributed force-aware contact representation leveraging kinetic and potential energy in hand-object interactions. ViTaM-D first reconstructs interactions using a visual network with contact constraint, then refines contact details through force-aware optimization, improving object deformation modeling. To evaluate deformable object reconstruction, we introduce the HOT dataset, featuring 600 hand-object interaction sequences in a high-precision simulation environment. Experiments on DexYCB and HOT datasets show that ViTaM-D outperforms state-of-the-art methods in reconstruction accuracy for both rigid and deformable objects. DF-Field also proves more effective in refining hand poses and enhancing contact modeling than previous refinement methods.
From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision
Chuang Yu
Key Laboratory of Opto-Electronic Information Processing, Chinese Academy of Sciences
Jinmiao Zhao
Shenyang Institute of Automation, Chinese Academy of Sciences
Yunpeng Liu
Shenyang Institute of Automation, Chinese Academy of Sciences
Sicheng Zhao
Tsinghua University
Yimian Dai
Nankai University
Xiangyu Yue
MMLab, The Chinese University of Hong Kong
Abstract
Recently, single-frame infrared small target (SIRST) detection with single point supervision has drawn wide-spread attention. However, the latest label evolution with single point supervision (LESPS) framework suffers from instability, excessive label evolution, and difficulty in exerting embedded network performance. Inspired by organisms gradually adapting to their environment and continuously accumulating knowledge, we construct an innovative Progressive Active Learning (PAL) framework, which drives the existing SIRST detection networks progressively and actively recognizes and learns harder samples. Specifically, to avoid the early low-performance model leading to the wrong selection of hard samples, we propose a model pre-start concept, which focuses on automatically selecting a portion of easy samples and helping the model have basic task-specific learning capabilities. Meanwhile, we propose a refined dual-update strategy, which can promote reasonable learning of harder samples and continuous refinement of pseudo-labels. In addition, to alleviate the risk of excessive label evolution, a decay factor is reasonably introduced, which helps to achieve a dynamic balance between the expansion and contraction of target annotations. Extensive experiments show that existing SIRST detection networks equipped with our PAL framework have achieved state-of-the-art (SOTA) results on multiple public datasets. Furthermore, our PAL framework can build an efficient and stable bridge between full supervision and single point supervision tasks. Our code is available at https://github.com/YuChuang1205/PAL
GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning
Kelin Yu
University of Maryland, College Park
Sheng Zhang
University of Maryland, College Park
Harshit Soora
University of Maryland, College Park
Furong Huang
University of Maryland, College Park
Heng Huang
University of Maryland, College Park
Pratap Tokekar
University of Maryland, College Park
Ruohan Gao
University of Maryland, College Park
Abstract
Recent advances have shown that video generation models can enhance robot learning by deriving effective robot actions through inverse dynamics. However, these methods heavily depend on the quality of generated data and struggle with fine-grained manipulation due to the lack of environment feedback. While video-based reinforcement learning improves policy robustness, it remains constrained by the uncertainty of video generation and the challenges of collecting large-scale robot datasets for training diffusion models. To address these limitations, we propose GENFLOWRL, which derives shaped rewards from generated flow trained from diverse cross-embodiment datasets. This enables learning generalizable and robust policies from diverse demonstrations using low-dimensional, object-centric features. Experiments on 10 manipulation tasks, both in simulation and real-world cross-embodiment evaluations, demonstrate that GENFLOWRL effectively leverages manipulation features extracted from generated object-centric flow, consistently achieving superior performance across diverse and challenging scenarios. Our Project Page: https://colinyu1.github.io/genflowrl/.
Language Driven Occupancy Prediction
Zhu Yu
Zhejiang University
Bowen Pang
Zhejiang University
Lizhe Liu
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group
Runmin Zhang
Zhejiang University
Qiang Li
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group
Si-Yuan Cao
Ningbo Global Innovation Center, Zhejiang University
Maochun Luo
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group
Mingxia Chen
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group
Sheng Yang
Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group
Hui-Liang Shen
Zhejiang University
Abstract
We introduce LOcc, an effective and generalizable framework for open-vocabulary occupancy (OVO) prediction. Previous approaches typically supervise the networks through coarse voxel-to-text correspondences via image features as intermediates or noisy and sparse correspondences from voxel-based model-view projections. To alleviate the inaccurate supervision, we propose a semantic transitive labeling pipeline to generate dense and finegrained 3D language occupancy ground truth. Our pipeline presents a feasible way to dig into the valuable semantic information of images, transferring text labels from images to LiDAR point clouds and ultimately to voxels, to establish precise voxel-to-text correspondences. By replacing the original prediction head of supervised occupancy models with a geometry head for binary occupancy states and a language head for language features, LOcc effectively uses the generated language ground truth to guide the learning of 3D language volume. Through extensive experiments, we demonstrate that our transitive semantic labeling pipeline can produce more accurate pseudo-labeled ground truth, diminishing labor-intensive human annotations. Additionally, we validate LOcc across various architectures, where all models consistently outperform state-of-the-art zero-shot occupancy prediction approaches on the Occ3DnuScenes dataset.
Learning to Generalize without Bias for Open-Vocabulary Action Recognition
Yating Yu
Northwestern Polytechnical University
Congqi Cao
Northwestern Polytechnical University
Yifan Zhang
Institute of Automation, Chinese Academy of Sciences
Yanning Zhang
Northwestern Polytechnical University
Abstract
Leveraging the effective visual-text alignment and static generalizability from CLIP, recent video learners adopt CLIP initialization with further regularization or recombination for generalization in open-vocabulary action recognition in-context. However, due to the static bias of CLIP, such video learners tend to overfit on shortcut static features, thereby compromising their generalizability, especially to novel out-of-context actions. To address this issue, we introduce Open-MeDe, a novel Meta-optimization framework with static Debiasing for Open-vocabulary action recognition. From a fresh perspective of generalization, Open-MeDe adopts a meta-learning approach to improve 'known-to-open generalizing' and 'image-to-video debiasing' in a cost-effective manner. Specifically, Open-MeDe introduces a cross-batch meta-optimization scheme that explicitly encourages video learners to quickly generalize to arbitrary subsequent data via virtual evaluation, steering a smoother optimization landscape. In effect, the free of CLIP regularization during optimization implicitly mitigates the inherent static bias of the video meta-learner. We further apply self-ensemble over the optimization trajectory to obtain generic optimal parameters that can achieve robust generalization to both in-context and out-of-context novel data. Extensive evaluations show that Open-MeDe not only surpasses state-of-the-art regularization methods tailored for in-context open-vocabulary action recognition but also substantially excels in out-of-context scenarios. Code is released at https://github.com/Mia-YatingYu/Open-MeDe.
TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
Mark Yu
ARC Lab, Tencent PCG
Wenbo Hu
The Chinese University of Hong Kong
Jinbo Xing
The Chinese University of Hong Kong
Ying Shan
The Chinese University of Hong Kong
Abstract
We present TrajectoryCrafter, a novel approach to redirect camera trajectories for monocular videos. By disentangling deterministic view transformations from stochastic content generation, our method achieves precise control over userspecified camera trajectories. We propose a novel dualstream conditional video diffusion model that concurrently integrates point cloud renders and source videos as conditions, ensuring accurate view transformations and coherent 4D content generation. Instead of leveraging scarce multiview videos, we curate a hybrid training dataset combining web-scale monocular videos with static multi-view datasets, by our innovative double-reprojection strategy, significantly fostering robust generalization across diverse scenes. Extensive evaluations on multi-view and large-scale monocular videos demonstrate the superior performance of our method.
ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching
Yuxuan Yuan
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Luyao Tang
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Yixin Chen
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Chaoqi Chen
Shenzhen University
Yue Huang
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Xinghao Ding
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Abstract
Albeit existing Single-Domain Generalized Object Detection (Single-DGOD) methods enable models to generalize to unseen domains, most assume that the training and testing data share the same label space. In real-world scenarios, unseen domains often introduce previously unknown objects, a challenge that has been largely overlooked. In this paper, we tackle the practical problem of Single-domain Generalizable Open-Set Object Detection (SG-OSOD), which addresses both unseen domains and unknown classes. We identify two key challenges: (1) detecting unknown classes with only known-class data, and (2) learning robust features to mitigate domain shift. To address these challenges, we propose the framework termed ASGS, which leverages adaptive subgraph structures to enhance the understanding of unknown scenes and classes. ASGS consists of Subgraph-wise Unknown-class Learning (SUL) and Class-wise Embedding Compaction (CEC). SUL employs non-parametric methods to detect unknown samples and performs Adaptive Subgraph Searching (ASS) for high-order structural feature extraction, enabling domainrobust unknown class learning. Moreover, the CEC module enhances class discrimination robustness through contrastive learning, which results in more compact class clusters in unknown scenarios. Experimental results demonstrate the effectiveness of the proposed ASGS.
CAT: A Unified Click-and-Track Framework for Realistic Tracking
Yongsheng Yuan
Dalian University of Technology, China
Jie Zhao
Dalian University of Technology, China
Dong Wang
Dalian University of Technology, China
Huchuan Lu
Dalian University of Technology, China
Abstract
Modern visual trackers have achieved robust performance with precisely initialized target bounding boxes. However, providing high-precision initial annotations is a process both labor-intensive and error-prone in realworld scenarios. Interactive initialization (e.g., click-based, scribble-based) presents a more practical alternative. In this paper, we introduce a unified Click-and-Track (CAT) framework for full-process tracking, eliminating the need for auxiliary models or complex initialization pipelines. We present a novel fine-tuning paradigm that bridges the information gap inherent in click-based initialization through two key innovations: 1) The proposed click-based localization and joint spatial-visual prompt refinement are sequentially performed to compensate for the geometric information loss (e.g., boundary ambiguity, shape uncertainty) inherent in click-based initialization. 2) We design a parameter-efficient module called CTMoE to leverage the tracker's inherent capabilities when fine-tuning. The proposed CTMoE enables the foundation model to learn different matching patterns, unifying click-based initialization and tracking within a unified architecture. Extensive experimental results demonstrate state-of-the-art performance of our click-based tracking method on the LaSOT benchmark (70.5% AUC) while maintaining parameter efficiency, surpassing existing click-based tracking frameworks by a large margin and even outperforming some bounding-boxinitialized trackers. The code and models are available at https://github.com/ysyuann/CAT.
Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction
Zhensheng Yuan
Jinan University
Haozhi Huang
Jinan University
Zhen Xiong
Jinan University
Di Wang
Jinan University
Guanghua Yang
Jinan University
Abstract
We present a framework that enables fast reconstruction and real-time rendering of urban-scale scenes while maintaining robustness against appearance variations across multi-view captures. Our approach begins with scene partitioning for parallel training, employing a visibility-based image selection strategy to optimize training efficiency. A controllable level-of-detail (LOD) strategy explicitly regulates Gaussian density under a user-defined budget, enabling efficient training and rendering while maintaining high visual fidelity. The appearance transformation module mitigates the negative effects of appearance inconsistencies across images while enabling flexible adjustments. Additionally, we utilize enhancement modules, such as depth regularization, scale regularization, and antialiasing, to improve reconstruction fidelity. Experimental results demonstrate that our method effectively reconstructs urban-scale scenes and outperforms previous approaches in both efficiency and quality. The source code is available at: https://yzslab.github.io/REUrbanGS.
Scaling 3D Compositional Models for Robust Classification and Pose Estimation
Xiaoding Yuan
Johns Hopkins University
Guofeng Zhang
Johns Hopkins University
Prakhar Kaushik
Johns Hopkins University
Artur Jesslen
University of Freiburg
Adam Kortylewski
University of Freiburg
Alan Yuille
Johns Hopkins University
Abstract
Deep learning algorithms for object classification and 3D object pose estimation lack robustness to out-of-distribution factors such as synthetic stimuli, changes in weather conditions, and partial occlusion. Recently, a class of Neural Mesh Models have been developed where objects are represented in terms of 3D meshes with learned features at the vertices. These models have shown robustness in small-scale settings, involving 10 objects, but it is unclear that they can be scaled up to 100s of object classes. The main problem is that their training involves contrastive learning among the vertices of all object classes, which scales quadratically with the number of classes. We present a strategy which exploits the compositionality of the objects, i.e. the independence of the feature vectors of the vertices, which greatly reduces the training time while also improving the performance of the algorithms. We first restructure the per-vertex contrastive learning into contrasting within class and between classes. Then we propose a process that dynamically decouples the contrast between classes which are rarely confused, and enhances the contrast between the vertices of classes that are most confused. Our large-scale 3D compositional model not only achieves state-of-the-art performance on the task of predicting classification and pose estimation simultaneously, surpassing Neural Mesh Models and standard DNNs, but is also more robust to out-of-distribution testing including occlusion, weather conditions, synthetic data, and generalization to unknown classes.
Self-Supervised Monocular 4D Scene Reconstruction for Egocentric Videos
Chengbo Yuan
Institute for Interdisciplinary Information Sciences, Tsinghua University
Geng Chen
Shanghai Qi Zhi Institute
Li Yi
Institute for Interdisciplinary Information Sciences, Tsinghua University
Yang Gao
Institute for Interdisciplinary Information Sciences, Tsinghua University
Abstract
Egocentric videos provide valuable insights into human interactions with the physical world, which has sparked growing interest in the computer vision and robotics communities. A critical challenge in fully understanding the geometry and dynamics of egocentric videos is dense scene reconstruction. However, the lack of high-quality labeled datasets in this field has hindered the effectiveness of current supervised learning methods. In this work, we aim to address this issue by exploring an self-supervised dynamic scene reconstruction approach. We introduce EgoMono4D, a novel model that unifies the estimation of multiple variables necessary for Egocentric Monocular 4D reconstruction, including camera intrinsic, camera poses, and video depth, all within a fast feed-forward framework. Starting from pretrained single-frame depth and intrinsic estimation model, we extend it with camera poses estimation and align multi-frame results on large-scale unlabeled egocentric videos. We evaluate EgoMono4D in both in-domain and zero-shot generalization settings, achieving superior performance in dense pointclouds sequence reconstruction compared to all baselines. EgoMono4D represents the first attempt to apply self-supervised learning for pointclouds sequence reconstruction to the label-scarce egocentric field, enabling fast, dense, and generalizable reconstruction. The interactable visualization, code and trained models are released https://egomono4d.github.io/.
WalkVLM: Aid Visually Impaired People Walking by Vision Language Model
Abstract
Approximately 200 million individuals around the world suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to offer walking assistance for these people. With the recent progress of vision-language models (VLMs), applying VLMs to offer walking guidance has become popular. However, the existing methods of walking guidance are mainly based on self-curated question-answering datasets that are not publicly accessible, without a standardized benchmark for training or evaluation. Moreover, walking assistance often requires real-time streaming video analysis and the generation of concise yet informative reminders, making VLMs struggle due to excessive responses and low efficiency in inferences. In this paper, we introduce the first large-scale dataset dedicated to walking assistance, comprising 12,000 video-annotation pairs, to provide a unified benchmark for training and evaluating systems to help visually-impaired individuals walk. Furthermore, a WalkVLM model is proposed, which employs chain of thought for hierarchical planning to generate concise but informative reminders and utilizes temporal-aware adaptive prediction to reduce the temporal redundancy of reminders. Finally, we have established a solid benchmark for blind walking task and verified the advantages of WalkVLM in stream video processing for this task compared to other VLMs.
MOSCATO: Predicting Multiple Object State Change Through Actions
Parnian Zameni
Northeastern University
Yuhan Shen
Northeastern University
Ehsan Elhamifar
Northeastern University
Abstract
We introduce MOSCATO: a new benchmark for predicting the evolving states of multiple objects through long procedural videos with multiple actions. While prior work in object state prediction has typically focused on a single object undergoing one or a few state changes, realworld tasks require tracking many objects whose states evolve over multiple actions. Given the high cost of gathering framewise object-state labels for many videos, we develop a weakly-supervised multiple object state prediction framework, which only uses action labels during training. Specifically, we propose a novel Pseudo-Label Acquisition (PLA) pipeline that integrates large language models, vision-language models, and action segment annotations to generate fine-grained, per-frame object-state pseudo-labels for training a Multiple Object State Prediction (MOSP) network. We further devise a State-Action Interaction (SAI) module that explicitly models the correlations between actions and object states, thereby improving MOSP. To facilitate comprehensive evaluation, we create the MOSCATO benchmark by augmenting three egocentric video datasets with framewise object-state annotations. Experiments show that our multi-stage pseudo-labeling approach and SAI module significantly boost performance over zero-shot VLM baselines and naive extensions of existing methods, underscoring the importance of holistic action-state modeling for fine-grained procedural video understanding.1
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding
Tatiana Zemskova
AIRI
Dmitry Yudin
AIRI
Abstract
A 3D scene graph represents a compact scene model by capturing both the objects present and the semantic relationships between them, making it a promising structure for robotic applications. To effectively interact with users, an embodied intelligent agent should be able to answer a wide range of natural language queries about the surrounding 3D environment. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for learning scene representations have shown that adapting these representations to the 3D world can significantly improve the quality of LLM responses. However, existing methods typically rely only on geometric information, such as object coordinates, and overlook the rich semantic relationships between objects. In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph that explicitly incorporates semantic relationships. This representation is used as input to LLMs for performing 3D vision-language tasks. In our experiments on popular ScanRefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate that our approach outperforms baselines that do not leverage semantic relationships between objects. The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM.
From Objects to Events: Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning
Yuhui Zeng
Xiamen University
Haoxiang Wu
Xiamen University
Wenjie Nie
Xiamen University
Guangyao Chen
Peking University
Xiawu Zheng
Peking University
Yunhang Shen
Tencent Youtu Lab
Jun Peng
Xiamen University
Yonghong Tian
Peking University
Rongrong Ji
Xiamen University
Abstract
Current object detectors excel at entity localization and classification, yet exhibit inherent limitations in event recognition capabilities. This deficiency arises from their architecture's emphasis on discrete object identification rather than modeling the compositional reasoning, interobject correlations, and contextual semantics essential for comprehensive event understanding. To address this challenge, we present a novel framework that expands the capability of standard object detectors beyond mere object recognition to complex event understanding through LLM-guided symbolic reasoning. Our key innovation lies in bridging the semantic gap between object detection and event understanding without requiring expensive taskspecific training. The proposed plug-and-play framework interfaces with any open-vocabulary detector while extending their inherent capabilities across architectures. At its core, our approach combines (i) a symbolic regression mechanism exploring relationship patterns among detected entities and (ii) a LLM-guided strategy guiding the search toward meaningful expressions. These discovered symbolic rules transform low-level visual perception into interpretable event understanding, providing a transparent reasoning path from objects to events with strong transferability across domains. We compared our training-free framework against specialized event recognition systems across diverse application domains. Experiments demonstrate that our framework enhances multiple object detector architectures to recognize complex events such as illegal fishing activities (75% AUROC, +8.36% improvement), construction safety violations (+15.77%), and abnormal crowd behaviors (+23.16%). Code is available at here.
GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments
Lin Zeng
Zhejiang University
Boming Zhao
Zhejiang University
Jiarui Hu
Zhejiang University
Xujie Shen
Zhejiang University
Ziqiang Dang
Zhejiang University
Hujun Bao
Zhejiang University
Zhaopeng Cui
Zhejiang University
Abstract
Novel view synthesis with neural models has advanced rapidly in recent years, yet adapting these models to scene changes remains an open problem. Existing methods are either labor-intensive, requiring extensive model retraining, or fail to capture detailed types of changes over time. In this paper, we present GaussianUpdate, a novel approach that combines 3D Gaussian representation with continual learning to address these challenges. Our method effectively updates the Gaussian radiance fields with current data while preserving information from past scenes. Unlike existing methods, GaussianUpdate explicitly models different types of changes through a novel multi-stage update strategy. Additionally, we introduce a visibility-aware continual learning approach with generative replay, enabling self-aware updating without the need to store images. The experiments on the benchmark dataset demonstrate our method achieves superior and real-time rendering with the capability of visualizing changes over different times. Please refer to our project webpage for more informations: https://zju3dv.github.io/GaussianUpdate.
AR-1-to-3: Single Image to Consistent 3D Object via Next-View Prediction
Xuying Zhang
Nankai University
Yupeng Zhou
Nankai University
Kai Wang
Nankai University
Yikai Wang
Tsinghua University
Zhen Li
Nankai University
Shaohui Jiao
ByteDance Inc.
Daquan Zhou
ByteDance Inc.
Qibin Hou
Nankai University
Ming-Ming Cheng
Nankai University
Abstract
Novel view synthesis (NVS) is a cornerstone for imageto-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our empirical observation that the target views closer to the input views exhibit higher fidelity. With this inspiration, we propose AR1-to-3, a novel next-view prediction paradigm based on diffusion models that first generates views close to the input views, which are then utilized as contextual information to progressively synthesize farther views. To encode the generated view subsequences as local and global conditions for the next-view prediction, we accordingly develop a stacked local feature encoding strategy (Stacked-LE) and an LSTMbased global feature encoding strategy (LSTM-GE). Extensive experiments demonstrate that our method significantly improves the consistency between the generated views and the input views, producing high-fidelity 3D assets.
A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions
Youliang Zhang
Tsinghua University
Ronghui Li
Tsinghua University
Yachao Zhang
Xiamen University
Liang Pan
The University of Hong Kong
Jingbo Wang
Shanghai AI Laboratory
Yebin Liu
Tsinghua University
Xiu Li
Tsinghua University
Abstract
Extracting physically plausible 3D human motion from videos is a critical task. Although existing simulation-based motion imitation methods can enhance the physical quality of daily motions estimated from monocular video capture, extending this capability to high-difficulty motions remains an open challenge. This can be attributed to some flawed motion clips in video-based motion capture results and the inherent complexity in modeling high-difficulty motions. Therefore, sensing the advantage of segmentation in localizing human body, we introduce a mask-based motion correction module (MCM) that leverages motion context and video mask to repair flawed motions; and propose a physics-based motion transfer module (PTM), which employs a prior injected pretrain and adapt approach for motion imitation, improving physical plausibility with the ability to handle in-the-wild and challenging motions. Our approach is designed as a plug-and-play module to physically refine the video motion capture, which also excels in motion generation tasks. Finally, we collected a challenging in-the-wild test set to establish a benchmark, and our method has demonstrated effectiveness on both the new benchmark and existing public datasets. Our project page is : https://physicalmotionrestoration.github.io/
AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving
Ruifei Zhang
The Chinese University of Hong Kong, Shenzhen
Junlin Xie
The Chinese University of Hong Kong, Shenzhen
Wei Zhang
Baidu Inc.
Weikai Chen
Guangdong Key Laboratory of Big Data Analysis and Processing
Xiao Tan
Baidu Inc.
Xiang Wan
Shenzhen Research Institute of Big Data
Guanbin Li
Sun Yat-sen University
Abstract
Effectively integrating Large Language Models (LLMs) into autonomous driving requires a balance between leveraging high-level reasoning and maintaining real-time efficiency. Existing approaches either activate LLMs too frequently, causing excessive computational overhead, or use fixed schedules, failing to adapt to dynamic driving conditions. To address these challenges, we propose AdaDrive, an adaptively collaborative slow-fast framework that optimally determines when and how LLMs contribute to decisionmaking. (1) When to activate the LLM: AdaDrive employs a novel adaptive activation loss that dynamically determines LLM invocation based on a comparative learning mechanism, ensuring activation only in complex or critical scenarios. (2) How to integrate LLM assistance: Instead of rigid binary activation, AdaDrive introduces an adaptive fusion strategy that modulates a continuous, scaled LLM influence based on scene complexity and prediction confidence, ensuring seamless collaboration with conventional planners. Through these strategies, AdaDrive provides a flexible, context-aware framework that maximizes decision accuracy without compromising real-time performance. Extensive experiments on language-grounded autonomous driving benchmarks demonstrate that AdaDrive state-of-the-art performance in terms of both driving accuracy and computational efficiency. Code is available at https://github.com/ReaFly/AdaDrive.
Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding
Xiaojie Zhang
Beijing University of Posts and Telecommunications
Yuanfei Wang
Peking University
Ruihai Wu
Peking University
Kunqi Xu
Peking University
Yu Li
Beijing University of Posts and Telecommunications
Liuyu Xiang
Beijing University of Posts and Telecommunications
Hao Dong
Peking University
Zhaofeng He
Beijing University of Posts and Telecommunications
Abstract
Articulated objects pose diverse manipulation challenges for robots. Since their internal structures are not directly observable, robots must adaptively explore and refine actions to generate successful manipulation trajectories. While existing works have attempted cross-category generalization in adaptive articulated object manipulation, two major challenges persist: (1) the geometric diversity of real-world articulated objects complicates visual perception and understanding, and (2) variations in object functions and mechanisms hinder the development of a unified adaptive manipulation strategy. To address these challenges, we propose AdaRPG, a novel framework that leverages foundation models to extract object parts, which exhibit greater local geometric similarity than entire objects, thereby enhancing visual affordance generalization for functional primitive skills. To support this, we construct a part-level affordance annotation dataset to train the affordance model. Additionally, AdaRPG utilizes the common knowledge embedded in foundation models to reason about complex mechanisms and generate high-level control codes that invoke primitive skill functions based on part affordance inference. Simulation and real-world experiments demonstrate AdaRPG's strong generalization ability across novel articulated object categories.
Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction
Runmin Zhang
Zhejiang University
Zhu Yu
Zhejiang University
Si-Yuan Cao
Ningbo Global Innovation Center, Zhejiang University
Lingyu Zhu
City University of Hong Kong
Guangyi Zhang
Zhejiang University
Xiaokai Bai
Zhejiang University
Hui-Liang Shen
Zhejiang University
Abstract
This work presents SGCDet, a novel multi-view indoor 3D object detection framework based on adaptive 3D volume construction. Unlike previous approaches that restrict the receptive field of voxels to fixed locations on images, we introduce a geometry and context aware aggregation module to integrate geometric and contextual information within adaptive regions in each image and dynamically adjust the contributions from different views, enhancing the representation capability of voxel features. Furthermore, we propose a sparse volume construction strategy that adaptively identifies and selects voxels with high occupancy probabilities for feature refinement, minimizing redundant computation in free space. Benefiting from the above designs, our framework achieves effective and efficient volume construction in an adaptive way. Better still, our network can be supervised using only 3D bounding boxes, eliminating the dependence on ground-truth scene geometry. Experimental results demonstrate that SGCDet achieves stateof-the-art performance on the ScanNet, ScanNet200 and ARKitScenes datasets. The source code is available at https://github.com/RM-Zhang/SGCDet.
Breaking Rectangular Shackles: Cross-View Object Segmentation for Fine-Grained Object Geo-Localization
Qingwang Zhang
Shenzhen University
Yingying Zhu
Shenzhen University
Abstract
This paper addresses the limitations of existing cross-view object geo-localization schemes, which rely on rectangular proposals to localize irregular objects in satellite imagery. These 'rectangular shackles' inherently struggle to precisely define objects with complex geometries, leading to incomplete coverage or erroneous localization. We propose a novel scheme, cross-view object segmentation (CVOS), which achieves fine-grained geo-localization by predicting pixel-level segmentation masks of query objects. CVOS enables accurate extraction of object shapes, sizes, and areas-critical for applications like urban planning and agricultural monitoring. We introduce the CVOGLSeg dataset specifically to support and evaluate the new CVOS scheme. To tackle CVOS challenges, we propose Transformer Object Geo-localization (TROGeo), a twostage framework. First, the Heterogeneous Task Training Stage (HTTS) employs a single transformer encoder with a Cross-View Object Perception Module (CVOPM) and is trained by learning a heterogeneous task. Second, the SAM Prompt Stage (SPS) utilizes SAM's zero-shot segmentation capability, guided by HTTS outputs, to generate precise masks. Extensive experiments on both CVOGL and CVOGL-Seg datasets demonstrate that our approach achieves state-of-the-art performance, effectively breaking the rectangular shackles and unlocking new possibilities for fine-grained object geo-localization. Our project page: https://zqwlearning.github.io/CVOS.
CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation
Dengke Zhang
South China University of Technology
Fagui Liu
Pengcheng Laboratory
Quan Tang
Pengcheng Laboratory
Abstract
Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without being constrained by a predefined set of categories. While Contrastive LanguageImage Pre-training (CLIP) excels in zero-shot classification, it struggles to align image patches with category embeddings because of its incoherent patch correlations. This study reveals that inter-class correlations are the main reason for impairing CLIP's segmentation performance. Accordingly, we propose CorrCLIP, which reconstructs the scope and value of patch correlations. Specifically, CorrCLIP leverages the Segment Anything Model (SAM) to define the scope of patch interactions, reducing inter-class correlations. To mitigate the problem that SAM-generated masks may contain patches belonging to different classes, CorrCLIP incorporates self-supervised models to compute coherent similarity values, suppressing the weight of interclass correlations. Additionally, we introduce two additional branches to strengthen patch features' spatial details and semantic representation. Finally, we update segmentation maps with SAM-generated masks to improve spatial consistency. Based on the improvement across patch correlations, feature representations, and segmentation maps, CorrCLIP achieves superior performance across eight benchmarks. Codes are available at: https://github.com/zdk258/CorrCLIP.
Detect Anything 3D in the Wild
Hanxue Zhang
OpenDriveLab at Shanghai AI Laboratory
Haoran Jiang
Fudan University
Qingsong Yao
Stanford University
Yanan Sun
OpenDriveLab at Shanghai AI Laboratory
Renrui Zhang
CUHK MMLab
Hao Zhao
Tsinghua University
Hongyang Li
OpenDriveLab at Shanghai AI Laboratory
Hongzi Zhu
Shanghai Jiao Tong University
Zetong Yang
OpenDriveLab at Shanghai AI Laboratory
Abstract
Despite the success of deep learning in close-set 3D object detection, existing approaches struggle with zero-shot generalization to novel objects and camera configurations. We introduce DetAny3D, a promptable 3D detection foundation model capable of detecting any novel object under arbitrary camera configurations using only monocular inputs. Training a foundation model for 3D detection is fundamentally constrained by the limited availability of annotated 3D data, which motivates DetAny3D to leverage the rich prior knowledge embedded in extensively pre-trained 2D foundation models to compensate for this scarcity. To effectively transfer 2D knowledge to 3D, DetAny3D incorporates two core modules: the 2D Aggregator, which aligns features from different 2D foundation models, and the 3D Interpreter with Zero-Embedding Mapping, which stabilizes early training in 2D-to-3D knowledge transfer. Experimental results validate the strong generalization of our DetAny3D, which not only achieves state-of-the-art performance on unseen categories and novel camera configurations, but also surpasses most competitors on in-domain data. DetAny3D sheds light on the potential of the 3D foundation model for diverse applications in real-world scenarios, e.g., rare object detection in autonomous driving, and demonstrates promise for further exploration of 3D-centric tasks in open-world settings. More visualization results can be found at our code repository.
DiffPCI: Large Motion Point Cloud frame Interpolation with Diffusion Model
Tianyu Zhang
Nankai University
Haobo Jiang
Nanyang Technological University
Jian Yang
Nankai University
Jin Xie
Nanjing University
Abstract
Point cloud interpolation aims to recover intermediate frames for temporally smoothing a point cloud sequence. However, real-world challenges, such as uneven or large scene motions, cause existing methods to struggle with limited interpolation precision. To address this, we introduce DiffPCI, a novel diffusion interpolation model that formulates the frame interpolation task as a progressive denoising diffusion process. Training DiffPCI involves two key stages: a forward interpolation diffusion process and a reverse interpolation denoising process. In the forward process, the clean intermediate frame is progressively transformed into a noisy one through continuous Gaussian noise injection. The reverse process then focuses on training a denoiser to gradually refine this noisy frame back to the ground-truth frame. In particular, we derive a point cloud interpolationspecific variational lower bound as our optimization objective for denoiser training. Furthermore, to alleviate the interpolation error especially in highly dynamic scenes, we develop a novel full-scale, dual-branch denoiser that enables more comprehensive front-back frame information fusion for robust bi-directional interpolation. Extensive experiments demonstrate that DiffPCI significantly outperforms current state-of-the-art frame interpolation methods (e.g. 27% and 860% reduction in the Chamfer Distance and Earth Mover's Distance on Nuscenes).
Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion
Shengyuan Zhang
Zhejiang University
An Zhao
Zhejiang University
Ling Yang
Peking University
Zejian Li
Zhejiang University
Chenye Meng
Zhejiang University
Haoran Xu
Zhejiang Green Zhixing Technology co., ltd
Tianrun Chen
Zhejiang University
AnYang Wei
Zhejiang Green Zhixing Technology co., ltd
Perry Pengyun GU
Zhejiang Green Zhixing Technology co., ltd
Lingyun Sun
Zhejiang University
Abstract
Diffusion models have been applied to 3D LiDAR scene completion due to their strong training stability and high completion quality. However, the slow sampling speed limits the practical application of diffusion-based scene completion models since autonomous vehicles require an efficient perception of surrounding environments. This paper proposes a novel distillation method tailored for 3D LiDAR scene completion models, dubbed ScoreLiDAR, which achieves efficient yet high-quality scene completion. ScoreLiDAR enables the distilled model to sample in significantly fewer steps after distillation. To improve completion quality, we also introduce a novel Structural Loss, which encourages the distilled model to capture the geometric structure of the 3D LiDAR scene. The loss contains a scene-wise term constraining the holistic structure and a point-wise term constraining the key landmark points and their relative configuration. Extensive experiments demonstrate that ScoreLiDAR significantly accelerates the completion time from 30.55 to 5.37 seconds per frame (>5x) on SemanticKITTI and achieves superior performance compared to state-of-the-art 3D LiDAR scene completion models. Our model and code are publicly available on https: //github.com/happyw1nd/ScoreLiDAR.
EMatch: A Unified Framework for Event-based Optical Flow and Stereo Matching
Pengjie Zhang
Beijing Institute of Technology
Lin Zhu
Beijing Institute of Technology
Xiao Wang
Anhui University
Lizhi Wang
Beijing Normal University
Hua Huang
Beijing Normal University
Abstract
Event cameras have shown promise in vision applications like optical flow estimation and stereo matching with many specialized architectures. However, existing works only focus event data within the confines of task-specific domains, overlooking the correlations between tasks across the temporal and spatial domains. In this paper, we propose a novel matching-based framework for event cameras to estimate flow and disparity simultaneously in a shared representation space, reformulating them as a unified pixelwise correspondence matching problem. Specifically, our method utilizes a Temporal Recurrent Network to aggregate asynchronous event features across temporal or spatial domains, and a Spatial Contextual Attention to enhance knowledge transfer across event flows via temporal or spatial interactions. By utilizing a shared pixel-wise feature similarities module, our network performs optical flow estimation from temporal event segments and stereo matching from spatial event segments simultaneously. Our unified model inherently supports multi-task unification and cross-task transfer, which facilitate training and streamline deployment. Without the need for retraining on specific tasks, our model can effectively handle both event-based flow and stereo estimation, achieving state-of-the-art performance on both tasks. Our code is publicly available at https://github.com/BIT-Vision/EMatch.
Efficient Visual Place Recognition Through Multimodal Semantic Knowledge Integration
Sitao Zhang
The Pennsylvania State University
Hongda Mao
Amazon
Qingshuang Chen
Amazon
Yelin Kim
Amazon
Abstract
Visual place recognition is crucial for autonomous navigation and robotic mapping. Current methods struggle with perceptual aliasing and computational inefficiency. We present SemVPR, a novel approach integrating multimodal semantic knowledge into VPR. By leveraging a pre-trained vision-language model as a teacher during the training phase, SemVPR learns local visual and semantic descriptors simultaneously, effectively mitigating perceptual aliasing through semantic-aware aggregation without extra inference cost. The proposed nested descriptor learning strategy generates a series of ultra-compact global descriptors, reduced by approximately 66x compared to state-of-the-art methods, in a coarse-to-fine manner, eliminating the need for offline dimensionality reduction or training multiple models. Extensive experiments across various VPR benchmarks demonstrate that SemVPR consistently outperforms state-of-the-art methods with significantly lower computational costs, rendering its feasibility for latency-sensitive scenarios in real-world applications.
Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance
Mingfang Zhang
The University of Tokyo
Ryo Yonetani
CyberAgent AI Lab
Yifei Huang
The University of Tokyo
Liangyang Ouyang
The University of Tokyo
Ruicong Liu
The University of Tokyo
Yoichi Sato
The University of Tokyo
Abstract
This paper presents a novel inertial localization framework named Egocentric Action-aware Inertial Localization (EAIL), which leverages egocentric action cues from headmounted IMU signals to localize the target individual within a 3D point cloud. Human inertial localization is challenging due to IMU sensor noise that causes trajectory drift over time. The diversity of human actions further complicates IMU signal processing by introducing various motion patterns. Nevertheless, we observe that some actions captured by the head-mounted IMU correlate with spatial environmental structures (e.g., bending down to look inside an oven, washing dishes next to a sink), thereby serving as spatial anchors to compensate for the localization drift. The proposed EAIL framework learns such correlations via hierarchical multi-modal alignment with vision-language guidance. By assuming that the 3D point cloud of the environment is available, it contrastively learns modality encoders that align short-term egocentric action cues in IMU signals with local environmental features in the point cloud. The learning process is enhanced using concurrently collected vision and language signals to improve multimodal alignment. The learned encoders are then used in reasoning the IMU data and the point cloud over time and space to perform inertial localization. Interestingly, these encoders can further be utilized to recognize the corresponding sequence of actions as a by-product. Extensive experiments demonstrate the effectiveness of the proposed framework over state-of-the-art inertial localization and inertial action recognition baselines. Project page: https://github. com/mf-zhang/Ego-Inertial-Localization.
Enhanced Event-based Dense Stereo via Cross-Sensor Knowledge Distillation
Haihao Zhang
Institute of Information Engineering, Chinese Academy of Sciences
Yunjian Zhang
Tsinghua University
Jianing Li
Tsinghua University
Lin Zhu
Beijing Institute of Technology
Meng Lv
Beijing Institute of Technology
Yao Zhu
Tsinghua University
Yanwei Liu
Tsinghua University
Xiangyang Ji
Tsinghua University
Abstract
Accurate stereo matching under fast motion and extreme lighting conditions is a challenge for many vision applications. Event cameras have the advantages of low latency and high dynamic range, thus providing a reliable solution to this challenge. However, since events are sparse, this makes it an ill-posed problem to obtain dense disparity using only events. In this work, we propose a novel framework for event-based dense stereo via cross-sensor knowledge distillation. Specifically, a multi-level intensityto-event distillation strategy is designed to maximize the potential of long-range information, local texture details, and task-related knowledge of the intensity images. Simultaneously, to enforce the cross-view consistency, an intensityevent joint left-right consistency module is proposed. With our framework, extensive dense and structural information contained in intensity images is distilled to the event branch. Therefore, retaining only the events can predict dense disparities during inference, preserving the low latency characteristics of the events. Adequate experiments conducted on the MVSEC and DSEC datasets demonstrate that our method exhibits superior stereo matching performance than baselines, both quantitatively and qualitatively.
Enhancing Zero-shot Object Counting via Text-guided Local Ranking and Number-evoked Global Attention
Shiwei Zhang
Xi'an Jiaotong University
Qi Zhou
Xi'an Jiaotong University
Wei Ke
Pengcheng Laboratory
Abstract
Text-guided zero-shot object counting leverages visionlanguage models (VLMs) to count objects of an arbitrary class given by a text prompt. Existing approaches for this challenging task only utilize local patch-level features to fuse with text feature, ignoring the important influence of the global image-level feature. In this paper, we propose a universal strategy that can exploit both local patchlevel features and global image-level feature simultaneously. Specifically, to improve the localization ability of VLMs, we propose Text-guided Local Ranking. Depending on the prior knowledge that foreground patches have higher similarity with the text prompt, a new local-text rank loss is designed to increase the differences between the similarity scores of foreground and background patches which push foreground and background patches apart. To enhance the counting ability of VLMs, Number-evoked Global Attention is introduced to first align global image-level feature with multiple number-conditioned text prompts. Then, the one with the highest similarity is selected to compute cross-attention with the global image-level feature. Through extensive experiments on widely used datasets and methods, the proposed approach has demonstrated superior advancements in performance, generalization, and scalability. Furthermore, to better evaluate text-guided zeroshot object counting methods, we propose a dataset named ZSC-8K, which is larger and more challenging, to establish a new benchmark. Codes and dataset are released at https://github.com/zaqai/LGCount.
Environment-Agnostic Pose: Generating Environment-independent Object Representations for 6D Pose Estimation
Shaobo Zhang
Northwest University
Yuhang Huang
National University of Defense Technology
Wanqing Zhao
Northwest University
Wei Zhao
Xidian University
Ziyu Guan
Northwest University
Jinye Peng
Northwest University
Abstract
This paper introduces EA6D, a novel diffusion-based framework for 6D pose estimation that operates effectively in any environment. Traditional pose estimation methods struggle with the variability and complexity of real-world scenarios, often leading to overfitting on controlled datasets and poor generalization to new scenes. To address these challenges, we propose a generative pose estimation paradigm that generates environment-independent object representations for pose estimation, which are robust to environmental variations such as illumination, occlusion, and background clutter. Specifically, we propose the novel Environment Decoupling Diffusion Model (EDDM) which separates object representations from environmental factors while enabling efficient few-step sampling by leveraging input image priors instead of pure noise initialization. We validate our approach on four standard benchmarks and a self-made dataset DiverseScenes. The results demonstrate that EA6D, trained using only synthetic data, can outperform the stateof-the-art methods with both synthetic and realistic data. In particular, for fair comparisons with synthetic data, we can exceed the previous SOTA by 18.1% and 33.5% on Linemod and Linemod-Occluded datasets respectively. Project page: https://github.com/acmff22/EA6D
Epona: Autoregressive Diffusion World Model for Autonomous Driving
Kaiwen Zhang
Horizon Robotics
Zhenyu Tang
Tsinghua University
Xiaotao Hu
Tsinghua University
Xingang Pan
Nanyang Technological University
Xiaoyang Guo
Horizon Robotics
Yuan Liu
Hong Kong University of Science and Technology
Jingwei Huang
Tencent Hunyuan
Li Yuan
Shenzhen Graduate School, Peking University
Qian Zhang
Horizon Robotics
Xiao-Xiao Long
Nanjing University
Xun Cao
Nanjing University
Wei Yin
Horizon Robotics
Abstract
Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in an end-toend framework. Our architecture enables high-resolution, long-duration generation while introducing a novel chainof-forward training strategy to address error accumulation in autoregressive loops. Experimental results demonstrate state-of-the-art performance with 7.4% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a realtime motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks.
Function-centric Bayesian Network for Zero-Shot Object Goal Navigation
Sixian Zhang
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing
Xinyao Yu
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing
Xinhang Song
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing
Yiyao Wang
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing
Shuqiang Jiang
Institute of Intelligent Computing Technology, Suzhou
Abstract
Object goal navigation requires an agent to navigate to a specified target in unseen environments without an explicit map, which demands an understanding of object-scene context to infer the target's location based on partial observations. The function of an object plays a crucial role in its categorization and naming. Analyzing an object's functional role within a given scene enhances the understanding of its contextual relationships, thereby aiding in goal inference. In this paper, we propose the Function-centric Bayesian Network (FBN) for the zero-shot ObjectNav task. FBN is designed to uncover the functions that observed objects afford individually or collaboratively with other objects, as well as the functional semantics contained within the observed scenes. The probabilistic directed edges in FBN describe the object-function and scene-function relationships, which are derived by prompting LLMs with the proposed CounterfactCoT. Leveraging FBN with Bayesian inference, the probability of each function group and probability map of goal occurrence are computed. Then the waypoint is selected based on obtained probability map. Experiments on MP3D and HM3D demonstrate that FBN effectively captures object-scene-function relationships and improves zero-shot ObjectNav performance.
GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography
Mengchen Zhang
Zhejiang University
Tong Wu
Stanford University
Jing Tan
The Chinese University of Hong Kong
Ziwei Liu
Nanyang Technological University
Gordon Wetzstein
Stanford University
Dahua Lin
The Chinese University of Hong Kong
Abstract
Camera trajectory design plays a crucial role in video production, serving as a fundamental tool for conveying directorial intent and enhancing visual storytelling. In cinematography, Directors of Photography meticulously craft camera movements to achieve expressive and intentional framing. However, existing methods for camera trajectory generation remain limited: Traditional approaches rely on geometric optimization or handcrafted procedural systems, while recent learning-based methods often inherit structural biases or lack textual alignment, constraining creative synthesis. In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. We first introduce DataDoP, a large-scale multi-modal dataset containing 29K realworld shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent. Thanks to the comprehensive and diverse database, we further train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movement generation based on text guidance and RGBD inputs, named GenDoP. Extensive experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability. We believe our approach establishes a new standard for learningbased cinematography, paving the way for future advancements in camera control and filmmaking. Our project website: https://kszpxxzmc.github.io/GenDoP/.
Harnessing Uncertainty-aware Bounding Boxes for Unsupervised 3D Object Detection
Ruiyang Zhang
University of Macau, China
Hu Zhang
CSIRO Data61, Australia
Zhedong Zheng
University of Macau, China
Abstract
Unsupervised 3D object detection aims to identify objects of interest from unlabeled raw data, such as LiDAR points. Recent approaches usually adopt pseudo 3D bounding boxes (3D bboxes) from clustering algorithm to initialize the model training. However, pseudo bboxes inevitably contain noise, and such inaccuracies accumulate to the final model, compromising the performance. Therefore, in an attempt to mitigate the negative impact of inaccurate pseudo bboxes, we introduce a new uncertainty-aware framework for unsupervised 3D object detection, dubbed UA3D. In particular, our method consists of two phases: uncertainty estimation and uncertainty regularization. (1) In the uncertainty estimation phase, we incorporate an extra auxiliary detection branch alongside the original primary detector. The prediction disparity between the primary and auxiliary detectors could reflect fine-grained uncertainty at the box coordinate level. (2) Based on the assessed uncertainty, we adaptively adjust the weight of every 3D bbox coordinate via uncertainty regularization, refining the training process on pseudo bboxes. For pseudo bbox coordinate with high uncertainty, we assign a relatively low loss weight. Extensive experiments verify that UA3D is robust against the noisy pseudo bboxes, yielding substantial improvements on nuScenes and Lyft compared to existing approaches, with increases of +3.9% APBEV and +1.5% AP3D on nuScenes, and +2.3% APBEV and +1.8% AP3D on Lyft.
Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation
Wenyao Zhang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Hongsi Liu
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China
Bohan Li
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Jiawei He
CASIA
Zekun Qi
Tsinghua University
Yunnan Wang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Shengyang Zhao
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China
Xinqiang Yu
CASIA
Wenjun Zeng
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China
Xin Jin
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China
Abstract
Current self-supervised monocular depth estimation (MDE) approaches encounter performance limitations due to insufficient semantic-spatial knowledge extraction. To address this challenge, we propose Hybrid-depth, a novel framework that systematically integrates foundation models (e.g., CLIP and DINO) to extract visual priors and acquire sufficient contextual information for MDE. Our approach introduces a coarse-to-fine progressive learning framework: 1) Firstly, we aggregate multi-grained features from CLIP (global semantics) and DINO (local spatial details) under contrastive language guidance. A proxy task comparing close-distant image patches is designed to enforce depthaware feature alignment using text prompts; 2) Next, building on the coarse features, we integrate camera pose information and pixel-wise language alignment to refine depth predictions. This module seamlessly integrates with existing self-supervised MDE pipelines (e.g., Monodepth2, ManyDepth) as a plug-and-play depth encoder, enhancing continuous depth estimation. By aggregating CLIP's semantic context and DINO's spatial details through language guidance, our method effectively addresses feature granularity mismatches. Extensive experiments on the KITTI benchmark demonstrate that our method significantly outperforms SOTA methods across all metrics, which also indeed benefits downstream tasks like BEV perception. Code is available at https://github.com/Zhangwenyao1/Hybrid-depth.
HyperGCT: A Dynamic Hyper-GNN-Learned Geometric Constraint for 3D Registration
Xiyu Zhang
Northwestern Polytechnical University
Jiayi Ma
Wuhan University
Jianwei Guo
Chinese Academy of Sciences
Wei Hu
Peking University
Zhaoshuai Qi
Northwestern Polytechnical University
Fei Hui
Chang'an University
Jiaqi Yang
Northwestern Polytechnical University
Yanning Zhang
Northwestern Polytechnical University
Abstract
Geometric constraints between feature matches are critical in 3D point cloud registration problems. Existing approaches typically model unordered matches as a consistency graph and sample consistent matches to generate hypotheses. However, explicit graph construction introduces noise, posing great challenges for handcrafted geometric constraints to render consistency. To overcome this, we propose HyperGCT, a flexible dynamic Hyper-GNNlearned geometric ConstrainT that leverages high-order consistency among 3D correspondences. To our knowledge, HyperGCT is the first method that mines robust geometric constraints from dynamic hypergraphs for 3D registration. By dynamically optimizing the hypergraph through vertex and edge feature aggregation, HyperGCT effectively captures the correlations among correspondences, leading to accurate hypothesis generation. Extensive experiments on 3DMatch, 3DLoMatch, KITTI-LC, and ETH show that HyperGCT achieves state-of-the-art performance. Furthermore, HyperGCT is robust to graph noise, demonstrating a significant advantage in terms of generalization.
KinMo: Kinematic-aware Human Motion Understanding and Generation
Pengfei Zhang
University of California, Irvine
Pinxin Liu
University of Rochester
Pablo Garrido
Flawless AI
Hyeongwoo Kim
Imperial College, London
Bindita Chaudhuri
Flawless AI
Abstract
Current human motion synthesis frameworks rely on global action descriptions, creating a modality gap that limits both motion understanding and generation capabilities. A single coarse description, such as 'run', fails to capture details such as variations in speed, limb positioning, and kinematic dynamics, leading to ambiguities between text and motion modalities. To address this challenge, we introduce KinMo, a unified framework built on a hierarchical describable motion representation that extends beyond global actions by incorporating kinematic group movements and their interactions. We design an automated annotation pipeline to generate high-quality, fine-grained descriptions for this decomposition, resulting in the KinMo dataset and offering a scalable and cost-efficient solution for dataset enrichment. To leverage these structured descriptions, we propose Hierarchical Text-Motion Alignment that progressively integrates additional motion details, thereby improving semantic motion understanding. Furthermore, we introduce a coarse-to-fine motion generation procedure to leverage enhanced spatial understanding to improve motion synthesis. Experimental results show that KinMo significantly improves motion understanding, demonstrated by enhanced text-motion retrieval performance and enabling more finegrained motion generation and editing capabilities. Project Page: https://andypinxinliu.github.io/KinMo
Learning Beyond Still Frames: Scaling Vision-Language Models with Video
Yiyuan Zhang
MMLab, CUHK
Handong Li
School of Artificial Intelligence, UCAS
Jing Liu
School of Artificial Intelligence, UCAS
Xiangyu Yue
MMLab, CUHK
Abstract
High-quality image-text data is critical for VisionLanguage Models (VLMs), yet traditional image-based pretraining is resource-intensive and fails to capture the temporal dynamics needed for video understanding. To address this, we introduce video pretraining to enhance VLMs with temporal reasoning. We propose Causal Hierarchical Aggregation, a novel method that efficiently processes video by separating computationally heavy spatial encoding from lightweight temporal propagation. This technique builds hierarchical receptive fields, enabling effective learning from large-scale video data. Scaling our method to over 100 billion video tokens, we achieve state-of-the-art performance and high throughput on both image and video understanding tasks (Figure 1). Our approach offers a scalable solution to advance multimodal learning for dynamic contexts. Our code and pretrained models will be released at https://github.com/invictus717/LLaVAPrime.
MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes
Xinjie Zhang
iComAI Lab, The Hong Kong University of Science and Technology
Zhening Liu
iComAI Lab, The Hong Kong University of Science and Technology
Yifan Zhang
Skywork AI
Xingtong Ge
iComAI Lab, The Hong Kong University of Science and Technology
Dailan He
The Chinese University of Hong Kong
Tongda Xu
Institute for AI Industry Research (AIR), Tsinghua University
Yan Wang
Institute for AI Industry Research (AIR), Tsinghua University
Zehong Lin
iComAI Lab, The Hong Kong University of Science and Technology
Shuicheng Yan
National University of Singapore
Jun Zhang
iComAI Lab, The Hong Kong University of Science and Technology
Abstract
4D Gaussian Splatting (4DGS) has recently emerged as a promising technique for capturing complex dynamic 3D scenes with high fidelity. It utilizes a 4D Gaussian representation and a GPU-friendly rasterizer, enabling rapid rendering speeds. Despite its advantages, 4DGS faces significant challenges, notably the requirement of millions of 4D Gaussians, each with extensive associated attributes, leading to substantial memory and storage cost. This paper introduces a memory-efficient framework for 4DGS. We streamline the color attribute by decomposing it into a per-Gaussian direct color component with only 3 parameters and a shared lightweight alternating current color predictor. This approach eliminates the need for spherical harmonics coefficients, which typically involve up to 144 parameters in classic 4DGS, thereby creating a memoryefficient 4D Gaussian representation. Furthermore, we introduce an entropy-constrained Gaussian deformation technique that uses a deformation field to expand the action range of each Gaussian and integrates an opacity-based entropy loss to limit the number of Gaussians, thus forcing our model to use as few Gaussians as possible to fit a dynamic scene well. With simple half-precision storage and zip compression, our framework achieves a storage reduction by approximately 190x and 125x on the Technicolor and Neural 3D Video datasets, respectively, compared to the original 4DGS. Meanwhile, it maintains comparable rendering speeds and scene representation quality, setting a new standard in the field. Code is available at https://github.com/Xinjie-Q/MEGA.
Manual-PA: Learning 3D Part Assembly from Instruction Diagrams
Jiahao Zhang
The Australian National University
Anoop Cherian
Mitsubishi Electric Research Labs
Cristian Rodriguez
The Australian Institute for Machine Learning
Weijian Deng
The Australian National University
Stephen Gould
The Australian National University
Abstract
Assembling furniture amounts to solving the discretecontinuous optimization task of selecting the furniture parts to assemble and estimating their connecting poses in a physically realistic manner. The problem is hampered by its combinatorially large yet sparse solution space thus making learning to assemble a challenging task for current machine learning models. In this paper, we attempt to solve this task by leveraging the assembly instructions provided in diagrammatic manuals that typically accompany the furniture parts. Our key insight is to use the cues in these diagrams to split the problem into discrete and continuous phases. Specifically, we present Manual-PA, a transformer-based instruction Manual-guided 3D Part Assembly framework that learns to semantically align 3D parts with their illustrations in the manuals using a contrastive learning backbone towards predicting the assembly order and infers the 6D pose of each part via relating it to the final furniture depicted in the manual. To validate the efficacy of our method, we conduct experiments on the benchmark PartNet dataset. Our results show that using the diagrams and the order of the parts lead to significant improvements in assembly performance against the state of the art. Further, Manual-PA demonstrates strong generalization to real-world IKEA furniture assembly on the IKEA-Manual dataset.
MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation
Pingrui Zhang
Fudan University
Xianqiang Gao
Shanghai AI Laboratory
Yuhan Wu
University of Science and Technology of China
Kehui Liu
Northwestern Polytechnical University
Dong Wang
Shanghai AI Laboratory
Zhigang Wang
Shanghai AI Laboratory
Bin Zhao
Northwestern Polytechnical University
Yan Ding
Shanghai AI Laboratory
Xuelong Li
TeleAI, China Telecom Corp Ltd
Abstract
In mobile manipulation, navigation and manipulation are often treated as separate problems, resulting in a significant gap between merely approaching an object and engaging with it effectively. Many navigation approaches primarily define success by proximity to the target, often overlooking the necessity for optimal positioning that facilitates subsequent manipulation. To address this, we introduce MoMa-Kitchen, a benchmark dataset comprising over 100k samples that provide training data for models to learn optimal final navigation positions for seamless transition to manipulation. Our dataset includes affordance-grounded floor labels collected from diverse kitchen environments, in which robotic mobile manipulators of different models attempt to grasp target objects amidst clutter. Using a fully automated pipeline, we simulate diverse real-world scenarios and generate affordance labels for optimal manipulation positions. Visual data are collected from RGB-D inputs captured by a first-person view camera mounted on the robotic arm, ensuring consistency in viewpoint during data collection. We also develop a lightweight baseline model, NavAff, for navigation affordance grounding that demonstrates promising performance on the MoMa-Kitchen benchmark. Our approach enables models to learn affordance-based final positioning that accommodates different arm types and platform heights, thereby paving the way for more robust and generalizable integration of navigation and manipulation in embodied AI. Project page: https://momakitchen.github.io/.
POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction
Songyan Zhang
Nanyang Technological University, Singapore
Yongtao Ge
Zhejiang University, China
Jinyuan Tian
Zhejiang University, China
Guangkai Xu
Zhejiang University, China
Hao Chen
Zhejiang University, China
Chen Lv
Nanyang Technological University, Singapore
Chunhua Shen
Zhejiang University, China
Abstract
Recent approaches to 3D reconstruction in dynamic scenes primarily rely on the integration of separate geometry estimation and matching modules, where the latter plays a critical role in distinguishing dynamic regions and mitigating the interference caused by moving objects. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, effectively reducing computational overhead by eliminating the need for redundant auxiliary modules. However, it still struggles with ambiguous correspondences in dynamic regions, which limits reconstruction performance in such scenarios. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying POintmap MAtching with Temporal mOtion. Specifically, our method first learns an explicit matching relationship by mapping RGB pixels across different views to 3D pointmaps within a unified coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in 3D reconstruction tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of our proposed POMATO by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, 3D point tracking, and pose estimation. Code and models are publicly available at https://github.com/wyddmw/POMATO.
PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning
Yan Zhang
Meshcapade
Yao Feng
Meshcapade
Alpár Cseke
Meshcapade
Nitin Saini
Meshcapade
Nathan Bajandas
Meshcapade
Nicolas Heron
Meshcapade
Michael J. Black
Max Planck Institute for Intelligent Systems, Tübingen
Abstract
We formulate the motor system of an interactive avatar as a generative motion model that can drive the body to move through 3D space in a perpetual, realistic, controllable, and responsive manner. Although human motion generation has been extensively studied, many existing methods lack the responsiveness and realism of real human movements. Inspired by recent advances in foundation models, we propose PRIMAL, which is learned with a two-stage paradigm. In the pretraining stage, the model learns body movements from a large number of sub-second motion segments, providing a generative foundation from which more complex motions are built. This training is fully unsupervised without annotations. Given a single-frame initial state during inference, the pretrained model not only generates unbounded, realistic, and controllable motion, but also enables the avatar to be responsive to induced impulses in real time. In the adaptation phase, we employ a novel ControlNet-like adaptor to fine-tune the base model efficiently, adapting it to new tasks such as few-shot personalized action generation and spatial target reaching. Evaluations show that our proposed method outperforms stateof-the-art baselines. We leverage the model to create a realtime character animation system in Unreal Engine that feels highly responsive and natural. 1
PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Model
Jinhua Zhang
University of Electronic Science and Technology of China
Hualian Sheng
Independent Researcher
Sijia Cai
Independent Researcher
Bing Deng
Independent Researcher
Qiao Liang
Independent Researcher
Wen Li
University of Electronic Science and Technology of China
Ying Fu
Beijing Institute of Technology
Jieping Ye
Independent Researcher
Shuhang Gu
University of Electronic Science and Technology of China
Abstract
Controllable generation is considered a potentially vital approach to address the challenge of annotating 3D data, and the precision of such controllable generation becomes particularly imperative in the context of data production for autonomous driving. Existing methods focus on the integration of diverse generative information into controlling inputs, utilizing frameworks such as GLIGEN or ControlNet, to produce commendable outcomes in controllable generation. However, such approaches intrinsically restrict generation performance to the learning capacities of predefined network architectures. In this paper, we explore the innovative integration of controlling information and introduce PerLDiff (Perspective-Layout Diffusion Models), a novel method for effective street view image generation that fully leverages perspective 3D geometric information. Our PerLDiff employs 3D geometric priors to guide the generation of street view images with precise object-level control within the network learning process, resulting in a more robust and controllable output. Moreover, it demonstrates superior controllability compared to alternative layout control methods. Empirical results justify that our PerLDiff markedly enhances the precision of controllable generation on the NuScenes and KITTI datasets.
PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling
Hao Zhang
University of Illinois Urbana Champaign
Haolan Xu
University of Illinois Urbana Champaign
Chun Feng
University of Illinois Urbana Champaign
Varun Jampani
Stability AI
Narendra Ahuja
University of Illinois Urbana Champaign
Abstract
Skinning and rigging are fundamental components in animation, articulated object reconstruction, motion transfer, and 4D generation. Existing approaches predominantly rely on Linear Blend Skinning (LBS), due to its simplicity and differentiability. However, LBS introduces artifacts such as volume loss and unnatural deformations, and it fails to model elastic materials like soft tissues, fur, and flexible appendages (e.g., elephant trunks, ears, and fatty tissues). In this work, we propose PhysRig: a differentiable physicsbased skinning and rigging framework that overcomes these limitations by embedding the rigid skeleton into a volumetric representation (e.g., a tetrahedral mesh), which is simulated as a deformable soft-body structure driven by the animated skeleton. Our method leverages continuum mechanics and discretizes the object as particles embedded in an Eulerian background grid to ensure differentiability with respect to both material properties and skeletal motion. Additionally, we introduce material prototypes, significantly reducing the learning space while maintaining high expressiveness. To evaluate our framework, we construct a comprehensive synthetic dataset using meshes from Objaverse [5], The Amazing Animals Zoo [35], and MixaMo [1], covering diverse object categories and motion patterns. Our method consistently outperforms traditional LBS-based approaches, generating more realistic and physically plausible results. Furthermore, we demonstrate the applicability of our framework in the pose transfer task highlighting its versatility for articulated object modeling. This project is available at https://physrig.github.io/.
PlaneRAS: Learning Planar Primitives for 3D Plane Recovery
Fang Zhang
Beijing University of Posts and Telecommunications, China
Wenzhao Zheng
Tsinghua University, China
Linqing Zhao
Tsinghua University, China
Zelan Zhu
Beijing University of Posts and Telecommunications, China
Jiwen Lu
Tsinghua University, China
Xiuzhuang Zhou
Beijing University of Posts and Telecommunications, China
Abstract
3D plane recovery from monocular images constitutes a fundamental task in indoor scene understanding. Recent methods formulate this problem as 2D pixel-level segmentation through convolutional networks or query-based architectures, which purely rely on 2D pixel features while neglecting the inherent 3D spatial nature of planar surfaces. To address this limitation, we propose an endto-end Plane Reconstruction, Aggregation, and Splatting (PlaneRAS) framework that explicitly leverages 3D geometric reasoning combined with online planar primitive reconstruction. Our framework introduces two core components: 1) a reconstruction module utilizing customized planar primitives to compactly represent 3D scene, and 2) a recovery module that aggregates local primitives to derive globally consistent plane instances. The proposed 3D-aware representation enables direct integration of pretrained geometric priors, significantly enhancing performance beyond conventional 2D-centric approaches. Extensive experiments on ScanNet and NYUv2 datasets demonstrate state-of-the-art results across various evaluation metrics, resulting from our explicit 3D geometric modeling and effective fusion of cross-dimensional features.
Quadratic Gaussian Splatting: High Quality Surface Reconstruction with Second-order Geometric Primitives
Ziyu Zhang
CASIA
Binbin Huang
The University of Hong Kong
Hanqing Jiang
SenseTime Research
Liyang Zhou
SenseTime Research
Xiaojun Xiang
SenseTime Research
Shuhan Shen
CASIA
Abstract
We propose Quadratic Gaussian Splatting (QGS), a novel representation that replaces static primitives with deformable quadric surfaces (e.g., ellipse, paraboloids) to capture intricate geometry. Unlike prior works that rely on Euclidean distance for primitive density modeling-a metric misaligned with surface geometry under deformation-QGS introduces geodesic distance-based density distributions. This innovation ensures that density weights adapt intrinsically to the primitive curvature, preserving consistency during shape changes (e.g., from planar disks to curved paraboloids). By solving geodesic distances in closed form on quadric surfaces, QGS enables surfaceaware splatting, where a single primitive can represent complex curvature that previously required dozens of planar surfels, potentially reducing memory usage while maintaining efficient rendering via fast ray-quadric intersection. Experiments on DTU, Tanks and Temples, and MipNeRF360 datasets demonstrate state-of-the-art surface reconstruction, with QGS reducing geometric error (chamfer distance) by 33% over 2DGS and 27% over GOF on the DTU dataset. Crucially, QGS retains competitive appearance quality, bridging the gap between geometric precision and visual fidelity for applications like robotics and immersive reality.
Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment
Shi-Chen Zhang
VCIP, CS, Nankai University
Yunheng Li
VCIP, CS, Nankai University
Yu-Huan Wu
IHPC, A*STAR, Singapore
Qibin Hou
Nankai University
Ming-Ming Cheng
Nankai University
Abstract
Semantic segmentation is fundamental to vision systems requiring pixel-level scene understanding, yet deploying it on resource-constrained devices demands efficient architectures. Although existing methods achieve real-time inference through lightweight designs, we reveal their inherent limitation: misalignment between class representations and image features caused by a per-pixel classification paradigm. With experimental analysis, we find that this paradigm results in a highly challenging assumption for efficient scenarios: Image pixel features should not vary for the same category in different images. To address this dilemma, we propose a coupled dual-branch offset learning paradigm that explicitly learns feature and class offsets to dynamically refine both class representations and spatial image features. Based on the proposed paradigm, we construct an efficient semantic segmentation network, OffSeg. Notably, the offset learning paradigm can be adopted to existing methods with no additional architectural changes. Extensive experiments on four datasets, including ADE20K, Cityscapes, COCO-Stuff-164K, and Pascal Context, demonstrate consistent improvements with negligible parameters.
RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation
Kaidong Zhang
Sun Yat-sen University
Rongtao Xu
MBZUAI
Pengzhen Ren
Peng Cheng Laboratory
Junfan Lin
Peng Cheng Laboratory
Hefeng Wu
Sun Yat-sen University
Liang Lin
Sun Yat-sen University
Xiaodan Liang
Sun Yat-sen University
Abstract
Operating robots in open-ended scenarios with diverse tasks is a crucial research and application direction in robotics. While recent progress in natural language processing and large multimodal models has enhanced robots' ability to understand complex instructions, robot manipulation still faces the procedural skill dilemma and the declarative skill dilemma in open environments. Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained visionlanguage model (VLM), an invariant operable representation (IOR) serving as a symbolic bridge, and a guided embodied agent (GEA). RoBridge maintains the declarative skill of VLM and unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution. RoBridge demonstrates significant performance improvements over existing baselines, achieving a 75% success rate on new tasks and an 83% average success rate in sim-to-real generalization using only five real-world data samples per task. This work represents a significant step towards integrating cognitive reasoning with physical execution in robotic systems, offering a new paradigm for general robotic manipulation.
SU-RGS: Relightable 3D Gaussian Splatting from Sparse Views under Unconstrained Illuminations
Qi Zhang
College of Intelligence and Computing, Tianjin University, China
Chi Huang
College of Intelligence and Computing, Tianjin University, China
Qian Zhang
College of Intelligence and Computing, Tianjin University, China
Nan Li
College of Intelligence and Computing, Tianjin University, China
Wei Feng
College of Intelligence and Computing, Tianjin University, China
Abstract
The latest advancements in scene relighting have been predominantly driven by inverse rendering with 3D Gaussian Splatting (3DGS). However, existing methods remain overly reliant on densely sampled images under static illumination conditions, which is prohibitively expensive and even impractical in real-world scenarios. In this paper, we propose a novel learning from Sparse views under Unconstrained illuminations Relightable 3D Gaussian Splatting (dubbed SU-RGS), to address this challenge by jointly optimizing 3DGS representations, surface materials, and environment illuminations (i.e., unknown and various lighting conditions in training) using only sparse input views. Firstly, SU-RGS presents a varying appearance rendering strategy, enabling each 3D Gaussian can perform inconsistent color under various lightings. Next, SU-RGS establishes the multi-view semantic consistency by constructing hierarchical semantics pseudo-labels across inter-views, to compensate for extra supervisions and facilitate sparse inverse rendering for confronting unconstrained illuminations. Additionally, we introduce an adaptive transient object perception component that integrates the scene geometry and semantics in a fine-grained manner, to quantify and eliminate the uncertainty of the foreground. Extensive experiments on both synthetic and real-world challenging datasets demonstrate the effectiveness of SU-RGS, achieving the state-of-the-art performance for scene inverse rendering by learning 3DGS from only sparse views under unconstrained illuminations.
Semantic-guided Camera Ray Regression for Visual Localization
Yesheng Zhang
School of Automation and Intelligent Sensing, Shanghai Jiao Tong University
Xu Zhao
School of Automation and Intelligent Sensing, Shanghai Jiao Tong University
Abstract
This work presents a novel framework for Visual Localization (VL), that is, regressing camera rays from query images to derive camera poses. As an overparameterized representation of the camera pose, camera rays possess superior robustness in optimization. Of particular importance, Camera Ray Regression (CRR) is privacy-preserving, rendering it a viable VL approach for real-world applications. Thus, we introduce DINO-based Multi-Mappers, coined DIMM, to achieve VL by CRR. DIMM utilizes DINO as a sceneagnostic encoder to obtain powerful features from images. To mitigate ambiguity, the features integrate both local and global perception, as well as potential geometric constraint. Then, a scene-specific mapper head regresses camera rays from these features. It incorporates a semantic attention module for soft fusion of multiple mappers, utilizing the rich semantic information in DINO features. In extensive experiments on both indoor and outdoor datasets, our methods showcase impressive performance, revealing a promising direction for advancements in VL.
SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing
Yingying Zhang
Ant Group
Lixiang Ru
Ant Group
Kang Wu
Wuhan University
Lei Yu
Ant Group
Lei Liang
Ant Group
Yansheng Li
Wuhan University
Jingdong Chen
Ant Group
Abstract
The multi-modal remote sensing foundation model (MMRSFM) has significantly advanced various Earth observation tasks, such as urban planning, environmental monitoring, and natural disaster management. However, most existing approaches generally require the training of separate backbone networks for each data modality, leading to redundancy and inefficient parameter utilization. Moreover, prevalent pre-training methods typically apply selfsupervised learning (SSL) techniques from natural images without adequately accommodating the characteristics of remote sensing (RS) images, such as the complicated semantic distribution within a single RS image. In this work, we present SkySense V2, a unified MM-RSFM that employs a single transformer backbone to handle multiple modalities. This backbone is pre-trained with a novel SSL strategy tailored to the distinct traits of RS data. In particular, SkySense V2 incorporates an innovative adaptive patch merging module and learnable modality prompt tokens to address challenges related to varying resolutions and limited feature diversity across modalities. In additional, we incorporate the mixture of experts (MoE) module to further enhance the performance of the foundation model. SkySense V2 demonstrates impressive generalization abilities through an extensive evaluation involving 16 datasets over 7 tasks, outperforming SkySense by an average of 1.8 points.
SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations
Songchun Zhang
ZJU
Huiyao Xu
ZJU
Sitong Guo
ZJU
Zhongwei Xie
HKUST
Hujun Bao
ZJU
Weiwei Xu
ZJU
Changqing Zou
ZJU
Abstract
Novel view synthesis (NVS) boosts immersive experiences in computer vision and graphics. Existing techniques, though progressed, rely on dense multi-view observations, restricting their application. We tackle the task of reconstructing photorealistic 3D scenes from only one or a few input views. We introduce SpatialCrafter, a framework that leverages the rich knowledge in video diffusion models to generate plausible additional observations, thereby alleviating reconstruction ambiguity. Through a trainable camera encoder and an epipolar attention mechanism for explicit geometric constraints, we achieve precise camera control and 3D consistency, further reinforced by a unified scale estimation strategy to handle scale discrepancies across datasets. Furthermore, by integrating monocular depth priors with semantic features in the video latent space, our framework directly regresses 3D Gaussian primitives and efficiently processes long-sequence features using a hybrid network structure. Extensive experiments show our method enhances sparse view reconstruction and restores the realistic appearance of 3D scenes. Project page: https://franklinz233. github.io/projects/spatialcrafter/.
StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth
Zheng Zhang
The University of Hong Kong
Lihe Yang
The University of Hong Kong
Tianyu Yang
DAMO Academy, Alibaba Group
Chaohui Yu
DAMO Academy, Alibaba Group
Xiaoyang Guo
The Chinese University of Hong Kong
Yixing Lao
The University of Hong Kong
Hengshuang Zhao
The University of Hong Kong
Abstract
Recent advances in monocular depth estimation significantly improve robustness and accuracy. However, relative depth models exhibit flickering and 3D inconsistency in video data, limiting 3D reconstruction applications. We introduce StableDepth, a scene-consistent and scaleinvariant depth estimation method achieving scene-level 3D consistency. Our dual-decoder architecture learns from large-scale unlabeled video data, enhancing generalization and reducing flickering. Unlike previous methods requiring full video sequences, StableDepth enables online inference at 13x faster speed, achieving significant improvements across benchmarks with comparable temporal consistency to video diffusion-based estimators.
TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction
Xuying Zhang
VCIP, CS, Nankai University
Yutong Liu
USTC
Yangguang Li
CUHK MMLab
Renrui Zhang
CUHK MMLab
Yufei Liu
Shanghai AI Lab
Kai Wang
VCIP, CS, Nankai University
Wanli Ouyang
CUHK MMLab
Zhiwei Xiong
USTC
Peng Gao
Shanghai AI Lab
Qibin Hou
VCIP, CS, Nankai University
Ming-Ming Cheng
VCIP, CS, Nankai University
Abstract
We present TAR3D, a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQVAE) and a Generative Pre-trained Transformer (GPT) to generate high-quality 3D assets. The core insight of this work is to migrate the multimodal unification and promising learning capabilities of the next-token prediction paradigm to conditional 3D object generation. To achieve this, the 3D VQ-VAE first encodes a wide range of 3D shapes into a compact triplane latent space and utilizes a set of discrete representations from a trainable codebook to reconstruct fine-grained geometries under the supervision of query point occupancy. Then, the 3D GPT, equipped with a custom triplane position embedding called TriPE, predicts the codebook index sequence with prefilling prompt tokens in an autoregressive manner so that the composition of 3D geometries can be modeled part by part. Extensive experiments on several public 3D datasets demonstrate that TAR3D can achieve superior generation quality over existing methods in text-to-3D and image-to-3D tasks.
Towards More Diverse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views
Xiangdong Zhang
School of AI, Shanghai Jiao Tong University
Shaofeng Zhang
School of AI, Shanghai Jiao Tong University
Junchi Yan
School of AI, Shanghai Jiao Tong University
Abstract
Point cloud learning, especially in a self-supervised way without manual labels, has gained growing attention in both vision and learning communities due to its potential utility in a wide range of applications. Most existing generative approaches for point cloud self-supervised learning focus on recovering masked points from visible ones within a single view. Recognizing that a two-view pre-training paradigm inherently introduces greater diversity and variance, it may thus enable more challenging and informative pre-training. Inspired by this, we explore the potential of two-view learning in this domain. In this paper, we propose Point-PQAE, a cross-reconstruction generative paradigm that first generates two decoupled point clouds/views and then reconstructs one from the other. To achieve this goal, we develop a crop mechanism for point cloud view generation for the first time and further propose a novel positional encoding to represent the 3D relative position between the two decoupled views. The cross-reconstruction significantly increases the difficulty of pre-training compared to selfreconstruction, which enables our method to surpass previous single-modal self-reconstruction methods in 3D selfsupervised learning. Specifically, it outperforms the selfreconstruction baseline (Point-MAE) by 6.5%, 7.0%, and 6.7% in three variants of ScanObjectNN with the MLPLINEAR evaluation protocol. The code is available at https://github.com/aHapBean/Point-PQAE
Tracking Tiny Drones against Clutter: Large-Scale Infrared Benchmark with Motion-Centric Adaptive Algorithm
Jiahao Zhang
College of Computer Science, Beijing University of Technology
Zongli Jiang
College of Computer Science, Beijing University of Technology
Jinli Zhang
College of Computer Science, Beijing University of Technology
Yixin Wei
College of Computer Science, Beijing University of Technology
Liang Li
NAIVE Lab, Brain Research Center, Beijing Institute of Basic Medical Sciences
Yizheng Wang
NAIVE Lab, Brain Research Center, Beijing Institute of Basic Medical Sciences
Gang Wang
NAIVE Lab, Brain Research Center, Beijing Institute of Basic Medical Sciences
Abstract
Tracking flying drones in infrared videos is a crucial yet challenging task. Existing drone trackers and datasets have limitations in dealing with and characterizing tiny targets (≤20x20 pixels) against highly complex backgrounds. To tackle this issue, we have developed a large-scale benchmark for tiny drone tracking in infrared videos (TDTIV), which comprises 290k frames and 280k manually annotated bounding boxes. Unlike traditional trackers that primarily rely on appearance matching, we introduce a novel method called Motion-Centric Adaptive Tracking (MCATrack), which initially employs a magnocell-inspired motion response to enhance the local signal-to-noise ratio of tiny target regions while suppressing complex clutter. Moreover, we design a Dynamic Cross-Guided module that integrates both initial and updated target features to address pose variations in long-term tracking. This module captures the latest target information to generate highly relevant candidate regions and refines them through precise optimization to achieve more accurate tracking results. Extensive experiments performed on the TDTIV and the well-recognized Anti-UAV 410 datasets have demonstrated the superiority of MCATrack over state-of-the-art competing trackers. Code and dataset are available at https://github.com/zhangjiahao02/MCATrack.
UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement
Xiao Zhang
Dalian University of Technology
Fei Wei
AMAP, Alibaba Group
Yong Wang
AMAP, Alibaba Group
Wenda Zhao
Dalian University of Technology
Feiyi Li
Dalian University of Technology
Xiangxiang Chu
AMAP, Alibaba Group
Abstract
Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain. Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge, exploiting their zero-shot learning capabilities. However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on manually crafted prompts. To overcome these limitations, we propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimizes both textual prompts and visual representations. Specifically, our approach introduces a multi-view domain prompt that combines linguistic domain priors with detection-specific knowledge, and a visual representation enhancement module that produces domain style variations. Furthermore, we introduce multi-level enhancement strategies, including relative domain distance and positive-negative separation, which align multi-modal representations at the image level and capture diverse visual representations at the instance level, respectively. Extensive experiments conducted on nine benchmark datasets demonstrate the superior performance of our framework in ZSDA detection scenarios. Code is available at https://github.com/AMAP-ML/UPRE.
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
Shiduo Zhang
Fudan University
Zhe Xu
Fudan University
Peiju Liu
Fudan University
Xiaopeng Yu
Fudan University
Yuan Li
Fudan University
Qinghui Gao
Fudan University
Zhaoye Fei
Fudan University
Zhangyue Yin
Fudan University
Zuxuan Wu
Fudan University
Yu-Gang Jiang
Fudan University
Xipeng Qiu
Fudan University
Abstract
General-purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks. Recently, methods based on foundation models especially Vision-LanguageAction models (VLAs) have shown a substantial potential to solve language-conditioned manipulation (LCM) tasks well. However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects. VLABench stands out from previous benchmarks in four key aspects: 1) tasks requiring world knowledge and common sense transfer, 2) natural language instructions with implicit human intentions rather than templates, 3) long-horizon tasks demanding multi-step reasoning, and 4) evaluation of both action policies and language model capabilities. The benchmark assesses multiple competencies including understanding of mesh&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning, etc. To support the downstream finetuning, we provide high-quality training data collected via an automated framework incorporating heuristic skills and prior information. The experimental results indicate that both the current state-of-theart pretrained VLAs and the workflow based on VLMs face challenges in our tasks. 12. 1Codes and more videos are available at https://vlabench.github.io/2Corresponding to: sdzhang23@m.fudan.edu.cn, xpqiu@fudan.edu.cn
VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving
Ruifei Zhang
The Chinese University of Hong Kong, Shenzhen
Wei Zhang
Baidu Inc.
Xiao Tan
Baidu Inc.
Sibei Yang
Sun Yat-sen University
Xiang Wan
Shenzhen Research Institute of Big Data
Xiaonan Luo
Guilin University of Electronic Technology
Guanbin Li
Sun Yat-sen University
Abstract
Recent advancements in language-grounded autonomous driving have been significantly promoted by the sophisticated cognition and reasoning capabilities of large language models (LLMs). However, current LLM-based approaches encounter critical challenges: (1) Failure analysis reveals that frequent collisions and obstructions, stemming from limitations in visual representations, remain primary obstacles to robust driving performance. (2) The substantial parameters of LLMs pose considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. VLDrive achieves compact visual tokens through innovative strategies, including cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation. Furthermore, we propose a distance-decoupled instruction attention mechanism to improve joint visual-linguistic feature learning, particularly for long-range visual tokens. Extensive experiments conducted in the CARLA simulator demonstrate VLDrive's effectiveness. Notably, VLDrive achieves stateof-the-art driving performance while reducing parameters by 81% (from 7B to 1.3B), yielding substantial driving score improvements of 15.4%, 16.8%, and 7.6% at tiny, short, and long distances, respectively, in closed-loop evaluations. Code is available at https://github.com/ReaFly/VLDrive.
VertexRegen: Mesh Generation with Continuous Level of Detail
Xiang Zhang
UC San Diego
Yawar Siddiqui
Meta Reality Labs Research
Armen Avetisyan
Meta Reality Labs Research
Chris Xie
Meta Reality Labs Research
Jakob Engel
Meta Reality Labs Research
Henry Howard-Jenkins
Meta Reality Labs Research
Abstract
We introduce VertexRegen, a novel mesh generation framework that enables generation at a continuous level of detail. Existing autoregressive methods generate meshes in a partial-to-complete manner and thus intermediate steps of generation represent incomplete structures. VertexRegen takes inspiration from progressive meshes and reformulates the process as the reversal of edge collapse, i.e. vertex split, learned through a generative model. Experimental results demonstrate that VertexRegen produces meshes of comparable quality to state-of-the-art methods while uniquely offering anytime generation with the flexibility to halt at any step to yield valid meshes with varying levels of detail.
Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning
Yafei Zhang
Faculty of Information Engineering and Automation, Kunming University of Science and Technology
Lingqi Kong
Faculty of Information Engineering and Automation, Kunming University of Science and Technology
Huafeng Li
Faculty of Information Engineering and Automation, Kunming University of Science and Technology
Jie Wen
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
Abstract
To reduce the reliance of visible-infrared person reidentification (ReID) models on labeled cross-modal samples, this paper explores a weakly supervised cross-modal person ReID method that uses only single-modal sample identity labels, addressing scenarios where cross-modal identity labels are unavailable. To mitigate the impact of missing cross-modal labels on model performance, we propose a heterogeneous expert collaborative consistency learning framework, designed to establish robust crossmodal identity correspondences in a weakly supervised manner. This framework leverages labeled data from each modality to independently train dedicated classification experts. To associate cross-modal samples, these classification experts act as heterogeneous predictors, predicting the identities of samples from the other modality. To improve prediction accuracy, we design a cross-modal relationship fusion mechanism that effectively integrates predictions from different experts. Under the implicit supervision provided by cross-modal identity correspondences, collaborative and consistent learning among the experts is encouraged, significantly enhancing the model's ability to extract modality-invariant features and improve crossmodal identity recognition. Experimental results on two challenging datasets validate the effectiveness of the proposed method. Code is available at https://github. com/KongLingqi2333/WSL-VIReID.
DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion
Qingcheng Zhao
ShanghaiTech University
Xiang Zhang
UC San Diego
Haiyang Xu
UC San Diego
Zeyuan Chen
UC San Diego
Jianwen Xie
Lambda, Inc.
Yuan Gao
Stanford University
Zhuowen Tu
UC San Diego
Abstract
We propose DepR, a depth-guided single-view scene reconstruction framework that integrates instance-level diffusion within a compositional paradigm. Instead of reconstructing the entire scene holistically, DepR generates individual objects and subsequently composes them into a coherent 3D layout. Unlike previous methods that use depth solely for object layout estimation during inference and therefore fail to fully exploit its rich geometric information, DepR leverages depth throughout both training and inference. Specifically, we introduce depth-guided conditioning to effectively encode shape priors into diffusion models. During inference, depth further guides DDIM sampling and layout optimization, enhancing alignment between the reconstruction and the input image. Despite being trained on limited synthetic data, DepR achieves state-of-the-art performance and demonstrates strong generalization in singleview scene reconstruction, as shown through evaluations on both synthetic and real-world datasets.
HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding
Jiahe Zhao
State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China
Ruibing Hou
State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China
Zejie Tian
Communication University of China
Hong Chang
State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China
Shiguang Shan
State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China
Abstract
We propose a new task to benchmark human-in-scene understanding for embodied agents: Human-In-Scene Question Answering (HIS-QA). Given a human motion within a 3D scene, HIS-QA requires the agent to comprehend human states and behaviors, reason about its surrounding environment, and answer human-related questions within the scene. To support this new task, we present HISBench, a multimodal benchmark that systematically evaluates HIS understanding across a broad spectrum, from basic perception to commonsense reasoning and planning. Our evaluation of various vision-language models on HIS-Bench reveals significant limitations in their ability to handle HIS-QA tasks. To this end, we propose HIS-GPT, the first foundation model for HIS understanding. HIS-GPT integrates 3D scene context and human motion dynamics into large language models while incorporating specialized mechanisms to capture human-scene interactions. Extensive experiments demonstrate that HISGPT sets a new state-of-the-art on HIS-QA tasks. We hope this work inspires future research on human behavior analysis in 3D scenes, advancing embodied AI and world models. Codes and data will be available at https://github.com/ZJHTerry18/HumanInScene.
ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation
Guosheng Zhao
Institute of Automation, Chinese Academy of Sciences
Xiaofeng Wang
Institute of Automation, Chinese Academy of Sciences
Chaojun Ni
GigaAI
Zheng Zhu
GigaAI
Wenkang Qin
GigaAI
Guan Huang
GigaAI
Xingang Wang
Institute of Automation, Chinese Academy of Sciences
Abstract
Combining reconstruction models with generative models has emerged as a promising paradigm for closed-loop simulation in autonomous driving. For example, ReconDreamer has demonstrated remarkable success in rendering large-scale maneuvers. However, a significant gap remains between the generated data and real-world sensor observations, particularly in terms of fidelity for structured elements, such as the ground surface. To address these challenges, we propose ReconDreamer++, an enhanced framework that significantly improves the overall rendering quality by mitigating the domain gap and refining the representation of the ground surface. Specifically, ReconDreamer++ introduces the Novel Trajectory Deformable Network (NTDNet), which leverages learnable spatial deformation mechanisms to bridge the domain gap between synthesized novel views and original sensor observations. Moreover, for structured elements such as the ground surface, we preserve geometric prior knowledge in 3D Gaussians, and the optimization process focuses on refining appearance attributes while preserving the underlying geometric structure. Experimental evaluations conducted on multiple datasets (Waymo, nuScenes, PandaSet, and EUVS) confirm the superior performance of ReconDreamer++. Specifically, on Waymo, ReconDreamer++ achieves performance comparable to Street Gaussians for the original trajectory while significantly outperforming ReconDreamer on novel trajectories. In particular, it achieves substantial improvements, including a 6.1% increase in NTA-IoU, a 23. 0% improvement in FID, and a remarkable 4.5% gain in the ground surface metric NTL-IoU, highlighting its effectiveness in accurately reconstructing structured elements such as the road surface.
Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning
Tianyi Zhao
Institute of Artificial Intelligence, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
Boyang Liu
Institute of Artificial Intelligence, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
Yanglei Gao
Institute of Artificial Intelligence, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
Yiming Sun
Southeast University
Maoxun Yuan
Institute of Artificial Intelligence, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
Xingxing Wei
Institute of Artificial Intelligence, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
Abstract
Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the monomodality insufficient learning problem, which arises from decreased feature extraction capability in multi-modal joint learning. This leads to a prevalent but unreasonable phenomenon-Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multimodal object detection task from the mono-modality learning perspective. Therefore, we construct a novel framework called M2D-LIF, which consists of the Mono-Modality Distillation (M2D) method and the Local Illuminationaware Fusion (LIF) module. The M2D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M2D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors. The codes are available at https://github.com/Zhao-Tian-yi/M2D-LIF.
Toward Material-Agnostic System Identification from Videos
Yizhou Zhao
Carnegie Mellon University
Haoyu Chen
Carnegie Mellon University
Chunjiang Liu
Carnegie Mellon University
Zhenyang Li
University of Alabama at Birmingham
Charles Herrmann
Google
Junhwa Hur
Google
Yinxiao Li
Google
Ming-Hsuan Yang
Google
Bhiksha Raj
Carnegie Mellon University
Min Xu
Carnegie Mellon University
Abstract
System identification from videos aims to recover object geometry and governing physical laws. Existing methods integrate differentiable rendering with simulation but rely on predefined material priors, limiting their ability to handle unknown ones. We introduce MASIV, the first vision-based framework for material-agnostic system identification. Unlike existing approaches that depend on hand-crafted constitutive laws, MASIV employs learnable neural constitutive models, inferring object dynamics without assuming a scene-specific material prior. However, the absence of full particle state information imposes unique challenges, leading to unstable optimization and physically implausible behaviors. To address this, we introduce dense geometric guidance by reconstructing continuum particle trajectories, providing temporally rich motion constraints beyond sparse visual cues. Comprehensive experiments show that MASIV achieves state-of-the-art performance in geometric accuracy, rendering quality, and generalization ability.
Learning 4D Embodied World Models
Haoyu Zhen
UMass Amherst
Qiao Sun
UMass Amherst
Hongxin Zhang
UMass Amherst
Junyan Li
UMass Amherst
Siyuan Zhou
HKUST
Yilun Du
Harvard University
Chuang Gan
UMass Amherst
Abstract
This paper presents an effective approach for learning novel 4D embodied world models, TesserAct, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.
Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation
Bozhong Zheng
ShanghaiTech University
Jinye Gan
ShanghaiTech University
Xiaohao Xu
University of Michigan, Ann Arbor
Xintao Chen
ShanghaiTech University
Wenqiao Li
ShanghaiTech University
Xiaonan Huang
University of Michigan, Ann Arbor
Na Ni
ShanghaiTech University
Yingna Wu
ShanghaiTech University
Abstract
3D point cloud anomaly detection is essential for robust vision systems but is challenged by pose variations and complex geometric anomalies. Existing patchbased methods often suffer from geometric fidelity issues due to discrete voxelization or projection-based representations, limiting fine-grained anomaly localization. We introduce Pose-Aware Signed Distance Field (PASDF), a novel framework that integrates 3D anomaly detection and repair by learning a continuous, poseinvariant shape representation. PASDF leverages a Pose Alignment Module for canonicalization and a SDF Network to dynamically incorporate pose, enabling implicit learning of high-fidelity anomaly repair templates from the continuous SDF. This facilitates precise pixel-level anomaly localization through an Anomaly-Aware Scoring Module. Crucially, the continuous 3D representation in PASDF extends beyond detection, facilitating insitu anomaly repair. Experiments on Real3D-AD and Anomaly-ShapeNet demonstrate state-of-the-art performance, achieving high object-level AUROC scores of 80.2% and 90.0%, respectively. These results highlight the effectiveness of continuous geometric representations in advancing 3D anomaly detection and facilitating practical anomaly region repair. Our code will be released to drive further research.
Efficient Multi-Person Motion Prediction by Lightweight Spatial and Temporal Interactions
Yuanhong Zheng
School of Mechanical, Electrical and Information Engineering, Shandong University, Weihai
Ruixuan Yu
School of Mechanical, Electrical and Information Engineering, Shandong University, Weihai
Jian Sun
School of Mathematics and Statistics, Xi'an Jiaotong University
Abstract
3D multi-person motion prediction is a highly complex task, primarily due to the dependencies on both individual past movements and the interactions between agents. Moreover, effectively modeling these interactions often incurs substantial computational costs. In this work, we propose a computationally efficient model for multi-person motion prediction by simplifying spatial and temporal interactions. Our approach begins with the design of lightweight dual branches that learn local and global representations for individual and multiple persons separately. Additionally, we introduce a novel cross-level interaction block to integrate the spatial and temporal representations from both branches. To further enhance interaction modeling, we explicitly incorporate the spatial inter-person distance embedding. With above efficient temporal and spatial design, we achieve state-of-the-art performance for multiple metrics on standard datasets of CMU-Mocap, MuPoTS-3D, and 3DPW, while significantly reducing the computational cost. Code is available at https://github.com/Yuanhong-Zheng/EMPMP.
Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding
Minghang Zheng
Wangxuan Institute of Computer Technology, Peking University
Yuxin Peng
Wangxuan Institute of Computer Technology, Peking University
Benyuan Sun
Central Media Technology Institute, Huawei
Yi Yang
Central Media Technology Institute, Huawei
Yang Liu
Wangxuan Institute of Computer Technology, Peking University
Abstract
In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given text query within a video stream. Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames. As online videos are streaming inputs and can go on indefinitely, it is impractical and inefficient to store all historical inputs. The existing OnVTG models employ memory to store recent historical video frame features and predict scores indicating whether the current frame corresponds to the start or end time of the target event. However, these methods lack effective event modeling and cannot retain long-term historical information, leading to low performance. To tackle these challenges, we propose a hierarchical event memory for OnVTG. We propose an event-based OnVTG framework that makes predictions based on event proposals that model event-level information with various durations. To preserve historically valuable event information, we introduce a hierarchical event memory that retains historical events, allowing the model to access both recent and long-term information. To enable the real-time prediction, we further propose a future prediction branch that predicts whether the target event will occur shortly and further regresses the start time of the event. We achieve state-of-the-art performance on the TACoS, ActivityNet Captions, and MAD datasets. Code is available at https://github.com/minghangz/OnVTG.
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
Chenhao Zheng
University of Washington
Jieyu Zhang
University of Washington
Mohammadreza Salehi
University of Washington
Ziqi Gao
University of Washington
Vishnu Iyengar
University of Washington
Norimasa Kobori
Woven by Toyota, Inc
Quan Kong
Woven by Toyota, Inc
Ranjay Krishna
University of Washington
Abstract
Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.
RARE: Refine Any Registration of Pairwise Point Clouds via Zero-Shot Learning
Chengyu Zheng
Nanjing University of Aeronautics and Astronautics
Jin Huang
Nanjing University of Aeronautics and Astronautics
Honghua Chen
Lingnan University
Mingqiang Wei
Nanjing University of Aeronautics and Astronautics
Abstract
Recent research leveraging large-scale pretrained diffusion models has demonstrated the potential of using diffusion features to establish semantic correspondences in images. Inspired by advancements in diffusion-based techniques, we propose a novel zero-shot method for refining point cloud registration algorithms. Our approach leverages correspondences derived from depth images to enhance point feature representations, eliminating the need for a dedicated training dataset. Specifically, we first project the point cloud into depth maps from multiple perspectives and extract implicit knowledge from a pretrained diffusion network as depth diffusion features. These features are then integrated with geometric features obtained from existing methods to establish more accurate correspondences between point clouds. By leveraging these refined correspondences, our approach achieves significantly improved registration accuracy. Extensive experiments demonstrate that our method not only enhances the performance of existing point cloud registration techniques but also exhibits robust generalization capabilities across diverse datasets. Codes are available at https://github.com/zhengcy-lambo/RARE.git.
Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation, Large-Scale Dataset, and New Insights
Junhao Zheng
Xi'an Jiaotong University
Jiahao Sun
Xi'an Jiaotong University
Chenhao Lin
Xi'an Jiaotong University
Zhengyu Zhao
Xi'an Jiaotong University
Chen Ma
Xi'an Jiaotong University
Chong Zhang
Xi'an Jiaotong University
Cong Wang
City University of Hong Kong
Qian Wang
Wuhan University
Chao Shen
Xi'an Jiaotong University
Abstract
Developing reliable defenses against patch attacks on object detectors has attracted increasing interest. However, we identify that existing defense evaluations lack a unified and comprehensive framework, resulting in inconsistent and incomplete assessments of current methods. To address this issue, we revisit 11 representative defenses and present the first patch defense benchmark, involving 2 attack goals, 13 patch attacks, 11 object detectors, and 4 diverse metrics. This leads to the large-scale adversarial patch dataset with 94 types of patches and 94,000 images. Our comprehensive analyses reveal new insights: (1) The difficulty in defending against naturalistic patches lies in the data distribution, rather than the commonly believed high frequencies. Our new dataset with diverse patch distributions can be used to improve existing defenses by 15.09% AP@0.5. (2) The average precision of the attacked object, rather than the commonly pursued patch detection accuracy, shows high consistency with defense performance. (3) Adaptive attacks can substantially bypass existing defenses, and defenses with complex/stochastic models or universal patch properties are relatively robust. We hope that our analyses will serve as guidance on properly evaluating patch attacks/defenses and advancing their design. Code and dataset are available at https://github.com/Gandolfczjh/APDE, where we will keep integrating new attacks/defenses.
S3R-GS: Streamlining the Pipeline for Large-Scale Street Scene Reconstruction
Guangting Zheng
University of Science and Technology of China
Jiajun Deng
The University of Adelaide
Xiaomeng Chu
University of Science and Technology of China
Yu Yuan
University of Science and Technology of China
Houqiang Li
University of Science and Technology of China
Yanyong Zhang
University of Science and Technology of China
Abstract
Recently, 3D Gaussian Splatting (3DGS) has reshaped the field of photorealistic 3D reconstruction, achieving impressive rendering quality and speed. However, when applied to large-scale street scenes, existing methods suffer from rapidly escalating per-viewpoint reconstruction costs as scene size increases, leading to significant computational overhead. After revisiting the conventional pipeline, we identify three key factors accounting for this issue: unnecessary local-to-global transformations, excessive 3D-to2D projections, and inefficient rendering of distant content. To address these challenges, we propose S3R-GS, a 3DGS framework that Streamlines the pipeline for largescale Street Scene Reconstruction, effectively mitigating these limitations. Moreover, most existing street 3DGS methods rely on ground-truth 3D bounding boxes to separate dynamic and static components, but 3D bounding boxes are difficult to obtain, limiting real-world applicability. To address this, we propose an alternative solution with 2D boxes, which are easier to annotate or can be predicted by off-the-shelf vision foundation models. Such designs together make S3R-GS readily adapt to large, in-the-wild scenarios. Extensive experiments demonstrate that S3R-GS enhances rendering quality and significantly accelerates reconstruction. Remarkably, when applied to videos from the challenging Argoverse2 dataset, it achieves state-of-theart PSNR and SSIM, reducing reconstruction time to below 50%-and even 20%-of competing methods. Code is available at https://github.com/Tom-zgt/S3RGS.
World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model
Yupeng Zheng
Institute of Automation, Chinese Academy of Sciences
Pengxuan Yang
School of Artificial Intelligence, UCAS
Zebin Xing
School of Artificial Intelligence, UCAS
Qichao Zhang
Institute of Automation, Chinese Academy of Sciences
Yuhang Zheng
National University of Singapore
Yinfeng Gao
Institute of Automation, Chinese Academy of Sciences
Pengfei Li
Tsinghua University
Teng Zhang
School of Artificial Intelligence, UCAS
Zhongpu Xia
Institute of Automation, Chinese Academy of Sciences
Peng Jia
School of Artificial Intelligence, UCAS
XianPeng Lang
School of Artificial Intelligence, UCAS
Dongbin Zhao
Institute of Automation, Chinese Academy of Sciences
Abstract
End-to-end autonomous driving directly generates planning trajectories from raw sensor data, yet it typically relies on costly perception supervision to extract scene information. A critical research challenge arises: constructing an informative driving world model to enable perception annotation-free, end-to-end planning via self-supervised learning. In this paper, we present World4Drive, an endto-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. Specifically, World4Drive first extracts scene features, including driving intention and world latent representations enriched with spatial-semantic priors provided by vision foundation models. It then generates multi-modal planning trajectories based on current scene features and driving intentions and predicts multiple intention-driven future states within the latent space. Finally, it introduces a world model selector module to evaluate and select the best trajectory. We achieve perception annotation-free, endto-end planning through self-supervised alignment between actual future observations and predicted observations reconstructed from the latent space. World4Drive achieves state-of-the-art performance without manual perception annotations on both the open-loop nuScenes and closed-loop NavSim benchmarks, demonstrating an 18.0% relative reduction in L2 error, 46.7% lower collision rate, and 3.75x faster training convergence. Codes will be accessed at https://github.com/ucaszyp/World4Drive.
iManip: Skill-Incremental Learning for Robotic Manipulation
Zexin Zheng
Sun Yat-sen University
Jia-Feng Cai
Sun Yat-sen University
Xiao-Ming Wu
Sun Yat-sen University
Yi-Lin Wei
Sun Yat-sen University
Yu-Ming Tang
Sun Yat-sen University
Ancong Wu
Sun Yat-sen University
Wei-Shi Zheng
Sun Yat-sen University
Abstract
The development of a generalist agent with adaptive multiple manipulation skills has been a long-standing goal in the robotics community. In this paper, we explore a crucial task, skill-incremental learning, in robotic manipulation, which is to endow the robots with the ability to learn new manipulation skills based on the previous learned knowledge without re-training. First, we build a skill-incremental environment based on the RLBench benchmark, and explore how traditional incremental methods perform in this setting. We find that they suffer from severe catastrophic forgetting due to the previous methods on classification overlooking the characteristics of temporality and action complexity in robotic manipulation tasks. Towards this end, we propose an incremental Manipulation framework, termed iManip, to mitigate the above issues. We firstly design a temporal replay strategy to maintain the integrity of old skills when learning new skill. Moreover, we propose the Extendable PerceiverIO, consisting of an action prompt with extendable weight to adapt to new action primitives in new skill. Extensive experiments show that our framework performs well in Skill-Incremental Learning.
CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection
Hanzhi Zhong
Zhejiang University
Zhiyu Xiang
Zhejiang University
Ruoyu Xu
Zhejiang University
Jingyun Fu
Zhejiang University
Peng Xu
Zhejiang University
Shaohong Wang
Zhejiang University
Zhihao Yang
Zhejiang University
Tianyu Pu
Zhejiang University
Eryun Liu
Zhejiang University
Abstract
4D radar has received significant attention in autonomous driving thanks to its robustness under adverse weathers. Due to the sparse points and noisy measurements of the 4D radar, most of the research perform the 3D object detection task by integrating images from camera and perform modality fusion in BEV space. However, the potential of the radar and the fusion mechanism is still largely unexplored, hindering the performance improvement. In this study, we propose a cross-view two-stage fusion network called CVFusion. In the first stage, we design a radar guided iterative (RGIter) BEV fusion module to generate high-recall 3D proposal boxes. In the second stage, we aggregate features from multiple heterogeneous views including points, image, and BEV for each proposal. These comprehensive instance level features greatly help refine the proposals and generate high-quality predictions. Extensive experiments on public datasets show that our method outperforms the previous state-of-the-art methods by a large margin, with 9.10% and 3.68% mAP improvements on View-of-Delft (VoD) and TJ4DRadSet, respectively. Our code will be available at https://github.com/zhzhzhzhzhz/CVFusion.
CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception
Jiaru Zhong
Tsinghua University
Jiahao Wang
Tsinghua University
Jiahui Xu
The University of Hong Kong
Xiaofan Li
Baidu Inc.
Zaiqing Nie
Tsinghua University
Haibao Yu
The University of Hong Kong
Abstract
Cooperative perception aims to address the inherent limitations of single-vehicle autonomous driving systems through information exchange among multiple agents. Previous research has primarily focused on single-frame perception tasks. However, the more challenging cooperative sequential perception tasks, such as cooperative 3D multiobject tracking, have not been thoroughly investigated. Therefore, we propose CoopTrack, a fully instance-level end-to-end framework for cooperative tracking, featuring learnable instance association, which fundamentally differs from existing approaches. CoopTrack transmits sparse instance-level features that significantly enhance perception capabilities while maintaining low transmission costs. Furthermore, the framework comprises two key components: Multi-Dimensional Feature Extraction, and CrossAgent Association and Aggregation, which collectively enable comprehensive instance representation with semantic and motion features, and adaptive cross-agent association and fusion based on a feature graph. Experiments on both the V2X-Seq and Griffin datasets demonstrate that CoopTrack achieves excellent performance. Specifically, it attains state-of-the-art results on V2X-Seq, with 39.0% mAP and 32.8% AMOTA. The project is available at https://github.com/zhongjiaru/CoopTrack.
OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation
Ding Zhong
AI Thrust, HKUST(GZ)
Xu Zheng
INSAIT, Sofia University
Chenfei Liao
AI Thrust, HKUST(GZ)
Yuanhuiyi Lyu
AI Thrust, HKUST(GZ)
Jialei Chen
Nagoya University
Shengyang Wu
UMich
Linfeng Zhang
SJTU
Xuming Hu
AI Thrust, HKUST(GZ)
Abstract
Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to 360◦domain, the significant field-of-view (FoV) gap between pinhole (70◦x70◦) and panoramic images (180◦x 360◦) poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixellevel semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2's memory mechanism to extract cross-patch correspondences that embeds the cross-FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries. For the second gap, OmniSAM fine-tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV-based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization ability across different sizes of source models. Extensive experimental results demonstrate that our method outperforms the state-ofthe-art methods by large margins, e.g., 79.06% (10.22%↑) on SPin8-to-SPan8, 62.46% (6.58%↑) on CS13-to-DP13.
RoboTrom-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction
Yufeng Zhong
Meituan
Chengjian Feng
Meituan
Feng Yan
Meituan
Fanfan Liu
Meituan
Liming Zheng
Meituan
Lin Ma
Meituan
Abstract
In language-guided visual navigation, agents locate target objects in unseen environments using natural language instructions. For reliable navigation in unfamiliar scenes, agents should possess strong perception, planning, and prediction capabilities. Additionally, when agents revisit previously explored areas during long-term navigation, they may retain irrelevant and redundant historical perceptions, leading to suboptimal results. In this work, we propose RoboTron-Nav, a unified framework that integrates perception, planning, and prediction capabilities through multitask collaborations on navigation and embodied question answering tasks, thereby enhancing navigation performances. Furthermore, RoboTron-Nav employs an adaptive 3D-aware history sampling strategy to effectively and efficiently utilize historical observations. By leveraging large language model, RoboTron-Nav comprehends diverse commands and complex visual scenes, resulting in appropriate navigation actions. RoboTron-Nav achieves an 81.1% success rate in object goal navigation on the CHORES-S benchmark, setting a new state-of-the-art performance.
UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI
Fangwei Zhong
Beijing Normal University
Kui Wu
Beihang University
Churan Wang
Peking University
Hao Chen
City University of Macau
Hai Ci
National University of Singapore
Zhoujun Li
Peking University
Yizhou Wang
Peking University
Abstract
We introduce UnrealZoo, a collection of over 100 photorealistic 3D virtual worlds built on Unreal Engine, designed to reflect the complexity and variability of open-world environments. We also provide a rich variety of playable entities, including humans, animals, robots, and vehicles for embodied AI research. We extend UnrealCV with optimized APIs and tools for data collection, environment augmentation, distributed training, and benchmarking. These improvements achieve significant improvements in the efficiency of rendering and communication, enabling advanced applications such as multi-agent interactions. Our experimental evaluation across visual navigation and tracking tasks reveals two key insights: 1) environmental diversity provides substantial benefits for developing generalizable reinforcement learning (RL) agents, and 2) current embodied agents face persistent challenges in open-world scenarios, including navigation in unstructured terrain, adaptation to unseen morphologies, and managing latency in the close-loop control systems for interacting in highly dynamic objects. UnrealZoo thus serves as both a comprehensive testing ground and a pathway toward developing more capable embodied AI systems for real-world deployment.
AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting
Xiaoyu Zhou
Peking University
Jingqi Wang
Peking University
Yongtao Wang
Peking University
Yufei Wei
Chongqing Changan Automobile Co., Ltd
Nan Dong
Peking University
Ming-Hsuan Yang
University of California, Merced
Abstract
Obtaining high-quality 3D semantic occupancy from raw sensor data remains an essential yet challenging task, often requiring extensive manual labeling. In this work, we propose AutoOcc, a vision-centric automated pipeline for open-ended semantic occupancy annotation that integrates differentiable Gaussian splatting guided by visionlanguage models. We formulate the open-ended semantic 3D occupancy reconstruction task to automatically generate scene occupancy by combining attention maps from vision-language models and foundation vision models. We devise semantic-aware Gaussians as intermediate geometric descriptors and propose a cumulative Gaussian-to-voxel splatting algorithm that enables effective and efficient occupancy annotation. Our framework outperforms existing automated occupancy annotation methods without human labels. Auto-cc also enables open-ended semantic occupancy auto-labeling, achieving robust performance in both static and dynamically complex scenarios.
Event-based Visual Vibrometry
Xinyu Zhou
Peking University
Peiqi Duan
Peking University
Yeliduosi Xiaokaiti
Peking University
Chao Xu
Peking University
Boxin Shi
Peking University
Abstract
Visual vibrometry has emerged as a powerful technique for remote acquisition of audio and the physical properties of materials. To capture high-frequency vibrations, framebased approaches often require a high-speed video camera and bright lighting to compensate for the short exposure time. In this paper, we introduce event-based visual vibrometry, a new high-speed visual vibration sensing method using an event camera. By leveraging the high temporal resolution and low bandwidth characteristics of event cameras, event-based visual vibrometry enables high-speed vibration sensing under ambient lighting conditions with improved data efficiency. Specifically, we leverage a hybrid camera system and propose an event-based subtle motion estimation framework that integrates an optimization-based approach based on the event generation model and a motion refinement network. We demonstrate our method by capturing vibration caused by audio sources and estimating material properties for various objects.
HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
Xin Zhou
Huazhong University of Science and Technology
Dingkang Liang
Huazhong University of Science and Technology
Sifan Tu
Huazhong University of Science and Technology
Xiwu Chen
The University of Hong Kong
Yikang Ding
MEGVII Technology
Dingyuan Zhang
Huazhong University of Science and Technology
Feiyang Tan
The University of Hong Kong
Hengshuang Zhao
The University of Hong Kong
Abstract
Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction. However, existing DWMs are limited to scene generation and fail to incorporate scene understanding, which involves interpreting and reasoning about the driving environment. In this paper, we present a unified Driving World Model named HERMES. We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios. Specifically, HERMES leverages a Bird's-Eye View (BEV) representation to consolidate multi-view spatial information while preserving geometric relationships and interactions. We also introduce world queries, which incorporate world knowledge into BEV features via causal attention in the Large Language Model, enabling contextual enrichment for understanding and generation tasks. We conduct comprehensive studies on nuScenes and OmniDrive-nuScenes datasets to validate the effectiveness of our method. HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%. The model and code will be publicly released at https://github.com/LMD0311/HERMES.
Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving
Hao Zhou
University of Chinese Academy of Sciences
Zhanning Gao
DeepRoute.AI
Zhili Chen
The Hong Kong University of Science and Technology
Maosheng Ye
DeepRoute.AI
Qifeng Chen
The Hong Kong University of Science and Technology
Tongyi Cao
DeepRoute.AI
Honggang Qi
University of Chinese Academy of Sciences
Abstract
In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to accurately represent driving-specific scenarios, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations by capturing driving-related representations with limited domain data, ensuring faster adaptation to driving scenarios. Extensive experiments confirm the effectiveness of the HoP framework, showing that it significantly outperforms previous state-of-the-art methods in all key metrics.
Latent-Reframe: Enabling Camera Control for Video Diffusion Models without Training
Zhenghong Zhou
University of Rochester
Jie An
University of Rochester
Jiebo Luo
University of Rochester
Abstract
Precise camera pose control is crucial for video generation with diffusion models. Existing methods require fine-tuning with additional datasets containing paired videos and camera pose annotations, which are both data-intensive and computationally costly, and may disrupt the model's distribution learned from the training data. We introduce Latent-Reframe, which enables camera control in a pretrained video diffusion model without fine-tuning. Unlike existing methods, Latent-Reframe operates during the sampling stage, maintaining efficiency while preserving the distribution learned during pretraining. Our approach reframes the latent code of video frames to align with the input camera trajectory through time-aware point clouds. Latent code inpainting and harmonization then refine the model's latent space, ensuring high-quality video generation. Latent-Reframe can be applied to both DiT- and UNet-based video diffusion models. Experimental results demonstrate that Latent-Reframe can achieve comparable or superior camera control precision and video quality to training-based methods, without the need for fine-tuning on additional datasets.
MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions
Qingyuan Zhou
Fudan University
Yuehu Gong
Fudan University
Weidong Yang
Fudan University
Jiaze Li
Nanyang Technological University
Yeqi Luo
Fudan University
Baixin Xu
Nanyang Technological University
Shuhao Li
Fudan University
Ben Fei
The Chinese University of Hong Kong
Ying He
Nanyang Technological University
Abstract
Novel view synthesis (NVS) and surface reconstruction (SR) are essential tasks in 3D Gaussian Splatting (3DGS). Despite recent progress, these tasks are often addressed independently, with GS-based rendering methods struggling under diverse light conditions and failing to produce accurate surfaces, while GS-based reconstruction methods frequently compromise rendering quality. This raises a central question: must rendering and reconstruction always involve a trade-off? To address this, we propose MGSR, a 2D/3D Mutual-boosted Gaussian Splatting for Surface Reconstruction that enhances both rendering quality and 3D reconstruction accuracy. MGSR introduces two branches-one based on 2DGS and the other on 3DGS. The 2DGS branch excels in surface reconstruction, providing precise geometry information to the 3DGS branch. Leveraging this geometry, the 3DGS branch employs a geometryguided illumination decomposition module that captures reflected and transmitted components, enabling realistic rendering under varied light conditions. Using the transmitted component as supervision, the 2DGS branch also achieves high-fidelity surface reconstruction. Throughout the optimization process, the 2DGS and 3DGS branches undergo alternating optimization, providing mutual supervision. Prior to this, each branch completes an independent warm-up phase, with an early stopping strategy implemented to reduce computational costs. We evaluate MGSR on a diverse set of synthetic and real-world datasets, at both object and scene levels, demonstrating strong performance in rendering and surface reconstruction. Code is available at https://github.com/TsingyuanChou/MGSR.
MonoMobility: Zero-Shot 3D Mobility Analysis from Monocular Videos
Hongyi Zhou
National University of Defense Technology
Yulan Guo
Sun Yat-sen University
Xiaogang Wang
Southwest University
Kai Xu
National University of Defense Technology
Abstract
Accurately analyzing the motion parts and their motion attributes in dynamic environments is crucial for advancing key areas such as embodied intelligence. Addressing the limitations of existing methods that rely on dense multiview images or detailed part-level annotations, we propose an innovative framework that can analyze 3D mobility from monocular videos in a zero-shot manner. This framework can precisely parse motion parts and motion attributes only using a monocular video, completely eliminating the need for annotated training data. Specifically, our method first constructs the scene geometry and roughly analyzes the motion parts and their initial motion attributes combining depth estimation, optical flow analysis and point cloud registration method, then employs 2D Gaussian splatting for scene representation. Building on this, we introduce an end-to-end dynamic scene optimization algorithm specifically designed for articulated objects, refining the initial analysis results to ensure the system can handle ‘rotation', ‘translation', and even complex movements (‘rotation+translation'), demonstrating high flexibility and versatility. To validate the robustness and wide applicability of our method, we created a comprehensive dataset comprising both simulated and real-world scenarios. Experimental results show that our framework can effectively analyze articulated object motions in an annotation-free manner, showcasing its significant potential in future embodied intelligence applications. The project page is at: https: //monomobility.github.io/MonoMobility.
OV3D-CG: Open-vocabulary 3D Instance Segmentation with Contextual Guidance
Mingquan Zhou
Chinese Academy of Sciences
Chen He
Chinese Academy of Sciences
Ruiping Wang
Chinese Academy of Sciences
Xilin Chen
Chinese Academy of Sciences
Abstract
Open-vocabulary 3D instance segmentation (OV-3DIS), which aims to segment and classify objects beyond predefined categories, is a critical capability for embodied AI applications. Existing methods rely on pre-trained 2D foundation models, focusing on instance-level features while overlooking contextual relationships, limiting their ability to generalize to rare or ambiguous objects. To address these limitations, we propose an OV-3DIS framework guided by contextual information. First, we employ a Class-agnostic Proposal Module, integrating a pre-trained 3D segmentation model with a SAM-guided segmenter to extract robust 3D instance masks. Subsequently, we design a Semantic Reasoning Module, which selects the best viewpoint for each instance and constructs three 2D contextaware representations. The representations are processed using Multimodal Large Language Models with Chain-ofThought prompting to enhance semantic inference. Notably, our method outperforms state-of-the-art methods on the ScanNet200 and Replica datasets, demonstrating superior open-vocabulary segmentation capabilities. Moreover, preliminary implementation in real-world scenarios verifies our method's robustness and accuracy, highlighting its potential for embodied AI tasks such as object-driven navigation. Our project page is at: https://viplvsu.github.io/OV3D-CG/.
Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes
Zhangjun Zhou
Huazhong University of Science and Technology
Yiping Li
Huazhong University of Science and Technology
Chunlin Zhong
Huazhong University of Science and Technology
Jianuo Huang
Huazhong University of Science and Technology
Jialun Pei
The Chinese University of Hong Kong
Hua Li
Hainan University
He Tang
Huazhong University of Science and Technology
Abstract
While the human visual system employs distinct mechanisms to perceive salient and camouflaged objects, existing models struggle to disentangle these tasks. Specifically, salient object detection (SOD) models frequently misclassify camouflaged objects as salient, while camouflaged object detection (COD) models conversely misinterpret salient objects as camouflaged. We hypothesize that this can be attributed to two factors: (i) the specific annotation paradigm of current SOD and COD datasets, and (ii) the lack of explicit aspect relationship modeling in current models. Prevalent SOD/COD datasets enforce a mutual exclusivity constraint, assuming scenes contain either salient or camouflaged objects, which poorly aligns with the real world. Furthermore, current SOD/COD methods are primarily designed for these highly constrained datasets and lack explicit modeling of the relationship between salient and camouflaged objects. In this paper, to promote the development of unconstrained salient and camouflaged object detection, we construct a large-scale dataset, USC12K, which features comprehensive labels and four different scenes that cover all possible logical existence scenarios of both salient and camouflaged objects. To explicitly model the relationship between salient and camouflaged objects, we propose a model called USCNet, which introduces two distinct prompt query mechanisms for modeling inter-sample and intrasample aspect relationships. Additionally, We designed CSCS to evaluate the model's ability to distinguish salient and camouflaged objects. Our method achieves SOTA performance across all scenes. Code and dataset: GitHub.
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
Gengze Zhou
The University of Adelaide
Yicong Hong
Adobe Research
Zun Wang
UNC, Chapel Hill
Chongyang Zhao
UNSW Sydney
Mohit Bansal
UNC, Chapel Hill
Qi Wu
The University of Adelaide
Abstract
The academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the latter concentrates on following detailed textual commands. Despite the differing focuses of these tasks, the underlying requirements of interpreting instructions, comprehending the surroundings, and inferring action decisions remain consistent. This paper consolidates diverse navigation tasks into a unified and generic framework - we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. Powered by SAME, we present a versatile agent capable of addressing seven navigation tasks simultaneously, achieving highly comparable performance to task-specific agents.
STD-GS: Exploring Frame-Event Interaction for SpatioTemporal-Disentangled Gaussian Splatting to Reconstruct High-Dynamic Scene
Hanyu Zhou
Huazhong University of Science and Technology
Haonan Wang
Huazhong University of Science and Technology
Haoyue Liu
Huazhong University of Science and Technology
Yuxing Duan
Huazhong University of Science and Technology
Luxin Yan
Huazhong University of Science and Technology
Gim Hee Lee
National University of Singapore
Abstract
High-dynamic scene reconstruction aims to represent static background with rigid spatial features and dynamic objects with deformed continuous spatiotemporal features. Typically, existing methods adopt unified representation model (e.g., Gaussian) to directly match the spatiotemporal features of dynamic scene from frame camera. However, this unified paradigm fails in the potential discontinuous temporal features of objects due to frame imaging and the heterogeneous spatial features between background and objects. In this work, we introduce event camera to compensate for frame camera, and propose a spatiotemporal-disentangled Gaussian splatting framework for high-dynamic scene reconstruction. As for dynamic scene, we figure out that background and objects have appearance discrepancy in frame-based spatial features and motion discrepancy in event-based temporal features, which motivates us to distinguish the spatiotemporal features between background and objects via clustering. As for dynamic object, we discover that Gaussian representations and event data share the consistent spatiotemporal characteristic, which could serve as a prior to guide the spatiotemporal disentanglement of object Gaussians. Within Gaussian splatting framework, the cumulative scene-object disentanglement can improve the spatiotemporal discrimination between background and objects to render the time-continuous dynamic scene. Extensive experiments are performed to verify the superiority of our method.
TopoTTA: Topology-Enhanced Test-Time Adaptation for Tubular Structure Segmentation
Jiale Zhou
Zhejiang University
Wenhan Wang
Beihang University
Shikun Li
Westlake University
Xiaolei Qu
Beihang University
Xin Guo
Beihang University
Yizhong Liu
Beihang University
Wenzhong Tang
Beihang University
Xun Lin
Beihang University
Yefeng Zheng
Westlake University
Abstract
Tubular structure segmentation (TSS) is important for various applications, such as hemodynamic analysis and route navigation. Despite significant progress in TSS, domain shifts remain a major challenge, leading to performance degradation in unseen target domains. Unlike other segmentation tasks, TSS is more sensitive to domain shifts, as changes in topological structures can compromise segmentation integrity, and variations in local features distinguishing foreground from background (e.g., texture and contrast) may further disrupt topological continuity. To address these challenges, we propose Topology-enhanced Test-Time Adaptation (TopoTTA), the first test-time adaptation framework designed specifically for TSS. TopoTTA consists of two stages: Stage 1 adapts models to cross-domain topological discrepancies using the proposed Topological Meta Difference Convolutions (TopoMDCs), which enhance topological representation without altering pre-trained parameters; Stage 2 improves topological continuity by a novel Topology Hard sample Generation (TopoHG) strategy and prediction alignment on hard samples with pseudo-labels in the generated pseudobreaks. Extensive experiments across four scenarios and ten datasets demonstrate TopoTTA's effectiveness in handling topological distribution shifts, achieving an average improvement of 31.81% in clDice. TopoTTA also serves as a plug-and-play TTA solution for CNN-based TSS models.
TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction
Zewei Zhou
University of California, Los Angeles
Seth Z. Zhao
University of California, Los Angeles
Tianhui Cai
University of California, Los Angeles
Zhiyu Huang
University of California, Los Angeles
Bolei Zhou
University of California, Los Angeles
Jiaqi Ma
University of California, Los Angeles
Abstract
End-to-end training of multi-agent systems offers significant advantages in improving multi-task performance. However, training such models remains challenging and requires extensive manual design and monitoring. In this work, we introduce TurboTrain, a novel and efficient training framework for multi-agent perception and prediction. TurboTrain comprises two key components: a multi-agent spatiotemporal pretraining scheme based on masked reconstruction learning and a balanced multi-task learning strategy based on gradient conflict suppression. By streamlining the training process, our framework eliminates the need for manually designing and tuning complex multi-stage training pipelines, substantially reducing training time and improving performance. We evaluate TurboTrain on a real-world cooperative driving dataset, V2XPnP-Seq, and demonstrate that it further improves the performance of state-of-the-art multi-agent perception and prediction models. Our results highlight that pretraining effectively captures spatiotemporal multi-agent features and significantly benefits downstream tasks. Moreover, the proposed balanced multi-task learning strategy enhances detection and prediction.
V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction
Zewei Zhou
University of California, Los Angeles
Hao Xiang
University of California, Los Angeles
Zhaoliang Zheng
University of California, Los Angeles
Seth Z. Zhao
University of California, Los Angeles
Mingyue Lei
University of California, Los Angeles
Yun Zhang
University of California, Los Angeles
Tianhui Cai
University of California, Los Angeles
Xinyi Liu
University of California, Los Angeles
Johnson Liu
University of California, Los Angeles
Maheswari Bajji
University of California, Los Angeles
Xin Xia
University of California, Los Angeles
Zhiyu Huang
University of California, Los Angeles
Bolei Zhou
University of California, Los Angeles
Jiaqi Ma
University of California, Los Angeles
Abstract
Vehicle-to-everything (V2X) technologies offer a promising paradigm to mitigate the limitations of constrained observability in single-vehicle systems. Prior work primarily focuses on single-frame cooperative perception, which fuses agents' information across different spatial locations but ignores temporal cues and temporal tasks (e.g., temporal perception and prediction). In this paper, we focus on the spatio-temporal fusion in V2X scenarios and design one-step and multi-step communication strategies (when to transmit) as well as examine their integration with three fusion strategies - early, late, and intermediate (what to transmit), providing comprehensive benchmarks with 11 fusion models (how to fuse). Furthermore, we propose V2XPnP, a novel intermediate fusion framework within one-step communication for end-to-end perception and prediction. Our framework employs a unified Transformer-based architecture to effectively model complex spatio-temporal relationships across multiple agents, frames, and high-definition maps. Moreover, we introduce the V2XPnP Sequential Dataset that supports all V2X collaboration modes and addresses the limitations of existing real-world datasets, which are restricted to singleframe or single-mode cooperation. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods in both perception and prediction tasks.
When Pixel Difference Patterns Meet ViT: PiDiViT for Few-Shot Object Detection
Hongliang Zhou
National University of Defense Technology, China
Yongxiang Liu
National University of Defense Technology, China
Canyu Mo
National University of Defense Technology, China
Weijie Li
National University of Defense Technology, China
Bowen Peng
National University of Defense Technology, China
Li Liu
National University of Defense Technology, China
Abstract
Few-shot object detection aims to detect novel classes with limited samples. Recent methods have leveraged rich semantic representations of pretrained vision transformer (ViT) to overcome limitations of model fine-tuning, thereby improving performance on novel classes. However, existing pretrained ViT schemes only perform transformer encoding in feature dimension, ignoring exploration of pixel-wise differences in low-level features and multiscale variations. The current challenges lie in: (i) extracted features suffer from blurred boundary features and smooth transition from center to boundary, leading to insufficient distinction between objects and backgrounds, and (ii) how to balance extraction of local details and global contour features under multiscale scenarios. So Pixel Difference Vision Transformer (PiDiViT) is proposed. Innovations include: (i) difference convolution fusion module (DCFM), which enhances feature differences from object centers to boundaries and effectively preserves global information by fusing pixel-wise central difference features with original features through an attention mechanism, and (ii) multiscale feature fusion module (MFFM), which adaptively fuses features extracted by five different scale convolutional kernels using a scale attention mechanism to generate attention weights, achieving an optimal balance between local detail and global semantic information extraction. PiDiViT achieves SOTA on the COCO benchmark: surpassing few-shot detection SOTA by 2.7 nAP50 (10-shot) and 4.0 nAP50 (30-shot) for novel classes, exceeding one-shot detection SOTA by 4.4 nAP50 and open-vocabulary detection SOTA by 3.7 nAP50. The code is available at https://github.com/Seaz9/PiDiViT.
Where, What, Why: Towards Explainable Driver Attention Prediction
Yuchen Zhou
Sun Yat-sen University
Jiayu Tang
Sun Yat-sen University
Xiaoyan Xiao
Sun Yat-sen University
Yueyao Lin
Sun Yat-sen University
Linkai Liu
Sun Yat-sen University
Zipeng Guo
Sun Yat-sen University
Hao Fei
National University of Singapore
Xiaobo Xia
National University of Singapore
Chao Gou
Sun Yat-sen University
Abstract
Modeling task-driven attention in driving is a fundamental challenge for both autonomous vehicles and cognitive science. Existing methods primarily predict where drivers look by generating spatial heatmaps, but fail to capture the cognitive motivations behind attention allocation in specific contexts, which limits deeper understanding of attention mechanisms. To bridge this gap, we introduce Explainable Driver Attention Prediction, a novel task paradigm that jointly predicts spatial attention regions (where), parses attended semantics (what), and provides cognitive reasoning for attention allocation (why). To support this, we present W³DA, the first large-scale explainable driver attention dataset. It enriches existing benchmarks with detailed semantic and causal annotations across diverse driving scenarios, including normal conditions, safety-critical situations, and traffic accidents. We further propose LLada, a Large Language model-driven framework for driver attention prediction, which unifies pixel modeling, semantic parsing, and cognitive reasoning within an end-to-end architecture. Extensive experiments demonstrate the effectiveness of LLada, exhibiting robust generalization across datasets and driving conditions. This work serves as a key step toward a deeper understanding of driver attention mechanisms, with significant implications for autonomous driving, intelligent driver training, and human-computer interaction.
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
Wenxuan Zhu
King Abdullah University of Science and Technology
Bing Li
King Abdullah University of Science and Technology
Cheng Zheng
King Abdullah University of Science and Technology
Jinjie Mai
King Abdullah University of Science and Technology
Jun Chen
King Abdullah University of Science and Technology
Letian Jiang
King Abdullah University of Science and Technology
Abdullah Hamdi
University of Oxford
Sara Rojas Martinez
King Abdullah University of Science and Technology
Chia-Wen Lin
National Tsing Hua University
Mohamed Elhoseiny
King Abdullah University of Science and Technology
Bernard Ghanem
King Abdullah University of Science and Technology
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning. 4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks. With 4DBench, we evaluate a wide range of open-source and closedsource MLLMs. The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding. 4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63% accuracy compared to the human baseline of 91%. These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs. Project page: https://4dbench.github.io/
A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition
Jie Zhu
Michigan State University
Yiyang Su
Michigan State University
Minchul Kim
Michigan State University
Anil Jain
Michigan State University
Xiaoming Liu
Michigan State University
Abstract
Whole-body biometric recognition is a challenging multimodal task that integrates various biometric modalities, including face, gait, and body. This integration is essential for overcoming the limitations of unimodal systems. Traditionally, whole-body recognition involves deploying different models to process multiple modalities, achieving the final outcome by score-fusion (e.g., weighted averaging of similarity matrices from each model). However, these conventional methods may overlook the variations in score distributions of individual modalities, making it challenging to improve final performance. In this work, we present Quality-guided Mixture of score-fusion Experts (QME), a novel framework designed for improving whole-body biometric recognition performance through a learnable score-fusion strategy using a Mixture of Experts (MoE). We introduce a novel pseudo-quality loss for quality estimation with a modality-specific Quality Estimator (QE), and a score triplet loss to improve the metric performance. Extensive experiments on multiple whole-body biometric datasets demonstrate the effectiveness of our proposed approach, achieving state-of-the-art results across various metrics compared to baseline methods. Our method is effective for multimodal and multi-model, addressing key challenges such as model misalignment in the similarity score domain and variability in data quality. Code is available at the Project Link.
Aether: Geometric-Aware Unified World Modeling
Haoyi Zhu
USTC
Yifan Wang
Shanghai AI Lab
Jianjun Zhou
SII
Wenzheng Chang
SJTU
Yang Zhou
ZJU
Zizun Li
FDU
Junyi Chen
FDU
Chunhua Shen
unknown
Jiangmiao Pang
unknown
Tong He
unknown
Abstract
The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes AETHER, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, AETHER achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates zero-shot synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Notably, even without real-world data, its reconstruction performance is comparable with or even better than that of domain-specific models. Additionally, AETHER employs camera trajectories as geometry-informed action spaces, enabling effective action-conditioned prediction and visual planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.
Beyond Pixel Uncertainty: Bounding the OoD Objects in Road Scenes
Huachao Zhu
Wuhan University
Zelong Liu
Wuhan University
Zhichao Sun
Wuhan University
Yuda Zou
Wuhan University
Gui-Song Xia
Wuhan University
Yongchao Xu
Wuhan University
Abstract
Recognizing out-of-distribution (OoD) objects on roads is crucial for safe driving. Most existing methods rely on segmentation models' uncertainty as anomaly scores, often resulting in false positives - especially at ambiguous regions like boundaries, where segmentation models inherently exhibit high uncertainty. Additionally, it is challenging to define a suitable threshold to generate anomaly masks, especially with the inconsistencies in predictions across consecutive frames. We propose DetSeg, a novel paradigm that helps incorporate object-level understanding. DetSeg first detects all objects in the open world and then suppresses in-distribution (ID) bounding boxes, leaving only OoD proposals. These proposals can either help previous methods eliminate false positives (DetSeg-R), or generate binary anomaly masks without complex threshold search when combined with a box-prompted segmentation module (DetSeg-S). Additionally, we introduce vanishing point guided Hungarian matching (VPHM) to smooth the prediction results within a video clip, mitigating abrupt variations of predictions between consecutive frames. Comprehensive experiments on various benchmarks demonstrate that DetSeg significantly improves performance, reducing the FPR95 of previous methods by up to 37.45%, offering a more robust and practical solution. Code: https://github.com/huachao0124/DetSeg-official.
ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis
Benjin ZHU
CUHK MMLab
Xiaogang WANG
CUHK MMLab
Hongsheng LI
CUHK MMLab
Abstract
Scene synthesis plays a crucial role in autonomous driving by addressing data scarcity and closed-loop validation. Current approaches struggle to maintain temporal consistency in synthesized videos while preserving fine-grained details. We introduce ConsistentCity, a two-stage framework with a novel Semantic Flow-guided Diffusion Transformers (SF-DiT) that convert sequential BEV semantic maps into temporally consistent driving videos. Operating in a pretrained occupancy VQ-VAE latent space, our SFDiT generates temporally consistent 3D occupancy, which provides guidance for controlled image and video diffusion for scene synthesis. To address the temporal consistency, SF-DiT enhances standard DiT blocks with temporal semantic modeling through two designs: (1) A Semantic Flow Estimation module capturing scene motions (flow, uncertainty, and classification) from sequential BEV semantic maps, and (2) A Semantic Flow-Modulated Cross-Attention module that dynamically adapts attention based on semantic flow patterns. This integration of semantic flow modeling in DiT enables consistent scene evolution understanding. Evaluations of image and video synthesis on nuScenes dataset demonstrate state-of-the-art performance with FID 8.3 and FVD 73.6, and superior temporal occupancy generation results on nuCraft and OpenOccupancy benchmarks. Code is available.
Depth Any Event Stream: Enhancing Event-based Monocular Depth Estimation via Dense-to-Sparse Distillation
Jinjing Zhu
HKUST(GZ)
Tianbo Pan
HKUST
Zidong Cao
HKUST
Yexin Liu
HKUST
James T. Kwok
HKUST
Hui Xiong
HKUST
Abstract
With the superior sensitivity of event cameras to high-speed motion and extreme lighting conditions, event-based monocular depth estimation has gained popularity to predict structural information about surrounding scenes in challenging environments. However, the scarcity of labeled event data constrains prior supervised learning methods. Unleashing the promising potential of the existing RGB-based depth foundation model, DAM [41], we propose Depth Any Event stream (EventDAM) to achieve high-performance eventbased monocular depth estimation in an annotation-free manner. EventDAM effectively combines paired dense RGB images with sparse event data by incorporating three key cross-modality components: Sparsity-aware Feature Mixture (SFM), Sparsity-aware Feature Distillation (SFD), and Sparsity-invariant Consistency Module (SCM). With the proposed sparsity metric, SFM mixes features from RGB images and event data to generate auxiliary depth predictions, while SFD facilitates adaptive feature distillation. Furthermore, SCM ensures output consistency across varying sparsity levels in event data, thereby endowing EventDAM with zeroshot capabilities across diverse scenes. Extensive experiments across a variety of benchmark datasets, compared to approaches using diverse input modalities, robustly substantiate the generalization and zero-shot capabilities of EventDAM.
EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment
Yufei Zhu
ShanghaiTech University
Yiming Zhong
ShanghaiTech University
Zemin Yang
ShanghaiTech University
Peishan Cong
ShanghaiTech University
Jingyi Yu
ShanghaiTech University
Xinge Zhu
The Chinese University of Hong Kong
Yuexin Ma
ShanghaiTech University
Abstract
Dexterous robotic hands often struggle to generalize effectively in complex environments due to models trained on low-diversity data. However, the real world presents an inherently unbounded range of scenarios. A natural solution is to enable robots learning from experience in complex environments-an approach akin to evolution, where systems improve through learning from both failures and successes. Motivated by this, we propose EvolvingGrasp, an evolutionary grasp generation method that continuously enhances grasping performance through efficient preference alignment. Specifically, we introduce Handpose-wise Preference Optimization (HPO), which allows the model to continuously align with preferences from both positive and negative feedback while progressively refining its grasping strategies. To further enhance efficiency and reliability during online adjustments, we incorporate a Physics-aware Consistency Model within HPO, which accelerates inference, reduces the number of timesteps needed for preference fine-tuning, and ensures physical plausibility throughout the process. Our results validate that EvolvingGrasp enables evolutionary grasp generation, ensuring robust, physically feasible, and preference-aligned grasping in both simulation and real scenarios.
IRASim: A Fine-Grained World Model for Robot Manipulation
Fangqi Zhu
Hong Kong University of Science and Technology
Hongtao Wu
ByteDance Seed
Song Guo
Hong Kong University of Science and Technology
Yuxiao Liu
ByteDance Seed
Chilam Cheang
ByteDance Seed
Tao Kong
ByteDance Seed
Abstract
World models allow autonomous agents to plan and explore by predicting the visual outcomes of different actions. However, for robot manipulation, it is challenging to accurately model the fine-grained robot-object interaction within the visual space using existing methods which overlooks precise alignment between each action and the corresponding frame. In this paper, we present IRASim, a novel world model capable of generating videos with fine-grained robotobject interaction details, conditioned on historical observations and robot action trajectories. We train a diffusion transformer and introduce a novel frame-level actionconditioning module within each transformer block to explicitly model and strengthen the action-frame alignment. Extensive experiments show that: (1) the quality of the videos generated by our method surpasses all the baseline methods and scales effectively with increased model size and computation; (2) policy evaluations using IRASim exhibit a strong correlation with those using the ground-truth simulator, highlighting its potential to accelerate real-world policy evaluation; (3) testing-time scaling through modelbased planning with IRASim significantly enhances policy performance, as evidenced by an improvement in the IoU metric on the Push-T benchmark from 0.637 to 0.961; (4) IRASim provides flexible action controllability, allowing virtual robotic arms in datasets to be controlled via a keyboard or VR controller. Video and code are available at https://gen-irasim.github.io/.
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities
Chenming Zhu
The University of Hong Kong
Tai Wang
Shanghai AI Laboratory
Wenwei Zhang
Shanghai AI Laboratory
Jiangmiao Pang
Shanghai AI Laboratory
Xihui Liu
The University of Hong Kong
Abstract
Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D scene understanding capabilities has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D visual understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we utilize the 3D position embeddings to enhance the 2D CLIP Patches with 3D spatial context information and construct 3D patches. By integrating the 3D position embeddings into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D visual understanding and 3D scene understanding. In contrast to previous 3D LMMs, LLaVA-3D supports decoding accurate 3D spatial perception outputs, e.g., 3D bounding boxes, directly from these 3D patches, without relying on the time-consuming off-the-shelf 3D segmentors. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D visual understanding and vision-language conversation capabilities with LLaVA.
LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment
Juelin Zhu
National University of Defense Technology
Shuaibang Peng
National University of Defense Technology
Long Wang
Westlake University
Hanlin Tan
National University of Defense Technology
Yu Liu
National University of Defense Technology
Maojun Zhang
National University of Defense Technology
Shen Yan
National University of Defense Technology
Abstract
We propose a novel method for aerial visual localization over low Level-of-Detail (LoD) city models. Previous wireframe-alignment-based method LoD-Loc [99] has shown promising localization results leveraging LoD models. However, LoD-Loc mainly relies on high-LoD (LoD3 or LoD2) city models, but the majority of available models and those many countries plan to construct nationwide are lowLoD (LoD1). Consequently, enabling localization on lowLoD city models could unlock drones' potential for global urban localization. To address these issues, we introduce LoD-Loc v2, which employs a coarse-to-fine strategy using explicit silhouette alignment to achieve accurate localization over low-LoD city models in the air. Specifically, given a query image, LoD-Loc v2 first applies a building segmentation network to shape building silhouettes. Then, in the coarse pose selection stage, we construct a pose cost volume by uniformly sampling pose hypotheses around a prior pose to represent the pose probability distribution. Each cost of the volume measures the degree of alignment between the projected and predicted silhouettes. We select the pose with maximum value as the coarse pose. In the fine pose estimation stage, a particle filtering method incorporating a multi-beam tracking approach is used to efficiently explore the hypothesis space and obtain the final pose estimation. To further facilitate research in this field, we release two datasets with LoD1 city models covering 10.7 km2, along with real RGB queries and ground-truth pose annotations. Experimental results show that LoDLoc v2 improves estimation accuracy with high-LoD models and enables localization with low-LoD models for the first time. Moreover, it outperforms state-of-the-art baselines by large margins, even surpassing texture-model-based methods, and broadens the convergence basin to accommodate larger prior errors. The project are available at https: //github.com/VictorZoo/LoD-Loc-v2.
MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model
Yaoye Zhu
Institute for AI Industry Research (AIR), Tsinghua University
Zhe Wang
Institute for AI Industry Research (AIR), Tsinghua University
Yan Wang
Institute for AI Industry Research (AIR), Tsinghua University
Abstract
As cooperative systems that leverage roadside cameras to assist autonomous vehicle perception become increasingly widespread, large-scale precise calibration of infrastructure cameras has become a critical issue. Traditional manual calibration methods are often time-consuming, laborintensive, and may require road closures. This paper proposes MamV2XCalib, the first V2X-based infrastructure camera calibration method with the assistance of vehicleside LiDAR. MamV2XCalib only requires autonomous vehicles equipped with LiDAR to drive near the cameras to be calibrated in the infrastructure, without the need for specific reference objects or manual intervention. We also introduce a new targetless LiDAR-camera calibration method, which combines multi-scale features and a 4D correlation volume to estimate the correlation between vehicle-side point clouds and roadside images. We model the temporal information and estimate the rotation angles with Mamba, effectively addressing calibration failures in V2X scenarios caused by defects in the vehicle-side data (such as occlusions) and large differences in viewpoint. We evaluate MamV2XCalib on the V2X-Seq and TUMTraf-V2X realworld datasets, demonstrating the effectiveness and robustness of our V2X-based automatic calibration approach. Compared to previous LiDAR-camera methods designed for calibration on one car, our approach achieves better and more stable calibration performance in V2X scenarios with fewer parameters. The code is available at https://github.com/zhuyaoye/MamV2XCalib.
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
Ziyu Zhu
Tsinghua University
Xilin Wang
Beihang University
Yixuan Li
State Key Laboratory of General Artificial Intelligence, BIGAI, China
Zhuofan Zhang
Tsinghua University
Xiaojian Ma
State Key Laboratory of General Artificial Intelligence, BIGAI, China
Yixin Chen
State Key Laboratory of General Artificial Intelligence, BIGAI, China
Baoxiong Jia
State Key Laboratory of General Artificial Intelligence, BIGAI, China
Wei Liang
Beijing Institute of Technology
Qian Yu
Beihang University
Zhidong Deng
Tsinghua University
Siyuan Huang
State Key Laboratory of General Artificial Intelligence, BIGAI, China
Qing Li
State Key Laboratory of General Artificial Intelligence, BIGAI, China
Abstract
Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus on grounding objects in static observations from 3D reconstruction, such as meshes and point clouds, but lack the ability to actively perceive and explore their environment. To address this limitation, we introduce Move to Understand (MTU3D), a unified framework that integrates active perception with 3D vision-language learning, enabling embodied agents to effectively explore and understand their environment. This is achieved by three key innovations: 1) Online query-based representation learning, enabling direct spatial memory construction from RGB-D frames, eliminating the need for explicit 3D reconstruction. 2) A unified objective for grounding and exploring, which represents unexplored locations as frontier queries and jointly optimizes object grounding and frontier selection. 3) End-to-end trajectory learning that combines Vision-Language-Exploration pre-training over a million diverse trajectories collected from both simulated and realworld RGB-D sequences. Extensive evaluations across various embodied navigation and question-answering benchmarks show that MTU3D outperforms state-of-the-art reinforcement learning and modular navigation approaches by 14%, 23%, 9%, and 2% in success rate on HM3D-OVON, GOAT-Bench, SG3D, and A-EQA, respectively. MTU3D's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images. The deployment on a real robot demonstrates MTU3D's effectiveness in handling real-world data. These findings highlight the importance of bridging visual grounding and exploration for embodied intelligence.
ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting
Ruijie Zhu
University of Science and Technology of China
Mulin Yu
Shanghai Artificial Intelligence Laboratory
Linning Xu
The Chinese University of Hong Kong
Lihan Jiang
University of Science and Technology of China
Yixuan Li
The Chinese University of Hong Kong
Tianzhu Zhang
unknown
Jiangmiao Pang
Shanghai Artificial Intelligence Laboratory
Bo Dai
The University of Hong Kong
Abstract
3D Gaussian Splatting is renowned for its high-fidelity reconstructions and real-time novel view synthesis, yet its lack of semantic understanding limits object-level perception. In this work, we propose ObjectGS, an object-aware framework that unifies 3D scene reconstruction with semantic understanding. Instead of treating the scene as a unified whole, ObjectGS models individual objects as local anchors that generate neural Gaussians and share object IDs, enabling precise object-level reconstruction. During training, we dynamically grow or prune these anchors and optimize their features, while a one-hot ID encoding with a classification loss enforces clear semantic constraints. We show through extensive experiments that ObjectGS not only outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks, but also integrates seamlessly with applications like mesh extraction and scene editing. Project page: https://ruijiezhu94.github.io/ObjectGS_
PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation
Zhihao Zhu
Shanghai Jiao Tong University
Yifan Zheng
Shanghai Jiao Tong University
Siyu Pan
Shanghai Jiao Tong University
Yaohui Jin
Shanghai Jiao Tong University
Yao Mu
Shanghai Jiao Tong University
Abstract
The fragmentation between high-level task semantics and low-level geometric features remains a persistent challenge in robotic manipulation. While vision-language models (VLMs) have shown promise in generating affordanceaware visual representations, the lack of semantic grounding in canonical spaces and reliance on manual annotations severely limit their ability to capture dynamic semanticaffordance relationships. To address these, we propose Primitive-Aware Semantic Grounding (PASG), a closed-loop framework that introduces: (1) Automatic primitive extraction through geometric feature aggregation, enabling cross-category detection of keypoints and axes; (2) VLM-driven semantic anchoring that dynamically couples geometric primitives with functional affordances and task-relevant description; (3) A spatial-semantic reasoning benchmark and a fine-tuned VLM (Qwen2.5VL-PA). We demonstrate PASG's effectiveness in practical robotic manipulation tasks across diverse scenarios, achieving performance comparable to manual annotations. PASG achieves a finer-grained semantic-affordance understanding of objects, establishing a unified paradigm for bridging geometric primitives with task semantics in robotic manipulation.
VGMamba: Attribute-to-Location Clue Reasoning for Quantity-Agnostic 3D Visual Grounding
Yihang Zhu
Xidian University
Jinhao Zhang
Xidian University
Yuxuan Wang
Xidian University
Aming Wu
Xidian University
Cheng Deng
Xidian University
Abstract
As an important direction of embodied intelligence, 3D Visual Grounding has attracted much attention, aiming to identify 3D objects matching the given language description. Most existing methods often follow a two-stage process, i.e., first detecting proposal objects and identifying the right objects based on the relevance to the given query. However, when the query is complex, it is difficult to leverage an abstract language representation to lock the corresponding objects accurately, affecting the grounding performance. In general, given a specific object, humans usually follow two clues to finish the corresponding grounding, i.e., attribute and location clues. To this end, we explore a new mechanism, attribute-to-location clue reasoning, to conduct accurate grounding. Particularly, we propose a VGMamba network that consists of an SVD-based attribute mamba, location mamba, and multi-modal fusion mamba. Taking a 3D point cloud scene and language query as the input, we first exploit SVD to make a decomposition of the extracted features. Then, a slidingwindow operation is conducted to capture attribute characteristics. Next, a location mamba is presented to obtain the corresponding location information. Finally, by means of multi-modal mamba fusion, the model could effectively localize the object that matches the given query. In the experiment, our method is verified on four datasets. Extensive experimental results demonstrate the superiority of our method.
WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection
Haodong Zhu
Beihang University
Wenhao Dong
Beihang University
Linlin Yang
Communication University of China
Hong Li
Beihang University
Yuguang Yang
Beihang University
Yangyang Ren
Beihang University
Qingcheng Zhu
Beihang University
Zichao Feng
Beihang University
Changbai Li
Beihang University
Shaohui Lin
East China Normal University
Runqi Wang
Beijing Jiaotong University
Xiaoyan Luo
Beihang University
Baochang Zhang
Beihang University
Abstract
Leveraging the complementary characteristics of visible (RGB) and infrared (IR) imagery offers significant potential for improving object detection. In this paper, we propose WaveMamba, a cross-modality fusion method that efficiently integrates the unique and complementary frequency features of RGB and IR decomposed by Discrete Wavelet Transform (DWT). An improved detection head incorporating the Inverse Discrete Wavelet Transform (IDWT) is also proposed to reduce information loss and produce the final detection results. The core of our approach is the introduction of WaveMamba Fusion Block (WMFB), which facilitates comprehensive fusion across low-/high-frequency sub-bands. Within WMFB, the Low-frequency Mamba Fusion Block (LMFB), built upon the Mamba framework, first performs initial low-frequency feature fusion with channel swapping, followed by deep fusion with an advanced gated attention mechanism for enhanced integration. Highfrequency features are enhanced using a strategy that applies an 'absolute maximum' fusion approach. These advancements lead to significant performance gains, with our method surpassing state-of-the-art approaches and achieving average mAP improvements of 4.5% on four benchmarks.
OMNI-DC: Highly Robust Depth Completion with Multiresolution Depth Integration
Yiming Zuo
Princeton University
Willow Yang
Princeton University
Zeyu Ma
Princeton University
Jia Deng
Princeton University
Abstract
Depth completion (DC) aims to predict a dense depth map from an RGB image and a sparse depth map. Existing DC methods generalize poorly to new datasets or unseen sparse depth patterns, limiting their real-world applications. We propose OMNI-DC, a highly robust DC model that generalizes well zero-shot to various datasets. The key design is a novel Multi-resolution Depth Integrator, allowing our model to deal with very sparse depth inputs. We also introduce a novel Laplacian loss to model the ambiguity in the training process. Moreover, we train OMNI-DC on a mixture of high-quality datasets with a scale normalization technique and synthetic depth patterns. Extensive experiments on 7 datasets show consistent improvements over baselines, reducing errors by as much as 43%. Codes and checkpoints are available at https://github.com/princeton-vl/OMNI-DC.
PanSt3R: Multi-view Consistent Panoptic Segmentation
Lojze ˇZust
Naver Labs Europe
Yohann Cabon
Naver Labs Europe
Juliette Marrie
Naver Labs Europe
Leonid Antsfeld
Naver Labs Europe
Boris Chidlovskii
Naver Labs Europe
J´erˆome Revaud
Naver Labs Europe
Gabriela Csurka
Naver Labs Europe
Abstract
Panoptic segmentation in 3D is a fundamental problem in scene understanding. Existing approaches typically rely on costly test-time optimizations (often based on NeRF) to consolidate 2D predictions of off-the-shelf panoptic segmentation methods into 3D. Instead, in this work, we propose a unified and integrated approach PanSt3R, which eliminates the need for test-time optimization by jointly predicting 3D geometry and multi-view-consistent panoptic segmentation in a single forward pass. Our approach harnesses the 3D representations of MUSt3R, a recent scalable multi-view version of DUSt3R, and 2D representations of DINOv2, then performs joint multi-view panoptic prediction via a mask transformer architecture. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach for multi-view segmentation. We also introduce a simple method for generating novel-view predictions based on the predictions of PanSt3R and vanilla 3DGS. Overall, the proposed PanSt3R is conceptually simple yet fast and scalable, and achieves state-of-the-art performance on several benchmarks, while being orders of magnitude faster. More information and examples available on our project page.

Want to develop your own custom dataset?

Whether you are looking for Multi-sensor Annotation tool or a complete annotation solution, we can help! Please tell us what you are looking for and we will get back to you within 24 hours.
Semantic segmentation annotation services