Autonomy related papers from ICCV 2025

NEW

Autonomy related papers from ICCV 2025

A searchable list of research papers from ICCV 2025 that are relevant to the Autonomous vehicles, Robotics and Industrial Automation industries.

Paper title	Authors	Description
PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes	Ahmed Abdelreheem KAUST Filippo Aleotti Niantic Spatial Jamie Watson Niantic Spatial Zawar Qureshi Niantic Spatial Abdelrahman Eldesokey KAUST Peter Wonka KAUST Gabriel Brostow UCL Sara Vicente Niantic Spatial Guillermo Garcia-Hernando Niantic Spatial	Paper Abstract We introduce the task of Language-Guided Object Placement in Real 3D Scenes. Given a 3D reconstructed point-cloud scene, a 3D asset, and a natural-language instruction, the goal is to place the asset so that the instruction is satisfied. The task demands tackling four intertwined challenges: (a) one-to-many ambiguity in valid placements; (b) precise geometric and physical reasoning; (c) joint understanding across the scene, the asset, and language; and (d) robustness to noisy point clouds with no privileged metadata at test time. The first three challenges mirror the complexities of synthetic scene generation, while the metadata-free, noisy-scan scenario is inherited from language-guided 3D visual grounding. We inaugurate this task by introducing a benchmark and evaluation protocol, releasing a dataset for training multi-modal large language models (MLLMs), and establishing a first nontrivial baseline. We believe this challenging setup and benchmark will provide a foundation for evaluating and advancing MLLMs in 3D understanding.
NormalLoc: Visual Localization on Textureless 3D Models using Surface Normals	Jiro Abe Visual Intelligence Research Laboratories, NEC Corporation Gaku Nakano Visual Intelligence Research Laboratories, NEC Corporation Kazumine Ogura Visual Intelligence Research Laboratories, NEC Corporation	Paper Supplementary Abstract We propose NormalLoc, a novel visual localization method for estimating the 6-DoF pose of a camera using textureless 3D models. Existing methods often rely on color or texture information, limiting their applicability in scenarios where such information is unavailable. NormalLoc addresses this limitation by using rendered normal images generated from surface normals of 3D models to establish a training scheme for both global descriptor computation and matching. This approach enables robust visual localization even when geometric details are limited. Experimental results demonstrate that NormalLoc achieves state-of-the-art performance for visual localization on textureless 3D models, especially in scenarios with limited geometric detail.
UINavBench: A Framework for Comprehensive Evaluation of Interactive Digital Agents	Harsh Agrawal Apple Eldon Schoop Apple Xinlei Pan Apple Anuj Mahajan Apple Ari Seff Apple Di Feng Apple Ruijia Cheng Apple Andres Romero Mier Y Teran Apple Esteban Gomez Apple Abhishek Sundararajan Apple Forrest Huang Apple Amanda Swearngin Apple Mohana Prasad Sathya Moorthy Apple Jeff Nichols Apple Alexander Toshev Apple	Paper Supplementary Abstract We build a comprehensive online evaluation benchmark for language-conditioned multi-step task execution on mobile interfaces. Our benchmark strives to evaluate the multistep planning, reasoning, and visual grounding capabilities of agents, using mobile user interfaces as a concrete testbed. To build diverse, challenging tasks that reflect real-world use cases, we propose an exhaustive taxonomy that allows us to measure progress along multiple decisionmaking abilities including multi-step planning, visual perception, action grounding, and using memory or external knowledge. We also highlight important factors such as statefulness, safety, and evaluation complexity that are key to design tasks that can be reliably evaluated. Using this taxonomy, we design 116 tasks across 36 unique apps. Through an automatic framework, we stage and evaluate several natural baselines with different input representations and planning strategies. We show that the bestperforming agent achieves 40% success on our benchmark. We further measure agents' abilities to plan, ground, and utilize world knowledge highlighting areas of improvement. 1. Intro Building autonomous agents has been a long-standing goal in Artificial Intelligence (AI). With recent advances in Large Language Models (LLMs), and Vision-Language Models (VLMs), there has been a surge in the development of interactive digital agents that can automate tasks on mobile phones [5-13]. These agents are designed to automate everyday activities such as shopping for groceries, planning trips, and organizing calendars. Several benchmarks have been introduced to evaluate these agents' ability to understand and navigate Web [14- 18], Android [19-25], and Desktop environments [26-28]. Benchmarks for mobile agents have typically been offline, consisting of static sets of images or ground truth trajectories against which an agent is evaluated. While offline benchmarks are performant, they do not reflect the real-world stocha
UPP: Unified Point-Level Prompting for Robust Point Cloud Analysis	Zixiang Ai Wangxuan Institute of Computer Technology, Peking University Zhenyu Cui Wangxuan Institute of Computer Technology, Peking University Yuxin Peng Wangxuan Institute of Computer Technology, Peking University Jiahuan Zhou Wangxuan Institute of Computer Technology, Peking University	Paper Supplementary Abstract Pre-trained point cloud analysis models have shown promising advancements in various downstream tasks, yet their effectiveness is typically suffering from low-quality point cloud (i.e., noise and incompleteness), which is a common issue in real scenarios due to casual object occlusions and unsatisfactory data collected by 3D sensors. To this end, existing methods focus on enhancing point cloud quality by developing dedicated denoising and completion models. However, due to the isolation between the point cloud enhancement and downstream tasks, these methods fail to work in various real-world domains. In addition, the conflicting objectives between denoising and completing tasks further limit the ensemble paradigm to preserve critical geometric features. To tackle the above challenges, we propose a unified point-level prompting method that reformulates point cloud denoising and completion as a prompting mechanism, enabling robust analysis in a parameterefficient manner. We start by introducing a Rectification Prompter to adapt to noisy points through the predicted rectification vector prompts, effectively filtering noise while preserving intricate geometric features essential for accurate analysis. Sequentially, we further incorporate a Completion Prompter to generate auxiliary point prompts based on the rectified point clouds, facilitating their robustness and adaptability. Finally, a Shape-Aware Unit module is exploited to efficiently unify and capture the filtered geometric features for the downstream point cloud analysis. Extensive experiments on four datasets demonstrate the superiority and robustness of our method when handling noisy and incomplete point cloud data against existing state-of-the-art methods. Our code is released at https://github.com/zhoujiahuan1991/ICCV2025-UPP.
Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation	Hiroyasu Akada Max Planck Institute for Informatics, SIC Jian Wang Max Planck Institute for Informatics, SIC Vladislav Golyanik Max Planck Institute for Informatics, SIC Christian Theobalt Max Planck Institute for Informatics, SIC	Paper Supplementary Abstract Egocentric 3D human pose estimation has been actively studied using cameras installed in front of a head-mounted device (HMD). While frontal placement is the optimal and the only option for some tasks, such as hand tracking, it remains unclear if the same holds for full-body tracking due to self-occlusion and limited field-of-view coverage. Notably, even the state-of-the-art methods often fail to estimate accurate 3D poses in many scenarios, such as when HMD users tilt their heads upward-a common motion in human activities. A key limitation of existing HMD designs is their neglect of the back of the body, despite its potential to provide crucial 3D reconstruction cues. Hence, this paper investigates the usefulness of rear cameras for full-body tracking. We also show that simply adding rear views to the frontal inputs is not optimal for existing methods due to their dependence on individual 2D joint detectors without effective multi-view integration. To address this issue, we propose a new transformer-based method that refines 2D joint heatmap estimation with multi-view information and heatmap uncertainty, thereby improving 3D pose tracking. Also, we introduce two new large-scale datasets, Ego4ViewSyn and Ego4View-RW, for a rear-view evaluation. Our experiments show that the new camera configurations with back views provide superior support for 3D pose tracking compared to only frontal placements. The proposed method achieves significant improvement over the current state of the art (>10% on MPJPE).
Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection	Giacomo D'Amicantonio Eindhoven University of Technology Snehashis Majhi INRIA Quan Kong Woven by Toyota Lorenzo Garattoni Woven by Toyota Gianpiero Francesca Woven by Toyota François Brémond INRIA Egor Bondarev Eindhoven University of Technology	Paper Supplementary Abstract Video Anomaly Detection (VAD) is a challenging task due to the variability of anomalous events and the limited availability of labeled data. Under the Weakly-Supervised VAD (WSVAD) paradigm, only video-level labels are provided during training, while predictions are made at the frame level. Although state-of-the-art models perform well on simple anomalies (e.g., explosions), they struggle with complex real-world events (e.g., shoplifting). This difficulty stems from two key issues: (1) the inability of current models to address the diversity of anomaly types, as they process all categories with a shared model, overlooking categoryspecific features; and (2) the weak supervision signal, which lacks precise temporal information, limiting the ability to capture nuanced anomalous patterns blended with normal events. To address these challenges, we propose Gaussian Splatting-guided Mixture of Experts (GS-MoE), a novel framework that employs a set of expert models, each specialized in capturing specific anomaly types. These experts are guided by a temporal Gaussian splatting loss, enabling the model to leverage temporal consistency and enhance weak supervision. The Gaussian splatting approach encourages a more precise and comprehensive representation of anomalies by focusing on temporal segments most likely to contain abnormal events. The predictions from these specialized experts are integrated through a mixture-of-experts mechanism to model complex relationships across diverse anomaly patterns. Our approach achieves state-of-theart performance, with a 91.58% AUC on the UCF-Crime dataset, and demonstrates superior results on XD-Violence and MSAD datasets. By leveraging category-specific expertise and temporal guidance, GS-MoE sets a new benchmark for VAD under weak supervision.
MinCD-PnP: Learning 2D-3D Correspondences with Approximate Blind PnP	Pei An Huazhong University of Science and Technology Jiaqi Yang Northwestern Polytechnical University Muyao Peng Huazhong University of Science and Technology You Yang Huazhong University of Science and Technology Qiong Liu Huazhong University of Science and Technology Xiaolin Wu Southwest Jiaotong University Liangliang Nan Delft University of Technology	Paper Supplementary Abstract Image-to-point-cloud (I2P) registration is a fundamental problem in computer vision, focusing on establishing 2D3D correspondences between an image and a point cloud. Recently, the differentiable perspective-n-point (PnP) has been widely used to supervise I2P registration networks by enforcing projective constraints on 2D-3D correspondences. However, differentiable PnP is highly sensitive to noise and outliers in the predicted correspondences, which hinders the effectiveness of correspondence learning. Inspired by the robustness of blind PnP to noise and outliers in correspondences, we propose an approximate blind PnP-based correspondence learning approach. To mitigate the high computational cost of blind PnP, we reformulate it as a more tractable problem: minimizing the Chamfer distance between learned 2D and 3D keypoints, referred to as MinCD-PnP. To effectively solve MinCD-PnP, we introduce a lightweight multi-task learning module, MinCD-Net, which can be easily integrated into the existing I2P registration architectures. Extensive experiments on 7-Scenes, RGBD-V2, ScanNet, and self-collected datasets demonstrate that MinCD-Net outperforms state-of-the-art methods and achieves higher inlier ratio and registration recall in both cross-scene and cross-dataset settings. The source code: https://github.com/anpei96/mincdpnp-demo.
SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image	Dimitrije Antić University of Amsterdam Georgios Paschalidis University of Amsterdam Shashank Tripathi Max Planck Institute for Intelligent Systems Theo Gevers University of Amsterdam Sai Kumar Dwivedi Max Planck Institute for Intelligent Systems Dimitrios Tzionas University of Amsterdam	Paper Supplementary Abstract Recovering 3D object pose and shape from a single image is a challenging and ill-posed problem. This is due to strong (self-)occlusions, depth ambiguities, the vast intraand inter-class shape variance, and the lack of 3D ground truth for natural images. Existing deep-network methods are trained on synthetic datasets to predict 3D shapes, so they often struggle generalizing to real-world images. Moreover, they lack an explicit feedback loop for refining noisy estimates, and primarily focus on geometry without directly considering pixel alignment. To tackle these limitations, we develop a novel render-and-compare optimization framework, called SDFit. This has three key innovations: First, it uses a learned category-specific and morphable signed-distance-function (mSDF) model, and fits this to an image by iteratively refining both 3D pose and shape. The mSDF robustifies inference by constraining the search on the manifold of valid shapes, while allowing for arbitrary shape topologies. Second, SDFit retrieves an initial 3D shape that likely matches the image, by exploiting foundational models for efficient look-up into 3D shape databases. Third, SDFit initializes pose by establishing rich 2D-3D correspondences between the image and the mSDF through foundational features. We evaluate SDFit on three image datasets, i.e., Pix3D, Pascal3D+, and COMIC. SDFit performs on par with SotA feed-forward networks for unoccluded images and common poses, but is uniquely robust to occlusions and uncommon poses. Moreover, it requires no retraining for unseen images. Thus, SDFit contributes new insights for generalizing in the wild. Code is available at https://anticdimi.github.io/sdfit.
Zero-shot Inexact CAD Model Alignment from a Single Image	Pattaramanee Arsomngern VISTEC Sasikarn Khwanmuang VISTEC Matthias Nießner Technical University of Munich Supasorn Suwajanakorn VISTEC	Paper Supplementary Abstract One practical approach to infer 3D scene structure from a single image is to retrieve a closely matching 3D model from a database and align it with the object in the image. Existing methods rely on supervised training with images and pose annotations, which limits them to a narrow set of object categories. To address this, we propose a weakly supervised 9-DoF alignment method for inexact 3D models that requires no scene-level pose annotations and generalizes to unseen categories. Our approach derives a novel feature space based on foundation features that ensure multi-view consistency and overcome symmetry ambiguities inherent in foundation features using a self-supervised triplet loss. Additionally, we introduce a texture-invariant pose refinement technique that performs dense alignment in normalized object coordinates, estimated through the enhanced feature space. We conduct extensive evaluations on the real-world ScanNet25k dataset, where our method outperforms SOTA weakly supervised baselines by +4.3% mean alignment accuracy and is the only weakly supervised approach to surpass the supervised ROCA by +2.7%. To assess generalization, we introduce SUN2CAD, a real-world test set with 20 novel object categories, where our method achieves SOTA results without prior training on them.
Less is More: Improving Motion Diffusion Models with Sparse Keyframes	Jinseok Bae Dept. of Electrical and Computer Engineering, Seoul National University Inwoo Hwang Dept. of Electrical and Computer Engineering, Seoul National University Young-Yoon Lee Roblox Ziyu Guo CSE, The Chinese University of Hong Kong Joseph Liu Roblox Yizhak Ben-Shabat Roblox Young Min Kim Dept. of Electrical and Computer Engineering, Seoul National University Mubbasir Kapadia Roblox	Paper Supplementary Abstract Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training complexity, especially when learning intricate distributions of large motion datasets even with modern neural architectures. This severely limits the performance of generative motion models for downstream tasks. Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and geometrically meaningful keyframes. Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps. Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps. We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks.
EVOLVE: Event-Guided Deformable Feature Transfer and Dual-Memory Refinement for Low-Light Video Object Segmentation	Jong-Hyeon Baek Chungnam National University Jiwon Oh Chungnam National University Yeong Jun Koh Chungnam National University	Paper Supplementary Abstract Video Object Segmentation (VOS) in low-light scenarios remains highly challenging due to significant texture loss and severe noise, which often lead to unreliable image feature generation and degraded segmentation performance. To address this issue, we propose EVOLVE, a novel multi-modal framework that integrates event-guided deformable feature transfer and dual-memory refinement for low-light VOS. EVOLVE addresses spatial misalignment between frames and improves object representation by utilizing event-driven cues. The event-guided deformable feature transfer (EDFT) module enhances feature alignment through event-driven deformable convolutions, where offsets derived from event features enable motion-aware spatial adjustments, leading to more precise propagation of object features in reference frames. Furthermore, the dual-memory object transformer (DMOT) iteratively refines object representations by maintaining and updating both image-based and event-based memory representations. Through its memory refinement module (MRM), DMOT selectively enhances relevant object features while suppressing background noise, resulting in stable and temporally coherent segmentation results. Extensive experiments on low-light VOS benchmarks demonstrate that EVOLVE achieves state-of-the-art segmentation performance, surpassing both event-based and image-based VOS methods in accuracy and computational efficiency. Code is available at https://github.com/whdgusdl48/EVOLVE.
FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation	Yunpeng Bai The University of Texas at Austin Qixing Huang The University of Texas at Austin	Paper Supplementary Abstract Monocular Depth Estimation (MDE) is a fundamental 3D vision problem with numerous applications such as 3D scene reconstruction, autonomous navigation, and AI content creation. However, robust and generalizable MDE remains challenging due to limited real-world labeled data and distribution gaps between synthetic datasets and real data. Existing methods often struggle with real-world test data with low efficiency, reduced accuracy, and lack of detail. To address these issues, we propose an efficient MDE approach named FiffDepth. The key feature of FiffDepth is its use of diffusion priors. It transforms diffusion-based image generators into a feed-forward architecture for detailed depth estimation. FiffDepth preserves key generative features and integrates the strong generalization capabilities of models like DINOv2. Through benchmark evaluations, we demonstrate that FiffDepth achieves exceptional accuracy, stability, and fine-grained detail, offering significant improvements in MDE performance against state-ofthe-art MDE approaches. The paper's source code is available here: https://yunpeng1998.github.io/FiffDepth/
RCTDistill: Cross-Modal Knowledge Distillation Framework for Radar-Camera 3D Object Detection with Temporal Fusion	Geonho Bang Seoul National University Minjae Seong Hanyang University Jisong Kim Hanyang University Geunju Baek Seoul National University Daye Oh Hyundai Motor Company Junhyung Kim Hyundai Motor Company Junho Koh Hyundai Motor Company Jun Won Choi Seoul National University	Paper Supplementary Abstract Radar-camera fusion methods have emerged as a costeffective approach for 3D object detection but still lag behind LiDAR-based methods in performance. Recent works have focused on employing temporal fusion and Knowledge Distillation (KD) strategies to overcome these limitations. However, existing approaches have not sufficiently accounted for uncertainties arising from object motion or sensor-specific errors inherent in radar and camera modalities. In this work, we propose RCTDistill, a novel cross-modal KD method based on temporal fusion, comprising three key modules: Range-Azimuth Knowledge Distillation (RAKD), Temporal Knowledge Distillation (TKD), and Region-Decoupled Knowledge Distillation (RDKD). RAKD is designed to consider the inherent errors in the range and azimuth directions, enabling effective knowledge transfer from LiDAR features to refine inaccurate BEV representations. TKD mitigates temporal misalignment caused by dynamic objects by aligning historical radar-camera BEV features with current LiDAR representations. RDKD enhances feature discrimination by distilling relational knowledge from the teacher model, allowing the student to differentiate foreground and background features. RCTDistill achieves state-of-the-art radar-camera fusion performance on both the nuScenes and View-of-Delft (VoD) datasets, with the fastest inference speed of 26.2 FPS.
Vid-Group: Temporal Video Grounding Pretraining from Unlabeled Videos in the Wild	Peijun Bao Nanyang Technological University Chenqi Kong Nanyang Technological University Siyuan Yang Nanyang Technological University Zihao Shao Peking University Xinghao Jiang Shanghai Jiaotong University Boon Poh Ng Nanyang Technological University Meng Hwa Er Nanyang Technological University Alex Kot Nanyang Technological University	Paper Supplementary Abstract Given a natural language query, temporal video grounding aims to localize the described temporal moment in an untrimmed video. A major challenge of this task is its heavy dependence on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. To support this, we introduce Temporal Video Grounding Pretraining (Vid-Group), a large-scale dataset collected in a scalable manner with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. The semantics-guided refinement enhances the pseudo labels by leveraging semantic similarity with video frames to clean out unpaired data and make initial adjustments to temporal boundaries. In the following memory-consensus correction phase, a memory bank tracks the model predictions, progressively correcting the temporal boundaries based on consensus within the memory. Comprehensive experiments demonstrate ReCorrect's strong generalization abilities across multiple downstream settings. The code, dataset, and pretrained models are available at https://github.com/baopj/Vid-Group.
MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation	Vladislav Bargatin Lomonosov Moscow State University Egor Chistov Lomonosov Moscow State University Alexander Yakovenko MSU Institute for Artificial Intelligence Dmitriy Vatolin MSU Institute for Artificial Intelligence	Paper Supplementary Abstract Recent advances in optical flow estimation have prioritized accuracy at the cost of growing GPU memory consumption, particularly for high-resolution (FullHD) inputs. We introduce MEMFOF, a memory-efficient multi-frame optical flow method that identifies a favorable trade-off between multi-frame estimation and GPU memory usage. Notably, MEMFOF requires only 2.09 GB of GPU memory at runtime for 1080p inputs, and 28.5 GB during training, which uniquely positions our method to be trained at native 1080p without the need for cropping or downsampling. We systematically revisit design choices from RAFT-like architectures, integrating reduced correlation volumes and high-resolution training protocols alongside multi-frame estimation, to achieve state-of-the-art performance across multiple benchmarks while substantially reducing memory overhead. Our method outperforms more resourceintensive alternatives in both accuracy and runtime efficiency, validating its robustness for flow estimation at high resolutions. At the time of submission, our method ranks first on the Spring benchmark with a 1-pixel (1px) outlier rate of 3.289, leads Sintel (clean) with an endpoint error (EPE) of 0.963, and achieves the best Fl-all error on KITTI-2015 at 2.94%. The code is available at: https: //github.com/msu-video-group/memfof.
Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation	Luca Bartolomei University of Bologna Enrico Mannocci University of Bologna Fabio Tosi University of Bologna Matteo Poggi University of Bologna Stefano Mattoccia University of Bologna	Paper Abstract Event cameras capture sparse, high-temporal-resolution visual information, making them particularly suitable for challenging environments with high-speed motion and strongly varying lighting conditions. However, the lack of large datasets with dense ground-truth depth annotations hinders learning-based monocular depth estimation from event data. To address this limitation, we propose a crossmodal distillation paradigm to generate dense proxy labels leveraging a Vision Foundation Model (VFM). Our strategy requires an event stream spatially aligned with RGB frames, a simple setup even available off-the-shelf, and exploits the robustness of large-scale VFMs. Additionally, we propose to adapt VFMs, either a vanilla one like Depth Anything v2 (DAv2), or deriving from it a novel recurrent architecture to infer depth from monocular event cameras. We evaluate our approach with synthetic and real-world datasets, demonstrating that i) our cross-modal paradigm achieves competitive performance compared to fully supervised methods without requiring expensive depth annotations, and ii) our VFM-based models achieve state-of-the-art performance.
What If: Understanding Motion Through Sparse Interactions	Stefan Andreas Baumann CompVis @ LMU Munich Nick Stracke CompVis @ LMU Munich Timy Phan CompVis @ LMU Munich Björn Ommer CompVis @ LMU Munich	Paper Supplementary Abstract Understanding the dynamics of a physical scene involves reasoning about the diverse ways it can potentially change, especially as a result of local interactions. We present the Flow Poke Transformer (FPT), a novel framework for directly predicting the distribution of local motion, conditioned on sparse interactions termed 'pokes'. Unlike traditional methods that typically only enable dense sampling of a single realization of scene dynamics, FPT provides an interpretable directly accessible representation of multi-modal scene motion, its dependency on physical interactions and the inherent uncertainties of scene dynamics. We also evaluate our model on several downstream tasks to enable comparisons with prior methods and highlight the ﬂexibility of our approach. On dense face motion generation, our generic pre-trained model surpasses specialized baselines. FPT can be fine-tuned in strongly out-of-distribution tasks such as synthetic datasets to enable significant improvements over in-domain methods in articulated object motion estimation. Additionally, predicting explicit motion distributions directly enables our method to achieve competitive performance on tasks like moving part segmentation from pokes which further demonstrates the versatility of our FPT. Code and models are publicly available at compvis.github.io/ﬂow-poke-transformer.
ClaraVid: A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling	Radu Beche Technical University of Cluj-Napoca Sergiu Nedevschi Technical University of Cluj-Napoca	Paper Supplementary Abstract The development of aerial holistic scene understanding algorithms is hindered by the scarcity of comprehensive datasets that enable both semantic and geometric reconstruction. While synthetic datasets offer an alternative, existing options exhibit task-specific limitations, unrealistic scene compositions, and rendering artifacts that compromise real-world applicability. We introduce ClaraVid, a synthetic aerial dataset specifically designed to overcome these limitations. Comprising 16,917 high-resolution images captured at 4032x3024 from multiple viewpoints across diverse landscapes, ClaraVid provides dense depth maps, panoptic segmentation, sparse point clouds, and dynamic object masks, while mitigating common rendering artifacts. To further advance neural reconstruction, we introduce the Delentropic Scene Profile (DSP), a novel complexity metric derived from differential entropy analysis, designed to quantitatively assess scene difficulty and inform reconstruction tasks. Utilizing DSP, we systematically benchmark neural reconstruction methods, uncovering a consistent, measurable correlation between scene complexity and reconstruction accuracy. Empirical results indicate that higher delentropy strongly correlates with increased reconstruction errors, validating DSP as a reliable complexity prior. The data and code are available on the project page: rdbch.github.com/claravid.
FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation	Yasser Benigmim Inria Mohammad Fahes Inria Tuan-Hung Vu Inria Andrei Bursuc Inria Raoul de Charette Inria	Paper Supplementary Abstract In this paper, we challenge the conventional practice in Open-Vocabulary Semantic Segmentation (OVSS) of using averaged class-wise text embeddings, which are typically obtained by encoding each class name with multiple templates (e.g., a photo of <class>, a sketch of a <class>). We investigate the impact of templates for OVSS, and find that for each class, there exist singletemplate classifiers-which we refer to as class-experts- that significantly outperform the conventional averaged classifier. First, to identify these class-experts, we introduce a novel approach that estimates them without any labeled data or training. By leveraging the class-wise prediction entropy of single-template classifiers, we select those yielding the lowest entropy as the most reliable class-experts. Second, we combine the outputs of class-experts in a new fusion process. Our plug-and-play method, coined FLOSS, is orthogonal and complementary to existing OVSS methods, offering an improvement without the need for additional labels or training. Extensive experiments show that FLOSS consistently enhances state-of-the-art OVSS models, generalizes well across datasets with different distribution shifts, and delivers substantial improvements in lowdata scenarios where only a few unlabeled images are available. Our code is available at https://github.com/yasserben/FLOSS.
AstroLoc: Robust Space to Ground Image Localizer	Gabriele Berton Politecnico di Torino Alex Stoken Amazon Carlo Masone Politecnico di Torino	Paper Supplementary Abstract Thousands of photos of Earth are taken every day by astronauts from the International Space Station. Localizing these photos, which has been performed manually for decades, has recently been approached through image retrieval solutions: given an astronaut photo, the goal is to find its most similar match among a large database of geotagged satellite images, in a task called Astronaut Photography Localization (APL). Yet, existing APL approaches are trained only using satellite images, without taking advantage of the millions of open-source astronaut photos. In this work we present the first APL pipeline capable of leveraging astronaut photos for training. We first produce full localization information for 300,000 manually weakly-labeled astronaut photos through an automated pipeline, and then use these images to train a model, called AstroLoc. AstroLoc learns a robust representation of Earth's surface features through two objective functions: pairing astronaut photos with their matching satellite counterparts in a pairwise loss, and a second loss on clusters of satellite imagery weighted by their relevance to astronaut photography through unsupervised mining. AstroLoc achieves a staggering 35% average improvement in recall@1 over previous SOTA, reaching a recall@100 consistently over 99% for existing datasets. Moreover, without fine-tuning, AstroLoc provides excellent results for related tasks like the lost-in-space satellite problem and historical space imagery localization.
Scene Coordinate Reconstruction Priors	Wenjing Bian University of Oxford Axel Barroso-Laguna Niantic Spatial Tommaso Cavallari Niantic Spatial Victor Adrian Prisacariu University of Oxford Eric Brachmann Niantic Spatial	Paper Supplementary Abstract Scene coordinate regression (SCR) models have proven to be powerful implicit scene representations for 3D vision, enabling visual relocalization and structure-from-motion. SCR models are trained specifically for one scene. If training images imply insufficient multi-view constraints SCR models degenerate. We present a probabilistic reinterpretation of training SCR models, which allows us to infuse highlevel reconstruction priors. We investigate multiple such priors, ranging from simple priors over the distribution of reconstructed depth values to learned priors over plausible scene coordinate configurations. For the latter, we train a 3D point cloud diffusion model on a large corpus of indoor scans. Our priors push predicted 3D scene points towards plausible geometry at each training step to increase their likelihood. On three indoor datasets our priors help learning better scene representations, resulting in more coherent scene point clouds, higher registration rates and better camera poses, with a positive effect on down-stream tasks such as novel view synthesis and camera relocalization.
Hyper-Depth: Hypergraph-based Multi-Scale Representation Fusion for Monocular Depth Estimation	Lin Bie Tsinghua University Siqi Li Tsinghua University Yifan Feng Tsinghua University Yue Gao Tsinghua University	Paper Abstract Monocular depth estimation (MDE) is a fundamental problem in computer vision with wide-ranging applications in various downstream tasks. While multi-scale features are perceptually critical for MDE, existing transformer-based approaches have yet to leverage them explicitly. To address this limitation, we propose a hypergraph-based multiscale representation fusion framework, Hyper-Depth. The proposed Hyper-Depth incorporates two key components: a semantic consistency enhancement (SCE) module and a geometric consistency constraint (GCC) module. The SCE module, designed based on hypergraph convolution, aggregates global information and enhances the representation of multi-scale patch features. Meanwhile, the GCC module provides geometric guidance to reduce over-fitting errors caused by excessive reliance on local features. In addition, we introduce a correlation-based conditional random fields (C-CRFs) module as the decoder to filter correlated patches and compute attention weights more effectively. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches across all evaluation metrics on the KITTI and NYU-Depth-v2 datasets, achieving improvements of 6.21% and 3.32% on the main metric RMSE, respectively. Furthermore, zero-shot evaluations on the nuScenes and SUN-RGBD datasets validate the generalizability of our method.
RayGaussX: Accelerating Gaussian-Based Ray Marching for Real-Time and High-Quality Novel View Synthesis	Hugo Blanc Mines Paris, PSL University Jean-Emmanuel Deschaud Mines Paris, PSL University Alexis Paljic Mines Paris, PSL University	Paper Supplementary Abstract RayGauss has achieved state-of-the-art rendering quality for novel-view synthesis on synthetic and indoor scenes by representing radiance and density fields with irregularly distributed elliptical basis functions, rendered via volume ray casting using a Bounding Volume Hierarchy (BVH). However, its computational cost prevents real-time rendering on real-world scenes. Our approach, RayGaussX, builds on RayGauss by introducing key contributions that accelerate both training and inference. Specifically, we incorporate volumetric rendering acceleration strategies such as empty-space skipping and adaptive sampling, enhance ray coherence, and introduce scale regularization to reduce false-positive intersections. Additionally, we propose a new densification criterion that improves density distribution in distant regions, leading to enhanced graphical quality on larger scenes. As a result, RayGaussX achieves 5x to 12x faster training and 50x to 80x higher rendering speeds (FPS) on real-world datasets while improving visual quality by up to +0.56 dB in PSNR. The associated code is available at: github.com/hugobl1/raygaussx.
GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow	Simon Boeder Robert Bosch GmbH Fabian Gigengack Robert Bosch GmbH Benjamin Risse University of Münster	Paper Supplementary Abstract Occupancy estimation has become a prominent task in 3D computer vision, particularly within the autonomous driving community. In this paper, we present a novel approach to occupancy estimation, termed GaussianFlowOcc, which is inspired by Gaussian Splatting and replaces traditional dense voxel grids with a sparse 3D Gaussian representation. Our efficient model architecture based on a Gaussian Transformer significantly reduces computational and memory requirements by eliminating the need for expensive 3D convolutions used with inefficient voxel-based representations that predominantly represent empty 3D spaces. GaussianFlowOcc effectively captures scene dynamics by estimating temporal flow for each Gaussian during the overall network training process, offering a straightforward solution to a complex problem that is often neglected by existing methods. Moreover, GaussianFlowOcc is designed for scalability, as it employs weak supervision and does not require costly dense 3D voxel annotations based on additional data (e.g., LiDAR). Through extensive experimentation, we demonstrate that GaussianFlowOcc significantly outperforms all previous methods for weakly supervised occupancy estimation on the nuScenes dataset while featuring an inference speed that is 50 times faster than current SOTA.
Uncertainty-Aware Diffusion-Guided Refinement of 3D Scenes	Sarosij Bose University of California, Riverside Arindam Dutta University of California, Riverside Sayak Nag University of California, Riverside Junge Zhang University of California, Riverside Jiachen Li University of California, Riverside Konstantinos Karydis University of California, Riverside Amit K. Roy-Chowdhury University of California, Riverside	Paper Supplementary Abstract Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, existing single image to 3D reconstruction methods render incoherent and blurry views. This problem is exacerbated when the unseen regions are far away from the input camera. In this work, we address these inherent limitations in existing single image-to-3D scene feedforward networks. To alleviate the poor performance due to insufficient information beyond the input image's view, we leverage a strong generative prior, in the form of a pretrained latent video diffusion model, for iterative refinement of a coarse scene represented by optimizable Gaussian parameters. To ensure that the style and texture of the generated images align with that of the input image, we incorporate on-the-fly Fourier-style transfer between the generated images and the input image. Additionally, we design a semantic uncertainty quantification module that calculates the perpixel entropy and yields uncertainty maps used to guide the refinement process from the most confident pixels while discarding the remaining highly uncertain ones. We conduct extensive experiments on real-world scene datasets, including in-domain RealEstate-10K and out-of-domain KITTIv2, showing that our approach can provide more realistic and high-fidelity novel view synthesis results compared to existing state-of-the-art methods. The project page is available at: https://github.com/UCR-Vision-andLearning-Group/UAR-Scenes
ScanEdit: Hierarchically-Guided Functional 3D Scan Editing	Mohamed El Amine Boudjoghra Technical University of Munich Ivan Laptev MBZUAI Angela Dai Technical University of Munich	Paper Supplementary Abstract With the fast pace of 3D capture technology and resulting abundance of 3D data, effective 3D scene editing becomes essential for a variety of graphics applications. In this work we present ScanEdit, an instruction-driven method for functional editing of complex, real-world 3D scans. To model large and interdependent sets of objects, we propose a hierarchically-guided approach. Given a 3D scan decomposed into its object instances, we first construct a hierarchical scene graph representation to enable effective, tractable editing. We then leverage reasoning capabilities of Large Language Models (LLMs) and translate highlevel language instructions into actionable commands applied hierarchically to the scene graph. Finally, ScanEdit integrates LLM-based guidance with explicit physical constraints and generates realistic scenes where object arrangements obey both physics and common sense. In our extensive experimental evaluation ScanEdit outperforms state of the art and demonstrates excellent results for a variety of real-world scenes and input instructions. Our code is available at aminebdj.github.io/scanedit
Spherical Epipolar Rectification for Deep Two-View Absolute Depth Estimation	Pierre-André Brousseau Université de Montréal Sébastien Roy Université de Montréal	Paper Supplementary Abstract Absolute depth estimation from single camera sequence of images is a relevant task given that mobile machines increasingly rely on vision to navigate. Deep learning for stereo matching has been demonstrated to improve performance for stereo rectified depth estimation but these methods require straightforward left-right camera setups. This work proposes to introduce deep stereo matching to two views of a monocular image sequence obtained from a camera in motion in a static scene. This paper introduces a novel and principled spherical epipolar rectification model, which handles all camera motions. This rectification model is differentiable and allows self-supervised deep stereo matching algorithms to compute disparity and recover depth, given known camera pose. This paper also introduces a spherical crop operation which limits rectified image size and allows for competitive absolute depth estimation performance. This results in a spherical rectification model that is demonstrated to provide metric depth and compete favorably with current state-of-the-art monocular depth estimators. The code is available at https://gitlab.com/labv3d/spherical-stereo.git.
ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training	Leonard Bruns KTH Royal Institute of Technology Axel Barroso-Laguna Niantic Spatial Tommaso Cavallari Niantic Spatial Áron Monszpart Third Dimension AI Sowmya Munukutla Niantic Spatial Victor Adrian Prisacariu University of Oxford Eric Brachmann Niantic Spatial	Paper Supplementary Abstract Scene coordinate regression (SCR) has established itself as a promising learning-based approach to visual relocalization. After mere minutes of scene-specific training, SCR models estimate camera poses of query images with high accuracy. Still, SCR methods fall short of the generalization capabilities of more classical feature-matching approaches. When imaging conditions of query images, such as lighting or viewpoint, are too different from the training views, SCR models fail. Failing to generalize is an inherent limitation of previous SCR frameworks, since their training objective is to encode the training views in the weights of the coordinate regressor itself. The regressor essentially overfits to the training views, by design. We propose to separate the coordinate regressor and the map representation into a generic transformer and a scene-specific map code. This separation allows us to pre-train the transformer on tens of thousands of scenes. More importantly, it allows us to train the transformer to generalize from mapping images to unseen query images during pre-training. We demonstrate on multiple challenging relocalization datasets that our method, ACE-G, leads to significantly increased robustness while keeping the computational footprint attractive.
CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation	Elena Bueno-Benito Institut de Robòtica i Informàtica Industrial, CSIC-UPC Mariella Dimiccoli Institut de Robòtica i Informàtica Industrial, CSIC-UPC	Paper Supplementary Abstract Unsupervised action segmentation has recently pushed its limits with ASOT, an optimal transport (OT)-based method that simultaneously learns action representations and performs clustering using pseudo-labels. Unlike other OT-based approaches, ASOT makes no assumptions about action ordering and can decode a temporally consistent segmentation from a noisy cost matrix between video frames and action labels. However, the resulting segmentation lacks segment-level supervision, limiting the effectiveness of feedback between frames and action representations. To address this limitation, we propose Closed Loop Optimal Transport (CLOT), a novel OT-based framework with a multi-level cyclic feature learning mechanism. Leveraging its encoder-decoder architecture, CLOT learns pseudolabels alongside frame and segment embeddings by solving two separate OT problems. It then refines both frame embeddings and pseudo-labels through cross-attention between the learned frame and segment embeddings, by integrating a third OT problem. Experimental results on four benchmark datasets demonstrate the benefits of cyclical learning for unsupervised action segmentation. 1
Active Learning Meets Foundation Models: Fast Remote Sensing Data Annotation for Object Detection	Marvin Burges TU Wien Philipe Ambrozio Dias Oak Ridge National Laboratory Carson Woody Oak Ridge National Laboratory Sarah Walters Oak Ridge National Laboratory Dalton Lunga Oak Ridge National Laboratory	Paper Abstract Object detection in remote sensing demands extensive, high-quality annotations-a process that is both laborintensive and time-consuming. In this work, we introduce a real-time active learning and semi-automated labeling framework that leverages foundation models to streamline dataset annotation for object detection in remote sensing imagery. For example, by integrating a Segment Anything Model (SAM), our approach generates mask-based bounding boxes that serve as the basis for dual sampling: (a) uncertainty estimation to pinpoint challenging samples, and (b) diversity assessment to ensure broad data coverage. Furthermore, our Dynamic Box Switching Module (DBS) addresses the well-known cold start problem for object detection models by replacing its suboptimal initial predictions with SAM-derived masks, thereby enhancing earlystage localization accuracy. Extensive evaluations on multiple remote sensing datasets, along with a real-world user study, demonstrate that our framework not only reduces annotation effort but also significantly boosts detection performance compared to traditional active learning sampling methods. The code for training and the user interface is available under https://github.com/mburgescvl/ICCV_AL4FM.
SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection for SLAM	Yannick Burkhardt Technical University of Munich Simon Schaefer Technical University of Munich Stefan Leutenegger Technical University of Munich	Paper Supplementary Abstract Event-based keypoint detection and matching holds significant potential, enabling the integration of event sensors into highly optimized Visual SLAM systems developed for frame cameras over decades of research. Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a datadriven approach to predict stable keypoints with expressive descriptors. Due to the absence of event datasets with ground truth keypoint labels, we leverage existing framebased keypoint detectors on readily available event-aligned and synchronized gray-scale frames for self-supervision: we generate temporally sparse keypoint pseudo-labels considering that events are a product of both scene appearance and camera motion. Combined with our novel, informationrich event representation, we enable SuperEvent to effectively learn robust keypoint detection and description in event streams. Finally, we demonstrate the usefulness of SuperEvent by its integration into a modern sparse keypoint and descriptor-based SLAM framework originally developed for traditional cameras, surpassing the state-of-theart in event-based SLAM by a wide margin. Source code is available at ethz-mrl.github.io/SuperEvent.
JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers	Kwon Byung-Ki POSTECH Qi Dai Microsoft Research Asia Lee Hyoseok POSTECH Chong Luo Microsoft Research Asia Tae-Hyun Oh KAIST	Paper Supplementary Abstract We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the stateof-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accurate depth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, namely, adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy. With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image gener- †Work done during an internship at Microsoft Research Asia. ation by simply controlling the timesteps of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a viable alternative to conditional generation.
Cycle-Consistent Learning for Joint Layout-to-Image Generation and Object Detection	Xinhao Cai Nanjing University of Science and Technology Qiuxia Lai Communication University of China Gensheng Pei Nanjing University of Science and Technology Xiangbo Shu Nanjing University of Science and Technology Yazhou Yao State Key Laboratory of Intelligent Manufacturing of Advanced Construction Machinery Wenguan Wang Zhejiang University Zhinan Yu Nanjing University of Science and Technology Bo Du School of Computer Science, Wuhan University	Paper Supplementary Abstract In this paper, we propose a generation-detection cycle consistent (GDCC) learning framework that jointly optimizes both layout-to-image (L2I) generation and object detection (OD) tasks in an end-to-end manner. The key of GDCC lies in the inherent duality between the two tasks, where L2I takes all object boxes and labels as input conditions to generate images, and OD maps images back to these layout conditions. Specifically, in GDCC, L2I generation is guided by a layout translation cycle loss, ensuring that the layouts used to generate images align with those predicted from the synthesized images. Similarly, OD benefits from an image translation cycle loss, which enforces consistency between the synthesized images fed into the detector and those generated from predicted layouts. While current L2I and OD tasks benefit from large-scale annotated layout-image pairs, our GDCC enables more efficient use of auto-synthesized data, thereby further enhancing data efficiency. It is worth noting that our GDCC framework is computationally efficient thanks to the perturbative single-step sampling strategy and a priority timestep re-sampling strategy during training. Besides, GDCC preserves the architectures of L2I, OD models, and the generation pipeline within the framework, thus maintaining the original inference speed. Extensive experiments demonstrate that GDCC significantly improves the controllability of diffusion models and the accuracy of object detectors.
CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs	Yihan Cao National University of Defense Technology Zheng Qin Defense Innovation Institute, Academy of Military Sciences Jiazhao Zhang Peking University Qin Zou Wuhan University Zhinan Yu National University of Defense Technology Bo Du Wuhan University Shuzhen Liu National University of Defense Technology Kai Xu National University of Defense Technology	Paper Supplementary Abstract Object goal navigation (ObjectNav) is a fundamental task in embodied AI, requiring an agent to locate a target object in previously unseen environments. This task is particularly challenging because it requires both perceptual and cognitive processes, including object recognition and decision-making. While substantial advancements in perception have been driven by the rapid development of visual foundation models, progress on the cognitive aspect remains constrained, primarily limited to either implicit learning through simulator rollouts or explicit reliance on predefined heuristic rules. Inspired by neuroscientific findings demonstrating that humans maintain and dynamically update fine-grained cognitive states during object search tasks in novel environments, we propose CogNav, a framework designed to mimic this cognitive process using large language models. Specifically, we model the cognitive process using a finite state machine comprising fine-grained cognitive states, ranging from exploration to identification. Transitions between states are determined by a large language model based on a dynamically constructed heterogeneous cognitive map, which contains spatial and semantic information about the scene being explored. Extensive evaluations on the HM3D, MP3D, and RoboTHOR benchmarks demonstrate that our cognitive process modeling significantly improves the success rate of ObjectNav at least by relative 14% over the state-of-the-arts.
IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark	Zhe Cao Beijing Institute of Technology Jin Zhang Beijing Institute of Technology Ruiheng Zhang Beijing Institute of Technology	Paper Supplementary Abstract Real-world infrared imagery presents unique challenges for vision-language models due to the scarcity of aligned text data and domain-specific characteristics. Although existing methods have advanced the field, their reliance on synthetic infrared images generated through style transfer from visible images, which limits their ability to capture the unique characteristics of the infrared modality. To address this, we propose IRGPT, the first multi-modal large language model for real-world infrared images, built upon a large-scale InfraRed-Text Dataset (IR-TD) comprising over 260K authentic image-text pairs. The proposed IR-TD dataset contains real infrared images paired with meticulously handcrafted texts, where the initial drafts originated from two complementary processes: (1) LLM-generated descriptions of visible images, and (2) rule-based descriptions of annotations. Furthermore, we introduce a bi-cross-modal curriculum transfer learning strategy that systematically transfers knowledge from visible to infrared domains by considering the difficulty scores of both infrared-visible and infraredtext. Evaluated on a benchmark of 9 tasks (e.g., recognition, grounding), IRGPT achieves state-of-the-art performance even compared with larger-scale models.
MotionCtrl: A Real-time Controllable Vision-Language-Motion Model	Bin Cao Institute of Automation, Chinese Academy of Sciences Sipeng Zheng BeingBeyond Ye Wang Renmin University of China Lujie Xia Peking University Qianshan Wei Southeast University Qin Jin Renmin University of China Jing Liu Institute of Automation, Chinese Academy of Sciences Zongqing Lu Peking University	Paper Supplementary Abstract Human motion generation involves synthesizing coherent human motion sequences conditioned on diverse multimodal inputs and holds significant potential for realworld applications. Despite recent advancements, existing vision-language-motion models (VLMMs) remain limited in achieving this goal. In this paper, we identify the lack of controllability as a critical bottleneck, where VLMMs struggle with diverse human commands, pose initialization, generation of long-term or unseen cases, and fine-grained control over individual body parts. To address these challenges, we introduce MotionCtrl, the first real-time, controllable VLMM with state-of-the-art performance. MotionCtrl achieves its controllability through training on HuMo100M, the largest human motion dataset to date, featuring over 5 million self-collected motions, 100 million multi-task instructional instances, and detailed part-level descriptions that address a long-standing gap in the field. Additionally, we propose a novel part-aware residual quantization technique for motion tokenization, enabling precise control over individual body parts during motion generation. Extensive experiments demonstrate MotionCtrl's superior performance across a wide range of motion benchmarks. Furthermore, we provide strategic design insights and a detailed time efficiency analysis to guide the development of practical motion generators.
UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction	Jin Cao Zhejiang University Hongrui Wu Tongji University Ziyong Feng DeepGlint Hujun Bao Zhejiang University Xiaowei Zhou Zhejiang University Sida Peng Zhejiang University	Paper Supplementary Abstract This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations. However, these methods rely heavily on dense observations for robustly optimizing model parameters. To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process. To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images. Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies. Extensive experiments on both synthetic and realworld datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. The code will be released for the reproducibility.
Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation	Yihong Cao Hunan University Jiaming Zhang Karlsruhe Institute of Technology Xu Zheng HKUST(GZ) Hao Shi Zhejiang University Kunyu Peng Karlsruhe Institute of Technology	Paper Supplementary Abstract Panoramic image processing is essential for omnicontext perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360° viewpoint coverage and occlusionaware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and syntheticto-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available at https://github.com/yihong-97/UNLOCK.
Visual Relation Diffusion for Human-Object Interaction Detection	Ping Cao Beijing Jiaotong University Yepeng Tang Beijing Jiaotong University Chunjie Zhang Beijing Jiaotong University Xiaolong Zheng Chinese Academy of Sciences Chao Liang Wuhan University Yunchao Wei Beijing Jiaotong University Yao Zhao Beijing Jiaotong University	Paper Supplementary Abstract Human-object interaction (HOI) detection relies on finegrained visual understanding to distinguish complex relationships between humans and objects. While recent generative diffusion models have demonstrated remarkable capability in learning detailed visual concepts through pixellevel generation, their potential for interaction-level relationship modeling remains largely unexplored. To bridge this gap, we propose a Visual Relation Diffusion model (VRDiff), which introduces dense visual relation conditions to guide interaction understanding. Specifically, we encode interaction-aware condition representations that capture both spatial responsiveness and contextual semantics of human-object pairs, conditioning the diffusion process purely on visual features rather than text-based inputs. Furthermore, we refine these relation representations through generative feedback from the diffusion model, enhancing HOI detection without requiring image synthesis. Extensive experiments on the HICO-DET benchmark demonstrate that VRDiff achieves competitive results under both standard and zero-shot HOI detection settings.
Doppler-Aware LiDAR-RADAR Fusion for Weather-Robust 3D Detection	Yujeong Chae Korea Advanced Institute of Science and Technology Heejun Park Korea Advanced Institute of Science and Technology Hyeonseong Kim Korea Advanced Institute of Science and Technology Kuk-Jin Yoon Korea Advanced Institute of Science and Technology	Paper Supplementary Abstract Robust 3D object detection across diverse weather conditions is crucial for safe autonomous driving, and RADAR is increasingly leveraged for its resilience in adverse weather. Recent advancements have explored 4D RADAR and LiDAR-RADAR fusion to enhance 3D perception capabilities, specifically targeting weather robustness. However, existing methods often handle Doppler in ways that are not well-suited for multi-modal settings or lack tailored encoding strategies, hindering effective feature fusion and performance. To address these shortcomings, we propose a novel Doppler-aware LiDAR-4D RADAR fusion (DLRFusion) framework for robust 3D object detection. We introduce a multi-path iterative interaction module that integrates LiDAR, RADAR power, and Doppler, enabling a structured feature fusion process. Doppler highlights dynamic regions, refining RADAR power and enhancing LiDAR features across multiple stages, improving detection confidence. Extensive experiments on the K-RADAR dataset demonstrate that our approach effectively exploits Doppler information, achieving state-of-the-art performance in both normal and adverse weather conditions.
GaussRender: Learning 3D Occupancy with Gaussian Rendering	Lo¨ıck Chambon ValeoAI, Paris, France Eloi Zablocki Sorbonne University, Paris, France Alexandre Boulch Sorbonne University, Paris, France Micka¨el Chen Hcompany.ai, Paris, France Matthieu Cord ValeoAI, Paris, France	Paper Supplementary Abstract Understanding the 3D geometry and semantics of driving scenes is critical for safe autonomous driving. Recent advances in 3D occupancy prediction have improved scene representation but often suffer from visual inconsistencies, leading to ﬂoating artifacts and poor surface localization. Existing voxel-wise losses (e.g., cross-entropy) fail to enforce visible geometric coherence. In this paper, we propose GaussRender, a module that improves 3D occupancy learning by enforcing projective consistency. Our key idea is to project both predicted and groundtruth 3D occupancy into 2D camera views, where we apply supervision. Our method penalizes 3D configurations that produce inconsistent 2D projections, thereby enforcing a more coherent 3D structure. To achieve this efficiently, we leverage differentiable rendering with Gaussian splatting. GaussRender seamlessly integrates with existing architectures while maintaining efficiency and requiring no inference-time modifications. Extensive evaluations on multiple benchmarks (SurroundOcc-nuScenes, Occ3DnuScenes, SSCBench-KITTI360) demonstrate that GaussRender significantly improves geometric fidelity across various 3D occupancy models (TPVFormer, SurroundOcc, Symphonies), achieving state-of-the-art results, particularly on surface-sensitive metrics such as RayIoU. The code is open-sourced at https://github.com/valeoai/GaussRender.
SMSTracker: Tri-path Score Mask Sigma Fusion for Multi-Modal Tracking	Sixian Chan Zhejiang University of Technology, China Zedong Li Zhejiang University of Technology, China Wenhao Li Nanyang Technological University, Singapore Shijian Lu Nanyang Technological University, Singapore Chunhua Shen Zhejiang University, China Xiaoqin Zhang Zhejiang University of Technology, China	Paper Abstract Multi-modal object tracking has emerged as a significant research focus in computer vision due to its robustness in complex environments, such as exposure variations, blur, and occlusions. Despite existing studies integrating supplementary modal information into pre-trained RGB trackers through visual prompt mechanisms, this approach exhibits a critical limitation: it inherently prioritizes RGB information as the dominant modality, thereby underutilizing the complementary information of alternative modalities. To address this fundamental limitation, we present SMSTracker, an innovative tri-path score mask sigma fusion framework for multi-modal tracking, including three key modules. Firstly, we design a tri-path Score Mask Fusion (SMF) module to evaluate and quantify the reliability of each modality, allowing optimal exploitation of complementary features between modalities. Secondly, we introduce a pioneering Sigma Interaction (SGI) module to facilitate a sophisticated fusion of modal features across tri-branches. Furthermore, we advance a Drop Key Finetuning (DKF) strategy to address the inherent challenge of unequal data contribution in multi-modal learning scenarios, thereby enhancing the model's capacity for comprehensive multi-modal information processing. Finally, extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event datasets demonstrate the significant performance improvements achieved by SMSTracker over existing state-of-the-art methods. Code and model are available at https://github.com/Leezed525/SMSTracker.
Hierarchical-aware Orthogonal Disentanglement Framework for Fine-grained Skeleton-based Action Recognition	Haochen Chang School of Systems Science and Engineering, Sun Yat-sen University Pengfei Ren State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications Haoyang Zhang Defense Innovation Institute, Academy of Military Sciences Liang Xie Defense Innovation Institute, Academy of Military Sciences Hongbo Chen School of Systems Science and Engineering, Sun Yat-sen University Erwei Yin Defense Innovation Institute, Academy of Military Sciences	Paper Supplementary Abstract In recent years, skeleton-based action recognition has gained significant attention due to its robustness in varying environmental conditions. However, most existing methods struggle to distinguish fine-grained actions due to subtle motion features, minimal inter-class variation, and they often fail to consider the underlying similarity relationships between action classes. To address these limitations, we propose a Hierarchical-aware Orthogonal Disentanglement framework (HiOD). We disentangle coarsegrained and fine-grained features by employing independent spatial-temporal granularity-aware bases, which encode semantic representations at varying levels of granularity. Additionally, we design a cross-granularity feature interaction mechanism that leverages complementary information between coarse-grained and fine-grained features. We further enhance the learning process through hierarchical prototype contrastive learning, which utilizes the parent class hierarchy to guide the learning of coarse-grained features while ensuring the distinguishability of fine-grained features within child classes. Extensive experiments on FineGYM, FSD-10, NTU RGB+D, and NTU RGB+D 120 datasets demonstrate the superiority of our method in finegrained action recognition tasks.
LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation	Wei-Jer Chang UC Berkeley Wei Zhan UC Berkeley Masayoshi Tomizuka UC Berkeley Manmohan Chandraker NEC Labs America Francesco Pittaluga NEC Labs America	Paper Supplementary Abstract Evaluating autonomous vehicles with controllability allows for scalable testing in counterfactual or structured settings, improving both efficiency and safety. We introduce LANGTRAJ, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning on natural language inputs, LANGTRAJ enables flexible and intuitive control over interactive behaviors, generating nuanced and realistic scenarios. Unlike prior approaches that rely on domain-specific guidance functions, LANGTRAJ incorporates language conditioning during training for more intuitive traffic simulation control. In addition, we propose a novel closed-loop training strategy for diffusion models to enhance realism in closed-loop simulation. To support language-conditioned simulation, we develop a scalable pipeline for annotating agent-agent interactions and single-agent behaviors, which we use to develop INTERDRIVE, a large-scale dataset offering diverse and interactive labels for training languageconditioned diffusion models. Validated on the Waymo Motion Dataset, LANGTRAJ demonstrates strong performance in both realism, language controllability, and languageconditioned safety-critical simulation, establishing a new paradigm for flexible and scalable autonomous vehicle testing. Project website: https://langtraj.github.io/.
Learning Neural Scene Representation from iToF Imaging	Wenjie Chang University of Science and Technology of China Hanzhi Chang University of Science and Technology of China Yueyi Zhang Miromind Wenfei Yang University of Science and Technology of China Tianzhu Zhang University of Science and Technology of China	Paper Supplementary Abstract Indirect Time-of-Flight (iToF) cameras are popular for 3D perception because they are cost-effective and easy to deploy. They emit modulated infrared signals to illuminate the scene and process the received signals to generate amplitude and phase images. The depth is calculated from the phase using the modulation frequency. However, the obtained depth often suffers from noise caused by multi-path interference, low signal-to-noise ratio (SNR), and depth wrapping. Building on recent advancements in neural scene representations, which have shown great potential in 3D modeling from multi-view RGB images, we propose leveraging this approach to reconstruct 3D representations from noisy iToF data. Our method utilizes the multi-view consistency of amplitude and phase maps, fusing information from all input views to generate an accurate scene representation. Considering the impact of infrared illumination, we propose a new rendering scheme for amplitude maps based on signed distance function (SDF) and introduce a neural lighting function to model the appearance variations caused by active illumination. We also incorporate a phaseguided sampling strategy and a wrapping-aware phase-todepth loss to utilize raw phase information and mitigate depth wrapping. Additionally, we add a noise-weight loss to prevent excessive smoothing information across noisy multi-view measurements. Experiments conducted on synthetic and real-world datasets demonstrate that the proposed method outperforms state-of-the-art techniques.
ALOcc: Adaptive Lifting-Based 3D Semantic Occupancy and Cost Volume-Based Flow Predictions	Dubing Chen SKL-IOTSC, CIS, University of Macau Jin Fang SKL-IOTSC, CIS, University of Macau Wencheng Han SKL-IOTSC, CIS, University of Macau Xinjing Cheng Junbo Yin CEMSE Division, King Abdullah University of Science and Technology Chenzhong Xu SKL-IOTSC, CIS, University of Macau Fahad Shahbaz Khan Mohamed bin Zayed University of Artificial Intelligence Jianbing Shen SKL-IOTSC, CIS, University of Macau	Paper Supplementary Abstract 3D semantic occupancy and flow prediction are fundamental to spatiotemporal scene understanding. This paper proposes a vision-based framework with three targeted improvements. First, we introduce an occlusion-aware adaptive lifting mechanism incorporating depth denoising. This enhances the robustness of 2D-to-3D feature transformation while mitigating reliance on depth priors. Second, we enforce 3D-2D semantic consistency via jointly optimized prototypes, using confidence- and category-aware sampling to address the long-tail classes problem. Third, to streamline joint prediction, we devise a BEV-centric cost volume to explicitly correlate semantic and flow features, supervised by a hybrid classification-regression scheme that handles diverse motion scales. Our purely convolutional architecture establishes new SOTA performance on multiple benchmarks for both semantic occupancy and joint occupancy semantic-flow prediction. We also present a family of models offering a spectrum of efficiency-performance trade-offs. Our real-time version exceeds all existing real-time methods in speed and accuracy, ensuring its practical viability.
AutoScape: Geometry-Consistent Long-Horizon Scene Generation	Jiacheng Chen Simon Fraser University Ziyu Jiang NEC Labs America Mingfu Liang Northwestern University Bingbing Zhuang NEC Labs America Jong-Chyi Su UC San Diego Sparsh Garg UC San Diego Ying Wu Northwestern University Manmohan Chandraker UC San Diego	Paper Supplementary Abstract This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6% and 43.0%, respectively. Project page: https://auto-scape.github.io.
Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction	Weirong Chen TU Munich Ganlin Zhang TU Munich Felix Wimbauer TU Munich Rui Wang Microsoft Nikita Araslanov TU Munich Andrea Vedaldi University of Oxford Daniel Cremers TU Munich	Paper Supplementary Abstract Traditional SLAM systems, which rely on bundle adjustment, struggle with the highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, while the latter can lead to inconsistent motion estimates. Taking a novel approach, this work leverages a 3D point tracker to separate camera-induced motion from the observed motion of dynamic objects. By considering only the camera-induced component, bundle adjustment can operate reliably on all scene elements. We further ensure depth consistency across video frames with lightweight postprocessing based on scale maps. Our framework combines the core of traditional SLAM-bundle adjustment-with a robust learning-based 3D tracker. Integrating motion decomposition, bundle adjustment, and depth refinement, our unified framework, BA-Track, accurately tracks camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.
CasP: Improving Semi-Dense Feature Matching Pipeline Leveraging Cascaded Correspondence Priors for Guidance	Peiqi Chen Wuhan University Lei Yu Ant Group Yi Wan Wuhan University Yingying Pei Wuhan University Xinyi Liu Wuhan University Yongxiang Yao Wuhan University Yingying Zhang Ant Group Lixiang Ru Ant Group Liheng Zhong Ant Group Jingdong Chen Ant Group Ming Yang Ant Group Yongjun Zhang Wuhan University	Paper Abstract Semi-dense feature matching methods have shown strong performance in challenging scenarios. However, the existing pipeline relies on a global search across the entire feature map to establish coarse matches, limiting further improvements in accuracy and efficiency. Motivated by this limitation, we propose a novel pipeline, CasP, which leverages cascaded correspondence priors for guidance. Specifically, the matching stage is decomposed into two progressive phases, bridged by a region-based selective cross-attention mechanism designed to enhance feature discriminability. In the second phase, one-to-one matches are determined by restricting the search range to the one-to-many prior areas identified in the first phase. Additionally, this pipeline benefits from incorporating high-level features, which helps reduce the computational costs of low-level feature extraction. The acceleration gains of CasP increase with higher resolution, and our lite model achieves a speedup of ∼2.2x at a resolution of 1152 compared to the most efficient method, ELoFTR. Furthermore, extensive experiments demonstrate its superiority in geometric estimation, particularly with impressive cross-domain generalization. These advantages highlight its potential for latency-sensitive and high-robustness applications, such as SLAM and UAV systems. Code is available at https://github.com/pq-chen/CasP.
DASH: 4D Hash Encoding with Self-Supervised Decomposition for Real-Time Dynamic Scene Rendering	Jie Chen University of Science and Technology of China Zhangchi Hu University of Science and Technology of China Peixi Wu University of Science and Technology of China Huyue Zhu University of Science and Technology of China Hebei Li University of Science and Technology of China Xiaoyan Sun Institute of Artificial Intelligence, Hefei Comprehensive National Science Center	Paper Supplementary Abstract Dynamic scene reconstruction is a long-term challenge in 3D vision. Existing plane-based methods in dynamic Gaussian splatting suffer from an unsuitable low-rank assumption, causing feature overlap and poor rendering quality. Although 4D hash encoding provides an explicit representation without low-rank constraints, directly applying it to the entire dynamic scene leads to substantial hash collisions and redundancy. To address these challenges, we present DASH, a real-time dynamic scene rendering framework that employs 4D hash encoding coupled with self-supervised decomposition. Our approach begins with a self-supervised decomposition mechanism that separates dynamic and static components without manual annotations or precomputed masks. Next, we introduce a multiresolution 4D hash encoder for dynamic elements, providing an explicit representation that avoids the low-rank assumption. Finally, we present a spatio-temporal smoothness regularization strategy to mitigate unstable deformation artifacts. Experiments on real-world datasets demonstrate that DASH achieves state-of-the-art dynamic rendering performance, exhibiting enhanced visual quality at realtime speeds of 264 FPS on a single 4090 GPU. Code: https://github.com/chenj02/DASH.
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers	Yuntao Chen HKISI, CAS Yuqi Wang New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences Zhaoxiang Zhang New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences	Paper Supplementary Abstract World model-based searching and planning are widely recognized as a promising path toward human-level physical intelligence. However, current driving world models primarily rely on video diffusion models, which specialize in visual generation but lack the flexibility to incorporate other modalities like action. In contrast, autoregressive transformers have demonstrated exceptional capability in modeling multimodal data. Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning through standard next-token prediction. Our DrivingGPT demonstrates strong performance in both actionconditioned video generation and end-to-end planning in the VQ token space for the first time, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow	Yixiang Chen New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences Peiyan Li New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences Yan Huang New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences Jiabing Yang New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences Kehan Chen New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences Liang Wang New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences	Paper Supplementary Abstract Current language-guided robotic manipulation systems often require low-level action-labeled datasets for imitation learning. While object-centric flow prediction methods mitigate this issue, they remain limited to scenarios involving rigid objects with clear displacement and minimal occlusion. In this work, we present Embodiment-Centric Flow (EC-Flow), a framework that directly learns manipulation from action-unlabeled videos by predicting embodimentcentric flow. Our key insight is that incorporating the embodiment's inherent kinematics significantly enhances generalization to versatile manipulation scenarios, including deformable object handling, occlusions, and non-objectdisplacement tasks. To connect the EC-Flow with language instructions and object interactions, we further introduce a goal-alignment module by jointly optimizing movement consistency and goal-image prediction. Moreover, translating EC-Flow to executable robot actions only requires a standard robot URDF (Unified Robot Description Format) file to specify kinematic constraints across joints, which makes it easy to use in practice. We validate EC-Flow on both simulation (Meta-World) and real-world tasks, demonstrating its state-of-the-art performance in occluded object handling (62% improvement), deformable object manipulation (45% improvement), and non-object-displacement tasks (80% improvement) than prior state-of-the-art objectcentric flow methods. More results can be found on our project website: https://ec-flow1.github.io.
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training	Xingyu Chen Zhejiang University Yue Chen Zhejiang University Yuliang Xiu Westlake University Andreas Geiger University of Tübingen, Tübingen AI Center Anpei Chen University of Tübingen, Tübingen AI Center	Paper Supplementary Abstract Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets.
EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds	Lu Chen State Key Lab of CAD&CG, Zhejiang University Yizhou Wang The Chinese University of Hong Kong Shixiang Tang The Chinese University of Hong Kong Qianhong Ma Shanghai Jiao Tong University Tong He Shanghai Artificial Intelligence Laboratory Wanli Ouyang The Chinese University of Hong Kong Xiaowei Zhou State Key Lab of CAD&CG, Zhejiang University Hujun Bao State Key Lab of CAD&CG, Zhejiang University Sida Peng State Key Lab of CAD&CG, Zhejiang University	Paper Supplementary Abstract Learning an agent model that behaves like humans- capable of jointly perceiving the environment, predicting the future, and taking actions from a first- person perspective- is a fundamental challenge in computer vision. Existing methods typically train separate models for these abilities, which fail to capture their intrinsic relationships and prevent them from learning from each other. Inspired by how humans learn through the perception- action loop, we propose EgoAgent, a unified agent model that simultaneously learns to represent, predict, and act within a single transformer. EgoAgent explicitly models the causal and temporal dependencies among these abilities by formulating the task as an interleaved sequence of states and actions. It further introduces a joint embedding-action-prediction architecture with temporally asymmetric predictor and observer branches, enabling synergistic optimization across all three capabilities. Comprehensive evaluations of EgoAgent on representative tasks such as image classification, egocentric future state prediction, and 3D human motion prediction demonstrate the superiority of our method. The code and trained models will be publicly available at https: //github.com/zju3dv/EgoAgent.
Enhancing Prompt Generation with Adaptive Refinement for Camouflaged Object Detection	Xuehan Chen Xi'an Jiaotong-Liverpool University, China Guangyu Ren Xi'an Jiaotong-Liverpool University, China Tianhong Dai Imperial College London, United Kingdom Tania Stathaki Imperial College London, United Kingdom Hengyan Liu Xi'an Jiaotong-Liverpool University, China	Paper Abstract Foundation models, such as Segment Anything Model (SAM), have exhibited remarkable performance in conventional segmentation tasks, primarily due to their training on large-scale datasets. Nonetheless, challenges remain in specific downstream tasks, such as Camouflaged Object Detection (COD). Existing research primarily aims to enhance performance by integrating additional multimodal information derived from other foundation models. However, directly leveraging the information generated by these models may introduce additional biases due to domain shifts. To address this issue, we propose an Adaptive Refinement Module (ARM), which efficiently processes multimodal information and simultaneously refining the mask prompt. Furthermore, we construct an auxiliary embedding that effectively exploits the intermediate information generated during ARM, providing SAM with richer feature representations. Experimental results indicate that our proposed architecture surpasses most state-of-the-art (SOTA) models in the COD task, particularly excelling in structured target segmentation.
Event-based Tiny Object Detection: A Benchmark Dataset and Baseline	Nuo Chen National University of Defense Technology, China Chao Xiao National University of Defense Technology, China Yimian Dai Nankai University Shiman He National University of Defense Technology, China Miao Li National University of Defense Technology, China Wei An National University of Defense Technology, China	Paper Supplementary Abstract Small object detection (SOD) in anti-UAV task is a challenging problem due to the small size of UAVs and complex backgrounds. Traditional frame-based cameras struggle to detect small objects in complex environments due to their low frame rates, limited dynamic range, and data redundancy. Event cameras, with microsecond temporal resolution and high dynamic range, provide a more effective solution for SOD. However, existing event-based object detection datasets are limited in scale, feature large targets size, and lack diverse backgrounds, making them unsuitable for SOD benchmarks. In this paper, we introduce a Eventbased Small object detection (EVSOD) dataset (namely EVUAV), the first large-scale, highly diverse benchmark for anti-UAV tasks. It includes 147 sequences with over 2.3 million event-level annotations, featuring extremely small targets (averaging 6.8 x 5.4 pixels) and diverse scenarios such as urban clutter and extreme lighting conditions. Furthermore, based on the observation that small moving targets form continuous curves in spatiotemporal event point clouds, we propose Event based Sparse Segmentation Network (EV-SpSegNet), a novel baseline for event segmentation in point cloud space, along with a Spatiotemporal Correlation (STC) loss that leverages motion continuity to guide the network in retaining target events. Extensive experiments on the EV-UAV dataset demonstrate the superiority of our method and provide a benchmark for future research in EVSOD. The dataset and code are at https: //github.com/ChenYichen9527/Ev-UAV.
Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation	Tiankai Chen Southwest Jiaotong University Yushu Li South China University of Technology Adam Goodge Institute for infocomm research(I2R), ASTAR Fei Teng Southwest Jiaotong University Xulei Yang Institute for infocomm research(I2R), ASTAR Tianrui Li Southwest Jiaotong University Xun Xu Institute for infocomm research(I2R), A*STAR	Paper Supplementary Abstract Out-of-distribution (OOD) detection in 3D point cloud data remains a challenge, particularly in applications where safe and robust perception is critical. While existing OOD detection methods have shown progress for 2D image data, extending these to 3D environments involves unique obstacles. This paper introduces a training-free framework that leverages Vision-Language Models (VLMs) for effective OOD detection in 3D point clouds. By constructing a graph based on class prototypes and testing data, we exploit the data manifold structure to enhancing the effectiveness of VLMs for 3D OOD detection. We propose a novel Graph Score Propagation (GSP) method that incorporates prompt clustering and self-training negative prompting to improve OOD scoring with VLM. Our method is also adaptable to few-shot scenarios, providing options for practical applications. We demonstrate that GSP consistently outperforms state-of-the-art methods across synthetic and realworld datasets for 3D point cloud OOD detection.
Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues	Chen Chen National University of Defense Technology, China Kangcheng Bin National University of Defense Technology, China Ting Hu National University of Defense Technology, China Jiahao Qi National University of Defense Technology, China Xingyue Liu National University of Defense Technology, China Tianpeng Liu National University of Defense Technology, China Zhen Liu National University of Defense Technology, China Yongxiang Liu National University of Defense Technology, China Ping Zhong National University of Defense Technology, China	Paper Supplementary Abstract Unmanned aerial vehicles (UAV)-based object detection with visible (RGB) and infrared (IR) images facilitates robust around-the-clock detection, driven by advancements in deep learning techniques and the availability of highquality dataset. However, the existing dataset struggles to fully capture real-world complexity for limited imaging conditions. To this end, we introduce a high-diversity dataset ATR-UMOD covering varying scenarios, spanning altitudes from 80m to 300m, angles from 0° to 75°, and allday, all-year time variations in rich weather and illumination conditions. Moreover, each RGB-IR image pair is annotated with 6 condition attributes, offering valuable highlevel contextual information. To meet the challenge raised by such diverse conditions, we propose a novel promptguided condition-aware dynamic fusion (PCDF) to adaptively reassign multimodal contributions by leveraging annotated condition cues. By encoding imaging conditions as text prompts, PCDF effectively models the relationship between conditions and multimodal contributions through a task-specific soft-gating transformation. A prompt-guided condition-decoupling module further ensures the availability in practice without condition annotations. Experiments on ATR-UMOD dataset reveal the effectiveness of PCDF.
GCRayDiffusion: Pose-Free Surface Reconstruction via Geometric Consistent Ray Diffusion	Li-Heng Chen Beijing Normal University Zi-Xin Zou VAST Chang Liu Beijing Normal University Tianjiao Jing Beijing Normal University Yan-Pei Cao VAST Shi-Sheng Huang Beijing Normal University Hongbo Fu Hong Kong University of Science and Technology Hua Huang Beijing Normal University	Paper Supplementary Abstract Accurate surface reconstruction from unposed images is crucial for efficient 3D object or scene creation. However, it remains challenging particularly for the joint camera pose estimation. Previous approaches have achieved impressive pose-free surface reconstruction results in denseview settings, but could easily fail for sparse-view scenarios without sufficient visual overlap. In this paper, we propose a new technique for pose-free surface reconstruction, which follows triplane-based signed distance field (SDF) learning but regularizes the learning by explicit points sampled from ray-based diffusion of camera pose estimation. Our key contribution is a novel Geometric Consistent Ray Diffusion model (GCRayDiffusion), where we represent camera poses as neural bundle rays and regress the distribution of noisy rays via a diffusion model. More importantly, we further condition the denoising process of RGRayDiffusion using the triplane-based SDF of the entire scene, which provides effective 3D consistent regularization to get multi-view consistent camera pose estimation. Finally, we incorporate RGRayDiffusion to the triplane-based SDF learning by introducing on-surface geometric regularization from the sampling points of the neural bundle rays, which leads to highly accurate pose-free surface reconstruction results even for sparse view inputs. Extensive evaluations on public datasets show that our GCRayDiffusion achieves more accurate camera pose estimation than previous approaches, with geometrically more consistent surface reconstruction results, especially given sparse view inputs. Our source code is available at https://github.com/CountNemoChan/GCRayDiffusion
GLEAM: Learning Generalizable Exploration Policy for Active Mapping in Complex 3D Indoor Scene	Xiao Chen The Chinese University of Hong Kong Tai Wang Shanghai AI Laboratory Quanyi Li Shanghai AI Laboratory Tao Huang Shanghai AI Laboratory Jiangmiao Pang Shanghai AI Laboratory Tianfan Xue The Chinese University of Hong Kong	Paper Supplementary Abstract Generalizable active mapping in complex unknown environments remains a critical challenge for mobile robots. Existing methods, constrained by insufficient training data and conservative exploration strategies, exhibit limited generalizability across scenes with diverse layouts and complex connectivity. To enable scalable training and reliable evaluation, we introduce GLEAM-Bench, the first largescale benchmark designed for generalizable active mapping with 1,152 diverse 3D scenes from synthetic and realscan datasets. Building upon this foundation, we propose GLEAM, a unified generalizable exploration policy for active mapping. Its superior generalizability comes mainly from our semantic representations, long-term navigable goals, and randomized strategies. It significantly outperforms state-of-the-art methods, achieving 66.50% coverage (+9.49%) with efficient trajectories and improved mapping accuracy on 128 unseen complex scenes.
GenHaze: Pioneering Controllable One-Step Realistic Haze Generation for Real-World Dehazing	Sixiang Chen The Hong Kong University of Science and Technology (Guangzhou) Tian Ye The Hong Kong University of Science and Technology (Guangzhou) Yunlong Lin Xiamen University Yeying Jin Tencent Yijun Yang The Hong Kong University of Science and Technology (Guangzhou) Haoyu Chen The Hong Kong University of Science and Technology (Guangzhou) Jianyu Lai The Hong Kong University of Science and Technology (Guangzhou) Song Fei The Hong Kong University of Science and Technology (Guangzhou) Zhaohu Xing The Hong Kong University of Science and Technology (Guangzhou) Fugee Tsung The Hong Kong University of Science and Technology Lei Zhu The Hong Kong University of Science and Technology	Paper Supplementary Abstract Real-world image dehazing is crucial for enhancing visual quality in computer vision applications. However, existing physics-based haze generation paradigms struggle to model the complexities of real-world haze and lack controllability, limiting the performance of existing baselines on real-world images. In this paper, we introduce GenHaze, a pioneering haze generation framework that enables the one-step generation of high-quality, reference-controllable hazy images. GenHaze leverages the pre-trained latent diffusion model (LDM) with a carefully designed clean-to-haze generation protocol to produce realistic hazy images. Additionally, by leveraging its fast, controllable generation of paired highquality hazy images, we illustrate that existing dehazing baselines can be unleashed in a simple and efficient manner. Extensive experiments indicate that GenHaze achieves visually convincing and quantitatively superior hazy images. It also significantly improves multiple existing dehazing models across 7 non-reference metrics with minimal fine-tuning epochs. Our work demonstrates that LDM possesses the potential to generate realistic degradations, providing an effective alternative to prior generation pipelines.
HORT: Monocular Hand-held Objects Reconstruction with Transformers	Zerui Chen Inria, École normale supérieure, CNRS, PSL Research University Rolandos Alexandros Potamias Imperial College London Shizhe Chen Inria, École normale supérieure, CNRS, PSL Research University Cordelia Schmid Inria, École normale supérieure, CNRS, PSL Research University	Paper Supplementary Abstract Reconstructing hand-held objects in 3D from monocular images remains a significant challenge in computer vision. Most existing approaches rely on implicit 3D representations, which produce overly smooth reconstructions and are time-consuming to generate explicit 3D shapes. While more recent methods directly reconstruct point clouds with diffusion models, the multi-step denoising makes high-resolution reconstruction inefficient. To address these limitations, we propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. Our method follows a coarse-to-fine strategy, first generating a sparse point cloud from the image and progressively refining it into a dense representation using pixel-aligned image features. To enhance reconstruction accuracy, we integrate image features with 3D hand geometry to jointly predict the object point cloud and its pose relative to the hand. Our model is trained end-to-end for optimal performance. Experimental results on both synthetic and real datasets demonstrate that our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.
Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning	Yiyang Chen South China University of Technology Shanshan Zhao Alibaba International Digital Commerce Group Lunhao Duan Alibaba International Digital Commerce Group Changxing Ding South China University of Technology Dacheng Tao Nanyang Technological University	Paper Supplementary Abstract Diffusion-based models, widely used in text-to-image generation, have proven effective in 2D representation learning. Recently, this framework has been extended to 3D self-supervised learning by constructing a conditional point generator for enhancing 3D representations. However, its performance remains constrained by the 3D diffusion model, which is trained on the available 3D datasets with limited size. We hypothesize that the robust capabilities of text-to-image diffusion models, particularly Stable Diffusion (SD), which is trained on large-scale datasets, can help overcome these limitations. To investigate this hypothesis, we propose PointSD, a framework that leverages the SD model for 3D self-supervised learning. By replacing the SD model's text encoder with a 3D encoder, we train a point-to-image diffusion model that allows point clouds to guide the denoising of rendered noisy images. With the trained point-to-image diffusion model, we use noisefree images as the input and point clouds as the condition to extract SD features. Next, we train a 3D backbone by aligning its features with these SD features, thereby facilitating direct semantic learning. Comprehensive experiments on downstream point cloud tasks and ablation studies demonstrate that the SD model can enhance point cloud self-supervised learning. Code is publicly available at https://github.com/wdttt/PointSD.
High-Precision 3D Measurement of Complex Textured Surfaces Using Multiple Filtering Approach	Yuchong Chen Southeast University Jian Yu Southeast University Shaoyan Gai Southeast University Zeyu Cai Southeast University Feipeng Da Southeast University	Paper Abstract In structured light systems, measurement accuracy tends to decline significantly when evaluating complex textured surfaces, particularly at boundaries between different colors. To address this issue, this paper conducts a detailed analysis to develop an error model that illustrates the relationship between phase error and image characteristics, specifically the blur level, grayscale value, and grayscale gradient. Based on this model, a high-precision approach for measuring complex textured targets is introduced, employing a multiple filtering approach. This approach first applies a sequence of filters to vary the blur level of the captured patterns, allowing calculation of phase differences under different blur conditions. Then, these phase differences are used in the constructed error model to identify the critical parameter causing phase errors. Finally, phase recovery is performed using the calibrated parameter, effectively reducing errors caused by complex textures. Experimental comparisons exhibit that this method reduces the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) by 40.31% and 40.78%, respectively. In multiple experiments, its performance generally surpassed that of existing methods, demonstrating improved accuracy and robustness.
Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image	Jerred Chen University of Oxford Ronald Clark University of Oxford	Paper Supplementary Abstract In many robotics and VR/AR applications, fast camera motions lead to a high level of motion blur, causing existing camera pose estimation methods to fail. In this work, we propose a novel framework that leverages motion blur as a rich cue for motion estimation rather than treating it as an unwanted artifact. Our approach works by predicting a dense motion flow field and a monocular depth map directly from a single motion-blurred image. We then recover the instantaneous camera velocity by solving a linear least squares problem under the small motion assumption. In essence, our method produces an IMU-like measurement that robustly captures fast and aggressive camera movements. To train our model, we construct a largescale dataset with realistic synthetic motion blur derived from ScanNet++v2 and further refine our model by training end-to-end on real data using our fully differentiable pipeline. Extensive evaluations on real-world benchmarks demonstrate that our method achieves state-of-the-art angular and translational velocity estimates, outperforming current methods like MASt3R and COLMAP.
InvRGB+L: Inverse Rendering of Complex Scenes with Unified Color and LiDAR Reflectance Modeling	Xiaoxue Chen AIR, Tsinghua University Bhargav Chandaka University of Illinois Urbana-Champaign Chih-Hao Lin University of Illinois Urbana-Champaign Ya-Qin Zhang AIR, Tsinghua University David Forsyth University of Illinois Urbana-Champaign Hao Zhao AIR, Tsinghua University Shenlong Wang University of Illinois Urbana-Champaign	Paper Supplementary Abstract We present InvRGB+L, a novel inverse rendering model that reconstructs large, relightable, and dynamic scenes from a single RGB+LiDAR sequence. Conventional inverse graphics methods rely primarily on RGB observations and use LiDAR mainly for geometric information, often resulting in suboptimal material estimates due to visible light interference. We find that LiDAR's intensity values-captured with active illumination in a different spectral range-offer complementary cues for robust material estimation under variable lighting. Inspired by this, InvRGB+L leverages LiDAR intensity cues to overcome challenges inherent in RGB-centric inverse graphics through two key innovations: (1) a novel physics-based LiDAR shading model and (2) RGB-LiDAR material consistency losses. The model produces novel-view RGB and LiDAR renderings of urban and indoor scenes and supports relighting, night simulations, and dynamic object insertions-achieving results that surpass current state-of-the-art methods in both scene-level urban inverse rendering and LiDAR simulation.
LONG3R: Long Sequence Streaming 3D Reconstruction	Zhuoguang Chen Shanghai Artificial Intelligence Laboratory Minghui Qin IIIS, Tsinghua University Tianyuan Yuan IIIS, Tsinghua University Zhe Liu IIIS, Tsinghua University Hang Zhao IIIS, Tsinghua University	Paper Abstract Recent advancements in multi-view scene reconstruction have been significant, yet existing methods face limitations when processing streams of input images. These methods either rely on time-consuming ofﬂine optimization or are restricted to shorter sequences, hindering their applicability in real-time scenarios. In this work, we propose LONG3R (LOng sequence streamiNG 3D Reconstruction), a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences. Our model achieves realtime processing by operating recurrently, maintaining and updating memory with each new observation. We first employ a memory gating mechanism to filter relevant memory, which, together with a new observation, is fed into a dual-source refined decoder for coarse-to-fine interaction. To effectively capture long-sequence memory, we propose a 3D spatio-temporal memory that dynamically prunes redundant spatial information while adaptively adjusting resolution along the scene. To enhance our model's performance on long sequences while maintaining training efficiency, we employ a two-stage curriculum training strategy, each stage targeting specific capabilities. Experiments demonstrate that LONG3R outperforms state-of-theart streaming methods, particularly for longer sequences, while maintaining real-time inference speed. Project page: https://zgchen33.github.io/LONG3R/.
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos	Yi Chen The University of Hong Kong Yuying Ge ARC Lab, Tencent PCG Weiliang Tang The Chinese University of Hong Kong Yizhuo Li The University of Hong Kong Yixiao Ge ARC Lab, Tencent PCG Mingyu Ding University of California, Berkeley Ying Shan ARC Lab, Tencent PCG Xihui Liu The University of Hong Kong	Paper Supplementary Abstract Recent developments in Large Language Models (LLMs) pre-trained on extensive corpora have shown significant success in various natural language processing (NLP) tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich 'corpus', can a similar generative pretraining approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer †Corresponding Authors. of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging 'language' of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood. To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.
Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation	Yingjie Chen Tongyi Lab, Alibaba Group Yifang Men Tongyi Lab, Alibaba Group Yuan Yao Tongyi Lab, Alibaba Group Miaomiao Cui Tongyi Lab, Alibaba Group Liefeng Bo Tongyi Lab, Alibaba Group	Paper Supplementary Abstract Motion-controllable image animation is a fundamental task with a wide range of potential applications. Recent works have made progress in controlling camera or object motion via various motion representations, while they still struggle to support collaborative camera and object motion control with adaptive control granularity. To this end, we introduce 3D-aware motion representation and propose an image animation framework, called Perception-as-Control, to achieve fine-grained collaborative motion control. Specifically, we construct 3D-aware motion representation from a reference image, manipulate it based on interpreted user instructions, and perceive it from different viewpoints. In this way, camera and object motions are transformed into intuitive and consistent visual changes. Then, our framework leverages the perception results as motion control signals, enabling it to support various motion-related video synthesis tasks in a unified and ﬂexible way. Experiments demonstrate the superiority of the proposed approach. For more details and qualitative results, please refer to our anonymous project webpage: Perception-as-Control.
Point Cloud Self-supervised Learning via 3D to Multi-view Masked Learner	Zhimin Chen Clemson University Xuewei Chen Clemson University Xiao Guo Michigan State University Yingwei Li Johns Hopkins University Longlong Jing The City University of New York Liang Yang The City University of New York Bing Li Clemson University	Paper Supplementary Abstract Recently, multi-modal masked autoencoders (MAE) has been introduced in 3D self-supervised learning, offering enhanced feature learning by leveraging both 2D and 3D data to capture richer cross-modal representations. However, these approaches have two limitations: (1) they inefficiently require both 2D and 3D modalities as inputs, even though the inherent multi-view properties of 3D point clouds already contain 2D modality. (2) input 2D modality causes the reconstruction learning to unnecessarily rely on visible 2D information, hindering 3D geometric representation learning. To address these challenges, we propose a 3D to Multi-View Learner (Multi-View ML) that only utilizes 3D modalities as inputs and effectively capture rich spatial information in 3D point clouds. Specifically, we first project 3D point clouds to multi-view 2D images at the feature level based on 3D-based pose. Then, we introduce two components: (1) a 3D to multi-view autoencoder that reconstructs point clouds and multi-view images from 3D and projected 2D features; (2) a multi-scale multi-head (MSMH) attention mechanism that facilitates local-global information interactions in each decoder transformer block through attention heads at various scales. Additionally, a novel twostage self-training strategy is proposed to align 2D and 3D representations. Our method outperforms state-of-the-art counterparts across various downstream tasks, including 3D classification, part segmentation, and object detection.
SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World	Chen Chen Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences Zhirui Wang Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences Taowei Sheng Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences Yi Jiang Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences Yundu Li Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences Peirui Cheng Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences Luning Zhang Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences Kaiqiang Chen Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences Yanfeng Hu Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences Xue Yang Shanghai Jiao Tong University Xian Sun Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences	Paper Abstract Existing vision-based 3D occupancy prediction methods are inherently limited in accuracy due to their exclusive reliance on street-view imagery, neglecting the potential benefits of incorporating satellite views. We propose SA-Occ, the first Satellite-Assisted 3D occupancy prediction model, which leverages GPS & IMU to integrate historical yet readily available satellite imagery into real-time applications, effectively mitigating limitations of ego-vehicle perceptions, involving occlusions and degraded performance in distant regions. To address the core challenges of crossview perception, we propose: 1) Dynamic-Decoupling Fusion, which resolves inconsistencies in dynamic regions caused by the temporal asynchrony between satellite and street views; 2) 3D-Proj Guidance, a module that enhances 3D feature extraction from inherently 2D satellite imagery; and 3) Uniform Sampling Alignment, which aligns the sampling density between street and satellite views. Evaluated on Occ3D-nuScenes, SA-Occ achieves state-of-the-art performance, especially among single-frame methods, with a 39.05% mIoU (a 6.97% improvement), while incurring only 6.93 ms of additional latency per frame.
Semantic Causality-Aware Vision-Based 3D Occupancy Prediction	Dubing Chen SKL-IOTSC, CIS, University of Macau Huan Zheng SKL-IOTSC, CIS, University of Macau Yucheng Zhou SKL-IOTSC, CIS, University of Macau Xianfei Li COWAROBOT Co. Ltd. Wenlong Liao COWAROBOT Co. Ltd. Tao He COWAROBOT Co. Ltd. Pai Peng COWAROBOT Co. Ltd. Jianbing Shen SKL-IOTSC, CIS, University of Macau	Paper Abstract Vision-based 3D semantic occupancy prediction is a critical task in 3D vision that integrates volumetric 3D reconstruction with semantic understanding. Existing methods, however, often rely on modular pipelines. These modules are typically optimized independently or use pre-configured inputs, leading to cascading errors. In this paper, we address this limitation by designing a novel causal loss that enables holistic, end-to-end supervision of the modular 2Dto-3D transformation pipeline. Grounded in the principle of 2D-to-3D semantic causality, this loss regulates the gradient ﬂow from 3D voxel representations back to the 2D features. Consequently, it renders the entire pipeline differentiable, unifying the learning process and making previously non-trainable components fully learnable. Building on this principle, we propose the Semantic Causality-Aware 2D-to-3D Transformation, which comprises three components guided by our causal loss: Channel-Grouped Lifting for adaptive semantic mapping, Learnable Camera Offsets for enhanced robustness against camera perturbations, and Normalized Convolution for effective feature propagation. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the Occ3D benchmark, demonstrating significant robustness to camera perturbations and improved 2D-to-3D semantic consistency.
Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation	Siyu Chen Jimei University Ting Han Sun Yat-sen University Changshe Zhang Xidian University Xin Luo Jimei University Meiliu Wu University of Glasgow Guorong Cai Jimei University Jinhe Su Jimei University	Paper Abstract Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depthaware learnable tokens to continuously decouple domaininvariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datasets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at https://github.com/SY-Ch/DepthForge.
UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving		Paper Supplementary Abstract The creation of diverse and realistic driving scenarios has become essential to enhance perception and planning capabilities of the autonomous driving system. However, generating long-duration, surround-view consistent driving videos remains a significant challenge. To address this, we present UniMLVG, a unified framework designed to generate extended street multi-perspective videos under precise control. By integrating single- and multi-view driving videos into the training data, our approach updates a DiT-based diffusion model equipped with cross-frame and cross-view modules across three stages with multi training objectives, substantially boosting the diversity and quality of generated visual content. Importantly, we propose an innovative explicit viewpoint modeling approach for multi-view video generation to effectively improve motion transition consistency. Capable of handling various input reference formats (e.g., text, images, or video), our UniMLVG generates high-quality multi-view videos according to the corresponding condition constraints such as 3D bounding boxes or frame-level text descriptions. Compared to the best models with similar capabilities, our framework achieves improvements of 48.2% in FID and 35.2% in FVD.
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning	Zhangquan Chen Tsinghua University Xufang Luo Microsoft Research Asia Dongsheng Li Microsoft Research Asia	Paper Supplementary Abstract Visual understanding is inherently intention-driven-humans selectively focus on different regions of a scene based on their goals. Recent advances in large multimodal models (LMMs) enable flexible expression of such intentions through natural language, allowing queries to guide visual reasoning processes. Frameworks like Visual Chain-of-Thought have demonstrated the benefit of incorporating explicit reasoning steps, where the model predicts a focus region before answering a query. However, existing approaches rely heavily on supervised training with annotated intermediate bounding boxes, which severely limits scalability due to the combinatorial explosion of intention-region pairs. To overcome this limitation, we propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception. VisRL optimizes the entire visual reasoning process using only reward signals. By treating intermediate focus selection as an internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations while aligning more closely with how humans learn to perceive the world. Extensive experiments across multiple benchmarks show that VisRL consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs. Our code is available at https://github.com/zhangquanchen/VisRL.
Constraint-Aware Feature Learning for Parametric Point Cloud	Xi Cheng Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen Ruiqi Lei Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen Di Huang Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen Zhichao Liao Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen Fengyuan Piao Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen Yan Chen Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen Pingfa Feng Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen Long Zeng Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen	Paper Supplementary Abstract Parametric point clouds are sampled from CAD shapes and are becoming increasingly common in industrial manufacturing. Most CAD-specific deep learning methods focus on geometric features, while overlooking constraints inherent in CAD shapes. This limits their ability to discern CAD shapes with similar appearances but different constraints. To tackle this challenge, we first analyze the constraint importance via simple validation experiments. Then, we introduce a deep learning-friendly constraint representation with three components, and design a constraintaware feature learning network (CstNet), which includes two stages. Stage 1 extracts constraint representation from BRep data or point cloud based on local features. It enables better generalization ability to unseen dataset after pre-training. Stage 2 employs attention layers to adaptively adjust the weights of three constraints' components. It facilitates the effective utilization of constraints. In addition, we built the first multi-modal parametric-purpose dataset, i.e. Param20K, comprising about 20K CAD instances of 75 classes. On this dataset, CstNet achieved 3.49% (classification) and 26.17% (rotation robustness) accuracy improvements over the state-of-the-art. To the best of our knowledge, CstNet is the first constraint-aware deep learning method tailored for parametric point cloud analysis. Our project page with source code is available at: https://cstnetwork.github.io/.
MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding	Tongtong Cheng Department of Computer Science, Chongqing University Rongzhen Li National Elite Institute of Engineering, Chongqing University Yixin Xiong Department of Computer Science, Chongqing University Tao Zhang Department of Computer Science, Chongqing University Jing Wang College of Computer Science and Technology, National University of Defense Technology Kai Liu Department of Computer Science, Chongqing University	Paper Abstract Accurate driving behavior recognition and reasoning are critical for autonomous driving video understanding. However, existing methods often tend to dig out the shallow causal, fail to address spurious correlations across modalities, and ignore the ego-vehicle level causality modeling. To overcome these limitations, we propose a novel Multimodal Causal Analysis Model (MCAM) that constructs latent causal structures between visual and language modalities. Firstly, we design a multi-level feature extractor to capture long-range dependencies. Secondly, we design a causal analysis module that dynamically models driving scenarios using a directed acyclic graph (DAG) of driving states. Thirdly, we utilize a vision-language transformer to align critical visual features with their corresponding linguistic expressions. Extensive experiments on the BDDX, and CoVLA datasets demonstrate that MCAM achieves SOTA performance in visual-language causal relationship learning. Furthermore, the model exhibits superior capability in capturing causal characteristics within video sequences, showcasing its effectiveness for autonomous driving applications. The code is available at https://github.com/SixCorePeach/MCAM
Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps	Chong Cheng The Hong Kong University of Science and Technology (Guangzhou) Sicheng Yu The Hong Kong University of Science and Technology (Guangzhou) Zijian Wang The Hong Kong University of Science and Technology (Guangzhou) Yifan Zhou The Hong Kong University of Science and Technology (Guangzhou) Hao Wang The Hong Kong University of Science and Technology (Guangzhou)	Paper Supplementary Abstract 3D Gaussian Splatting (3DGS) has become a popular solution in SLAM due to its high-fidelity and real-time novel view synthesis performance. However, some previous 3DGS SLAM methods employ a differentiable rendering pipeline for tracking, lack geometric priors in outdoor scenes. Other approaches introduce separate tracking modules, but they accumulate errors with significant camera movement, leading to scale drift. To address these challenges, we propose a robust RGB-only outdoor 3DGS SLAM method: S3PO-GS. Technically, we establish a self-consistent tracking module anchored in the 3DGS pointmap, which avoids cumulative scale drift and achieves more precise and robust tracking with fewer iterations. Additionally, we design a patch-based pointmap dynamic mapping module, which introduces geometric priors while avoiding scale ambiguity. This significantly enhances tracking accuracy and the quality of scene reconstruction, making it particularly suitable for complex outdoor environments. Our experiments on the Waymo, KITTI, and DL3DV datasets demonstrate that S3PO-GS achieves state-of-the-art results in novel view synthesis and outperforms other 3DGS SLAM methods in tracking accuracy. Project page: https://3dagentworld. github.io/S3PO-GS/.
RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration	Chong Cheng The Hong Kong University of Science and Technology (Guangzhou) Yu Hu The Hong Kong University of Science and Technology (Guangzhou) Sicheng Yu The Hong Kong University of Science and Technology (Guangzhou) Beizhen Zhao The Hong Kong University of Science and Technology (Guangzhou) Zijian Wang The Hong Kong University of Science and Technology (Guangzhou) Hao Wang The Hong Kong University of Science and Technology (Guangzhou)	Paper Supplementary Abstract 3D Gaussian Splatting (3DGS) has demonstrated its potential in reconstructing scenes from unposed images. However, optimization-based 3DGS methods struggle with sparse views due to limited prior knowledge. Meanwhile, feed-forward Gaussian approaches are constrained by input formats, making it challenging to incorporate more input views. To address these challenges, we propose RegGS, a 3D Gaussian registration-based framework for reconstructing unposed sparse views. RegGS aligns local 3D Gaussians generated by a feed-forward network into a globally consistent 3D Gaussian representation. Technically, we implement an entropy-regularized Sinkhorn algorithm to efficiently solve the optimal transport Mixture 2-Wasserstein (MW2) distance, which serves as an alignment metric for Gaussian mixture models (GMMs) in Sim(3) space. Furthermore, we design a joint 3DGS registration module that integrates the MW2 distance, photometric consistency, and depth geometry. This enables a coarse-to-fine registration process while accurately estimating camera poses and aligning the scene. Experiments on the RE10K and ACID datasets demonstrate that RegGS effectively registers local Gaussians with high fidelity, achieving precise pose estimation and high-quality novel-view synthesis. Project page: https://3dagentworld.github.io/reggs/.
Temporal-aware Query Routing for Real-time Video Instance Segmentation	Zesen Cheng School of Electronic and Computer Engineering, Peking University, Shenzhen Kehan Li Alibaba Group Yian Zhao School of Electronic and Computer Engineering, Peking University, Shenzhen Hang Zhang Alibaba Group Chang Liu Department of Automation and BNRist, Tsinghua University, Beijing Jie Chen School of Electronic and Computer Engineering, Peking University, Shenzhen	Paper Abstract With the rise of applications such as embodied intelligence, developing high real-time online video instance segmentation (VIS) has become increasingly important. However, through time profiling of the components in advanced online VIS architecture (i.e., transformer-based architecture), we find that the transformer decoder significantly hampers the inference speed. Further analysis of the similarities between the outputs from adjacent frames at each transformer decoder layer reveals significant redundant computations within the transformer decoder. To address this issue, we introduce Temporal-Aware query Routing (TAR) mechanism. We embed it before each transformer decoder layer. By fusing the optimal queries from the previous frame, the queries output by the preceding decoder layer, and their differential information, TAR predicts a binary classification score and then uses an argmax operation to determine whether the current layer should be skipped. Experimental results demonstrate that integrating TAR into the baselines achieves significant efficiency gains (24.7 →34.6 FPS for MinVIS, 22.4 →32.8 FPS for DVIS++) while also improving performance (e.g., on YoutubeVIS 2019, 47.4 →48.4 AP for MinVIS, 55.5 →55.7 AP for DVIS++). Furthermore, our analysis of the TAR mechanism shows that the number of skipped layers increases as the differences between adjacent video frames decrease, which suggests that our method effectively utilizes inter-frame differences to reduce redundant computations in the transformer decoder.
EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device	Gunjan Chhablani Georgia Tech Xiaomeng Ye Georgia Tech Muhammad Zubair Irshad Toyota Research Institute Zsolt Kira Georgia Tech	Paper Supplementary Abstract The field of Embodied AI predominantly relies on simulation for training and evaluation, often using either fully synthetic environments that lack photorealism or high-fidelity real-world reconstructions captured with expensive hardware. As a result, sim-to-real transfer remains a major challenge. In this paper, we introduce EmbodiedSplat, a novel approach that personalizes policy training by efficiently capturing the deployment environment and finetuning policies within the reconstructed scenes. Our method leverages 3D Gaussian Splatting (GS) and the Habitat-Sim simulator to bridge the gap between realistic scene capture and effective training environments. Using iPhonecaptured deployment scenes, we reconstruct meshes via GS, enabling training in settings that closely approximate realworld conditions. We conduct a comprehensive analysis of training strategies, pre-training datasets, and mesh reconstruction techniques, evaluating their impact on simto-real predictivity in real-world scenarios. Experimental results demonstrate that agents fine-tuned with EmbodiedSplat outperform both zero-shot baselines pre-trained on large-scale real-world datasets (HM3D) and synthetically generated datasets (HSSD), achieving absolute success rate improvements of 20% and 40% on real-world Image Navigation task. Moreover, our approach yields a high sim-vs-real correlation (0.87-0.97) for the reconstructed meshes, underscoring its effectiveness in adapting policies to diverse environments with minimal effort. Project page: https://gchhablani.github.io/embodied-splat.
Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting	Seunggeun Chi Purdue University Enna Sachdeva Honda Research Institute USA Pin-Hao Huang Honda Research Institute USA Kwonjoon Lee Honda Research Institute USA	Paper Supplementary Abstract Amodal completion, the task of inferring the complete appearance of objects despite partial occlusions, is crucial for understanding complex human-object interactions (HOI) in computer vision and robotics. Existing methods, including pre-trained diffusion models, often struggle to generate plausible completions in dynamic scenarios due to their limited understanding of HOI. To address this challenge, we propose a novel approach that leverages physical prior knowledge alongside a specialized multi-regional inpainting technique tailored for HOI. By incorporating physical constraints derived from human topology and contact information, we define two distinct regions: the primary region, where occluded object parts are most likely to reside, and the secondary region, where occlusions are less probable. Our multi-regional inpainting method employs customized denoising strategies across these regions within a diffusion model, thereby enhancing the accuracy and realism of generated completions in both shape and visual detail. Experimental results demonstrate that our approach substantially outperforms existing methods in HOI scenarios, advancing 1Work done at Honda Research Institute machine perception toward a more human-like understanding of dynamic environments. Furthermore, we show that our pipeline remains robust even without ground-truth contact annotations, broadening its applicability to tasks such as 3D reconstruction and novel view/pose synthesis.
Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation	Zhixiang Chi University of Toronto Yanan Wu China Agricultural University Li Gu Concordia University Huan Liu McMaster University Ziqiang Wang Concordia University Yang Zhang Beijing Jiaotong University Yang Wang Concordia University Konstantinos N Plataniotis University of Toronto	Paper Supplementary Abstract CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patchlevel correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViTB, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.
AJAHR: Amputated Joint Aware 3D Human Mesh Recovery	Hyunjin Cho Chung-Ang University Giyun Choi Chung-Ang University Jongwon Choi Chung-Ang University	Paper Supplementary Abstract Existing human mesh recovery methods assume a standard human body structure, overlooking diverse anatomical conditions such as limb loss. This assumption introduces bias when applied to individuals with amputations-a limitation further exacerbated by the scarcity of suitable datasets. To address this gap, we propose Amputated Joint Aware 3D Human Mesh Recovery (AJAHR), which is an adaptive pose estimation framework that improves mesh reconstruction for individuals with limb loss. Our model integrates a body-part amputation classifier, jointly trained with the mesh recovery network, to detect potential amputations. We also introduce Amputee 3D (A3D), which is a synthetic dataset offering a wide range of amputee poses for robust training. While maintaining competitive performance on non-amputees, our approach achieves state-ofthe-art results for amputated individuals. Additional materials can be found at: https://chojinie.github.io/project_AJAHR/
DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding	Jungbin Cho Yonsei University Junwan Kim Yonsei University Jisoo Kim Yonsei University Minseo Kim Yonsei University Mingu Kang Sungkyunkwan University Sungeun Hong Sungkyunkwan University Tae-Hyun Oh Yonsei University Youngjae Yu Yonsei University	Paper Supplementary Abstract Human motion is inherently continuous and dynamic, posing significant challenges for generative models. While discrete generation methods are widely used, they suffer from limited expressiveness and frame-wise noise artifacts. In contrast, continuous approaches produce smoother, more natural motion but often struggle to adhere to conditioning signals due to high-dimensional complexity and limited training data. To resolve this 'discord' between discrete and continuous representations we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that leverages rectified flow to decode discrete motion tokens in the continuous, raw motion space. Our core idea is to frame token decoding as a conditional generation task, ensuring that DisCoRD captures fine-grained dynamics and achieves smoother, more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals on diverse settings. Extensive evaluations demonstrate that DisCoRD achieves state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on KIT-ML. These results establish DisCoRD as a robust solution for bridging the divide between discrete efficiency and continuous realism. Project website: https://whwjdqls.github.io/discord-motion/
Learning Large Motion Estimation from Intermediate Representations with a High-Resolution Optical Flow Dataset Featuring Long-Range Dynamic Motion	Hoonhee Cho KAIST Yuhwan Jeong KAIST Kuk-Jin Yoon KAIST	Paper Supplementary Abstract With advancements in sensor and display technologies, high-resolution imagery is becoming increasingly prevalent in diverse applications. As a result, optical ﬂow estimation needs to adapt to larger image resolutions, where even moderate movements lead to substantial pixel displacements, making long-range motion estimation more critical than ever. However, existing datasets primarily focus on short-range ﬂow in low-resolution settings, limiting the generalization of models to high-resolution scenarios with large displacements. Additionally, there is a lack of suitable datasets for evaluating model capacity in longrange motion estimation, further hindering progress in this area. To address this, we introduce RelayFlow-4K, highresolution 4K optical ﬂow dataset designed to capture diverse motion patterns, including long-range intermediate frame ﬂows. While such datasets provide valuable training resources, long-range estimation remains challenging due to increased matching ambiguity. Simply incorporating these datasets does not inherently improve performance. To this end, we propose a novel training framework that integrates matching cost distillation and incremental time-step learning to refine cost volume estimation and stabilize training. Additionally, we leverage the distance map, which measures the distance from unmatched regions to their nearest matched pixels, improving occlusion handling. Our approach significantly enhances long-range optical ﬂow estimation in high-resolution settings. Our datasets and code are available at https://github. com/Chohoonhee/RelayFlow-4K.
Humans as a Calibration Pattern: Dynamic 3D Scene Reconstruction from Unsynchronized and Uncalibrated Videos	Changwoon Choi Seoul National University Jeongjun Kim Seoul National University Geonho Cha NAVER Cloud Minkwan Kim Seoul National University Dongyoon Wee NAVER Cloud Young Min Kim Seoul National University	Paper Supplementary Abstract Recent works on dynamic 3D neural field reconstruction assume the input from synchronized multi-view videos whose poses are known. The input constraints are often not satisfied in real-world setups, making the approach impractical. We show that unsynchronized videos from unknown poses can generate dynamic neural fields as long as the videos capture human motion. Humans are one of the most common dynamic subjects captured in videos, and their shapes and poses can be estimated using state-ofthe-art libraries. While noisy, the estimated human shape and pose parameters provide a decent initialization point to start the highly non-convex and under-constrained problem of training a consistent dynamic neural representation. Given the shape and pose parameters of humans in individual frames, we formulate methods to calculate the time offsets between videos, followed by camera pose estimations that analyze the 3D joint positions. Then, we train the dynamic neural fields employing multiresolution grids while we concurrently refine both time offsets and camera poses. The setup still involves optimizing many parameters; therefore, we introduce a robust progressive learning strategy to stabilize the process. Experiments show that our approach achieves accurate spatio-temporal calibration and high-quality scene reconstruction in challenging conditions.
FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution	Gene Chou Netflix Eyeline Studios Wenqi Xian Netflix Eyeline Studios Guandao Yang Stanford University Mohamed Abdelfattah Cornell University Bharath Hariharan Cornell University Noah Snavely Cornell University Ning Yu Netflix Eyeline Studios Paul Debevec Netflix Eyeline Studios	Paper Abstract A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce highresolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044→1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data. We evaluate our approach across multiple datasets against state-ofthe-art depth models, and find that ours outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require highresolution depth, such as video editing, and online decisionmaking, such as robotics. We release all code and model weights at https://github.com/Eyeline-Research/FlashDepth.
OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection	Adrian Chow University of Waterloo Evelien Riddell University of Waterloo Yimu Wang University of Waterloo Sean Sedwards University of Waterloo Krzysztof Czarnecki University of Waterloo	Paper Supplementary Abstract Open-vocabulary 3D object detection for autonomous driving aims to detect novel objects beyond the predefined training label sets in point cloud scenes. Existing approaches achieve this by connecting traditional 3D object detectors with vision-language models (VLMs) to regress 3D bounding boxes for novel objects and perform open-vocabulary classification through cross-modal alignment between 3D and 2D features. However, achieving robust cross-modal alignment remains a challenge due to semantic inconsistencies when generating corresponding 3D and 2D feature pairs. To overcome this challenge, we present OV-SCAN, an Open-Vocabulary 3D framework that enforces Semantically Consistent Alignment for Novel object discovery. OVSCAN employs two core strategies: discovering precise 3D annotations and filtering out low-quality or corrupted alignment pairs (arising from 3D annotation, occlusioninduced, or resolution-induced noise). Extensive experiments on the nuScenes dataset demonstrate that OV-SCAN achieves state-of-the-art performance. Our code is available at https://github.com/ahtchow/OV-SCAN.
EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception	Sanjoy Chowdhury University of Maryland, College Park Subrata Biswas Meta Reality Labs Sayan Nag University of Toronto Tushar Nagarajan Meta Reality Labs Calvin Murdock Meta Reality Labs Ishwarya Ananthabhotla Meta Reality Labs Yijun Qian Meta Reality Labs Vamsi Krishna Ithapu Meta Reality Labs Dinesh Manocha University of Maryland, College Park Ruohan Gao University of Maryland, College Park	Paper Supplementary Abstract Modern perception models, particularly those designed for multisensory egocentric tasks, have achieved remarkable performance but often come with substantial computational costs. These high demands pose challenges for real-world deployment, especially in resource-constrained environments. In this paper, we introduce EGOADAPT, a framework that adaptively performs cross-modal distillation and policy learning to enable efficient inference across different egocentric perception tasks, including egocentric action recognition, active speaker localization, and behavior anticipation. Our proposed policy module is adaptable to task-specific action spaces, making it broadly applicable. Experimental results on three challenging egocentric datasets EPIC-Kitchens, EasyCom, and Aria Everyday Activities demonstrate that our method significantly enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%, and energy up to 9.6x, while still on-par and in many cases outperforming, the performance of corresponding state-of-the-art models.
GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions	Xiaomeng Chu University of Science and Technology of China Jiajun Deng The University of Adelaide Guoliang You University of Science and Technology of China Wei Liu University of Science and Technology of China Xingchen Li University of Science and Technology of China Jianmin Ji University of Science and Technology of China Yanyong Zhang University of Science and Technology of China	Paper Supplementary Abstract Flexible instruction-guided 6-DoF grasping is a significant yet challenging task for real-world robotic systems. Existing methods utilize the contextual understanding capabilities of the large language models (LLMs) to establish mappings between expressions and targets, allowing robots to comprehend users' intentions in the instructions. However, the LLM's knowledge about objects' physical properties remains underexplored despite its tight relevance to grasping. In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary questionanswering (QA) tasks. Particularly, we design a set of QA templates to enable hierarchical reasoning that includes three stages: target parsing, physical property analysis, and grasp action selection. Moreover, GraspCoT presents a unified multimodal LLM architecture, which encodes multi-view observations of 3D scenes into 3D-aware visual tokens, and then jointly embeds these visual tokens with CoT-derived textual tokens within LLMs to generate grasp pose predictions. Furthermore, we present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands. Extensive experiments on IntentGrasp demonstrate the superiority of our method, with additional validation in real-world robotic applications confirming its practicality. The code is available at https://github.com/cxmomo/GraspCoT.
ETA: Energy-based Test-time Adaptation for Depth Completion	Younjoon Chung Yale University Hyoungseob Park Yale University Patrick Rim Yale University Xiaoran Zhang Yale University Jihe He Yale University Ziyao Zeng Yale University Safa Cicek UCLA Byung-Woo Hong Chung-Ang University James S. Duncan Yale University Alex Wong Yale University	Paper Supplementary Abstract We propose a method for test-time adaptation of pretrained depth completion models. Depth completion models, trained on some 'source' data, often predict erroneous outputs when transferred to 'target' data captured in novel environmental conditions due to a covariate shift. The crux of our method lies in quantifying the likelihood of depth predictions belonging to the source data distribution. The challenge is in the lack of access to out-of-distribution (target) data prior to deployment. Hence, rather than making assumptions regarding the target distribution, we utilize adversarial perturbations as a mechanism to explore the data space. This enables us to train an energy model that scores local regions of depth predictions as in- or out-ofdistribution. We update the parameters of pretrained depth completion models at test time to minimize energy, effectively aligning test-time predictions to those of the source distribution. We call our method 'Energy-based Test-time Adaptation', or ETA for short. We evaluate our method across three indoor and three outdoor datasets, where ETA improve over the previous state-of-the-art method by an average of 6.94% for outdoors and 10.23% for indoors. Project Page: https://fuzzythecat.github.io/eta.
ToF-Splatting: Dense SLAM using Sparse Time-of-Flight Depth and Multi-Frame Integration	Andrea Conti University of Bologna Matteo Poggi University of Bologna Valerio Cambareri Sony DepthSensing Solutions Martin R. Oswald University of Amsterdam Stefano Mattoccia University of Bologna	Paper Supplementary Abstract Time-of-Flight (ToF) sensors provide efficient active depth sensing at relatively low power budgets; among such designs, only very sparse measurements from lowresolution sensors are considered to meet the increasingly limited power constraints of mobile and AR/VR devices. However, such extreme sparsity levels limit the seamless usage of ToF depth in SLAM. In this work, we propose ToF-Splatting, the first 3D Gaussian Splatting-based SLAM pipeline tailored for using effectively very sparse ToF input data. Our approach improves upon the state of the art by introducing a multi-frame integration module, which produces dense depth maps by merging cues from extremely sparse ToF depth, monocular color, and multi-view geometry. Extensive experiments on both real and synthetic sparse ToF datasets demonstrate the advantages of our approach, as it achieves state-of-the-art tracking and mapping performances on reference datasets.
SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark	Alex Costanzino University of Bologna Pierluigi Zama Ramirez University of Bologna Luigi Lella University of Bologna Matteo Ragaglia SACMI Imola Alessandro Oliva SACMI Imola Giuseppe Lisanti University of Bologna Luigi Di Stefano University of Bologna	Paper Supplementary Abstract We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalising from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 Mpx) and point clouds (∼7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.
Debiased Teacher for Day-to-Night Domain Adaptive Object Detection	Yiming Cui Hangzhou Dianzi University Liang Li Institute of Computing Technology, Chinese Academy of Sciences Haibing Yin Hangzhou Dianzi University Yuhan Gao Lishui Institute of Hangzhou Dianzi University Yaoqi Sun Lishui University Chenggang Yan Hangzhou Dianzi University	Paper Supplementary Abstract Day-to-Night Domain Adaptive Object Detection (DNDAOD) is a significant challenge due to the low visibility and signal-to-noise ratio at night. Although recent selftraining approaches achieve promising results, they fail to address three critical biases: distribution bias, training bias, and confirmation bias. Therefore, we propose a Debiased Teacher to address the above biases from three aspects: domain transforming, representation compensating, and pseudo label calibrating. Concretely, the day-to-night domain transforming module (DNDT) leverages physical priors to model some key day-night domain differences, thus transforming daytime images into night-like images. Then, the cross-domain representation compensating module (CDRC) selectively mixes objects from nighttime and night-like images to compensate for the model's general representation of nighttime objects. Further, to correct confirmation bias caused by learning from inaccurate pseudo labels, the pseudo label confirmation calibrating module (ConCal) is designed to obtain accurate pseudo labels for better nighttime knowledge learning. Experimental results on three benchmarks demonstrate that our method outperforms current SOTA methods by a large margin.
SeaS: Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning	Zhewei Dai Huazhong University of Science and Technology Shilei Zeng Huazhong University of Science and Technology Haotian Liu Huazhong University of Science and Technology Xurui Li Huazhong University of Science and Technology Feng Xue University of Trento Yu Zhou Huazhong University of Science and Technology	Paper Supplementary Abstract We introduce SeaS, a unified industrial generative model for automatically creating diverse anomalies, authentic normal products, and precise anomaly masks. While extensive research exists, most efforts either focus on specific tasks, i.e., anomalies or normal products only, or require separate models for each anomaly type. Consequently, prior methods either offer limited generative capability or depend on a vast array of anomaly-specific models. We demonstrate that U-Net's differentiated learning ability captures the distinct visual traits of slightly-varied normal products and diverse anomalies, enabling us to construct a unified model for all tasks. Specifically, we first introduce an Unbalanced Abnormal (UA) Text Prompt, comprising one normal token and multiple anomaly tokens. More importantly, our Decoupled Anomaly Alignment (DA) loss decouples anomaly attributes and binds them to distinct anomaly tokens of UA, enabling SeaS to create unseen anomalies by recombining these attributes. Furthermore, our Normal-image Alignment (NA) loss aligns the normal token to normal patterns, making generated normal products globally consistent and locally varied. Finally, SeaS produces accurate anomaly masks by fusing discriminative U-Net features with high-resolution VAE features. SeaS sets a new benchmark for industrial generation, significantly enhancing downstream applications, with average improvements of +8.66% pixel-level AP for synthesis-based AD approaches, +1.10% imagelevel AP for unsupervised AD methods, and +12.79% IoU for supervised segmentation models. Code is available at https://github.com/HUST-SLOW/SeaS.
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs	Erik Daxberger Apple Nina Wenzel Apple David Griffiths Apple Haiming Gang Apple Justin Lazarow Apple Gefen Kohavi Apple Kai Kang Apple Marcin Eichner Apple Yinfei Yang Apple Afshin Dehghan Apple Peter Grasch Apple	Paper Supplementary Abstract Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space. In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes. Our Cubify Anything VQA (CA-VQA) data covers diverse spatial tasks including spatial relationship prediction, metric size and distance estimation, and 3D grounding. We show that CA-VQA enables us to train MM-Spatial, a strong generalist MLLM that also achieves state-of-the-art performance on 3D spatial understanding benchmarks, including our own. We show how incorporating metric depth and multi-view inputs (provided in CA-VQA) can further improve 3D understanding, and demonstrate that data alone allows our model to achieve depth perception capabilities comparable to dedicated monocular depth estimation models. https://github.com/apple/ml-cubifyanything
Interpretable point cloud classification using multiple instance learning	Matt De Vries Sentinal4D Reed Naidoo Institute of Cancer Research Olga Fourkioti Institute of Cancer Research Lucas G Dent University College London Nathan Curry Imperial College London Chris Dunsby University College London Chris Bakal Institute of Cancer Research	Paper Supplementary Abstract Understanding 3D cell shape is crucial in biomedical research, where morphology serves as a key indicator of disease, cellular state, and drug response. However, many existing 3D point cloud classification models lack interpretability, limiting their utility for extracting biologically meaningful insights. In this work, we unify standard point cloud backbones and feature aggregation strategies within a Multiple Instance Learning (MIL) framework to enable inherently interpretable classification. Our approach, POINTMIL, improves classification performance while providing fine-grained point-level explanations without relying on post hoc analysis. We demonstrate state-of-the-art mACC (97.3%) and F1 (97.5%) in the IntrA biomedical dataset and evaluate the interpretability using quantitative and qualitative metrics. Additionally, we introduce ATLAS-1, a novel dataset of drug-treated 3D cancer cells, and use it to show how POINTMIL captures fine-grained morphological effects of chemical treatments. Beyond biomedical applications, POINTMIL generalises to standard benchmarks such as ModelNet40 and ScanObjectNN, offering interpretable 3D object recognition across domains1.
Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration	Junyuan Deng The Hong Kong University of Science and Technology Wei Yin Horizon Robotics Xiaoyang Guo Horizon Robotics Qian Zhang Horizon Robotics Xiaotao Hu The Hong Kong University of Science and Technology Weiqiang Ren Horizon Robotics Xiao-Xiao Long Nanjing University Ping Tan The Hong Kong University of Science and Technology	Paper Supplementary Abstract In this paper, we present DM-Calib, a diffusion-based approach for estimating pinhole camera intrinsic parameters from a single input image. Monocular camera calibration is essential for many 3D vision tasks. However, most existing methods depend on handcrafted assumptions or are constrained by limited training data, resulting in poor generalization across diverse real-world images. Recent advancements in stable diffusion models, trained on massive data, have shown the ability to generate high-quality images with varied characteristics. Emerging evidence indicates that these models implicitly capture the relationship between camera focal length and image content. Building on this insight, we explore how to leverage the powerful priors of diffusion models for monocular pinhole camera calibration. Specifically, we introduce a new image-based representation, termed Camera Image, which losslessly encodes the numerical camera intrinsics and integrates seamlessly with the diffusion framework. Using this representation, we reformulate the problem of estimating camera intrinsics as the generation of a dense Camera Image conditioned on an input image. By fine-tuning a stable diffusion model to generate a Camera Image from a single RGB input, we can extract camera intrinsics via a RANSAC operation. We further demonstrate that our monocular calibration method enhances performance across various 3D tasks, including zero-shot metric depth estimation, 3D metrology, pose estimation and sparse-view reconstruction. Extensive experiments on multiple public datasets show that our approach significantly outperforms baselines and provides broad benefits to 3D vision tasks.
Open-World Skill Discovery from Unsegmented Demonstration Videos	Jingwen Deng Peking University Zihao Wang Peking University Shaofei Cai Peking University Anji Liu University of California, Los Angeles Yitao Liang Peking University	Paper Supplementary Abstract Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on random splitting or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semanticaware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in Minecraft, a rich open-world simulator with extensive gameplay videos available online. The SBD-generated segments yielded relative performance improvements of 63.7% and 52.1% for conditioned policies on short-term atomic tasks, and 11.3% and 20.8% for their corresponding hierarchical agents on long-horizon tasks, compared to unsegmented baselines. Our method can leverage the diverse YouTube videos to train instruction-following agents. The project page is at https://craftjarvis.github.io/SkillDiscovery/.
Self-Calibrating Gaussian Splatting for Large Field-of-View Reconstruction	Youming Deng Cornell University Wenqi Xian Netflix Eyeline Studios Guandao Yang Stanford University Leonidas Guibas Stanford University Gordon Wetzstein Stanford University Steve Marschner Cornell University Paul Debevec Netflix Eyeline Studios	Paper Supplementary Abstract Large field-of-view (FOV) cameras can simplify and accelerate scene capture because they provide complete coverage with fewer views. However, existing reconstruction pipelines fail to take full advantage of large-FOV input data because they convert input views to perspective images, resulting in stretching that prevents the use of the full image. Additionally, they calibrate lenses using models that do not accurately fit real fisheye lenses in the periphery. We present a new reconstruction pipeline based on Gaussian Splatting that uses a flexible lens model and supports fields of view approaching 180 degrees. We represent lens distortion with a hybrid neural field based on an Invertible ResNet and use a cubemap to render wideFOV images while retaining the efficiency of the Gaussian Splatting pipeline. Our system jointly optimizes lens distortion, camera intrinsics, camera poses, and scene representations using a loss measured directly against the original input pixels. We present extensive experiments on both synthetic and real-world scenes, demonstrating that our model accurately fits real-world fisheye lenses and that our end-to-end self-calibration approach provides higherquality reconstructions than existing methods. More details and videos can be found at the project page: https: //denghilbert.github.io/self-cali/.
Kaleidoscopic Background Attack: Disrupting Pose Estimation with Multi-Fold Radial Symmetry Textures	Xinlong Ding University of Science and Technology Beijing Hongwei Yu University of Science and Technology Beijing Jiawei Li University of Science and Technology Beijing Feifan Li University of Science and Technology Beijing Yu Shang Tsinghua University Bochao Zou University of Science and Technology Beijing Huimin Ma University of Science and Technology Beijing Jiansheng Chen University of Science and Technology Beijing	Paper Supplementary Abstract Camera pose estimation is a fundamental computer vision task that is essential for applications like visual localization and multi-view stereo reconstruction. In the objectcentric scenarios with sparse inputs, the accuracy of pose estimation can be significantly influenced by background textures that occupy major portions of the images across different viewpoints. In light of this, we introduce the Kaleidoscopic Background Attack (KBA), which uses identical segments to form discs with multi-fold radial symmetry. These discs maintain high similarity across different viewpoints, enabling effective attacks on pose estimation models even with natural texture segments. Additionally, a projected orientation consistency loss is proposed to optimize the kaleidoscopic segments, leading to a significant enhancement in the attack effectiveness. Experimental results show that optimized adversarial kaleidoscopic backgrounds can effectively attack various camera pose estimation models.
RePoseD: Efficient Relative Pose Estimation With Known Depth Information	Yaqing Ding Czech Technical University in Prague Viktor Kocur Comenius University in Bratislava Václav Vávra Czech Technical University in Prague Zuzana Berger Haladová Comenius University in Bratislava Jian Yang Nankai University Torsten Sattler Czech Technical University in Prague Zuzana Kukelova Czech Technical University in Prague	Paper Supplementary Abstract Recent advances in monocular depth estimation methods (MDEs) and their improved accuracy open new possibilities for their applications. In this paper, we investigate how monocular depth estimates can be used for relative pose estimation. In particular, we are interested in answering the question whether using MDEs improves results over traditional point-based methods. We propose a novel framework for estimating the relative pose of two cameras from point correspondences with associated monocular depths. Since depth predictions are typically defined up to an unknown scale or even both unknown scale and shift parameters, our solvers jointly estimate the scale or both the scale and shift parameters along with the relative pose. We derive efficient solvers considering different types of depths for three camera configurations: (1) two calibrated cameras, (2) two cameras with an unknown shared focal length, and (3) two cameras with unknown different focal lengths. Our new solvers outperform stateof-the-art depth-aware solvers in terms of speed and accuracy. In extensive real experiments on multiple datasets and with various MDEs, we discuss which depth-aware solvers are preferable in which situation. The code is available at https://github.com/kocurvik/mdrp.
Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition	Jeonghyeok Do Korea Advanced Institute of Science and Technology Munchurl Kim Korea Advanced Institute of Science and Technology	Paper Supplementary Abstract In zero-shot skeleton-based action recognition (ZSAR), aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. ZSAR faces a fundamental challenge in bridging the modality gap between the two-kind features, which severely limits generalization to unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated by the success of diffusion models in multi-modal alignment (e.g., text-to-image, text-to-video), we firstly present a diffusion-based skeletontext alignment framework for ZSAR. Our approach, Triplet Diffusion for Skeleton-Text Matching (TDSM), focuses on cross-alignment power of diffusion models rather than their generative capability. Specifically, TDSM aligns skeleton features with text prompts by incorporating text features into the reverse diffusion process, where skeleton features are denoised under text guidance, forming a unified skeleton-text latent space for robust matching. To enhance discriminative power, we introduce a triplet diffusion (TD) loss that encourages our TDSM to correct skeletontext matches while pushing them apart for different action classes. Our TDSM significantly outperforms very recent state-of-the-art methods with significantly large margins of 2.36%-point to 13.05%-point, demonstrating superior accuracy and scalability in zero-shot settings through effective skeleton-text matching.
DAMap: Distance-aware MapNet for High Quality HD Map Construction	Jinpeng Dong Xi'an Jiaotong University Chen Li Xi'an Jiaotong University Yutong Lin Xi'an Jiaotong University Jingwen Fu Xi'an Jiaotong University Sanping Zhou Xi'an Jiaotong University Nanning Zheng Xi'an Jiaotong University	Paper Supplementary Abstract High-definition (HD) map is an important component to support navigation and planning for autonomous driving vehicles. Predicting map elements with high quality (high classification and localization scores) is crucial to the safety of autonomous driving vehicles. However, current methods perform poorly in high quality predictions due to inherent task misalignment. Two main factors are responsible for misalignment: 1) inappropriate task labels due to one-to-many matching queries sharing the same labels, and 2) sub-optimal task features due to task-shared sampling mechanism. In this paper, we reveal two inherent defects in current methods and develop a novel HD map construction method named DAMap to address these problems. Specifically, DAMap consists of three components: Distance-aware Focal Loss (DAFL), Hybrid Loss Scheme (HLS), and Task Modulated Deformable Attention (TMDA). The DAFL is introduced to assign appropriate classification labels for one-to-many matching samples. The TMDA is proposed to obtain discriminative task-specific features. Furthermore, the HLS is proposed to better utilize the advantages of the DAFL. We perform extensive experiments and consistently achieve performance improvement on the NuScenes and Argoverse2 benchmarks under different metrics, baselines, splits, backbones, and schedules.
DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation	Yue-Jiang Dong Tsinghua University Wang Zhao ARC Lab, Tencent PCG Jiale Xu ARC Lab, Tencent PCG Ying Shan ARC Lab, Tencent PCG Song-Hai Zhang Tsinghua University	Paper Supplementary Abstract Diffusion-based video depth estimation methods have achieved remarkable success. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions. In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.
From One to More: Contextual Part Latents for 3D Generation	Shaocong Dong HKUST Lihe Ding CUHK Xiao Chen CUHK Yaokun Li CUHK Yuxin Wang HKUST Yucheng Wang HKUST Qi Wang HKUST Jaehyeok Kim HKUST Chenjian Gao CUHK Zhanpeng Huang SenseTime Research Zibin Wang SenseTime Research Tianfan Xue CUHK Dan Xu HKUST	Paper Supplementary Abstract To generate 3D objects, early research focused on multiview-driven approaches relying solely on 2D renderings. Recently, the 3D native latent diffusion paradigm has demonstrated superior performance in 3D generation, because it fully leverages the geometric information provided in ground truth 3D data. Despite its fast development, 3D diffusion still faces three challenges. First, the majority of these methods represent a 3D object by one single latent, regardless of its complexity. This may lead to detail loss when generating 3D objects with multiple complicated parts. Second, most 3D assets are designed parts by parts, yet the current holistic latent representation overlooks the independence of these parts and their interrelationships, limiting the model's generative ability. Third, current methods rely on global conditions (e.g., text, image, point cloud) to control the generation process, lacking detailed controllability. Therefore, motivated by how 3D designers create a 3D object, we present a new part-based 3D generation framework, CoPart, which represents a 3D object with multiple contextual part latents and simultaneously generates coherent 3D parts. This part-based framework has several advantages, including: i) reduces the encoding burden of intricate objects by decomposing them into simpler parts, ii) facilitates part learning and part relationship modeling, and iii) naturally supports part-level control. Furthermore, to ensure the coherence of part latents and to harness the powerful priors from foundation models, we propose a novel mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising. Benefiting from the part-based representation, we demonstrate that CoPart can support various applications including part-editing, articulated object generation, and mini-scene generation. Moreover, we collect a new large-scale 3D part dataset named Partverse from Objaverse through automatic mesh segmentation and subsequent human post-annotations. By training on the proposed dataset, CoPart achieves promising part-based 3D generation with high controllability. Project page: https://copart3d.github.io.
Online Dense Point Tracking with Streaming Memory	Qiaole Dong Fudan University Yanwei Fu Fudan University	Paper Supplementary Abstract Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video, even in the presence of occlusions. Traditional methods use optical flow models to directly estimate long-range motion, but they often suffer from appearance drifting without considering temporal consistency. Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one, which is slow and less effective for long-range tracking. To account for temporal consistency and enable efficient information propagation, we present a lightweight and fast model with Streaming memory for dense POint Tracking and online video processing. The SPOT framework features three core components: a customized memory reading module for feature enhancement, a sensory memory for short-term motion dynamics modeling, and a visibilityguided splatting module for accurate information propagation. This combination enables SPOT to perform dense point tracking with state-of-the-art accuracy on the CVO benchmark, as well as comparable or superior performance to offline models on sparse tracking benchmarks such as TAP-Vid and RoboTAP. Notably, SPOT with 10x smaller parameter numbers operates at least 2x faster than previous state-of-the-art models while maintaining the best performance on CVO. We will release the models and codes at: https://dqiaole.github.io/SPOT/.
Teaching VLMs to Localize Specific Objects from In-context Examples	Sivan Doveh Weizmann Institute of Science Nimrod Shabtay IBM Research Eli Schwartz IBM Research Hilde Kuehne IBM Research Raja Giryes Tel Aviv University Rogerio Feris MIT-IBM Leonid Karlinsky MIT-IBM James Glass MIT CSAIL Assaf Arbelle IBM Research Shimon Ullman Weizmann Institute of Science M. Jehanzeb Mirza MIT CSAIL	Paper Supplementary Abstract Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that present-day VLMs (including the proprietary GPT-4o) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. In this work, we focus on the task of few-shot personalized localization, where a model is given a small set of annotated images (in-context examples) - each with a category label and bounding box - and is tasked with localizing the same object type in a query image. Personalized localization can be particularly important in cases of ambiguity of several related objects that can respond to a text or an object that is hard to describe with words. To provoke personalized localization abilities in models, we present a data-centric solution that fine-tunes them using carefully curated data from video object tracking datasets. By leveraging sequences of frames tracking the same object across multiple shots, we simulate instruction-tuning dialogues that promote context awareness. To reinforce this, we introduce a novel regularization technique that replaces object labels with pseudo-names, ensuring the model relies on visual context rather than prior knowledge. Our method significantly enhances the few-shot localization performance of recent VLMs ranging from 7B to 72B in size, without sacrificing generalization, as demonstrated on several benchmarks tailored towards evaluating personalized localization abilities. This work is the first to explore and benchmark personalized few-shot localization for VLMs - exposing critical weaknesses in presentday VLMs, and laying a foundation for future research in context-driven vision-language applications.
3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views	Xiaobiao Du University of Technology Sydney Yida Wang Li Auto Inc. Haiyang Sun Li Auto Inc. Zhuojie Wu The University of Queensland Hongwei Sheng The University of Queensland Shuyun Wang The University of Queensland Jiaying Ying The University of Queensland Ming Lu City University of Macau Tianqing Zhu City University of Macau Kun Zhan Li Auto Inc. Xin Yu The University of Queensland	Paper Supplementary Abstract 3D cars are widely used in self-driving systems, virtual and augmented reality, and gaming applications. However, existing 3D car datasets are either synthetic or low-quality, limiting their practical utility and leaving a significant gap with the high-quality real-world 3D car dataset. In this paper, we present the first large-scale 3D real car dataset, termed 3DRealCar, which offers three key features: (1) High-Volume: 2,500 cars meticulously scanned using smartphones to capture RGB images and point clouds with realworld dimensions; (2) High-Quality: Each car is represented by an average of 200 dense, high-resolution 360degree RGB-D views, enabling high-fidelity 3D reconstruction; (3) High-Diversity: The dataset encompasses a diverse collection of cars from over 100 brands, captured under three distinct lighting conditions (reflective, standard, and dark). We further provide detailed car parsing maps for each instance to facilitate research in automotive segmentation tasks. To focus on vehicles, background point clouds are removed, and all cars are aligned to a unified coordinate system, enabling controlled reconstruction and rendering. We benchmark state-of-the-art 3D reconstruction methods across different lighting conditions using 3DRealCar. Extensive experiments demonstrate that the standard lighting subset can be used to reconstruct high-quality 3D car models that significantly enhance performance on various carrelated 2D and 3D tasks. Notably, our dataset reveals critical challenges faced by current 3D reconstruction methods under reflective and dark lighting conditions, providing valuable insights for future research. Our project is hosted at https://xiaobiaodu.github.io/3drealcar/.
Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection	Ji Du Nankai University Xin Wang The Hong Kong Polytechnic University Fangwei Hao Nankai University Mingyang Yu Nankai University Chunyuan Chen Nankai University Jiesheng Wu Anhui Normal University Bin Wang Nankai University Jing Xu The Hong Kong Polytechnic University Ping Li The Hong Kong Polytechnic University	Paper Supplementary Abstract At the core of Camouﬂaged Object Detection (COD) lies segmenting objects from their highly similar surroundings. Previous efforts navigate this challenge primarily through image-level modeling or annotation-based optimization. Despite advancing considerably, this commonplace practice hardly taps valuable dataset-level contextual information or relies on laborious annotations. In this paper, we propose RISE, a RetrIeval SElf-augmented paradigm that exploits the entire training dataset to generate pseudolabels for single images, which could be used to train COD models. RISE begins by constructing prototype libraries for environments and camouﬂaged objects using training images (without ground truth), followed by K-Nearest Neighbor (KNN) retrieval to generate pseudo-masks for each image based on these libraries. It is important to recognize that using only training images without annotations exerts a pronounced challenge in crafting high-quality prototype libraries. In this light, we introduce a Clustering-thenRetrieval (CR) strategy, where coarse masks are first generated through clustering, facilitating subsequent histogrambased image filtering and cross-category retrieval to produce high-confidence prototypes. In the KNN retrieval stage, to alleviate the effect of artifacts in feature maps, we propose Multi-View KNN Retrieval (MVKR), which integrates retrieval results from diverse views to produce more robust and precise pseudo-masks. Extensive experiments demonstrate that RISE outperforms state-of-the-art unsupervised and prompt-based methods. Code is available at https://github.com/xiaohainku/RISE
RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors	Sicong Du CaiNiao Inc., Alibaba Group Jiarun Liu CaiNiao Inc., Alibaba Group Qifeng Chen CaiNiao Inc., Alibaba Group Hao-Xiang Chen BNRist, Tsinghua University Tai-Jiang Mu BNRist, Tsinghua University Sheng Yang CaiNiao Inc., Alibaba Group	Paper Supplementary Abstract A single-pass driving clip frequently results in incomplete scanning of the road structure, making reconstructed scene expanding a critical requirement for sensor simulators to effectively regress driving actions. Although contemporary 3D Gaussian Splatting (3DGS) techniques achieve remarkable reconstruction quality, their direct extension through the integration of diffusion priors often introduces cumulative physical inconsistencies and compromises training efficiency. To address these limitations, we present RGE-GS, a novel expansive reconstruction framework that synergizes diffusion-based generation with reward-guided Gaussian integration. The RGE-GS framework incorporates two key innovations: First, we propose a reward network that learns to identify and prioritize consistently generated patterns prior to reconstruction phases, thereby enabling selective retention of diffusion outputs for spatial stability. Second, during the reconstruction process, we devise a differentiated training strategy that automatically adjust Gaussian optimization progress according to scene converge metrics, which achieving better convergence than baseline methods. Extensive evaluations of publicly available datasets demonstrate that RGEGS achieves state-of-the-art performance in reconstruction quality. Our source-code will be made publicly available at https://github.com/CN-ADLab/RGE-GS.
RTMap: Real-Time Recursive Mapping with Change Detection and Localization	Yuheng Du CaiNiao Inc., Alibaba Group Sheng Yang CaiNiao Inc., Alibaba Group Lingxuan Wang CaiNiao Inc., Alibaba Group Zhenghua Hou CaiNiao Inc., Alibaba Group Chengying Cai CaiNiao Inc., Alibaba Group Zhitao Tan CaiNiao Inc., Alibaba Group Mingxia Chen CaiNiao Inc., Alibaba Group Shi-Sheng Huang Beijing Normal University Qiang Li CaiNiao Inc., Alibaba Group	Paper Abstract While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi-agent observations. We propose RTMap to enhance these single-traversal methods by persistently crowdsourcing a multi-traversal HD map as a self-evolutional memory. On onboard agents, RTMap simultaneously addresses three core challenges in an endto-end fashion: (1) Uncertainty-aware positional modeling for HD map elements, (2) probabilistic-aware localization w.r.t. the crowdsourced prior-map, and (3) realtime detection for possible road structural changes. Experiments on several public autonomous driving datasets demonstrate our solid performance on both the prior-aided map quality and the localization accuracy, demonstrating our effectiveness of robustly serving downstream prediction and planning modules while gradually improving the accuracy and freshness of the crowdsourced prior-map asynchronously. Our source-code will be made publicly available at https://github.com/CN-ADLab/RTMap.
RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation	Yuwen Du Shanghai Jiao Tong University Anning Hu Shanghai Jiao Tong University Zichen Chao Nanjing University of Science and Technology Yifan Lu Shanghai Jiao Tong University Junhao Ge Shanghai Jiao Tong University Genjia Liu Shanghai Jiao Tong University Weitao Wu Nanjing University of Science and Technology Lanjun Wang Tianjin University Siheng Chen Shanghai Jiao Tong University	Paper Abstract Roadside Collaborative Perception refers to a system where multiple roadside units collaborate to pool their perceptual data, assisting vehicles in enhancing their environmental awareness. Existing roadside perception methods concentrate on model design but overlook data issues like calibration errors, sparse information, and multi-view consistency, leading to poor performance on recent published datasets. To significantly enhance roadside collaborative perception and address critical data issues, we present the first simulation framework RoCo-Sim for roadside collaborative perception. RoCo-Sim is capable of generating diverse, multi-view consistent simulated roadside data through dynamic foreground editing and fullscene style transfer of a single image. RoCo-Sim consists of four components: (1) Camera Extrinsic Optimizer ensures accurate 3D to 2D projection for roadside cameras; (2) A novel Multi-View Occlusion-Aware Sampler (MOAS) determines the placement of diverse digital assets within 3D space; (3) DepthSAM innovatively models foreground-background relationships from single-frame fixed-view images, ensuring multi-view consistency of foreground; and (4) Scalable Post-Processing Toolkit generates more realistic and enriched scenes through style transfer and other enhancements. RoCo-Sim significantly improves roadside 3D object detection, outperforming SOTA methods by 83.74% on Rcooper-Intersection and 83.12% on TUMTraf-V2X for AP70. RoCo-Sim fills a critical gap in roadside perception simulation. Code can be accessed at: https://github.com/duyuwen-duen/RoCo-Sim
Counting Stacked Objects	Corentin Dumery EPFL Noa Etté EPFL Aoxiang Fan EPFL Ren Li EPFL Jingyi Xu Stony Brook University Hieu Le EPFL Pascal Fua EPFL	Paper Supplementary Abstract Visual object counting is a fundamental computer vision task underpinning numerous real-world applications, from cell counting in biomedicine to traffic and wildlife monitoring. However, existing methods struggle to handle the challenge of stacked 3D objects in which most objects are hidden by those above them. To address this important yet underexplored problem, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems - estimating the 3D geometry of the object stack and the occupancy ratio from multi-view images. By combining geometric reconstruction and deep learningbased depth analysis, our method can accurately count identical objects within containers, even when they are irregularly stacked. We validate our 3D Counting pipeline on large-scale synthetic and diverse real-world datasets with manually verified total counts. Our datasets and code and can be found at https://corentindumery.github. io/projects/stacks.html
Is Tracking Really More Challenging in First Person Egocentric Vision?	Matteo Dunnhofer University of Udine Zaira Manigrasso University of Udine Christian Micheloni University of Udine	Paper Supplementary Abstract Visual object tracking and segmentation are becoming fundamental tasks for understanding human activities in egocentric vision. Recent research has benchmarked state-ofthe-art methods and concluded that first person egocentric vision presents challenges compared to previously studied domains. However, these claims are based on evaluations conducted across significantly different scenarios. Many of the challenging characteristics attributed to egocentric vision are also present in third person videos of human-object activities. This raises a critical question: how much of the observed performance drop stems from the unique first person viewpoint inherent to egocentric vision versus the domain of human-object activities? To address this question, we introduce a new benchmark study designed to disentangle such factors. Our evaluation strategy enables a more precise separation of challenges related to the first person perspective from those linked to the broader domain of human-object activity understanding. By doing so, we provide deeper insights into the true sources of difficulty in egocentric tracking and segmentation, facilitating more targeted advancements on this task.
SynCity: Training-Free Generation of 3D Worlds	Paul Engstler University of Oxford Aleksandar Shtedritski University of Oxford Iro Laina University of Oxford Christian Rupprecht University of Oxford Andrea Vedaldi University of Oxford	Paper Supplementary Abstract We propose SynCity, a method for generating explorable 3D worlds from textual descriptions. Our approach leverages pre-trained textual, image, and 3D generators without requiring fine-tuning or inference-time optimization. While most 3D generators are object-centric and unable to create large-scale worlds, we demonstrate how 2D and 3D generators can be combined to produce ever-expanding scenes. The world is generated tile by tile, with each new tile created within its context and seamlessly integrated into the scene. SynCity enables fine-grained control over the appearance and layout of the generated worlds, which are both detailed and diverse. Project page: https://research.paulengstler.com/syncity/
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding	Yue Fan State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China Xiaojian Ma State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China Rongpeng Su State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China Jun Guo State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China Rujie Wu State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China Xi Chen State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China Qing Li State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China	Paper Supplementary Abstract This paper investigates the problem of understanding dynamic 3D scenes from egocentric observations, a key challenge in robotics and embodied AI. Unlike prior studies that explored this as long-form video understanding and utilized egocentric video only, we instead propose an LLMbased agent, Embodied VideoAgent, which constructs scene memory from both egocentric video and embodied sensory inputs (e.g. depth and pose sensing). We further introduce a VLM-based approach to automatically update the memory when actions or activities over objects are perceived. Embodied VideoAgent attains significant advantages over counterparts in challenging reasoning and planning tasks in 3D scenes, achieving gains of 6.5% on Ego4D-VQ3D, 2.6% on OpenEQA, and 15.3% on EnvQA. We have also demonstrated its potential in various embodied AI tasks including generating embodied interactions and perception for robot manipulation. The code and demo will be made public.
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data	Ke Fan Shanghai Jiao Tong University Shunlin Lu CUHK, Shenzhen Minyue Dai Fudan University Runyi Yu HKUST Lixing Xiao Zhejiang University Zhiyang Dou HKU Junting Dong Shanghai AI Laboratory Lizhuang Ma Shanghai Jiao Tong University, East China Normal University Jingbo Wang Shanghai AI Laboratory	Paper Supplementary Abstract Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zeroshot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.
PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization	Bing Fan University of North Texas Yunhe Feng University of North Texas Yapeng Tian University of Texas at Dallas James Chenhao Liang U.S. Naval Research Laboratory Yuewei Lin Brookhaven National Laboratory Yan Huang University of North Texas Heng Fan University of North Texas	Paper Supplementary Abstract Egocentric visual query localization (EgoVQL) focuses on localizing the target of interest in space and time from firstperson videos, given a visual query. Despite recent progressive, existing methods often struggle to handle severe object appearance changes and cluttering background in the video due to lacking sufficient target cues, leading to degradation. Addressing this, we introduce PRVQL, a novel Progressive knowledge-guided Refinement framework for EgoVQL. The core is to continuously exploit target-relevant knowledge directly from videos and utilize it as guidance to refine both query and video features for improving target localization. Our PRVQL contains multiple processing stages. The target knowledge from one stage, comprising appearance and spatial knowledge extracted via two specially designed knowledge learning modules, are utilized as guidance to refine the query and videos features for the next stage, which are used to generate more accurate knowledge for further feature refinement. With such a progressive process, target knowledge in PRVQL can be gradually improved, which, in turn, leads to better refined query and video features for localization in the final stage. Compared to previous methods, our PRVQL, besides the given object cues, enjoys additional crucial target information from a video as guidance to refine features, and hence enhances EgoVQL in complicated scenes. In our experiments on challenging Ego4D, PRVQL achieves stateof-the-art result and largely surpasses other methods, showing its efficacy. Our code, model and results will be released at https://github.com/fb-reps/PRVQL.
RIOcc: Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction	Baojie Fan Nanjing University of Posts and Telecommunications Xiaotian Li Nanjing University of Posts and Telecommunications Yuhan Zhou Nanjing University of Posts and Telecommunications Yuyu Jiang Nanjing University of Posts and Telecommunications Jiandong Tian Shenyang Institute of Automation, Chinese Academy of Sciences Huijie Fan Shenyang Institute of Automation, Chinese Academy of Sciences	Paper Supplementary Abstract The multi-modal 3D semantic occupancy task provides a comprehensive understanding of the scene and has received considerable attention in the field of autonomous driving. However, existing methods mainly focus on processing large-scale voxels, which bring high computational costs and degrade details. Additionally, they struggle to accurately capture occluded targets and distant information. In this paper, we propose a novel LiDAR-Camera 3D semantic occupancy prediction framework called RIOcc, with collaborative feature refinement and multi-scale cross-modal fusion transformer. Specifically, RIOcc encodes multi-modal data into a unified Bird's Eye View (BEV) space, which reduces computational complexity and enhances the efficiency of feature alignment. Then, multi-scale feature processing substantially expands the receptive fields. Meanwhile, in the LiDAR branch, we design the Dual-branch Pooling (DBP) to adaptively enhance geometric features across both the Channel and Grid dimensions. In the camera branch, the Wavelet and Semantic Encoders are developed to extract high-level semantic features with abundant edge and structural information. Finally, to facilitate effective cross-modal complementarity, we develop the Deformable Dual-Attention (DDA) module. Extensive experiments demonstrate that RIOcc achieves state-of-the-art performance, with 54.2 mIoU and 25.9 mIoU on the Occ3DnuScenes and nuScenes-Occupancy datasets, respectively.
Video Individual Counting for Moving Drones	Yaowu Fan Sun Yat-sen University Jia Wan Harbin Institute of Technology (Shenzhen) Tao Han Hong Kong University of Science and Technology Antoni B. Chan City University of Hong Kong Andy J. Ma Sun Yat-sen University	Paper Supplementary Abstract Video Individual Counting (VIC) has received increasing attention for its importance in intelligent video surveillance. Existing works are limited in two aspects, i.e., dataset and method. Previous datasets are captured with fixed or rarely moving cameras with relatively sparse individuals, restricting evaluation for a highly varying view and time in crowded scenes. Existing methods rely on localization followed by association or classification, which struggle under dense and dynamic conditions due to inaccurate localization of small targets. To address these issues, we introduce the MovingDroneCrowd Dataset, featuring videos captured by fast-moving drones in crowded scenes under diverse illuminations, shooting heights and angles. We further propose a Shared Density map-guided Network (SDNet) using a Depth-wise Cross-Frame Attention (DCFA) module to directly estimate shared density maps between consecutive frames, from which the inflow and outflow density maps are derived by subtracting the shared density maps from the global density maps. The inflow density maps across frames are summed up to obtain the number of unique pedestrians in a video. Experiments on our datasets and publicly available ones show the superiority of our method over the state of the arts in highly dynamic and complex crowded scenes. Our dataset and codes have been released publicly1.
Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision	Xiao Fang Carnegie Mellon University Minhyek Jeon Carnegie Mellon University Zheyang Qin Carnegie Mellon University Stanislav Panev Carnegie Mellon University Celso de Melo DEVCOM Army Research Laboratory Shuowen Hu DEVCOM Army Research Laboratory Shayok Chakraborty Florida State University Fernando De la Torre Carnegie Mellon University	Paper Supplementary Abstract Detecting vehicles in aerial imagery is a critical task with applications in traffic monitoring, urban planning, and defense intelligence. Deep learning methods have provided state-of-the-art (SOTA) results for this application. However, a significant challenge arises when models trained on data from one geographic region fail to generalize effectively to other areas. Variability in factors such as environmental conditions, urban layouts, road networks, vehicle types, and image acquisition parameters (e.g., resolution, lighting, and angle) leads to domain shifts that degrade model performance. This paper proposes a novel method that uses generative AI to synthesize high-quality aerial images and their labels, improving detector training through data augmentation. Our key contribution is the development of a multi-stage, multi-modal knowledge transfer framework utilizing fine-tuned latent diffusion models (LDMs) to mitigate the distribution gap between the source and target environments. Extensive experiments across diverse aerial imagery domains show consistent performance improvements in AP50 over supervised learning on source domain data, weakly supervised adaptation methods, unsupervised domain adaptation methods, and open-set object detectors by 4-23%, 6-10%, 7-40%, and more than 50%, respectively. Furthermore, we introduce two newly annotated aerial datasets from New Zealand and Utah to support further research in this field. Project page is available at: https://humansensinglab.github.io/AGenDA
MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh	Shuangkang Fang Beihang University I-Chao Shen The University of Tokyo Yufeng Wang Beihang University Yi-Hsuan Tsai Google Yi Yang StepFun Shuchang Zhou StepFun Wenrui Ding Beihang University Takeo Igarashi The University of Tokyo Ming-Hsuan Yang UC Merced	Paper Supplementary Abstract We present MeshLLM, a novel framework that leverages large language models (LLMs) to understand and generate text-serialized 3D meshes. Our approach addresses key limitations in existing methods, including the limited dataset scale when catering to LLMs' token length and the loss of 3D structural information during mesh serialization. We introduce a Primitive-Mesh decomposition strategy, which divides 3D meshes into structurally meaningful subunits. This enables the creation of a large-scale dataset with 1500k+ samples, almost 50x larger than previous methods, which aligns better with the LLM scaling law principles. Furthermore, we propose inferring face connectivity from vertices and local mesh assembly training strategies, significantly enhancing the LLMs' ability to capture mesh topology and spatial structures. Experiments show that MeshLLM outperforms the state-of-the-art LLaMA-Mesh in both mesh generation quality and shape understanding, highlighting its great potential in processing text-serialized 3D meshes.
NeRF Is a Valuable Assistant for 3D Gaussian Splatting	Shuangkang Fang Beihang University I-Chao Shen The University of Tokyo Takeo Igarashi The University of Tokyo Yufeng Wang Beihang University ZeSheng Wang Beihang University Yi Yang StepFun Wenrui Ding Beihang University Shuchang Zhou StepFun	Paper Supplementary Abstract We introduce NeRF-GS, a novel framework that jointly optimizes Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). This framework leverages the inherent continuous spatial representation of NeRF to mitigate several limitations of 3DGS, including sensitivity to Gaussian initialization, limited spatial awareness, and weak inter-Gaussian correlations, thereby enhancing its performance. In NeRF-GS, we revisit the design of 3DGS and progressively align its spatial features with NeRF, enabling both representations to be optimized within the same scene through shared 3D spatial information. We further address the formal distinctions between the two approaches by optimizing residual vectors for both implicit features and Gaussian positions to enhance the personalized capabilities of 3DGS. Experimental results on benchmark datasets show that NeRF-GS surpasses existing methods and achieves state-of-the-art performance. This outcome confirms that NeRF and 3DGS are complementary rather than competing, offering new insights into hybrid approaches that combine 3DGS and NeRF for efficient 3D scene representation.
Proxy-Bridged Game Transformer for Interactive Extreme Motion Prediction	Yanwen Fang The University of Hong Kong Wenqi Jia University of Illinois Urbana-Champaign Xu Cao University of Illinois Urbana-Champaign Peng-Tao Jiang vivo Mobile Communication Co., Ltd Guodong Li The University of Hong Kong Jintai Chen HKUST(GZ)	Paper Supplementary Abstract Multi-person motion prediction becomes particularly challenging when handling highly interactive scenarios involving extreme motions. Previous works focused more on the case of ‘moderate' motions (e.g., walking together), where predicting each pose in isolation often yields reasonable results. However, these approaches fall short in modeling extreme motions like lindy-hop dances, as they require a more comprehensive understanding of cross-person dependencies. To bridge this gap, we introduce Proxybridged Game Transformer (PGformer), a Transformerbased foundation model that captures the interactions driving extreme multi-person motions. PGformer incorporates a novel cross-query attention module to learn bidirectional dependencies between pose sequences and a proxy unit that subtly controls bidirectional spatial information flow. We evaluated PGformer on the challenging ExPI dataset, which involves large collaborative movements. Both quantitative and qualitative results demonstrate the superiority of PGformer in both short- and long-term predictions. We also test the proposed method on moderate movement datasets CMU-Mocap and MuPoTS-3D, generalizing PGformer to scenarios with more than two individuals with promising results. Code of PGformer is available at https://github.com/joyfang1106/pgformer.
SuperDec: 3D Scene Decomposition with Superquadrics Primitives	Elisabetta Fedele ETH Zurich Boyang Sun ETH Zurich Leonidas Guibas Stanford University Marc Pollefeys ETH Zurich Francis Engelmann Stanford University	Paper Supplementary Abstract We present SUPERDEC, an approach for creating compact 3D scene representations via decomposition into superquadric primitives. While most recent methods use geometric primitives to obtain photorealistic 3D reconstructions, we instead leverage them to obtain a compact yet expressive representation. To this end, we design a novel architecture that efficiently decomposes point clouds of arbitrary objects into a compact set of superquadrics. We train our model on ShapeNet and demonstrate its generalization capabilities on object instances from ScanNet++ as well as on full Replica scenes. Finally, we show that our compact superquadric-based representation supports a wide range of downstream applications, including robotic manipulation and controllable visual content generation. Project page: https://super-dec.github.io.
ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking	Xiaokun Feng School of Artificial Intelligence, UCAS Shiyu Hu School of Physical and Mathematical Sciences, NTU Xuchen Li School of Artificial Intelligence, UCAS Dailing Zhang School of Artificial Intelligence, UCAS Meiqi Wu School of Artificial Intelligence, UCAS Jing Zhang School of Artificial Intelligence, UCAS Xiaotang Chen School of Artificial Intelligence, UCAS Kaiqi Huang School of Artificial Intelligence, UCAS	Paper Supplementary Abstract Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame. To achieve robust tracking, especially in complex long-term scenarios that reﬂect real-world conditions as recently highlighted by MGIT, it is essential not only to characterize the target features but also to utilize the context features related to the target. However, the visual and textual target-context cues derived from the initial prompts generally align only with the initial target state. Due to their dynamic nature, target states are constantly changing, particularly in complex long-term sequences. It is intractable for these cues to continuously guide Vision-Language Trackers (VLTs). Furthermore, for the text prompts with diverse expressions, our experiments reveal that existing VLTs struggle to discern which words pertain to the target or the context, complicating the utilization of textual cues. In this work, we present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states through comprehensive Target-Context feature modeling, thereby achieving robust tracking. Specifically, (1) for the visual modality, we propose an effective temporal visual target-context modeling approach that provides the tracker with timely visual cues. (2) For the textual modality, we achieve precise target words identification solely based on textual content, and design an innovative context words calibration method to adaptively utilize auxiliary context words. (3) We conduct extensive experiments on mainstream benchmarks and ATCTrack achieves a new SOTA performance. The code and models will be released at: https://github.com/XiaokunFeng/ATCTrack
Gaussian-based World Model: Gaussian Priors for Voxel-Based Occupancy Prediction and Future Motion Prediction	Tuo Feng ReLER, CCAI, Zhejiang University Wenguan Wang ReLER, CCAI, Zhejiang University Yi Yang ReLER, CCAI, Zhejiang University	Paper Supplementary Abstract In autonomous driving, accurately predicting occupancy and motion is crucial for safe navigation within dynamic environments. However, existing methods often suffer from difficulties in handling complex scenes and uncertainty arising from sensor data. To address these issues, we propose a new Gaussian-based World Model (GWM), seamlessly integrating raw multi-modal sensor inputs. In 1st stage, Gaussian representation learner utilizes self-supervised pretraining to learn robust Gaussian representation. Gaussian representation integrates semantic and geometric information and establishes a robust probabilistic understanding of the environment. In 2nd stage, GWM seamlessly integrates learning, simulation, and planning into a unified framework, empowering the uncertainty-aware simulator & planner to jointly forecast future scene evolutions and vehicle trajectories. Simulator generates future scene predictions by modeling both static and dynamic elements, while planner calculates optimal paths to minimize collision risks, thus enhancing navigation safety. Overall, GWM employs a sensor-to-planning world model that directly processes raw sensor data, setting it apart from previous methods. Experiments show that GWM outperforms state-of-the-art approaches by 1.46% in semantic comprehension and 0.07m in motion prediction. Moreover, we provide an in-depth analysis of Gaussian representations under complex scenarios.
I2VControl: Disentangled and Unified Video Motion Synthesis Control	Wanquan Feng Intelligent Creation Team, ByteDance Tianhao Qi University of Science and Technology of China (USTC) Jiawei Liu Intelligent Creation Team, ByteDance Mingzhen Sun Institute of Automation, Chinese Academy of Sciences (CASIA) Pengqi Tu Intelligent Creation Team, ByteDance Tianxiang Ma Intelligent Creation Team, ByteDance Fei Dai Intelligent Creation Team, ByteDance Songtao Zhao Intelligent Creation Team, ByteDance Siyu Zhou Intelligent Creation Team, ByteDance Qian He Intelligent Creation Team, ByteDance	Paper Abstract Motion controllability is crucial in video synthesis. However, most previous methods are limited to single control types, and combining them often results in logical conflicts. In this paper, we propose a disentangled and unified framework, namely I2VControl, to overcome the logical conflicts. We rethink camera control, object dragging, and motion brush, reformulating all tasks into a consistent representation based on point trajectories, each managed by a dedicated formulation. Accordingly, we propose a spatial partitioning strategy, where each unit is assigned to a concomitant control category, enabling diverse control types to be dynamically orchestrated within a single synthesis pipeline without conflicts. Furthermore, we design an adapter structure that functions as a plug-in for pre-trained models and is agnostic to specific model architectures. We conduct extensive experiments, achieving excellent performance on various control tasks, and our method further facilitates user-driven creative combinations, enhancing innovation and creativity. Project page: https://wanquanf.github.io/I2VControl.
Partially Matching Submap Helps: Uncertainty Modeling and Propagation for Text to Point Cloud Localization	Mingtao Feng Xidian University Longlong Mei Xidian University Zijie Wu Xidian University Jianqiao Luo Hunan University Fenghao Tian Xidian University Jie Feng Xidian University Weisheng Dong Xidian University Yaonan Wang Hunan University	Paper Supplementary Abstract Text to point cloud cross-modal localization is a crucial vision-language task for future human-robot collaboration. Existing coarse-to-fine frameworks assume that each query text precisely corresponds to the center area of a submap, limiting their applicability in real-world scenarios. This work redefines the task under a more realistic assumption, relaxing the one-to-one retrieval constraint by allowing partially matching query text and submap pairs. To address this challenge, we augment datasets with partially matching submaps and introduce an uncertainty-aware framework. Specifically, we model cross-modal ambiguity in fine-grained location regression by integrating uncertainty scores, represented as 2D Gaussian distributions, to mitigate the impact of challenging samples. Additionally, we propose an uncertaintyaware similarity metric that enhances similarity assessment between query text and submaps by propagating uncertainty into coarse place recognition, enabling the model to learn discriminative features, effectively handle partially matching samples and improve task synergy. Extensive experiments on KITTI360Pose and CityRefer demonstrate that our method achieves state-of-the-art performance across both stages. Our code is available at https://github.com/Afoolbird/PMSH
St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World	Haiwen Feng UC Berkeley Junyi Zhang UC Berkeley Qianqian Wang UC Berkeley Yufei Ye Stanford University Pengcheng Yu Max Planck Institute for Intelligent Systems Michael J. Black Max Planck Institute for Intelligent Systems Trevor Darrell UC Berkeley Angjoo Kanazawa UC Berkeley	Paper Supplementary Abstract Dynamic 3D reconstruction and point tracking in videos are typically treated as separate tasks, despite their deep connection. We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs. This is achieved by predicting two appropriately defined pointmaps for a pair of frames captured at different moments. Specifically, we predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry while maintaining 3D correspondences. Chaining these predictions through the video sequence with respect to a reference frame naturally computes long-range correspondences, effectively combining 3D reconstruction with 3D tracking. Unlike prior methods that rely heavily on 4D ground truth supervision, we employ a novel adaptation scheme based on a reprojection loss. We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework. Our code, model, and benchmark will be released.
VideoOrion: Tokenizing Object Dynamics in Videos	Yicheng Feng School of Computer Science, Peking University Yijiang Li University of California, San Diego Wanpeng Zhang School of Computer Science, Peking University Sipeng Zheng unknown Hao Luo School of Computer Science, Peking University Zihao Yue Renmin University of China Zongqing Lu BeingBeyond	Paper Supplementary Abstract We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos-the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detectsegment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our method addresses the persistent challenge in Video-LLMs of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs. Compared to prior methods which resort to downsampling the original video or aggregating visual tokens using resamplers, leading to information loss and entangled semantics, VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations but also enables explicit object modeling of video content with minimal computational cost. Moreover, the introduced object tokens naturally allow VideoOrion to accomplish video-based referring tasks. Experimental results show that VideoOrion can learn to make good use of the object tokens, and achieves competitive results on both general video question answering and video-based referring benchmarks.
FlowR: Flowing from Sparse to Dense 3D Reconstructions	Tobias Fischer ETH Zurich Samuel Rota Bul`o Meta Reality Labs Zurich Yung-Hsu Yang ETH Zurich Nikhil Keetha Meta Reality Labs Zurich Lorenzo Porzi Meta Reality Labs Zurich Norman Müller Meta Reality Labs Zurich Katja Schwarz Meta Reality Labs Zurich Jonathon Luiten Meta Reality Labs Zurich Marc Pollefeys ETH Zurich Peter Kontschieder Meta Reality Labs Zurich	Paper Supplementary Abstract 3D Gaussian splatting enables high-quality novel view synthesis (NVS) at real-time frame rates. However, its quality drops sharply as we depart from the training views. Thus, dense captures are needed to match the high-quality expectations of applications like Virtual Reality (VR). However, such dense captures are very laborious and expensive to obtain. Existing works have explored using 2D generative models to alleviate this requirement by distillation or generating additional training views. These models typically rely on a noise-to-data generative process conditioned only on a handful of reference input views, leading to hallucinations, inconsistent generation results, and subsequent reconstruction artifacts. Instead, we propose a multi-view, ﬂow matching model that learns a ﬂow to directly connect novel view renderings from possibly sparse reconstructions to renderings that we expect from dense reconstructions. This enables augmenting scene captures with consistent, generated views to improve reconstruction quality. Our model is trained on a novel dataset of 3.6M image pairs and can process up to 45 views at 540→960 resolution (91K tokens) on one H100 GPU in a single forward pass. Our pipeline consistently improves NVS in sparse- and dense-view scenarios, leading to higher-quality reconstructions than prior works across multiple, widely-used NVS benchmarks.
Unified Category-Level Object Detection and Pose Estimation from RGB Images using 3D Prototypes	Tom Fischer Saarland University Xiaojie Zhang University of Technology Nuremberg Eddy Ilg University of Technology Nuremberg	Paper Supplementary Abstract Recognizing objects in images is a fundamental problem in computer vision. Although detecting objects in 2D images is common, many applications require determining their pose in 3D space. Traditional category-level methods rely on RGB-D inputs, which may not always be available, or employ two-stage approaches that use separate models and representations for detection and pose estimation. For the first time, we introduce a unified model that integrates detection and pose estimation into a single framework for RGB images by leveraging neural mesh models with learned features and multi-model RANSAC. Our approach achieves state-of-the-art results for RGB category-level pose estimation on REAL275, improving on the current state-of-the-art by 22.9% averaged across all scale-agnostic metrics. Finally, we demonstrate that our unified method exhibits greater robustness compared to single-stage baselines. Our code and models are available at github.com/Fischer-Tom/unified-detectionand-pose-estimation.
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation	Haoyu Fu Huazhong University of Science and Technology Diankun Zhang Xiaomi EV Zongchuang Zhao Huazhong University of Science and Technology Jianfeng Cui Xiaomi EV Dingkang Liang Huazhong University of Science and Technology Chong Zhang Xiaomi EV Dingyuan Zhang Huazhong University of Science and Technology Hongwei Xie Xiaomi EV Bing Wang Xiaomi EV Xiang Bai Huazhong University of Science and Technology	Paper Supplementary Abstract End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a hOlistic E2E autonomous dRiving framework by vIsion-language instructed actiON generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.
ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives	Yuqian Fu INSAIT, Sofia University 'St. Kliment Ohridski' Runze Wang Fudan University Bin Ren University of Trento Guolei Sun ETH Zurich Biao Gong unknown Yanwei Fu Fudan University Danda Pani Paudel unknown Xuanjing Huang Fudan University Luc Van Gool unknown	Paper Supplementary Abstract Bridging the gap between ego-centric and exo-centric views has been a long-standing question in computer vision. In this paper, we focus on the emerging Ego-Exo object correspondence task, which aims to understand object relations across ego-exo perspectives through segmentation. While numerous segmentation models have been proposed, most operate on a single image (view), making them impractical for cross-view scenarios. PSALM [75], a recently proposed segmentation method, stands out as a notable exception with its demonstrated zero-shot ability on this task. However, due to the drastic viewpoint change between ego and exo, PSALM fails to accurately locate and segment objects, especially in complex backgrounds or when object appearances change significantly. To address these issues, we propose ObjectRelator, a novel approach featuring two key modules: Multimodal Condition Fusion (MCFuse) and SSL-based Cross-View Object Alignment (XObjAlign). MCFuse introduces language as an additional cue, integrating both visual masks and textual descriptions to improve object localization and prevent incorrect associations. XObjAlign enforces cross-view consistency through self-supervised alignment, enhancing robustness to object appearance variations. Extensive experiments demonstrate ObjectRelator's effectiveness on the large-scale Ego-Exo4D benchmark and HANDAL-X (an adapted dataset for cross-view segmentation) with state-of-the-art performance. Code is available at: http://yuqianfu.com/ObjectRelator.
Beyond RGB: Adaptive Parallel Processing for RAW Object Detection	Shani Gamrian Sony Research Hila Barel Sony Research Feiran Li Sony Research Masakazu Yoshimura Sony Group Corporation Daisuke Iso Sony Research	Paper Supplementary Abstract Object detection models are typically applied to standard RGB images processed through Image Signal Processing (ISP) pipelines, which are designed to enhance sensorcaptured RAW images for human vision. However, these ISP functions can lead to a loss of critical information that may be essential in optimizing for computer vision tasks, such as object detection. In this work, we introduce Raw Adaptation Module (RAM), a module designed to replace the traditional ISP, with parameters optimized specifically for RAW object detection. Inspired by the parallel processing mechanisms of the human visual system, RAM departs from existing learned ISP methods by applying multiple ISP functions in parallel rather than sequentially, allowing for a more comprehensive capture of image features. These processed representations are then fused in a specialized module, which dynamically integrates and optimizes the information for the target task. This novel approach not only leverages the full potential of RAW sensor data but also enables task-specific pre-processing, resulting in superior object detection performance. Our approach outperforms RGB-based methods and achieves state-of-the-art results across diverse RAW image datasets under varying lighting conditions and dynamic ranges. Our code is available at https://github.com/SonyResearch/RawAdaptationModule.
GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting	Wanshui Gan The University of Tokyo Fang Liu The University of Tokyo Hongbin Xu South China University of Technology Ningkai Mo Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences Naoto Yokoya The University of Tokyo	Paper Supplementary Abstract We introduce GaussianOcc, a systematic method that investigates Gaussian splatting for fully self-supervised and efficient 3D occupancy estimation in surround views. First, traditional methods for self-supervised 3D occupancy estimation still require ground truth 6D ego pose from sensors during training. To address this limitation, we propose Gaussian Splatting for Projection (GSP) module to provide accurate scale information for fully self-supervised training from adjacent view projection. Additionally, existing methods rely on volume rendering for final 3D voxel representation learning using 2D signals (depth maps and semantic maps), which is time-consuming and less effective. We propose Gaussian Splatting from Voxel space (GSV) to leverage the fast rendering properties of Gaussian splatting. As a result, the proposed GaussianOcc method enables fully self-supervised (no ground truth ego pose) 3D occupancy estimation in competitive performance with low computational cost (2.7 times faster in training and 5 times faster in rendering). The relevant code is available in https://github.com/GANWANSHUI/GaussianOcc.git.
Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens	Suchisrit Gangopadhyay Yale University Jung-Hee Kim Michigan State University Xien Chen Yale University Patrick Rim Yale University Hyoungseob Park Yale University Alex Wong Yale University	Paper Supplementary Abstract We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: github.com/JungHeeKim29/calibration-token.
3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation	Jianzhe Gao Zhejiang University Rui Liu Zhejiang University Wenguan Wang Zhejiang University	Paper Supplementary Abstract Vision-language navigation (VLN) requires an agent to traverse complex 3D environments based on natural language instructions, necessitating a thorough scene understanding. While existing works equip agents with various scene representations to enhance spatial awareness, they often neglect the complex 3D geometry and rich semantics in VLN scenarios, limiting the ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. Specifically, Egocentric Scene Map is constructed online by initializing 3D Gaussians from sparse pseudo-lidar point clouds, providing informative geometric priors for scene understanding. Each Gaussian primitive is further enriched through Open-Set Semantic Grouping operation, which groups 3D Gaussians based on their membership in object instances or stuff categories within the open world, resulting in a unified 3D Gaussian Map. Building on this map, Multi-Level Action Prediction strategy, which combines spatial-semantic cues at multiple granularities, is designed to assist agents in decision-making. Extensive experiments conducted on three public benchmarks (i.e., R2R, R4R, and REVERIE) validate the effectiveness of our method.
3D Mesh Editing using Masked LRMs	Will Gao University of Chicago Dilin Wang Meta Reality Labs Yuchen Fan Meta Reality Labs Aljaz Bozic Meta Reality Labs Tuur Stuyck Meta Reality Labs Zhengqin Li Meta Reality Labs Zhao Dong Meta Reality Labs Rakesh Ranjan Meta Reality Labs Nikolaos Sarafianos Meta Reality Labs	Paper Supplementary Abstract We present a novel approach to shape editing, building on recent progress in 3D reconstruction from multi-view images. We formulate shape editing as a conditional reconstruction problem, where the model must reconstruct the input shape with the exception of a specified 3D region, in which the geometry should be generated from the conditional signal. To this end, we train a conditional Large Reconstruction Model (LRM) for masked reconstruction, using multi-view consistent masks rendered from a randomly generated 3D occlusion, and using one clean viewpoint as the conditional signal. During inference, we manually define a 3D region to edit and provide an edited image from a canonical viewpoint to fill that region. We demonstrate that, in just a single forward pass, our method not only preserves the input geometry in the unmasked region through reconstruction capabilities on par with SoTA, but is also expressive enough to perform a variety of mesh edits from a single image guidance that past works struggle with, while being 2 −10x faster than the top-performing prior work.
Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians	Quankai Gao University of Southern California Iliyan Georgiev Adobe Research Tuanfeng Y. Wang Adobe Research Krishna Kumar Singh Adobe Research Ulrich Neumann University of Southern California Jae Shin Yoon Adobe Research	Paper Supplementary Abstract 3D generation has made significant progress, however, it still largely remains at the object-level. Feedforward 3D scene-level generation has been rarely explored due to the lack of models capable of scaling-up latent representation learning on 3D scene-level data. Unlike object-level generative models, which are trained on well-labeled 3D data in a bounded canonical space, scene-level generations with 3D scenes represented by 3D Gaussian Splatting (3DGS) are unbounded and exhibit scale inconsistency across different scenes, making unified latent representation learning for generative purposes extremely challenging. In this paper, we introduce Can3Tok, the first 3D scene-level variational autoencoder (VAE) capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding, which effectively captures both semantic and spatial information of the inputs. Beyond model design, we propose a general pipeline for 3D scene data processing to address scale inconsistency issue. We validate our method on the recent scene-level 3D dataset DL3DV-10K, where we found that only Can3Tok successfully generalizes to novel 3D scenes, while compared methods fail to converge on even a few hundred scene inputs during training and exhibit zero generalization ability during inference. Finally, we demonstrate image-to-3DGS and text-to-3DGS generation as our applications to demonstrate it's ability to faciliate downstream generation tasks. Project page: https://github.com/Zerg-Overmind/Can3Tok
CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction	Yuanyuan Gao Northwestern Polytechnical University Hao Li Northwestern Polytechnical University Jiaqi Chen Northwestern Polytechnical University Zhengyu Zou Northwestern Polytechnical University Zhihang Zhong Shanghai Artificial Intelligence Laboratory Dingwen Zhang Northwestern Polytechnical University Xiao Sun Shanghai Artificial Intelligence Laboratory Junwei Han Northwestern Polytechnical University	Paper Supplementary Abstract Despite its significant achievements in large-scale scene reconstruction, 3D Gaussian Splatting still faces substantial challenges, including slow processing, high computational costs, and limited geometric accuracy. These core issues arise from its inherently unstructured design and the absence of efficient parallelization. To overcome these challenges simultaneously, we introduce CityGS-X, a scalable architecture built on a novel parallelized hybrid hierarchical 3D representation (PH2-3D). As an early attempt, CityGS-X abandons the cumbersome merge-and-partition process and instead adopts a newly-designed batch-level multi-task rendering process. This architecture enables efficient multi-GPU rendering through dynamic Level-ofDetail voxel allocations, significantly improving scalability and performance. To further enhance both overall quality and geometric accuracy, CityGS-X presents a progressive RGB-Depth-Normal training strategy. This approach enhances 3D consistency by jointly optimizing appearance and geometry representation through multi-view constraints and off-the-shelf depth priors within batch-level training. Through extensive experiments, CityGS-X consistently outperforms existing methods in terms of faster training times, larger rendering capacities, and more accurate geometric details in large-scale scenes. Notably, CityGS-X can train and render a scene with 5,000+ images in just 5 hours using only 4x4090 GPUs, a task that would make other alternative methods encounter Out-Of-Memory (OOM) issues and fail completely. This implies that CityGS-X is far beyond the capacity of other existing methods. Project Page: https://lifuguan.github.io/CityGS-X/
Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction	Zhirui Gao National University of Defense Technology Renjiao Yi National University of Defense Technology Yaqiao Dai National University of Defense Technology Xuening Zhu National University of Defense Technology Wei Chen National University of Defense Technology Chenyang Zhu National University of Defense Technology Kai Xu National University of Defense Technology	Paper Supplementary Abstract This paper presents an end-to-end framework for reconstructing 3D parametric curves directly from multi-view edge maps. Contrasting with existing two-stage methods that follow a sequential 'edge point cloud reconstruction and parametric curve fitting' pipeline, our one-stage approach optimizes 3D parametric curves directly from 2D edge maps, eliminating error accumulation caused by the inherent optimization gap between disconnected stages. However, parametric curves inherently lack suitability for rendering-based multi-view optimization, necessitating a complementary representation that preserves their geometric properties while enabling differentiable rendering. We propose a novel bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components. This tight correspondence formulates a curveaware Gaussian representation, CurveGaussian, that enables differentiable rendering of 3D curves, allowing direct optimization guided by multi-view evidence. Furthermore, we introduce a dynamically adaptive topology optimization framework during training to refine curve structures through linearization, merging, splitting, and pruning operations. Comprehensive evaluations on the ABC dataset and real-world benchmarks demonstrate our onestage method's superiority over two-stage alternatives, particularly in producing cleaner and more robust reconstructions. Additionally, by directly optimizing parametric curves, our method significantly reduces the parameter count during training, achieving both higher efficiency and superior performance compared to existing approaches.
DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning	Ziqi Gao Shenzhen University Qiufu Li Shenzhen University Linlin Shen Shenzhen University	Paper Supplementary Abstract Compared to 2D data, the scale of point cloud data in different domains available for training, is quite limited. Researchers have been trying to combine these data of different domains for masked autoencoder (MAE) pre-training to leverage such a data scarcity issue. However, the prior knowledge learned from mixed domains may not align well with the downstream 3D point cloud analysis tasks, leading to degraded performance. To address such an issue, we propose the Domain-Adaptive Point Cloud Masked Autoencoder (DAP-MAE), an MAE pre-training method, to adaptively integrate the knowledge of cross-domain datasets for general point cloud analysis. In DAP-MAE, we design a heterogeneous domain adapter that utilizes an adaptation mode during pre-training, enabling the model to comprehensively learn information from point clouds across different domains, while employing a fusion mode in the finetuning to enhance point cloud features. Meanwhile, DAPMAE incorporates a domain feature generator to guide the adaptation of point cloud features to various downstream tasks. With only one pre-training, DAP-MAE achieves excellent performance across four different point cloud analysis tasks, reaching 95.18% in object classification on ScanObjectNN and 88.45% in facial expression recognition on Bosphorus. The code will be released at https: //github.com/CVI-SZU/DAP-MAE
Epipolar Consistent Attention Aggregation Network for Unsupervised Light Field Disparity Estimation	Chen Gao Beijing Jiaotong University Shuo Zhang Beijing Jiaotong University Youfang Lin Beijing Jiaotong University	Paper Supplementary Abstract Disparity estimation is an essential step in processing and analyzing Light Field (LF) images. Recent methods construct the cost volume to exploit the correspondence of the LFs over the preset maximum disparity, limiting them to process the large parallax scenes. Different from constructing cost volume, the self-attention mechanism calculates the parallax attention between epipolar lines to find the matching points. However, for LFs that have different views, the related disparity scales are different in parallax attention since the baselines with the central view are different. Moreover, if the matching information is occluded in one view, the disparity information can be explored through other views. Therefore, mapping these attentions to the same scale and selecting effective matching information are key points for disparity estimation from parallax attention. In this paper, we explore parallax attention for LF and design an unsupervised method, named Epipolar Consistent Attention Aggregation Network (ECAAN). We first introduce an epipolar consistent scale unification block by considering the consistency relationships to standardize disparity scales of the parallax attention maps. Based on the intra-properties and inter-relationships of parallax attention, we further propose a consistent occlusionfree aggregation block to integrate the information from the occlusion-free areas. In addition, we design an improved photometric loss to constrain the model. ECAAN achieves state-of-the-art performance in LF depth estimation. Notably, ECAAN attains a mean square error of 0.2 on large-disparity LF datasets, achieving a 68% error reduction compared to the second-best method.
MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control	Ruiyuan Gao CUHK Kai Chen HKUST Bo Xiao Huawei Cloud Lanqing Hong Huawei Noah's Ark Lab Zhenguo Li Huawei Noah's Ark Lab Qiang Xu CUHK	Paper Supplementary Abstract The rapid advancement of diffusion models has greatly improved video synthesis, especially in controllable video generation, which is vital for applications like autonomous driving. Although DiT with 3D VAE has become a standard framework for video generation, it introduces challenges in controllable driving video generation, especially for framewise geometric control, rendering existing methods ineffective. To address these issues, we propose MagicDriveV2, a novel approach that integrates the MVDiT block and spatial-temporal conditional encoding to enable multiview video generation and precise geometric control. Additionally, we introduce an efficient method for obtaining contextual descriptions for videos to support diverse textual control, along with a progressive training strategy using mixed video data to enhance training efficiency and generalizability. Consequently, MagicDrive-V2 enables multi-view driving video synthesis with 3.3x resolution and 4x frame count (compared to current SOTA), rich contextual control, and geometric controls. Extensive experiments demonstrate MagicDrive-V2's ability, unlocking broader applications in autonomous driving. Project page: flymin.github.io/magicdrive-v2/
Self-supervised Learning of Hybrid Part-aware 3D Representations of 2D Gaussians and Superquadrics	Zhirui Gao National University of Defense Technology Renjiao Yi National University of Defense Technology Yuhang Huang National University of Defense Technology Wei Chen National University of Defense Technology Chenyang Zhu National University of Defense Technology Kai Xu National University of Defense Technology	Paper Supplementary Abstract Low-level 3D representations, such as point clouds, meshes, NeRFs and 3D Gaussians, are commonly used for modeling 3D objects and scenes. However, cognitive studies indicate that human perception operates at higher levels and interprets 3D environments by decomposing them into meaningful structural parts, rather than low-level elements like points or voxels. Structured geometric decomposition enhances scene interpretability and facilitates downstream tasks requiring component-level manipulation. In this work, we introduce PartGS, a self-supervised part-aware reconstruction framework that integrates 2D Gaussians and superquadrics to parse objects and scenes into an interpretable decomposition, leveraging multi-view image inputs to uncover 3D structural information. Our method jointly optimizes superquadric meshes and Gaussians by coupling their parameters within a hybrid representation. On one hand, superquadrics enable the representation of a wide range of shape primitives, facilitating flexible and meaningful decompositions. On the other hand, 2D Gaussians capture detailed texture and geometric details, ensuring high-fidelity appearance and geometry reconstruction. Operating in a self-supervised manner, our approach demonstrates superior performance compared to state-of-the-art methods across extensive experiments on the DTU, ShapeNet, and real-world datasets.
SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting	Zihui Gao Zhejiang University Jia-Wang Bian ByteDance Seed Guosheng Lin Nanyang Technological University Hao Chen Zhejiang University Chunhua Shen Zhejiang University	Paper Supplementary Abstract Surface reconstruction and novel view rendering from sparse-view images are challenging. Signed Distance Function (SDF)-based methods struggle with fine details, while 3D Gaussian Splatting (3DGS)-based approaches lack global geometry coherence. We propose a novel hybrid method that combines the strengths of both approaches: SDF captures coarse geometry to enhance 3DGS-based rendering, while newly rendered images from 3DGS refine the details of SDF for accurate surface reconstruction. As a result, our method surpasses state-of-the-art approaches in surface reconstruction and novel view synthesis on the DTU and MobileBrick datasets. Code will be released at: https://github.com/aim-uofa/SurfaceSplat.
VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions	Yash Garg University of California, Riverside Saketh Bachu University of California, Riverside Arindam Dutta University of California, Riverside Rohit Lal University of California, Riverside Sarosij Bose University of California, Riverside Calvin-Khang Ta University of California, Riverside M. Salman Asif University of California, Riverside Amit Roy-Chowdhury University of California, Riverside	Paper Supplementary Abstract Human pose and shape (HPS) estimation methods have been extensively studied, with many demonstrating high zero-shot performance on in-the-wild images and videos. However, these methods often struggle in challenging scenarios involving complex human poses or significant occlusions. Although some studies address 3D human pose estimation under occlusion, they typically evaluate performance on datasets that lack realistic or substantial occlusions, e.g., most existing datasets introduce occlusions with random patches over the human or clipart-style overlays, which may not reflect real-world challenges. To bridge this gap in realistic occlusion datasets, we introduce a novel benchmark dataset, VOccl3D, a Video-based human Occlusion dataset with 3D body pose and shape annotations. Inspired by works such as AGORA and BEDLAM, we constructed this dataset using advanced computer graphics rendering techniques, incorporating diverse real-world occlusion scenarios, clothing textures, and human motions. Additionally, we fine-tuned recent HPS methods, CLIFF and BEDLAM-CLIFF, on our dataset, demonstrating significant qualitative and quantitative improvements across multiple public datasets, as well as on the test split of our dataset, while comparing its performance with other stateof-the-art methods. Furthermore, we leveraged our dataset to enhance human detection performance under occlusion by fine-tuning an existing object detector, YOLO11, thus leading to a robust end-to-end HPS estimation system under occlusions. Overall, this dataset serves as a valuable resource for future research aimed at benchmarking methods designed to handle occlusions, offering a more realistic alternative to existing occlusion datasets. See the Project page for code and dataset:https://yashgarg98. github.io/VOccl3D-dataset/† Currently at NASA MSFC IMPACT. ‡ Currently at Dolby Laboratories. Work done while the authors were at UCR.
Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving	Junhao Ge Shanghai Jiao Tong University Zuhong Liu Shanghai Jiao Tong University Longteng Fan Shanghai Jiao Tong University Yifan Jiang Shanghai Jiao Tong University Jiaqi Su Shanghai Jiao Tong University Yiming Li New York University Zhejun Zhang ETH Zurich Siheng Chen Shanghai Jiao Tong University	Paper Supplementary Abstract End-to-end (E2E) autonomous driving (AD) models require diverse, high-quality data to perform well across various driving scenarios. However, collecting large-scale realworld data is expensive and time-consuming, making highfidelity synthetic data essential for enhancing data diversity and model robustness. Existing driving simulators have significant limitations for synthetic data generation: gameengine-based simulators struggle to produce realistic sensor data, while NeRF-based and diffusion-based methods face efficiency challenges. Additionally, recent simulators designed for closed-loop evaluation provide limited interaction with other vehicles, failing to simulate complex realworld traffic dynamics. To address these issues, we introduce SceneCrafter, a realistic, interactive, and efficient AD simulator based on 3D Gaussian Splatting (3DGS). SceneCrafter not only efficiently generates realistic driving logs across diverse traffic scenarios but also enables robust closed-loop evaluation of end-to-end models. Experimental results demonstrate that SceneCrafter serves as both a reliable evaluation platform and a efficient data generator that significantly improves end-to-end model generalization. Our code will be released at https://github. com/cancaries/SceneCrafter.
ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones	Anurag Ghosh Carnegie Mellon University Shen Zheng Carnegie Mellon University Robert Tamburo Carnegie Mellon University Khiem Vuong Carnegie Mellon University Juan Alvarez-Padilla Carnegie Mellon University Hailiang Zhu Carnegie Mellon University Michael Cardei Carnegie Mellon University Nicholas Dunn Carnegie Mellon University Christoph Mertz Carnegie Mellon University Srinivasa G. Narasimhan Carnegie Mellon University	Paper Supplementary Abstract Perceiving and autonomously navigating through work zones is a challenging and underexplored problem. Open datasets for this long-tailed scenario are scarce. We propose the ROADWork dataset to learn to recognize, observe, analyze, and drive through work zones. State-of-the-art foundation models fail when applied to work zones. Finetuning models on our dataset significantly improves perception and navigation in work zones. With ROADWork , we discover new work zone images with higher precision (+32.5%) at a much higher rate (12.8x) around the world. Open-vocabulary methods fail too, whereas fine-tuned detectors improve performance (+32.2 AP). Vision-Language Models (VLMs) struggle to describe work zones, but finetuning substantially improves performance (+36.7 SPICE). Beyond fine-tuning, we show the value of simple techniques. Video label propagation provides additional gains (+2.6 AP) for instance segmentation. While reading work zone signs, composing a detector and text spotter via cropscaling improves performance (+14.2% 1-NED). Composing work zone detections to provide context further reduces hallucinations (+3.9 SPICE) in VLMs. We predict navigational goals and compute drivable paths from work zone videos. Incorporating road work semantics ensures 53.6% goals have angular error (AE) < 0.5◦(+9.9 %) and 75.3% pathways have AE < 0.5◦(+8.1 %).
Splat-LOAM: Gaussian Splatting LiDAR Odometry and Mapping	Emanuele Giacomini Sapienza University of Rome Luca Di Giammarino Sapienza University of Rome Lorenzo De Rebotti Sapienza University of Rome Giorgio Grisetti Sapienza University of Rome Martin R. Oswald University of Amsterdam	Paper Supplementary Abstract LiDARs provide accurate geometric measurements, making them valuable for ego-motion estimation and reconstruction tasks. Although its success, managing an accurate and lightweight representation of the environment still poses challenges. Both classic and NeRF-based solutions have to trade off accuracy over memory and processing times. In this work, we build on recent advancements in Gaussian Splatting methods to develop a novel LiDAR odometry and mapping pipeline that exclusively relies on Gaussian primitives for its scene representation. Leveraging spherical projection, we drive the refinement of the primitives uniquely from LiDAR measurements. Experiments show that our approach matches the current registration performance, while achieving SOTA results for mapping tasks with minimal GPU requirements. This efficiency makes it a strong candidate for further exploration and potential adoption in realtime robotics estimation tasks.
Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation	Uzay Gökay University of Bonn Federico Spurio University of Bonn Dominik R. Bach University of Bonn Juergen Gall University of Bonn	Paper Supplementary Abstract Current state-of-the-art methods for skeleton-based temporal action segmentation are predominantly supervised and require annotated data, which is expensive to collect. In contrast, existing unsupervised temporal action segmentation methods have focused primarily on video data, while skeleton sequences remain underexplored, despite their relevance to real-world applications, robustness, and privacypreserving nature. In this paper, we propose a novel approach for unsupervised skeleton-based temporal action segmentation. Our method utilizes a sequence-to-sequence temporal autoencoder that keeps the information of the different joints disentangled in the embedding space. Latent skeleton sequences are then divided into non-overlapping patches and quantized to obtain distinctive skeleton motion words, driving the discovery of semantically meaningful action clusters. We thoroughly evaluate the proposed approach on three widely used skeleton-based datasets, namely HuGaDB, LARa, and BABEL. The results demonstrate that our model outperforms the current state-of-theart unsupervised temporal action segmentation methods. Code is available at github.com/bachlab/SMQ.
RoMo: Robust Motion Segmentation Improves Structure from Motion	Lily Goli Google DeepMind Sara Sabour Google DeepMind Mark Matthews Google DeepMind Marcus A. Brubaker Google DeepMind Dmitry Lagun Google DeepMind Alec Jacobson Adobe Research David J. Fleet Google DeepMind Saurabh Saxena Google DeepMind Andrea Tagliasacchi Google DeepMind	Paper Supplementary Abstract There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casuallycaptured video. Estimating accurate camera poses from videos through structure-from-motion (SfM) relies on robustly separating static and dynamic parts of a video. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical ﬂow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-theart on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.
CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction	Zhefei Gong Westlake University Pengxiang Ding Zhejiang University Shangke Lyu Westlake University Siteng Huang Zhejiang University Mingyang Sun Zhejiang University Wei Zhao Westlake University Zhaoxin Fan Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing Donglin Wang Westlake University	Paper Supplementary Abstract In robotic visuomotor policy learning, diffusion-based models have achieved significant success in improving the accuracy of action trajectory generation compared to traditional autoregressive models. However, they suffer from inefficiency due to multiple denoising steps and limited flexibility from complex constraints. In this paper, we introduce Coarseto-Fine AutoRegressive Policy (CARP), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-to-fine, nextscale approach. CARP decouples action generation into two stages: first, an action autoencoder learns multi-scale representations of the entire action sequence; then, a GPTstyle transformer refines the sequence prediction through a coarse-to-fine autoregressive process. This straightforward and intuitive approach produces highly accurate and smooth actions, matching or even surpassing the performance of diffusion-based policies while maintaining efficiency on par with autoregressive policies. We conduct extensive evaluations across diverse settings, including single-task and multitask scenarios on state-based and image-based simulation benchmarks, as well as real-world tasks. CARP achieves competitive success rates, with up to a 10% improvement, and delivers 10x faster inference compared to state-of-theart policies, establishing a high-performance, efficient, and flexible paradigm for action generation in robotic tasks.
ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models	Bingchen Gong École Polytechnique Diego Gomez École Polytechnique Abdullah Hamdi Visual Geometry Group, University of Oxford Abdelrahman Eldesokey King Abdullah University of Science and Technology (KAUST) Ahmed Abdelreheem King Abdullah University of Science and Technology (KAUST) Peter Wonka King Abdullah University of Science and Technology (KAUST)	Paper Supplementary Abstract We propose a novel zero-shot approach for keypoint detection on 3D shapes. Point-level reasoning on visual data is challenging as it requires precise localization capability, posing problems even for powerful models like DINO or CLIP. Traditional methods for 3D keypoint detection rely heavily on annotated 3D datasets and extensive supervised training, limiting their scalability and applicability to new categories or domains. In contrast, our method utilizes the rich knowledge embedded within Multi-Modal Large Language Models (MLLMs). Specifically, we demonstrate, for the first time, that pixel-level annotations used to train recent MLLMs can be exploited for both extracting and naming salient keypoints on 3D models without any ground truth labels or supervision. Experimental evaluations demonstrate that our approach achieves competitive performance on standard benchmarks compared to supervised methods, despite not requiring any 3D keypoint annotations during training. Our results highlight the potential of integrating language models for localized 3D shape understanding. This work opens new avenues for cross-modal learning and underscores the effectiveness of MLLMs in contributing to 3D computer vision challenges.
Referring Expression Comprehension for Small Objects	Kanoko Goto Institute of Science Tokyo Takumi Hirose Institute of Science Tokyo Mahiro Ukai Institute of Science Tokyo Shuhei Kurita National Institute of Informatics Nakamasa Inoue Institute of Science Tokyo	Paper Supplementary Abstract Referring expression comprehension (REC) aims to localize the target object described by a natural language expression. Recent advances in vision-language learning have led to significant performance improvements in REC tasks. However, localizing extremely small objects remains a considerable challenge despite its importance in real-world applications such as autonomous driving. To address this issue, we introduce a novel dataset and method for REC targeting small objects. First, we present the small object REC (SOREC) dataset, which consists of 100,000 pairs of referring expressions and corresponding bounding boxes for small objects in driving scenarios. Second, we propose the progressive-iterative zooming adapter (PIZA), an adapter module for parameter-efficient fine-tuning that enables models to progressively zoom in and localize small objects. In a series of experiments, we apply PIZA to GroundingDINO and demonstrate a significant improvement in accuracy on the SOREC dataset. Our dataset, codes and pre-trained models are publicly available on the project page.
Knowledge-Guided Part Segmentation	Xuejian Gou Xidian University Fang Liu Xidian University Licheng Jiao Xidian University Shuo Li Xidian University Lingling Li Xidian University Hao Wang Xidian University Xu Liu Xidian University Puhua Chen Xidian University Wenping Ma Xidian University	Paper Supplementary Abstract In real-world scenarios, objects and their parts inherently possess both coarse-grained differences and intricate fine-grained structural relationships. These characteristics can be formalized as knowledge, leveraged for finegrained part comprehension. However, existing part segmentation models consistently fail to capture these complex inter-part relationships, treating parts as independent entities and disregarding object-level distinctions. To address these limitations, we propose a novel Knowledge-Guided Part Segmentation (KPS) framework. Our approach automatically extracts structural relationships between parts using a large language model (LLM) and integrates them into a knowledge graph. Subsequently, a structural knowledge guidance module employs a graph convolutional network (GCN) to model these relationships. Furthermore, a coarse-grained object guidance module captures objectspecific distinctions and integrates them as visual guidance. The integrated insights from the part structure and object differentiation guide the fine-grained part segmentation. Our KPS achieves notable improvements in segmentation performance, with a 4.96% mIoU gain on PartImageNet and a 3.73% gain on Pascal-Part. Moreover, in the open-vocabulary setting on Pascal-Part-116, it improves hIoU by 3.25%, highlighting the effectiveness of knowledge guidance in enhancing fine-grained part segmentation.
GEOPARD: Geometric Pretraining for Articulation Prediction in 3D Shapes	Pradyumn Goyal UMass Amherst Dmitry Petrov UMass Amherst Sheldon Andrews Ecole de technologie superieure Yizhak Ben-Shabat Roblox Hsueh-Ti Derek Liu Roblox Evangelos Kalogerakis UMass Amherst, TU Crete	Paper Supplementary Abstract We present GEOPARD1, a transformer-based architecture for predicting articulation from a single static snapshot of a 3D shape. The key idea of our method is a pretraining strategy that allows our transformer to learn plausible candidate articulations for 3D shapes based on a geometric-driven search without manual articulation annotation. The search automatically discovers physically valid part motions that do not cause detachments or collisions with other shape parts. Our experiments indicate that this geometric pretraining strategy, along with carefully designed choices in our transformer architecture, yields state-of-the-art results in articulation inference in the PartNet-Mobility dataset.
Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs	Bhavya Goyal University of Wisconsin-Madison Felipe Gutierrez-Barragan Ubicept Wei Lin University of Wisconsin-Madison Andreas Velten University of Wisconsin-Madison, Ubicept Yin Li University of Wisconsin-Madison Mohit Gupta University of Wisconsin-Madison, Ubicept	Paper Supplementary Abstract LiDAR-based 3D sensors provide point clouds, a canonical 3D representation used in various scene understanding tasks. Modern LiDARs face key challenges in several real-world scenarios, such as long-distance or low-albedo objects, producing sparse or erroneous point clouds. These errors, which are rooted in the noisy raw LiDAR measurements, get propagated to downstream perception models, resulting in potentially severe loss of accuracy. This is because conventional 3D processing pipelines do not retain any uncertainty information from the raw measurements when constructing point clouds. We propose Probabilistic Point Clouds (PPC), a novel 3D scene representation where each point is augmented with a probability attribute that encapsulates the measurement uncertainty (or confidence) in the raw data. We further introduce inference approaches that leverage PPC for robust 3D object detection; these methods are versatile and can be used as computationally lightweight drop-in modules in 3D inference pipelines. We demonstrate, via both simulations and real captures, that PPC-based 3D inference methods outperform several baselines using LiDAR as well as camera-LiDAR fusion models, across challenging indoor and outdoor scenarios involving small, distant, and low-albedo objects, as well as strong ambient light. Our project webpage is at https://bhavyagoyal.github.io/ppc.
Dark-ISP: Enhancing RAW Image Processing for Low-Light Object Detection	Jiasheng Guo Institute of Science and Technology for Brain-inspired Intelligence, Fudan University Xin Gao Institute of Science and Technology for Brain-inspired Intelligence, Fudan University Yuxiang Yan Institute of Science and Technology for Brain-inspired Intelligence, Fudan University Guanghao Li Institute of Science and Technology for Brain-inspired Intelligence, Fudan University Jian Pu Institute of Science and Technology for Brain-inspired Intelligence, Fudan University	Paper Supplementary Abstract Low-light Object detection is crucial for many realworld applications but remains challenging due to degraded image quality. While recent studies have shown that RAW images offer superior potential over RGB images, existing approaches either use RAW-RGB images1 with information loss or employ complex frameworks. To address these, we propose a lightweight and self-adaptive Image Signal Processing (ISP) plugin, Dark-ISP, which directly processes Bayer RAW images in dark environments, enabling seamless end-to-end training for object detection. Our key innovations are: (1) We deconstruct conventional ISP pipelines into sequential linear (sensor calibration) and nonlinear (tone mapping) sub-modules, recasting them as differentiable components optimized through task-driven losses. Each module is equpped with content-aware adaptability and physics-informed priors, enabling automatic RAW-toRGB conversion aligned with detection objectives. (2) By exploiting the ISP pipeline's intrinsic cascade structure, we devise a Self-Boost mechanism that facilitates cooperation between sub-modules. Through extensive experiments on three RAW image datasets, we demonstrate that our method outperforms state-of-the-art RGB- and RAW-based detection approaches, achieving superior results with minimal parameters in challenging low-light environments.
DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation	Jiazhe Guo Tsinghua University Yikang Ding MEGVII Xiwu Chen Mach Drive Shuo Chen Tsinghua University Bohan Li Shanghai Jiao Tong University Yingshuang Zou Tsinghua University Xiaoyang Lyu University of Hong Kong Feiyang Tan Mach Drive Xiaojuan Qi University of Hong Kong Zhiheng Li Tsinghua University Hao Zhao Tsinghua University	Paper Supplementary Abstract Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations. The Project is available at https://royalmelon0505.github.io/DiST-4D
IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation	Wenxuan Guo Tsinghua University Xiuwei Xu Tsinghua University Hang Yin Tsinghua University Ziwei Wang Nanyang Technological University Jianjiang Feng Tsinghua University Jie Zhou Tsinghua University Jiwen Lu Tsinghua University	Paper Supplementary Abstract Visual navigation with an image as goal is a fundamental and challenging problem. Conventional methods either rely on end-to-end RL learning or modular-based policy with topological graph or BEV map as memory, which cannot fully model the geometric relationship between the explored 3D environment and the goal image. In order to efficiently and accurately localize the goal image in 3D space, we build our navigation system upon the renderable 3D gaussian (3DGS) representation. However, due to the computational intensity of 3DGS optimization and the large search space of 6-DoF camera pose, directly leveraging 3DGS for image localization during agent exploration process is prohibitively inefficient. To this end, we propose IGL-Nav, an Incremental 3D Gaussian Localization framework for efficient and 3D-aware image-goal navigation. Specifically, we incrementally update the scene representation as new images arrive with feed-forward monocular prediction. Then we coarsely localize the goal by leveraging the geometric information for discrete space matching, which can be equivalent to efficient 3D convolution. When the agent is close to the goal, we finally solve the fine target pose with optimization via differentiable rendering. The proposed IGL-Nav outperforms existing state-of-the-art methods by a large margin across diverse experimental configurations. It can also handle the more challenging free-view imagegoal setting and be deployed on real-world robotic platform using a cellphone to capture goal image at arbitrary pose. Project page: https://gwxuan.github.io/IGL-Nav/.
Motion-2-to-3: Leveraging 2D Motion Data for 3D Motion Generations	Ruoxi Guo Zhejiang University Huaijin Pi The University of Hong Kong Zehong Shen Zhejiang University Qing Shuai Zhejiang University Zechen Hu Deep Glint Zhumei Wang Deep Glint Yajiao Dong Deep Glint Ruizhen Hu Shenzhen University Taku Komura The University of Hong Kong Sida Peng Zhejiang University Xiaowei Zhou Zhejiang University	Paper Supplementary Abstract Text-driven human motion synthesis has showcased its potential for revolutionizing motion design in the movie and game industry. Existing methods often rely on 3D motion capture data, which requires special setups, resulting in high costs for data acquisition, ultimately limiting the diversity and scope of human motion. In contrast, 2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities. In this paper, we explore the use of 2D human motion extracted from videos as an alternative data source to improve text-driven 3D motion generation. Our approach introduces a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data. We first train a single-view 2D local motion generator on a large dataset of text-2D motion pairs. Then we fine-tune the generator with 3D data, transforming it into a multi-view generator that predicts view-consistent local joint motion and root dynamics. Evaluations on the well-acknowledged dataset and novel text prompts demonstrate that our method can efficiently utilize 2D data, supporting a wider range of realistic 3D human motion generation.
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm	Ziyan Guo Singapore University of Technology and Design Zeyu Hu LIGHTSPEED De Wen Soh Singapore University of Technology and Design Na Zhao Singapore University of Technology and Design	Paper Supplementary Abstract Human motion generation and editing are key components of computer vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for realworld applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion generation. Consequently, they lack editing capabilities, finegrained control, and fail to facilitate knowledge sharing across tasks. To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: Motion-Condition-Motion, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target motion. Based on this paradigm, we propose a unified framework, MotionLab, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified conditions. In MotionLab, we introduce the 1) MotionFlow Transformer to enhance conditional generation and editing without task-specific modules; 2) Aligned Rotational Position Encoding to guarantee the time synchronization between source motion and target motion; 3) Task Specified Instruction Modulation; and 4) Motion Curriculum Learning for effective multi-task learning and knowledge sharing across tasks. Notably, our MotionLab demonstrates promising generalization capabilities and inference efficiency across multiple benchmarks for human motion. Our code and additional video results are available at: https://diouo.github.io/motionlab.github.io/.
Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras	Shuang Guo TU Berlin and Robotics Institute Friedhelm Hamann TU Berlin and Robotics Institute, Science of Intelligence Excellence Cluster, Einstein Center for Digital Future Guillermo Gallego TU Berlin and Robotics Institute, Science of Intelligence Excellence Cluster, Einstein Center for Digital Future	Paper Supplementary Abstract Event cameras rely on motion to obtain information about scene appearance. This means that appearance and motion are inherently linked: either both are present and recorded in the event data, or neither is captured. Previous works treat the recovery of these two visual quantities as separate tasks, which does not fit with the above-mentioned nature of event cameras and overlooks the inherent relations between them. We propose an unsupervised learning framework that jointly estimates optical flow (motion) and image intensity (appearance) using a single network. From the data generation model, we newly derive the event-based photometric error as a function of optical flow and image intensity. This error is further combined with the contrast maximization framework to form a comprehensive loss function that provides proper constraints for both flow and intensity estimation. Exhaustive experiments show our method's state-ofthe-art performance: in optical flow estimation, it reduces EPE by 20% and AE by 25% compared to unsupervised approaches, while delivering competitive intensity estimation results, particularly in high dynamic range scenarios. Our method also achieves shorter inference time than all other optical flow methods and many of the image reconstruction methods, while they output only one quantity. Project page: https://github.com/tub-rip/E2FAI
WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images	Yansong Guo Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Jie Hu National University of Singapore Yansong Qu Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Liujuan Cao Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University	Paper Supplementary Abstract Recent advances in intuitive 3D segmentation from 2D images have demonstrated impressive performance. However, current models typically require extensive scene-specific training to accurately reconstruct and segment objects, which limits their applicability in real-time scenarios. In this paper, we introduce WildSeg3D, an efficient approach that enables the segmentation of arbitrary 3D objects across diverse environments using a feed-forward mechanism. A key challenge of this feed-forward approach lies in the accumulation of 3D alignment errors across multiple 2D views, which can lead to inaccurate 3D segmentation results. To address this issue, we propose Dynamic Global Aligning (DGA), a technique that improves the accuracy of global multi-view alignment by focusing on difficult-to-match 3D points across images, using a dynamic adjustment function. Additionally, for real-time intuitive segmentation, we introduce Multi-view Group Mapping (MGM), a method that utilizes an object mask cache to integrate multi-view segmentations and respond rapidly to user prompts. WildSeg3D demonstrates robust generalization across arbitrary scenes, thereby eliminating the need for scene-specific training. Specifically, WildSeg3D not only attains the accuracy of state-of-the-art (SOTA) methods but also achieves a 40x speedup compared to existing SOTA models. Code will be released at https://github.com/Ethan16162/WildSeg3D.
HVPUNet: Hybrid-Voxel Point-cloud Upsampling Network	Juhyung Ha Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN Vibhas Kumar Vats Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN Soon-heung Jung Electronics and Telecommunications Research Institute, Daejeon Alimoor Reza Department of Mathematics and Computer Science, Drake University, Des Moines, IA David J. Crandall Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN	Paper Abstract Point-cloud upsampling aims to generate dense point sets from sparse or incomplete 3D data. Most existing work uses a point-to-point framework. While this method achieves high geometric precision, it is slow because of irregular memory accesses to process unstructured point data. Alternatively, voxel-based methods offer computational efficiency by using regular grids, but struggle to preserve precise point locations due to discretization. To resolve this efficiency-precision trade-off, we introduce Hybrid Voxels, a representation that combines both voxel occupancy and a continuous point offset. We then present the Hybrid-Voxel Point-cloud Upsampling Network (HVPUNet), an efficient framework built upon this representation. HVPUNet integrates two key modules: (1) Shape Completion to restore missing geometry by filling empty voxels, and (2) SuperResolution to enhance spatial resolution and capture finer surface details. We also use progressive refinement, operational voxel expansion, and implicit geometric learning. Experimental results demonstrate that HVPUNet can upsample point clouds at significantly lower computational cost than the state-of-the-art, but with comparable model accuracy.
Multi-modal Multi-platform Person Re-Identification: Benchmark and Method	Ruiyang Ha ShanghaiTech University Songyi Jiang ShanghaiTech University Bin Li ShanghaiTech University Bikang Pan ShanghaiTech University Yihang Zhu ShanghaiTech University Junjie Zhang Xi'an Jiaotong-Liverpool University Xiatian Zhu University of Surrey Shaogang Gong Queen Mary University London Jingya Wang ShanghaiTech University	Paper Supplementary Abstract Conventional person re-identification (ReID) research is often limited to single-modality sensor data from static cameras, which fails to address the complexities of realworld scenarios where multi-modal signals are increasingly prevalent. For instance, consider an urban ReID system integrating stationary RGB cameras, nighttime infrared sensors, and UAVs equipped with dynamic tracking capabilities. Such systems face significant challenges due to variations in camera perspectives, lighting conditions, and sensor modalities, hindering effective person ReID. To address these challenges, we introduce the MP-ReID benchmark, a novel dataset designed specifically for multi-modality and multi-platform ReID. This benchmark uniquely compiles data from 1,930 identities across diverse modalities, including RGB, infrared, and thermal imaging, captured by both UAVs and ground-based cameras in indoor and outdoor environments. Building on this benchmark, we introduce UniPrompt ReID, a framework with specific-designed prompts, tailored for cross-modality and cross-platform scenarios. Our method consistently outperforms state-of-the-art approaches, establishing a robust foundation for future research in complex and dynamic ReID environments. Our dataset and code are available at: https://mp-reid. github.io/.
CarGait: Cross-Attention based Re-ranking for Gait recognition	Gavriel Habib OriginAI Noa Barzilay OriginAI Or Shimshi OriginAI Rami Ben-Ari OriginAI Nir Darshan OriginAI	Paper Supplementary Abstract Gait recognition is a computer vision task that identifies individuals based on their walking patterns. Its performance is commonly evaluated by ranking a gallery of candidates and measuring the identification accuracy at Rank-K. Existing models are typically single-staged, searching for the probe's nearest neighbors in a gallery, using a global feature representation. While these models can excel at retrieving the correct identity within the top-K predictions, they often struggle when hard negatives are among the top shortlist, leading to relatively low performance at the highest ranks (e.g., Rank-1). In this paper, we introduce CarGait, a Re-ranking (re-ordering the top-K list) method for gait recognition, leveraging the fine-grained correlations between pairs of gait sequences, through cross-attention between gait strips. This re-ranking scheme can be adapted to existing single-stage models to enhance their final results. We demonstrate the capabilities of CarGait by extensive experiments on three common gait datasets, Gait3D, GREW, and OU-MVLP, and seven different gait models, showing consistent gains in Rank-1,5 accuracy, while outperforming existing re-ranking approaches, and a strong baseline.
DoppDrive: Doppler-Driven Temporal Aggregation for Improved Radar Object Detection	Yuval Haitman General Motors, Technical Center Israel Oded Bialer General Motors, Technical Center Israel	Paper Supplementary Abstract Radar-based object detection is essential for autonomous driving due to radar's long detection range. However, the sparsity of radar point clouds, especially at long range, poses challenges for accurate detection. Existing methods increase point density through temporal aggregation with ego-motion compensation, but this approach introduces scatter from dynamic objects, degrading detection performance. We propose DoppDrive, a novel DopplerDriven temporal aggregation method that enhances radar point cloud density while minimizing scatter. Points from previous frames are shifted radially according to their dynamic Doppler component to eliminate radial scatter, with each point assigned a unique aggregation duration based on its Doppler and angle to minimize tangential scatter. DoppDrive is a point cloud density enhancement step applied before detection, compatible with any detector, and we demonstrate that it significantly improves object detection performance across various detectors and datasets.Our project page: https://yuvalhg.github.io/DoppDrive/
Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description	Anna-Maria Halacheva INSAIT, Sofia University 'St. Kliment Ohridski' Yang Miao INSAIT, Sofia University 'St. Kliment Ohridski' Jan-Nico Zaech INSAIT, Sofia University 'St. Kliment Ohridski' Xi Wang INSAIT, Sofia University 'St. Kliment Ohridski', ETH Zurich, TU Munich Luc Van Gool INSAIT, Sofia University 'St. Kliment Ohridski' Danda Pani Paudel INSAIT, Sofia University 'St. Kliment Ohridski'	Paper Supplementary Abstract 3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. Providing a solution to these applications requires a multifaceted approach that covers scene-centric, object-centric, as well as interaction-centric capabilities. While there exist numerous datasets and algorithms approaching the former two problems, the task of understanding interactable and articulated objects is underrepresented and only partly covered in the research field. In this work, we address this shortcoming by introducing: (1) Articulate3D, an expertly curated 3D dataset featuring high-quality manual annotations on 280 indoor scenes. Articulate3D provides 8 types of annotations for articulated objects, covering parts and detailed motion information, all stored in a standardized scene representation format designed for scalable 3D content creation, exchange and seamless integration into simulation environments. (2) USDNet, a novel unified framework capable of simultaneously predicting part segmentation along with a full specification of motion attributes for articulated objects. We evaluate USDNet on Articulate3D as well as two existing datasets, demonstrating the advantage of our unified dense prediction approach. Furthermore, we highlight the value of Articulate3D through cross-dataset and crossdomain evaluations and showcase its applicability in downstream tasks such as scene editing through LLM prompting and robotic policy training for articulated object manipulation. We provide open access to our dataset, benchmark, and method's source code.
ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models	Shadi Hamdan Koc University, KUIS AI Center Chonghao Sima The University of Hong Kong Zetong Yang OpenDriveLab Hongyang Li The University of Hong Kong Fatma Güney Koc University, KUIS AI Center	Paper Supplementary Abstract How can we benefit from large models without sacrificing inference speed, a common dilemma in self-driving systems? A prevalent solution is a dual-system architecture, employing a small model for rapid, reactive decisions and a larger model for slower but more informative analyses. Existing dual-system designs often implement parallel architectures where inference is either directly conducted using the large model at each current frame or retrieved from previously stored inference results. However, these works still struggle to enable large models for a timely response to every online frame. Our key insight is to shift intensive computations of the current frame to previous time steps and perform a batch inference of multiple time steps to make large models respond promptly to each time step. To achieve the shifting, we introduce Efficiency through Thinking Ahead (ETA), an asynchronous system designed to: (1) propagate informative features from the past to the current frame using future predictions from the large model, (2) extract current frame features using a small model for real-time responsiveness, and (3) integrate these dual features via an action mask mechanism that emphasizes action-critical image regions. Evaluated on the Bench2Drive CARLA Leaderboard-v2 benchmark, ETA advances state-of-the-art performance by 8% with a driving score of 69.53 while maintaining a near-real-time inference speed at 50 ms. Code and checkpoints can be found here.
All in One: Visual-Description-Guided Unified Point Cloud Segmentation	Zongyan Han Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi Mohamed El Amine Boudjoghra Technical University of Munich Jiahua Dong Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi Jinhong Wang Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi Rao Muhammad Anwer Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi	Paper Supplementary Abstract Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a SemanticVisual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at https://github. com/Hanzy1996/VDG-Uni3DSeg.
DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing	Shengdong Han School of Computer Science, Nanjing University of Posts and Telecommunications Shangdong Yang School of Computer Science, Nanjing University of Posts and Telecommunications Yuxuan Li VCIP, CS, Nankai University Xin Zhang VCIP, CS, Nankai University Xiang Li VCIP, CS, Nankai University, NKIARI, Futian, Shenzhen Jian Yang VCIP, CS, Nankai University Ming-Ming Cheng VCIP, CS, Nankai University, NKIARI, Futian, Shenzhen Yimian Dai VCIP, CS, Nankai University, NKIARI, Futian, Shenzhen	Paper Supplementary Abstract Resolving closely-spaced small targets in dense clusters presents a significant challenge in infrared imaging, as the overlapping signals hinder precise determination of their quantity, sub-pixel positions, and radiation intensities. While deep learning has advanced the field of infrared small target detection, its application to closely-spaced infrared small targets has not yet been explored. This gap exists primarily due to the complexity of separating superimposed characteristics and the lack of an open-source infrastructure. In this work, we propose the Dynamic Iterative Shrinkage Thresholding Network (DISTA-Net), which reconceptualizes traditional sparse reconstruction within a dynamic framework. DISTANet adaptively generates convolution weights and thresholding parameters to tailor the reconstruction process in real time. To the best of our knowledge, DISTA-Net is the first deep learning model designed specifically for the unmixing of closely-spaced infrared small targets, achieving superior sub-pixel detection accuracy. Moreover, we have established the first open-source ecosystem to foster further research in this field. This ecosystem comprises three key components: (1) CSIST-100K, a publicly available benchmark dataset; (2) CSO-mAP, a custom evaluation metric for sub-pixel detection; and (3) GrokCSO, an open-source toolkit featuring DISTA-Net and other state-of-the-art models, available at https://github.com/GrokCV/GrokCSO.
Extrapolated Urban View Synthesis Benchmark	Xiangyu Han NYU Zhen Jia NYU Boyi Li NVIDIA Yan Wang NVIDIA Boris Ivanovic NVIDIA Yurong You NVIDIA Lingjie Liu UPenn Yue Wang NVIDIA Marco Pavone Stanford Chen Feng NYU Yiming Li NVIDIA	Paper Supplementary Abstract Photorealistic simulators are essential for the training and evaluation of vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis (NVS), a crucial capability that generates diverse unseen viewpoints to accommodate the broad and continuous pose distribution of AVs. Recent advances in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic rendering at realtime speeds and have been widely used in modeling largescale driving scenes. However, their performance is commonly evaluated using an interpolated setup with highly correlated training and test views. In contrast, extrapolation, where test views largely deviate from training views, remains underexplored, limiting progress in generalizable simulation technology. To address this gap, we leverage publicly available AV datasets with multiple traversals, multiple vehicles, and multiple cameras to build the first Extrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct both quantitative and qualitative evaluations of state-of-the-art NVS methods across different evaluation settings. Our results show that current NVS methods are prone to overfitting to training views. Besides, incorporating diffusion priors and improving geometry cannot fundamentally improve NVS under large view changes, highlighting the need for more robust approaches and largescale training. We have released the data to help advance self-driving and urban robotics simulation technology.
MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking	Han Han MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China Wei Zhai MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China Yang Cao MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China Bin Li MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China Zheng-jun Zha MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China	Paper Supplementary Abstract Tracking Any Point (TAP) plays a crucial role in motion analysis. Video-based approaches rely on iterative local matching for tracking, but they assume linear motion during the blind time between frames, which leads to point loss under large displacements or nonlinear motion. The high temporal resolution and motion blur-free characteristics of event cameras provide continuous, fine-grained motion information, capturing subtle variations with microsecond precision. This paper presents an event-based framework for tracking any point, which tackles the challenges posed by spatial sparsity and motion sensitivity in events through two tailored modules. Specifically, to resolve ambiguities caused by event sparsity, a motion-guidance module incorporates kinematic vectors into the local matching process. Additionally, a variable motion aware module is integrated to ensure temporally consistent responses that are insensitive to varying velocities, thereby enhancing matching precision. To validate the effectiveness of the approach, two event dataset for tracking any point is constructed by simulation. The method improves the Survival50 metric by 17.9% over event-only tracking of any point baseline. Moreover, on standard feature tracking benchmarks, it outperforms all existing methods, even those that combine events and video frames.
PolGS: Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction	Yufei Han Beijing University of Posts and Telecommunications Bowen Tie Beijing University of Posts and Telecommunications Heng Guo Beijing University of Posts and Telecommunications, Xiong'an Aerospace Information Research Institute Youwei Lyu Beijing University of Posts and Telecommunications Si Li Beijing University of Posts and Telecommunications Boxin Shi State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Yunpeng Jia Beijing University of Posts and Telecommunications Zhanyu Ma Beijing University of Posts and Telecommunications	Paper Supplementary Abstract Efficient shape reconstruction for surfaces with complex reflectance properties is crucial for real-time virtual reality. While 3D Gaussian Splatting (3DGS)-based methods offer fast novel view rendering by leveraging their explicit surface representation, their reconstruction quality lags behind that of implicit neural representations, particularly in the case of recovering surfaces with complex reflective reflectance. To address these problems, we propose PolGS, a Polarimetric Gaussian Splatting model allowing fast reflective surface reconstruction in 10 minutes. By integrating polarimetric constraints into the 3DGS framework, PolGS effectively separates specular and diffuse components, enhancing reconstruction quality for challenging reflective materials. Experimental results on the synthetic and real-world dataset validate the effectiveness of our method. Project page: https://yu-fei-han.github.io/polgs.
REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment	Haonan Han Tsinghua University Rui Yang The University of Hong Kong Huan Liao Tsinghua University Jiankai Xing Tsinghua University Zunnan Xu Tsinghua University Xiaoming Yu Tencent Junwei Zha Tencent Xiu Li Tsinghua University Wanhua Li Harvard University	Paper Supplementary Abstract Traditional image-to-3D models often struggle with scenes containing multiple objects due to biases and occlusion complexities. To address this challenge, we present REPARO, a novel approach for compositional 3D asset generation from single images. REPARO employs a two-step process: first, it extracts individual objects from the scene and reconstructs their 3D meshes using image-to-3D models; then, it optimizes the layout of these meshes through differentiable rendering techniques, ensuring coherent scene composition. By integrating optimal transport-based longrange appearance loss term and high-level semantic loss term in the differentiable rendering, REPARO can effectively recover the layout of 3D assets. The proposed method can significantly enhance object independence, detail accuracy, and overall scene coherence. Extensive evaluation of multi-object scenes demonstrates that our REPARO offers a comprehensive approach to address the complexities of multi-object 3D scene generation from single images. The demo have been available at https://reparo-3d.github.io/
SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies	Liang Han School of Software, Tsinghua University Xu Zhang China Telecom Haichuan Song Computer Science and Technology, East China Normal University Kanle Shi Kuaishou Technology Yu-Shen Liu School of Software, Tsinghua University Zhizhong Han Department of Computer Science, Wayne State University	Paper Supplementary Abstract Surface reconstruction from sparse views aims to reconstruct a 3D shape or scene from few RGB images. The latest methods are either generalization-based or overfittingbased. However, the generalization-based methods do not generalize well on views that were unseen during training, while the reconstruction quality of overfitting-based methods is still limited by the limited geometry clues. To address this issue, we propose SparseRecon, a novel neural implicit reconstruction method for sparse views with volume rendering-based feature consistency and uncertaintyguided depth constraint. Firstly, we introduce a feature consistency loss across views to constrain the neural implicit field. This design alleviates the ambiguity caused by insufficient consistency information of views and ensures completeness and smoothness in the reconstruction results. Secondly, we employ an uncertainty-guided depth constraint to back up the feature consistency loss in areas with occlusion and insignificant features, which recovers geometry details for better reconstruction quality. Experimental results demonstrate that our method outperforms the state-of-the-art methods, which can produce high-quality geometry with sparse-view input, especially in the scenarios with small overlapping views. Project page: https: //hanl2010.github.io/SparseRecon/.
PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation	Xiaoyang Hao Southern University of Science and Technology Han Li Southern University of Science and Technology	Paper Abstract Monocular 3D human pose estimation (HPE) methods estimate the 3D positions of joints from individual images. Existing 3D HPE approaches often use the cropped image alone as input for their models. However, the relative depths of joints cannot be accurately estimated from cropped images without the corresponding camera intrinsics, which determine the perspective relationship between 3D objects and the cropped images. In this work, we introduce Perspective Encoding (PE) to encode the camera intrinsics of the cropped images. Moreover, since the human subject can appear anywhere within the original image, the perspective relationship between the 3D scene and the cropped image differs significantly, which complicates model fitting. Additionally, the further the human subject deviates from the image center, the greater the perspective distortions in the cropped image. To address these issues, we propose Perspective Rotation (PR), a transformation applied to the original image that centers the human subject, thereby reducing perspective distortions and alleviating the difficulty of model fitting. By incorporating PE and PR, we propose a novel 3D HPE framework, PersPose. Experimental results demonstrate that PersPose achieves state-of-the-art (SOTA) performance on the 3DPW, MPIINF-3DHP, and Human3.6M datasets. For example, on the in-the-wild dataset 3DPW, PersPose achieves an MPJPE of 60.1 mm, 7.54% lower than the previous SOTA approach. Code is available at: https://github.com/KenAdamsJoseph/PersPose.
Principles of Visual Tokens for Efficient Video Understanding	Xinyue Hao University of Edinburgh Gen Li University of Edinburgh Shreyank N Gowda University of Nottingham Robert B. Fisher University of Edinburgh Jonathan Huang Scaled Foundations Anurag Arnab Google DeepMind Laura Sevilla-Lara University of Edinburgh	Paper Supplementary Abstract Video understanding has made huge strides in recent years, relying largely on the power of transformers. As this architecture is notoriously expensive and video data is highly redundant, research into improving efficiency has become particularly relevant. Some creative solutions include token selection and merging. While most methods succeed in reducing the cost of the model and maintaining accuracy, an interesting pattern arises: most methods do not outperform the baseline of randomly discarding tokens. In this paper we take a closer look at this phenomenon and observe 5 principles of the nature of visual tokens. For example, we observe that the value of tokens follows a clear Paretodistribution where most tokens have remarkably low value, and just a few carry most of the perceptual information. We build on these and further insights to propose a lightweight video model, LITE, that can select a small number of tokens effectively, outperforming state-of-the-art and existing baselines across datasets (Kinetics-400 and SomethingSomething-V2) in the challenging trade-off of computation (GFLOPs) vs accuracy. Experiments also show that LITE generalizes across datasets and even other tasks without the need for retraining. The code is released at https: //github.com/maggieHao/Efficient-LITE.
AllTracker: Efficient Dense Point Tracking at High Resolution	Adam W. Harley Stanford University Yang You Stanford University Xinglong Sun Stanford University Yang Zheng Stanford University Nikhil Raghuraman Stanford University Yunqi Gu Stanford University Sheldon Liang Carnegie Mellon University Wen-Hsuan Chu Carnegie Mellon University Achal Dave Toyota Research Institute Suya You Army Research Laboratory Rares Ambrus Toyota Research Institute Katerina Fragkiadaki Carnegie Mellon University Leonidas Guibas Stanford University	Paper Supplementary Abstract We introduce AllTracker: a model that estimates longrange point tracks by way of estimating the ﬂow field between a query frame and every other frame of a video. Unlike existing point tracking methods, our approach delivers high-resolution and dense (all-pixel) correspondence fields, which can be visualized as ﬂow maps. Unlike existing optical ﬂow methods, our approach corresponds one frame to hundreds of subsequent frames, rather than just the next frame. We develop a new architecture for this task, blending techniques from existing work in optical ﬂow and point tracking: the model performs iterative inference on lowresolution grids of correspondence estimates, propagating information spatially via 2D convolution layers, and propagating information temporally via pixel-aligned attention layers. The model is fast and parameter-efficient (16 million parameters), and delivers state-of-the-art point tracking accuracy at high resolution (i.e., tracking 768 →1024 pixels, on a 40G GPU). A benefit of our design is that we can train jointly on optical ﬂow datasets and point tracking datasets, and we find that doing so is crucial for top performance. We provide an extensive ablation study on our architecture details and training recipe, making it clear which details matter most. Our code and model weights are available: https://alltracker.github.io
TorchAdapt: Towards Light-Agnostic Real-Time Visual Perception	Khurram Azeem Hashmi DFKI Karthik Palyakere Suresh DFKI Didier Stricker DFKI Muhammad Zeshan Afzal DFKI	Paper Supplementary Abstract Low-light conditions significantly degrade the performance of high-level vision tasks. Existing approaches either enhance low-light images without considering normal illumination scenarios, leading to poor generalization, or are tailored to specific tasks. We propose TorchAdapt, a realtime adaptive feature enhancement framework that generalizes robustly across varying illumination conditions without degrading performance in well-lit scenarios. TorchAdapt consists of two complementary modules: the Torch module enhances semantic features beneficial for downstream tasks, while the Adapt module dynamically modulates these enhancements based on input content. Leveraging a novel light-agnostic learning strategy, TorchAdapt aligns feature representations of enhanced and well-lit images to produce powerful illumination-invariant features. Extensive experiments on multiple high-level vision tasks, including object detection, face detection, instance segmentation, semantic segmentation, and video object detection, demonstrate that TorchAdapt consistently outperforms state-of-the-art lowlight enhancement and task-specific methods in both lowlight and light-agnostic settings. TorchAdapt thus provides a unified, flexible solution for robust visual perception across diverse lighting conditions.
Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness, Generalization, and Transferability	Boyong He Institute of Artificial Intelligence, Xiamen University Yuxiang Ji Institute of Artificial Intelligence, Xiamen University Zhuoyue Tan Institute of Artificial Intelligence, Xiamen University Liaoni Wu Institute of Artificial Intelligence, Xiamen University	Paper Supplementary Abstract Detectors often suffer from performance drop due to domain gap between training and testing data. Recent methods explore diffusion models applied to domain generalization (DG) and adaptation (DA) tasks, but still struggle with large inference costs and have not yet fully leveraged the capabilities of diffusion models. We propose to tackle these problems by extracting intermediate features from a single-step diffusion process, improving feature collection and fusion to reduce inference time by 75% while enhancing performance on source domains (i.e., Fitness). Then, we construct an object-centered auxiliary branch by applying box-masked images with class prompts to extract robust and domain-invariant features that focus on object. We also apply consistency loss to align the auxiliary and ordinary branch, balancing fitness and generalization while preventing overfitting and improving performance on target domains (i.e., Generalization). Furthermore, within a unified framework, standard detectors are guided by diffusion detectors through feature-level and object-level alignment on source domains (for DG) and unlabeled target domains (for DA), thereby improving cross-domain detection performance (i.e., Transferability). Our method achieves competitive results on 3 DA benchmarks and 5 DG benchmarks. Additionally, experiments on COCO generalization benchmark demonstrate that our method maintains significant advantages and show remarkable efficiency in large domain shifts and low-data scenarios. Our work shows the superiority of applying diffusion models to domain generalized and adaptive detection tasks and offers valuable insights for visual perception tasks across diverse domains. The code is available at Fitness-Generalization-Transferability.
DexVLG: Dexterous Vision-Language-Grasp Model at Scale	Jiawei He Beijing Academy of Artificial Intelligence Danshi Li Galbot Xinqiang Yu Galbot Zekun Qi Galbot Wenyao Zhang Galbot Jiayi Chen Galbot Zhaoxiang Zhang Institute of Automation, Chinese Academy of Sciences Zhizheng Zhang Galbot Li Yi Tsinghua University He Wang Beijing Academy of Artificial Intelligence	Paper Supplementary Abstract As large models gain traction, vision-language models are enabling robots to tackle increasingly complex tasks. However, limited by the difficulty of data collection, progress has mainly focused on controlling simple gripper end-effectors. There is little research on functional grasping with large models for human-like dexterous hands. In this paper, we introduce DexVLG, a large Vision-Language-Grasp model for Dexterous grasp pose prediction aligned with language instructions using single-view RGBD input. To accomplish this, we generate a dataset of 170 million dexterous grasp poses mapped to semantic parts across 174,000 objects in simulation, paired with detailed part-level captions. This large-scale dataset, named DexGraspNet 3.0, is used to train a VLM with a flow-matching-based pose head producing instruction-aligned grasp poses for tabletop objects. To evaluate DexVLG's performance, we create benchmarks in simulations and conduct real-world experiments. Extensive experiments demonstrate DexVLG's strong zero-shot generalization capabilities, achieving an over 76% zero-shot execution success rate and state-of-the-art part-grasp accuracy in simulation, as well as successful part-aligned grasps on physical objects in real-world scenarios.
Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds	Pei He Xidian University Lingling Li Xidian University Licheng Jiao Xidian University Ronghua Shang Xidian University Fang Liu Xidian University Shuang Wang Xidian University Xu Liu Xidian University Wenping Ma Xidian University	Paper Abstract Domain generalization in 3D segmentation is a critical challenge in deploying models to unseen environments. Current methods mitigate the domain shift by augmenting the data distribution of point clouds. However, the model learns global geometric patterns in point clouds while ignoring the category-level distribution and alignment. In this paper, a category-level geometry learning framework is proposed to explore the domain-invariant geometric features for domain generalized 3D semantic segmentation. Specifically, Category-level Geometry Embedding (CGE) is proposed to perceive the fine-grained geometric properties of point cloud features, which constructs the geometric properties of each class and couples geometric embedding to semantic learning. Secondly, Geometric Consistent Learning (GCL) is proposed to simulate the latent 3D distribution and align the category-level geometric embeddings, allowing the model to focus on the geometric invariant information to improve generalization. Experimental results verify the effectiveness of the proposed method, which has very competitive segmentation accuracy compared with the state-of-the-art domain generalized point cloud methods. The code will be available at https://github.com/ChicalH/DCGL.
Dual-Rate Dynamic Teacher for Source-Free Domain Adaptive Object Detection	Qi He Southwest Jiaotong University, Chengdu Xiao Wu Southwest Jiaotong University, Chengdu Jun-Yan He Meituan Inc. Shuai Li The Hong Kong Polytechnic University	Paper Abstract Source-Free Domain Adaptive Object Detection transfers knowledge from a labeled source domain to an unlabeled target domain while preserving data privacy by restricting access to source data during adaptation. Existing approaches predominantly leverage the Mean Teacher framework for self-training in the target domain. The exponential moving average (EMA) mechanism in the Mean Teacher stabilizes the training by averaging the student weights over training steps. However, in domain adaptation, its inherent lag in responding to emerging knowledge can hinder the rapid adaptation of the student to target-domain shifts. To address this challenge, Dual-rate Dynamic Teacher (DDT) with Asynchronous EMA (AEMA) is proposed, which implements group-wise parameter updates. In contrast to traditional EMA, which simultaneously updates all parameters, AEMA dynamically decomposes teacher parameters into two functional groups based on their contributions to capture the domain shift. By applying a distinct smoothing coefficient to two groups, AEMA simultaneously enables fast adaptation and historical knowledge retention. Comprehensive experiments carried out on three widely used traffic benchmarks have demonstrated that the proposed DDT achieves superior performance, outperforming SOTA methods by a clear margin. The codes are available at https://github.com/qih96/DDT.
ERNet: Efficient Non-Rigid Registration Network for Point Sequences	Guangzhao He Zhejiang University Yuxi Xiao ATE3D Zhen Xu ATE3D Xiaowei Zhou ATE3D Sida Peng Zhejiang University	Paper Supplementary Abstract Registering an object shape to a sequence of point clouds undergoing non-rigid deformation is a long-standing challenge. The key difficulties stem from two factors: (i) the presence of local minima due to the non-convexity of registration objectives, especially under noisy or partial inputs, which hinders accurate and robust deformation estimation, and (ii) error accumulation over long sequences, leading to tracking failures. To address these challenges, we introduce to adopt a scalable data-driven approach and propose ERNet, an efficient feed-forward model trained on large deformation datasets. It is designed to handle noisy and partial inputs while effectively leveraging temporal information for accurate and consistent sequential registration. The key to our design is predicting a sequence of deformation graphs through a two-stage pipeline, which first estimates framewise coarse graph nodes for robust initialization, before refining their trajectories over time in a sliding-window fashion. Extensive experiments show that our proposed approach The authors are affiliated with the State Key Lab of CAD&CG. (i) outperforms previous state of the art on both the DeformingThings4D and D-FAUST datasets, and (ii) achieves more than 4x speedup compared to the previous best, offering significant efficiency improvement.
RareCLIP: Rarity-aware Online Zero-shot Industrial Anomaly Detection	Jianfang He Institute of Automation, Chinese Academy of Sciences Min Cao Soochow University Silong Peng Institute of Automation, Chinese Academy of Sciences Qiong Xie Institute of Automation, Chinese Academy of Sciences	Paper Supplementary Abstract Large vision-language models such as CLIP have made significant strides in zero-shot anomaly detection through prompt engineering. However, most existing methods typically process each test image individually, ignoring the practical rarity of abnormal patches in real-world scenarios. Although some batch-based approaches exploit the rarity by processing multiple samples concurrently, they generally introduce unacceptable latency for real-time applications. To mitigate these limitations, we propose RareCLIP, a novel online zero-shot anomaly detection framework that enables sequential image processing in real-time without requiring prior knowledge of the target domain. RareCLIP capitalizes on the zero-shot capabilities of CLIP and integrates a dynamic test-time rarity estimation mechanism. A key innovation of our framework is the introduction of a prototype patch feature memory bank, which aggregates representative features from historical observations and continuously updates their corresponding rarity measures. For each incoming image patch, RareCLIP computes a rarity score by aggregating the rarity measures of its nearest neighbors within the memory bank. Moreover, we introduce a prototype sampling strategy based on dissimilarity to enhance computational efficiency, as well as a similarity calibration strategy to enhance the robustness of rarity estimation. Extensive experiments demonstrate that RareCLIP attains state-of-the-art performance with 98.2% image-level AUROC on MVTec AD and 94.4% on VisA, while achieving a latency of 59.4 ms. Code is available at https://github.com/hjf02/RareCLIP.
Simulating Dual-Pixel Images From Ray Tracing For Depth Estimation	Fengchen He Huazhong University of Science and Technology Dayang Zhao Huazhong University of Science and Technology Hao Xu Huazhong University of Science and Technology Tingwei Quan Huazhong University of Science and Technology Shaoqun Zeng Huazhong University of Science and Technology	Paper Supplementary Abstract Many studies utilize dual-pixel (DP) sensor phase information for various applications, such as depth estimation and deblurring. However, since DP image features are entirely determined by the camera hardware, DP-depth paired datasets are very scarce, especially when performing depth estimation on customized cameras. To overcome this, studies simulate DP images using ideal optical models. However, these simulations often violate real optical propagation laws, leading to poor generalization to real DP data. To address this, we investigate the domain gap between simulated and real DP data, and propose solutions using the Simulating DP Images from Ray Tracing (Sdirt) scheme. Sdirt generates realistic DP images via ray tracing and integrates them into the depth estimation training pipeline. Experimental results show that models trained with Sdirt-simulated images generalize better to real DP data. The code and collected datasets will be available at https://github.com/LinYark/Sdirt.
SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling	Xianglong He Tsinghua University Zi-Xin Zou VAST Chia-Hao Chen Tsinghua University Yuan-Chen Guo VAST Ding Liang VAST Chun Yuan Tsinghua University Wanli Ouyang The Chinese University of Hong Kong Yan-Pei Cao VAST Yangguang Li VAST	Paper Supplementary Abstract Creating high-fidelity 3D meshes with arbitrary topology, including open surfaces and complex interiors, remains a significant challenge. Existing implicit field methods often require costly and detail-degrading watertight conversion, while other approaches struggle with high resolutions. This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to 10243 directly from rendering losses. SparseFlex combines the accuracy of Flexicubes with a sparse voxel structure, focusing computation on surface-adjacent regions and efficiently handling open surfaces. Crucially, we introduce a frustum-aware sectional voxel training strategy that activates only relevant voxels during rendering, dramatically reducing memory consumption and enabling high-resolution training. This also allows, for the first time, the reconstruction of mesh interiors using only rendering supervision. Building upon this, we demonstrate a complete shape modeling pipeline by training a variational autoencoder (VAE) and a rectified flow transformer for high-quality 3D shape generation. Our experiments show state-of-the-art reconstruction accuracy, with a ∼82% reduction in Chamfer Distance and a ∼88% increase in F-score compared to previous methods, and demonstrate the generation of high-resolution, detailed 3D shapes with arbitrary topology. By enabling high-resolution, differentiable mesh reconstruction and generation with rendering losses, SparseFlex significantly advances the state-of-the-art in 3D shape representation and modeling.
SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis	Wenkun He Tsinghua University Yun Liu Tsinghua University Ruitao Liu Tsinghua University Li Yi Tsinghua University	Paper Supplementary Abstract Synthesizing realistic human-object interaction motions is a critical problem in VR/AR and human animation. Unlike the commonly studied scenarios involving a single human or hand interacting with one object, we address a more generic multi-body setting with arbitrary numbers of humans, hands, and objects. The high correlations and mutual influences among bodies leads to two major challenges, for which we propose solutions. First, to satisfy the high demands for synchronization of different body motions, we mathematically derive a new set of alignment scores during the training process, and use maximum likelihood sampling on a dynamic graphical model for explicit synchronization during inference. Second, the high-frequency interactions between objects are often overshadowed by the large-scale low-frequency movements. To address this, we introduce frequency decomposition and explicitly represent high-frequency components in the frequency domain. Extensive experiments across five datasets with various multibody configurations demonstrate the superiority of SyncDiff over existing state-of-the-art motion synthesis methods.
2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos	Marvin Heidinger Computer Science Department, Technische Universität Darmstadt Snehal Jauhri Computer Science Department, Technische Universität Darmstadt Vignesh Prasad Computer Science Department, Technische Universität Darmstadt Georgia Chalvatzaki Computer Science Department, Technische Universität Darmstadt	Paper Supplementary Abstract When interacting with objects, humans effectively reason about which regions of objects are viable for an intended action, i.e., the affordance regions of the object. They can also account for subtle differences in object regions based on the task to be performed and whether one or two hands need to be used. However, current visionbased affordance prediction methods often reduce the problem to naive object part segmentation. In this work, we propose a framework for extracting affordance data from human activity video datasets. Our extracted 2HANDS dataset contains precise object affordance region segmentations and affordance class-labels as narrations of the activity performed. The data also accounts for bimanual actions, i.e., two hands co-ordinating and interacting with one or more objects. We present a VLM-based affordance prediction model, 2HandedAfforder, trained on the dataset and demonstrate superior performance over baselines in affordance region segmentation for various activities. Finally, we show that our predicted affordance regions are actionable, i.e., can be used by an agent performing a task, through demonstration in robotic manipulation scenarios. Project-website: sites.google.com/view/2handedafforder
Kaputt: A Large-Scale Dataset for Visual Defect Detection	Sebastian Höfer Amazon, Fulfillment Technologies & Robotics Dorian F. Henning Amazon, Fulfillment Technologies & Robotics Artemij Amiranashvili Amazon, Fulfillment Technologies & Robotics Douglas Morrison Amazon, Fulfillment Technologies & Robotics Mariliza Tzes Amazon, Fulfillment Technologies & Robotics Ingmar Posner Amazon, Fulfillment Technologies & Robotics, University of Oxford Marc Matvienko Amazon, Fulfillment Technologies & Robotics Alessandro Rennola Amazon, Fulfillment Technologies & Robotics Anton Milan Amazon, Fulfillment Technologies & Robotics	Paper Supplementary Abstract We present a novel large-scale dataset for defect detection in a logistics setting. Recent work on industrial anomaly detection has primarily focused on manufacturing scenarios with highly controlled poses and a limited number of object categories. Existing benchmarks like MVTec-AD [6] and VisA [33] have reached saturation, with state-of-theart methods achieving up to 99.9% AUROC scores. In contrast to manufacturing, anomaly detection in retail logistics faces new challenges, particularly in the diversity and variability of object pose and appearance. Leading anomaly detection methods fall short when applied to this new setting. To bridge this gap, we introduce a new benchmark that overcomes the current limitations of existing datasets. With over 230,000 images (and more than 29,000 defective instances), it is 40 times larger than MVTec and contains more than 48,000 distinct objects. To validate the difficulty of the problem, we conduct an extensive evaluation of multiple state-of-the-art anomaly detection methods, demonstrating that they do not surpass 56.96% AUROC on our dataset. Further qualitative analysis confirms that existing methods struggle to leverage normal samples under heavy pose and appearance variation. With our large-scale dataset, we set a new benchmark and encourage future research towards solving this challenging problem in retail logistics anomaly detection. The dataset is available for download under https://www.kaputt-dataset.com.
3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt	Lukas Höllein Technical University of Munich Aljaž Božič Meta Michael Zollhöfer Meta Matthias Nießner Technical University of Munich	Paper Supplementary Abstract We present 3DGS-LM, a new method that accelerates the reconstruction of 3D Gaussian Splatting (3DGS) by replacing its ADAM optimizer with a tailored Levenberg-Marquardt (LM). Existing methods reduce the optimization time by decreasing the number of Gaussians or by improving the implementation of the differentiable rasterizer. However, they still rely on the ADAM optimizer to fit Gaussian parameters of a scene in thousands of iterations, which can take up to an hour. To this end, we change the optimizer to LM that runs in conjunction with the 3DGS differentiable rasterizer. For efficient GPU parallelization, we propose a caching data structure for intermediate gradients that allows us to efficiently calculate Jacobian-vector products in custom CUDA kernels. In every LM iteration, we calculate update directions from multiple image subsets using these kernels and combine them in a weighted mean. Overall, our method is 20% faster than the original 3DGS while obtaining the same reconstruction quality. Our optimization is also agnostic to other methods that accelerate 3DGS, thus enabling even faster speedups compared to vanilla 3DGS.
Communication-Efficient Multi-Vehicle Collaborative Semantic Segmentation via Sparse 3D Gaussian Sharing	Tianyu Hong Tianjin University Xiaobo Zhou Tianjin University Wenkai Hu Tianjin University Qi Xie Tianjin University Zhihui Ke Tianjin University Tie Qiu Qinghai Minzu University	Paper Supplementary Abstract Collaborative perception is considered a promising approach to address the inherent limitations of single-vehicle systems by sharing data among vehicles, thereby enhancing performance in perception tasks such as bird's-eye view (BEV) semantic segmentation. However, existing methods share the entire dense, scene-level BEV feature, which contains significant redundancy and lacks height information, ultimately leading to unavoidable bandwidth waste and performance degradation. To address these challenges, we present GSCOOP, the first collaborative semantic segmentation framework that leverages sparse, object-centric 3D Gaussians to fundamentally overcome communication bottlenecks. By representing scenes with compact Gaussians that preserve complete spatial information, GSCOOP achieves both high perception accuracy and communication efficiency. To further optimize transmission, we introduce the PriorityBased Gaussian Selection (PGS) module to adaptively select critical Gaussians and a Semantic Gaussian Compression (SGC) module to compress Gaussian attributes with minimal overhead. Extensive experiments on OPV2V and V2X-Seq demonstrate that GSCOOP achieves state-of-the-art performance, even with more than 500x lower communication volume.The code link is https://github.com/SHEVIP/GSCOOP.
General Compression Framework for Efficient Transformer Object Tracking	Lingyi Hong Shanghai Key Lab of Intelligent Information Processing, College of Computer Science and Artificial Intelligence, Fudan University Jinglun Li College of Intelligent Robotics and Advanced Manufacturing, Fudan University Xinyu Zhou Shanghai Key Lab of Intelligent Information Processing, College of Computer Science and Artificial Intelligence, Fudan University Shilin Yan Shanghai Key Lab of Intelligent Information Processing, College of Computer Science and Artificial Intelligence, Fudan University Pinxue Guo College of Intelligent Robotics and Advanced Manufacturing, Fudan University Kaixun Jiang College of Intelligent Robotics and Advanced Manufacturing, Fudan University Zhaoyu Chen College of Intelligent Robotics and Advanced Manufacturing, Fudan University Shuyong Gao Shanghai Key Lab of Intelligent Information Processing, College of Computer Science and Artificial Intelligence, Fudan University Runze Li Lenovo Research Xingdong Sheng Lenovo Research	Paper Supplementary Abstract Previous works have attempted to improve tracking efficiency through lightweight architecture design or knowledge distillation from teacher models to compact student trackers. However, these solutions often sacrifice accuracy for speed to a great extent, and also have the problems of complex training process and structural limitations. Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce model size while preserving tracking accuracy. Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages to break the limitation of model structure. Additionally, we also design a unique replacement training technique that randomly substitutes specific stages in the student model with those from the teacher model, as opposed to training the student model in isolation. Replacement training enhances the student model's ability to replicate the teacher model's behavior and simplifies the training process. To further forcing student model to emulate teacher model, we incorporate prediction guidance and stage-wise feature mimicking to provide additional supervision during the teacher model's compression process. CompressTracker is structurally agnostic, making it compatible with any transformer architecture. We conduct a series of experiment to verify the effectiveness and generalizability of our CompressTracker. Our CompressTracker-SUTrack, compressed from SUTrack, retains about 99% performance on LaSOT (72.2% AUC) while achieves 2.42x speed up. Code is available at here.
4D Visual Pre-training for Robot Learning	Chengkai Hou State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Yanjie Ze Shanghai Qizhi Institute Yankai Fu State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Zeyu Gao CASIA Songbo Hu Tsinghua University Yue Yu Tsinghua University Shanghang Zhang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Huazhe Xu Shanghai Qizhi Institute	Paper Supplementary Abstract General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks; yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP, is a novel 4D Visual Pre-training framework for realworld robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on the larger public datasets directly. Across twelve real-world manipulation tasks, FVP boosts the average success rate of 3D Diffusion Policy (DP3) for these tasks by 28%. The FVP pre-trained DP3 achieves stateof-the-art performance across imitation learning methods. Moreover, the efficacy of FVP adapts across various point cloud encoders and datasets. Finally, we apply FVP to the RDT-1B, a larger Vision-Language-Action robotic model, enhancing its performance on various robot tasks. Our project page is available at: https://4d-visualpretraining.github.io/.
FROSS: Faster-Than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images	Hao-Yu Hou National Tsing Hua University Chun-Yi Lee National Taiwan University Motoharu Sonogashira RIKEN Yasutomo Kawanishi RIKEN	Paper Supplementary Abstract The ability to abstract complex 3D environments into simplified and structured representations is crucial across various domains. 3D semantic scene graphs (SSGs) achieve this by representing objects as nodes and their interrelationships as edges, facilitating high-level scene understanding. Existing methods for 3D SSG generation, however, face significant challenges, including high computational demands and non-incremental processing that hinder their suitability for real-time open-world applications. To address this issue, we propose FROSS (Faster-thanReal-Time Online 3D Semantic Scene Graph Generation), an innovative approach for online and faster-than-realtime 3D SSG generation that leverages the direct lifting of 2D scene graphs to 3D space and represents objects as 3D Gaussian distributions. This framework eliminates the dependency on precise and computationallyintensive point cloud processing. Furthermore, we extend the Replica dataset with inter-object relationship annotations, creating the ReplicaSSG dataset for comprehensive evaluation of FROSS. The experimental results from evaluations on ReplicaSSG and 3DSSG datasets show that FROSS can achieve superior performance while operating significantly faster than prior 3D SSG generation methods. Our implementation and dataset are publicly available at https://github.com/Howardkhh/FROSS.
Single-Scanline Relative Pose Estimation for Rolling Shutter Cameras	Petr Hruby ETH Zürich Marc Pollefeys ETH Zürich / Microsoft Spatial AI Lab	Paper Supplementary Abstract We propose a novel approach for estimating the relative pose between rolling shutter cameras using the intersections of line projections with a single scanline per image. This allows pose estimation without explicitly modeling camera motion. Alternatively, scanlines can be selected within a single image, enabling single-view relative pose estimation for scanlines of rolling shutter cameras. Our approach is designed as a foundational building block for rolling shutter structure-from-motion (SfM), where no motion model is required, and each scanline's pose can be computed independently. We classify minimal solvers for this problem in both generic and specialized settings, including cases with parallel lines and known gravity direction, assuming known intrinsics and no lens distortion. Furthermore, we develop minimal solvers for the parallel-lines scenario, both with and without gravity priors, by leveraging connections between this problem and the estimation of 2D structure from 1D cameras. Experiments on rolling shutter images from the Fastec dataset demonstrate the feasibility of our approach for initializing rolling shutter SfM, highlighting its potential for further development. The code will be made publicly available.
OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations	Peng-Hao Hsu National Tsing Hua University Ke Zhang Amazon Fu-En Wang Amazon Tao Tu Cornell University Ming-Feng Li Carnegie Mellon University Yu-Lun Liu National Yang Ming Chiao Tung University Albert Y. C. Chen Amazon Min Sun National Tsing Hua University Cheng-Hao Kuo Amazon	Paper Supplementary Abstract Open-vocabulary (OV) 3D object detection is an emerging field, yet its exploration through image-based methods remains limited compared to 3D point cloud-based methods. We introduce OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector trained without human annotations. In particular, OpenM3D is a single-stage detector adapting the 2D-induced voxel features from the ImGeoNet model. To support OV, it is jointly trained with a class-agnostic 3D localization loss requiring high-quality 3D pseudo boxes and a voxel-semantic alignment loss requiring diverse pre-trained CLIP features. We follow the training setting of OV-3DET where posed RGB-D images are given but no human annotations of 3D boxes or classes are available. We propose a 3D Pseudo Box Generation method using a graph embedding technique that combines 2D segments into coherent 3D structures. Our pseudo-boxes achieve higher precision and recall than other methods, including the method proposed in OV-3DET. We further sample diverse CLIP features from 2D segments associated with each coherent 3D structure to align with the corresponding voxel feature. The key to training a highly accurate singlestage detector requires both losses to be learned toward high-quality targets. At inference, OpenM3D, a highly efficient detector, requires only multi-view images for input and demonstrates superior accuracy and speed (0.3 sec. per scene) on ScanNet200 and ARKitScenes indoor benchmarks compared to existing methods. We outperform a strong twostage method that leverages our class-agnostic detector with a ViT CLIP-based OV classifier and a baseline incorporating multi-view depth estimator on both accuracy and speed.
Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts	Zixuan Hu School of Computer Science, Peking University Dongxiao Li School of Computer Science, Peking University Xinzhu Ma The Chinese University of Hong Kong Shixiang Tang The Chinese University of Hong Kong Xiaotong Li School of Computer Science, Peking University Wenhan Yang Peng Cheng Laboratory, Shenzhen, China Ling-Yu Duan School of Computer Science, Peking University	Paper Supplementary Abstract Accurate monocular 3D object detection (M3OD) is pivotal for safety-critical applications like autonomous driving, yet its reliability deteriorates significantly under real-world domain shifts caused by environmental or sensor variations. To address these shifts, Test-Time Adaptation (TTA) methods have emerged, enabling models to adapt to target distributions during inference. While prior TTA approaches recognize the positive correlation between low uncertainty and high generalization ability, they fail to address the dual uncertainty inherent to M3OD: semantic uncertainty (ambiguous class predictions) and geometric uncertainty (unstable spatial localization). To bridge this gap, we propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD. Through a convex optimization lens, we introduce an innovative convex structure of the focal loss and further derive a novel unsupervised version, enabling label-agnostic uncertainty weighting and balanced learning for high-uncertainty objects. In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues, reducing uncertainty from the unstable 3D representation. This dual-branch mechanism forms a complementary loop: enhanced spatial perception improves semantic classification, and robust semantic predictions further refine spatial understanding. Extensive experiments demonstrate the superiority of DUO over existing methods across various datasets and domain shift types. The source code is available at https://github.com/hzcar/DUO.
DyGS-SLAM: Real-Time Accurate Localization and Gaussian Reconstruction for Dynamic Scenes	Xinggang Hu Dalian University of Technology Chenyangguang Zhang Tsinghua University Mingyuan Zhao University of Chinese Academy of Sciences Yuanze Gui Beijing University of Technology Xiangkui Zhang Dalian University of Technology Xiangyang Ji Tsinghua University	Paper Supplementary Abstract In dynamic scenes, achieving accurate camera localization and reconstructing a long-term consistent map containing only the static background are two major challenges faced by Visual Simultaneous Localization and Mapping (VSLAM). In current traditional dynamic VSLAM systems, the methods used to handle dynamic objects are primarily designed for localization; if applied to reconstruction, they are prone to introducing motion artifacts. Meanwhile, mask compensation strategies in NeRF- or 3DGS-based dynamic VSLAM systems also face challenges, such as the inability to completely eliminate dynamic object artifacts and low real-time performance. To address these issues, we leverage object detection to extract semantic information and propose a dynamic feature detection algorithm based on both geometry and appearance. This algorithm accurately identifies known and unknown moving objects and determines their actual motion states. To mitigate the issue of insufficient detection box coverage, we design a dynamic object box correction algorithm based on clustering and Gaussian mixture models to comprehensively identify moving object regions. Furthermore, to overcome the limitations of sparse features in texture-scarce environments, we introduce a feature densification strategy based on image texture complexity, enhancing reconstruction quality while maintaining real-time performance. Extensive experimental evaluations demonstrate that our system achieves state-of-the-art localization and reconstruction performance in dynamic scenes and can run in real time on resource-constrained devices.
Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion	Tongyan Hua HKUST(GZ) Lutao Jiang HKUST(GZ) Ying-Cong Chen HKUST Wufan Zhao HKUST(GZ)	Paper Supplementary Abstract Recent advancements in generative models have enabled 3D urban scene generation from satellite imagery, unlocking promising applications in gaming, digital twins, and beyond. However, most existing methods rely heavily on neural rendering techniques, which hinder their ability to produce detailed 3D structures on a broader scale, largely due to the inherent structural ambiguity derived from relatively limited 2D observations. To address this challenge, we propose Sat2City, a novel framework that synergizes the representational capacity of sparse voxel grids with latent diffusion models, tailored specifically for our novel 3D city dataset. Our approach is enabled by three key components: (1) A cascaded latent diffusion framework that progressively recovers 3D city structures from satellite imagery, (2) a Re-Hash operation at its Variational Autoencoder (VAE) bottleneck to compute multi-scale feature grids for stable appearance optimization, and (3) an inverse sampling strategy enabling implicit supervision for smooth appearance transitioning. To overcome the challenge of collecting real-world city-scale 3D models with high-quality geometry and appearance, we introduce a dataset of synthesized large-scale 3D cities paired with satellite-view height maps. Validated on this dataset, our framework generates detailed 3D structures from a single satellite image, achieving superior fidelity compared to existing city generation models.
From Gaze to Movement: Predicting Visual Attention for Autonomous Driving Human-Machine Interaction based on Programmatic Imitation Learning	Yexin Huang Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University Yongbin Lin Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University Lishengsa Yue Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University Zhihong Yao School of Transportation and Logistics, Southwest Jiaotong University Jie Wang Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University	Paper Supplementary Abstract Human-machine interaction technology requires not only the distribution of human visual attention but also the prediction of the gaze point trajectory. We introduce PILOT, a programmatic imitation learning approach that predicts a driver's eye movements based on a set of rule-based conditions. These conditions-derived from driving operations and traffic flow characteristics-define how gaze shifts occur. They are initially identified through incremental synthesis, a heuristic search method, and then refined via LBFGS, a numerical optimization technique. These humanreadable rules enable us to understand drivers' eye movement patterns and make efficient and explainable predictions. We also propose DATAD, a dataset that covers 12 types of autonomous driving takeover scenarios, collected from 60 participants and comprising approximately 600,000 frames of gaze point data. Compared to existing eye-tracking datasets, DATAD includes additional driving metrics and surrounding traffic flow characteristics, providing richer contextual information for modeling gaze behavior. Experimental evaluations of PILOT on DATAD demonstrate superior accuracy and faster prediction speeds compared to four baseline models. Specifically, PILOT reduces the MSE of predicted trajectories by 38.59% to 88.02% and improves the accuracy of gaze object predictions by 6.90% to 55.06%. Moreover, PILOT achieves these gains with approximately 30% lower prediction time, offering both more accurate and more efficient eye movement prediction.
Generalizable Object Re-Identification via Visual In-Context Prompting	Zhizhong Huang Michigan State University Xiaoming Liu Michigan State University	Paper Supplementary Abstract Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture identity-sensitive features critical for ReID. This paper proposes Visual In-Context Prompting (VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only in-context examples as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models (VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through taskspecific prompting, which then guides a VFM (e.g., DINO) to extract ID-discriminative features via dynamic visual prompts. By aligning LLM-derived semantic concepts with the VFM's pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories. Code is available at https://github.com/Hzzone/VICP.
HarmonySeg: Tubular Structure Segmentation with Deep-Shallow Feature Fusion and Growth-Suppression Balanced Loss	Yi Huang DAMO Academy, Alibaba Group Ke Zhang Department of Electrical and Computer Engineering, Johns Hopkins University Wei Liu DAMO Academy, Alibaba Group Yuanyuan Wang Department of Biomedical Engineering, Fudan University Vishal M. Patel Department of Electrical and Computer Engineering, Johns Hopkins University Le Lu DAMO Academy, Alibaba Group Xu Han Department of Hepatobiliary and Pancreatic Surgery, The First Affiliated Hospital of College of Medicine, Zhejiang University Dakai Jin DAMO Academy, Alibaba Group Ke Yan DAMO Academy, Alibaba Group	Paper Supplementary Abstract Accurate segmentation of tubular structures in medical images, such as vessels and airway trees, is crucial for computer-aided diagnosis, radiotherapy, and surgical planning. However, significant challenges exist in algorithm design when faced with diverse sizes, complex topologies, and (often) incomplete data annotation of these structures. We address these difficulties by proposing a new tubular structure segmentation framework named HarmonySeg. First, we design a deep-to-shallow decoder network featuring flexible convolution blocks with varying receptive fields, which enables the model to effectively adapt to tubular structures of different scales. Second, to highlight potential anatomical regions and improve the recall of small tubular structures, we incorporate vesselness maps as auxiliary information. These maps are aligned with image features through a shallow-and-deep fusion module, which simultaneously eliminates unreasonable candidates to maintain high precision. Finally, we introduce a topology-preserving loss function that leverages contextual and shape priors to balance the growth and suppression of tubular structures, which also allows the model to handle low-quality and incomplete annotations. Extensive quantitative experiments are conducted on four public datasets. The results show that our model can accurately segment 2D and 3D tubular structures and outperform existing state-of-the-art methods. External validation on a private dataset also demonstrates good generalizability. Code will be released at this link.
Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive Segmentation	You Huang Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Lichao Chen Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Jiayi Ji Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Liujuan Cao Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Shengchuan Zhang Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Rongrong Ji Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University	Paper Supplementary Abstract Interactive segmentation (IS) improves annotation efficiency by segmenting target regions from user prompts, with widespread applications in real-world scenarios. Current approaches face a critical trade-off: dense-token methods achieve superior accuracy and detail preservation but suffer from prohibitively slow processing on CPU devices, while the Segment Anything Model (SAM) advances the field with sparse prompt tokens for fast inference but compromises segmentation quality. In this paper, we propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing, which introduces four key enhancements. First, we propose Dynamic Prompt Embedding (DPE) that adaptively processes only regions of interest while avoiding additional overhead from background tokens. Second, we introduce Dynamic Hybrid Attention (DHA), which leverages previous segmentation masks to route tokens through either full attention (O(N 2)) for boundary regions or our proposed efficient BSQ attention (O(N)) for non-boundary regions. Third, we develop Hybrid Mixture of Experts (HMoE), which applies similar adaptive computation strategies in FFN modules with CPU-optimized parallel processing. Finally, we present Dynamic Local Upsampling (DLU), a reverse operation of DPE, which localizes objects with a lightweight MLP and performs fine-grained upsampling only in detected regions. Experimental results on high-precision IS benchmarks demonstrate that Inter2Former achieves SOTA performance with high efficiency on CPU devices.
LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression	Wenjie Huang Shanghai Jiao Tong University Qi Yang University of Missouri-Kansas City Shuting Xia Shanghai Jiao Tong University He Huang Shanghai Jiao Tong University Yiling Xu Shanghai Jiao Tong University Zhu Li University of Missouri-Kansas City	Paper Supplementary Abstract Existing AI-based point cloud compression methods struggle with dependence on specific training data distributions, which limits their real-world deployment. Implicit Neural Representation (INR) methods solve the above problem by encoding overfitted network parameters to the bitstream, resulting in more distribution-agnostic results. However, due to the limitation of encoding time and decoder size, current INR based methods only consider lossy geometry compression. In this paper, we propose the first INR based lossless point cloud geometry compression method called Lossless Implicit Neural Representations for Point Cloud Geometry Compression (LINR-PCGC). To accelerate encoding speed, we design a group of point clouds level coding framework with an eﬀective network initialization strategy, which can reduce around 60% encoding time. A lightweight coding network based on multiscale SparseConv, consisting of scale context extraction, child node prediction, and model compression modules, is proposed to realize fast inference and compact decoder size. Experimental results show that our method consistently outperforms traditional and AI-based methods: for example, with the convergence time in the MVUB dataset, our method reduces the bitstream by approximately 21.21% compared to G-PCC TMC13v23 and 21.95% compared to SparsePCGC. Our project can be seen on https://huangwenjie2023.github.io/LINR-PCGC/.
Learning A Unified Template for Gait Recognition	Panjian Huang School of Artificial Intelligence, Beijing Normal University Saihui Hou School of Artificial Intelligence, Beijing Normal University Junzhou Huang Department of Computer Science and Engineering, The University of Texas at Arlington Yongzhen Huang School of Artificial Intelligence, Beijing Normal University	Paper Abstract 'What I cannot create, I do not understand.' Human wisdom reveals that creation is one of the highest forms of learning. For example, Diffusion Models have demonstrated remarkable semantic structure and memory in image generation, understanding, and restoration, which intuitively benefits representation learning. However, current gait networks rarely embrace this perspective, relying primarily on learning by contrasting gait samples under varying complex conditions, leading to semantic inconsistency and uniformity issues. To address these issues, we propose Origins with generative capabilities whose underlying philosophy is that different entities are generated from a unified template, inherently regularizing gait representations within a consistent and diverse semantic space to capture accurate gait differences. Admittedly, learning this unified template is exceedingly challenging, as it requires the comprehensiveness of the template to encompass gait representations with various conditions. Inspired by Diffusion Models, Origins diffuses the unified template into timestep templates for gait generative learning, and meanwhile transfers the unified template for gait representation learning. Especially, gait generative and representation learning serve as a unified framework for end-to-end joint training. Extensive experiments on CASIA-B, CCPG, SUSTech1K, Gait3D, GREW and CCGR-MINI demonstrate that Origins performs unified generative and representation learning, achieving superior performance.
MV-Adapter: Multi-View Consistent Image Generation Made Easy	Zehuan Huang Beihang University Yuan-Chen Guo VAST Haoran Wang Shanghai Jiao Tong University Ran Yi Shanghai Jiao Tong University Lizhuang Ma Shanghai Jiao Tong University Yan-Pei Cao VAST Lu Sheng Beihang University	Paper Supplementary Abstract Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) †Project lead; corresponding author models and require full fine-tuning, leading to high computational costs and degradation in image quality due to scarce high-quality 3D data. This paper introduces MVAdapter, an efficient and versatile adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated selfattention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pretrained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and imagebased 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MVAdapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.
No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views	Ranran Huang Imperial College London Krystian Mikolajczyk Imperial College London	Paper Supplementary Abstract We introduce SPFSplat, an efficient framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training or inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs within a single feed-forward step. Alongside the rendering loss based on estimated novel-view poses, a reprojection loss is integrated to enforce the learning of pixel-aligned Gaussian primitives for enhanced geometric constraints. This pose-free training paradigm and efficient one-step feedforward design make SPFSplat well-suited for practical applications. Remarkably, despite the absence of pose supervision, SPFSplat achieves state-of-the-art performance in novel view synthesis even under significant viewpoint changes and limited image overlap. It also surpasses recent methods trained with geometry priors in relative pose estimation. Code and trained models are available on our project page: https://ranrhuang.github.io/spfsplat/.
OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images	Ziyue Huang State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China Yongchao Feng State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China Ziqi Liu State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China Shuai Yang State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China Qingjie Liu State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China Yunhong Wang State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China	Paper Supplementary Abstract Remote sensing object detection has made significant progress, but most studies still focus on closed-set detection, limiting generalization across diverse datasets. Openvocabulary object detection (OVD) provides a solution by leveraging multimodal associations between text prompts and visual features. However, existing OVD methods for remote sensing (RS) images are constrained by small-scale datasets and fail to address the unique challenges of remote sensing interpretation, include oriented object detection and the need for both high precision and real-time performance in diverse scenarios. To tackle these challenges, we propose OpenRSD, a universal open-prompt RS object detection framework. OpenRSD supports multimodal prompts and integrates multi-task detection heads to balance accuracy and real-time requirements. Additionally, we design a multi-stage training pipeline to enhance the generalization of model. Evaluated on seven public datasets, OpenRSD demonstrates superior performance in oriented and horizontal bounding box detection, with real-time inference capabilities suitable for large-scale RS image analysis. Compared to YOLO-World, OpenRSD exhibits an 8.7% higher average precision and achieves an inference speed of 20.8 FPS. Codes and models are available at: https://github.com/floatingstarZ/OpenRSDt.
RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation	Junwen Huang Technical University of Munich Shishir Reddy Vutukur Technical University of Munich Peter KT Yu XYZ Robotics Nassir Navab Technical University of Munich Slobodan Ilic Technical University of Munich Benjamin Busam Technical University of Munich	Paper Supplementary Abstract Typical template-based object pose pipelines estimate the pose by retrieving the closest matching template and aligning it with the observed image. However, failure to retrieve the correct template often leads to inaccurate pose predictions. To address this, we reformulate template-based object pose estimation as a ray alignment problem, where the viewing directions from multiple posed template images are learned to align with a non-posed query image. Inspired by recent progress in diffusion-based camera pose estimation, we embed this formulation into a diffusion transformer architecture that aligns a query image with a set of posed templates. We reparameterize object rotation using object-centered camera rays and model object translation by extending scale-invariant translation estimation to dense translation offsets. Our model leverages geometric priors from the templates to guide accurate query pose inference. A coarse-to-fine training strategy based on narrowed template sampling improves performance without modifying the network architecture. Extensive experiments across multiple benchmark datasets show competitive results of our method compared to state-of-the-art approaches in unseen object pose estimation.
RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving	Zhijian Huang Shenzhen Campus of Sun Yat-sen University Chengjian Feng Meituan Feng Yan Meituan Baihui Xiao Meituan Zequn Jie Meituan Yujie Zhong Meituan Xiaodan Liang Shenzhen Campus of Sun Yat-sen University Lin Ma Meituan	Paper Supplementary Abstract Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current datadriven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose RoboTron-Drive, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pretraining to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD datasets to finetune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on three unseen datasets, where RoboTron-Drive achieves state-of-theart performance across all tasks. We hope RoboTron-Drive as a promising solution for AD in the real world.
Towards Foundational Models for Single-Chip Radar	Tianshu Huang Carnegie Mellon University Akarsh Prabhakara University of Wisconsin-Madison Chuhan Chen Carnegie Mellon University Jay Karhade Carnegie Mellon University Deva Ramanan Carnegie Mellon University Matthew O'Toole Carnegie Mellon University Anthony Rowe Carnegie Mellon University	Paper Supplementary Abstract mmWave radars are compact, inexpensive, and durable sensors that are robust to occlusions and work regardless of environmental conditions, such as weather and darkness. However, this comes at the cost of poor angular resolution, especially for inexpensive single-chip radars, which are typically used in automotive and indoor sensing applications. Although many have proposed learning-based methods to mitigate this weakness, no standardized foundational models or large datasets for the mmWave radar have emerged, and practitioners have largely trained task-specific models from scratch using relatively small datasets. In this paper, we collect (to our knowledge) the largest available raw radar dataset with 1M samples (29 hours) and train a foundational model for 4D single-chip radar, which can predict 3D occupancy and semantic segmentation with quality that is typically only possible with much higher resolution sensors. We demonstrate that our Generalizable Radar Transformer (GRT) generalizes across diverse settings, can be fine-tuned for different tasks, and shows logarithmic data scaling of 20% per 10x data. We also run extensive ablations on common design decisions, and find that using raw radar data significantly outperforms widely-used lossy representations, equivalent to a 10x increase in training data. Finally, we roughly estimate that ≈100M samples (3000 hours) of data are required to fully exploit the potential of GRT.
ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition	Ronggang Huang South China University of Technology Haoxin Yang South China University of Technology Yan Cai South China University of Technology Xuemiao Xu South China University of Technology Huaidong Zhang South China University of Technology Shengfeng He Singapore Management University	Paper Supplementary Abstract 3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multianchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation. Code is available at https://github. com/visualjason/ViewSRD.
Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting	Jiaxin Huang Zhejiang University Sheng Miao Zhejiang University Bangbang Yang ByteDance Yuewen Ma ByteDance Yiyi Liao Zhejiang University	Paper Supplementary Abstract Reconstructing 4D dynamic scenes from casually captured monocular videos is valuable but highly challenging, as each timestamp is observed from a single viewpoint. We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views - synthesizing multi-view videos from a monocular input. Unlike existing methods that either solely leverage geometric priors for supervision or use generative priors while overlooking geometry, we integrate both. This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints based on monocular depth priors. To achieve this, we train a video inpainting model on unposed web videos with synthetically generated masks that mimic warping occlusions, ensuring spatially and temporally consistent completion of missing regions. To further mitigate inaccuracies in monocular depth priors, we introduce an iterative view augmentation strategy and a robust reconstruction loss. Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion.
When Anchors Meet Cold Diffusion: A Multi-Stage Approach to Lane Detection	Bo-Lun Huang National Yang Ming Chiao Tung University Zi-Xiang Ni National Yang Ming Chiao Tung University Feng-Kai Huang National Taiwan University Hong-Han Shuai National Yang Ming Chiao Tung University Wen-Huang Cheng National Taiwan University	Paper Supplementary Abstract Accurate and stable lane detection is crucial for the reliability of autonomous driving systems. A core challenge lies in predicting lane positions in complex scenarios, such as curved roads or when markings are ambiguous or absent. Conventional approaches leverage deep learning techniques to extract both high-level and low-level visual features, aiming to achieve a comprehensive understanding of the driving environment. However, these methods often rely on predefined anchors within a single-pass model, limiting their adaptability. The one-shot prediction paradigm struggles with precise lane estimation in challenging scenarios, such as curved roads or adverse conditions like low visibility at night. To address these limitations, we propose a novel cold diffusion-based framework that initializes lane predictions with predefined anchors and iteratively refines them. This approach retains the flexibility and progressive refinement capabilities of diffusion models while overcoming the constraints of traditional hot diffusion techniques. To further enhance the model's coarse-to-fine refinement capabilities, we introduce a multi-resolution image processing strategy, where images are analyzed at different timesteps to capture both global and local lane structure details. Besides, we incorporate a learnable noise variance schedule, enabling the model to dynamically adjust its learning process based on multi-resolution inputs. Experimental results demonstrate that our method significantly improves detection accuracy across a variety of challenging scenarios, outperforming state-of-the-art lane detection methods. Codes and trained weights are available at https://github.com/ntudr/CDiffLane
Everything is a Video: Unifying Modalities through Next-Frame Prediction	G. Thomas Hudson Durham University Dean Slack Durham University Thomas Winterbottom Durham University Jamie Sterling Durham University Chenghao Xiao Durham University Junjie Shentu Durham University Noura Al Moubayed Durham University	Paper Abstract
MBTI: Masked Blending Transformers with Implicit Positional Encoding for Frame-rate Agnostic Motion Estimation	Jungwoo Huh Yonsei University Yeseung Park Yonsei University Seongjean Kim Yonsei University Jungsu Kim Yonsei University Sanghoon Lee Yonsei University	Paper Supplementary Abstract Human motion estimation models typically assume a fixed number of input frames, making them sensitive to variations in frame rate and leading to inconsistent motion predictions across different temporal resolutions. This limitation arises because input frame rates inherently determine the temporal granularity of motion capture, causing discrepancies when models trained on a specific frame rate encounter different sampling frequencies. To address this challenge, we propose MBTI (Masked Blending Transformers with Implicit Positional Encoding), a frame rate-agnostic human motion estimation framework designed to maintain temporal consistency across varying input frame rates. Our approach leverages a masked autoencoder (MAE) architecture with masked token blending, which aligns input tokens with a predefined high-reference frame rate, ensuring a standardized temporal representation. Additionally, we introduce implicit positional encoding, which encodes absolute time information using neural implicit functions, enabling more natural motion reconstruction beyond discrete sequence indexing. By reconstructing motion at a high reference frame rate and optional downsampling, MBTI ensures both frame rate generalization and temporal consistency. To comprehensively evaluate MBTI, we introduce EMDB-FPS, an augmented benchmark designed to assess motion estimation robustness across multiple frame rates in both local and global motion estimation tasks. To further assess MBTI's robustness, we introduce the Motion Consistency across Frame rates (MCF), a novel metric to quantify the deviation of motion predictions across different input frame rates. Our results demonstrate that MBTI outperforms state-of-the-art methods in both motion accuracy and temporal consistency, achieving the most stable and consistent motion predictions across varying frame rates.
Motion Synthesis with Sparse and Flexible Keyjoint Control	Inwoo Hwang Seoul National University Jinseok Bae Seoul National University Donggeun Lim Seoul National University Young Min Kim Seoul National University	Paper Supplementary Abstract Creating expressive character animations is laborintensive, requiring intricate manual adjustment of animators across space and time. Previous works on controllable motion generation often rely on a predefined set of dense spatio-temporal specifications (e.g., dense pelvis trajectories with exact per-frame timing), limiting practicality for animators. To process high-level intent and intuitive control in diverse scenarios, we propose a practical controllable motions synthesis framework that respects sparse and flexible keyjoint signals. Our approach employs a decomposed diffusion-based motion synthesis framework that first synthesizes keyjoint movements from sparse input control signals and then synthesizes full-body motion based on the completed keyjoint trajectories. The low-dimensional keyjoint movements can easily adapt to various control signal types, such as end-effector position for diverse goal-driven motion synthesis, or incorporate functional constraints on a subset of keyjoints. Additionally, we introduce a time-agnostic control formulation, eliminating the need for frame-specific timing annotations and enhancing control flexibility. Then, the shared second stage can synthesize a natural whole-body motion that precisely satisfies the task requirement from dense keyjoint movements. We demonstrate the effectiveness of sparse and flexible keyjoint control through comprehensive experiments on diverse datasets and scenarios. Project page: http://inwoohwang.me/SFControl
SceneMI: Motion In-betweening for Modeling Human-Scene Interaction	Inwoo Hwang Seoul National University Bing Zhou Snap Inc. Young Min Kim Seoul National University Jian Wang Snap Inc. Chuan Guo Snap Inc.	Paper Supplementary Abstract Modeling human-scene interactions (HSI) is essential for understanding and simulating everyday human behaviors. Recent approaches utilizing generative modeling have made progress in this domain; however, they are limited in controllability and flexibility for real-world applications. To address these challenges, we propose reformulating the HSI modeling problem as Scene-aware Motion In-betweening-a more tractable and practical task. We introduce SceneMI, a framework that supports several practical applications, including keyframe-guided character animation in 3D scenes and enhancing the motion quality of imperfect HSI data. SceneMI employs dual scene descriptors to comprehensively encode global and local scene context. Furthermore, our framework leverages the inherent denoising nature of diffusion models to generalize on noisy keyframes. Experimental results demonstrate SceneMI's effectiveness in scene-aware keyframe in-betweening and generalization to the real-world GIMO dataset, where motions and scenes are acquired by noisy IMU sensors and smartphones. We further showcase SceneMI's applicability in HSI reconstruction from monocular videos. Project page: http://inwoohwang.me/SceneMI
Towards Visual Localization Interoperability: Cross-Feature for Collaborative Visual Localization and Mapping	Alberto Jaenal Ericsson Research Paula Carbó Cubero Ericsson Research José Araujo Ericsson Research André Mateus Ericsson Research	Paper Supplementary Abstract The growing presence of vision-based systems in the physical world comes with a major requirement: highly accurate estimation of the pose, a task typically addressed through methods based on local features. The totality of the available feature-based localization solutions are designed under the assumption of using the same feature for mapping and localization. However, as the implementation provided by each vendor is based on heterogeneous feature extraction algorithms, collaboration between different devices is not straightforward or even not possible. Although there are some alternatives, such as re-extracting the features or reconstructing the image from them, these are impractical or costly to implement in a real pipeline. To overcome this, and inspired in the seminal work Cross-Descriptor [13], we propose Cross-Feature, a method that applies a patchbased training strategy to a simple MLP which projects features to a common embedded space. As a consequence, our proposal allows to establish suitable correspondences between features computed through heterogeneous algorithms, e.g. SIFT [25] and SuperPoint [10]. We experimentally demonstrate the validity of Cross-Feature by evaluating it in tasks as Image Matching, Visual Localization and a new Collaborative Visual Localization and Mapping scenario. We believe this is the first step towards full Visual Localization interoperability. Code is available at https: //github.com/EricssonResearch/crossfeat.
Identity-aware Language Gaussian Splatting for Open-vocabulary 3D Semantic Segmentation	SungMin Jang Konkuk University Wonjun Kim Konkuk University	Paper Supplementary Abstract Open-vocabulary 3D semantic segmentation has been actively studied by incorporating language features into 3D scene representations. Even though many methods have shown the notable improvement in this task, they still have difficulties to make language embeddings be consistent across different views. This inconsistency highly results in mis-labeling where different language embeddings are assigned to the same part of an object. To address this issue, we propose a simple yet powerful method that aligns language embeddings via the identity information. The key idea is to locate language embeddings for the same identity closely in the latent space while putting them apart otherwise. This approach allows the same object to have identical language embeddings in novel views with accurate semantic masks, which are well aligned with the input text. Furthermore, we propose a progressive mask expanding scheme that enables more accurate extraction of semantic mask boundaries. This scheme is very effective in preserving the boundary shape of the target region by allowing the model to consider the local relationship between segments. Experimental results on benchmark datasets demonstrate that our method delivers state-of-the-art performance in open-vocabulary 3D semantic segmentation. https://github.com/DCVL-3D/ILGS release
Splat-based 3D Scene Reconstruction with Extreme Motion-blur	Hyeonjoong Jang KAIST Dongyoung Choi KAIST Donggun Kim KAIST Woohyun Kang KAIST Min H. Kim KAIST	Paper Supplementary Abstract We propose a splat-based 3D scene reconstruction method from RGB-D input that effectively handles extreme motion blur, a frequent challenge in low-light environments. Under dim illumination, RGB frames often suffer from severe motion blur due to extended exposure times, causing traditional camera pose estimation methods, such as COLMAP, to fail. This results in inaccurate camera pose and blurry color input, compromising the quality of 3D reconstructions. Although recent 3D reconstruction techniques like Neural Radiance Fields and Gaussian Splatting have demonstrated impressive results, they rely on accurate camera trajectory estimation, which becomes challenging under fast motion or poor lighting conditions. Furthermore, rapid camera movement and the limited field of view of depth sensors reduce point cloud overlap, limiting the effectiveness of pose estimation with the ICP algorithm. To address these issues, we introduce a method that combines camera pose estimation and image deblurring using a Gaussian Splatting framework, leveraging both 3D Gaussian splats and depth inputs for enhanced scene representation. Our method first aligns consecutive RGB-D frames through optical flow and ICP, then refines camera poses and 3D geometry by adjusting Gaussian positions for optimal depth alignment. To handle motion blur, we model camera movement during exposure and deblur images by comparing the input with a series of sharp, rendered frames. Experiments on a new RGB-D dataset with extreme motion blur show that our method outperforms existing approaches, enabling high-quality reconstructions even in challenging conditions. This approach has broad implications for 3D mapping applications in robotics, autonomous navigation, and augmented reality. Both code and dataset are publicly available on https://github.com/KAISTVCLAB/gs-extreme-motion-blur.
Sparfels: Fast Reconstruction from Sparse Unposed Imagery	Shubhendu Jena Inria, Univ. Rennes, CNRS, IRISA Amine Ouasfi Inria, Univ. Rennes, CNRS, IRISA Mae Younes Inria, Univ. Rennes, CNRS, IRISA Adnane Boukhayma Inria, Univ. Rennes, CNRS, IRISA	Paper Supplementary Abstract We present a method for Sparse view reconstruction with surface element splatting that runs within 3 minutes on a consumer grade GPU. While few methods address sparse radiance field learning from noisy or unposed sparse cameras, shape recovery remains relatively underexplored in this setting. Several radiance and shape learning testtime optimization methods address the sparse posed setting by learning data priors or using combinations of external monocular geometry priors. Differently, we propose an efficient and simple pipeline harnessing a single recent 3D foundation model. We leverage its various task heads, notably point maps and camera initializations to instantiate a bundle adjusting 2D Gaussian Splatting (2DGS) model, and image correspondences to guide camera optimization midst 2DGS training. Key to our contribution is a novel formulation of splatted color variance along rays, which can be computed efficiently. Reducing this moment in training leads to more accurate shape reconstructions. We demonstrate state-of-the-art performances in the sparse uncalibrated setting in reconstruction and novel view benchmarks based on established multi-view datasets. Code will be made available at https://shubhendujena.github.io/Sparfels-web/
Robust Adverse Weather Removal via Spectral-based Spatial Grouping	Yuhwan Jeong KAIST Yunseo Yang KAIST Youngho Yoon KAIST Kuk-Jin Yoon KAIST	Paper Supplementary Abstract Adverse weather conditions cause diverse and complex degradation patterns, driving the development of All-inOne (AiO) models. However, recent AiO solutions still struggle to capture diverse degradations, since global filtering methods like direct operations on the frequency domain fail to handle highly variable and localized distortions. To address these issue, we propose Spectral-based Spatial Grouping Transformer (SSGformer), a novel approach that leverages spectral decomposition and group-wise attention for multi-weather image restoration. SSGformer decomposes images into high-frequency edge features using conventional edge detection and low-frequency information via Singular Value Decomposition. We utilize multi-head linear attention to effectively model the relationship between these features. The fused features are integrated with the input to generate a grouping-mask that clusters regions based on the spatial similarity and image texture. To fully leverage this mask, we introduce a group-wise attention mechanism, enabling robust adverse weather removal and ensuring consistent performance across diverse weather conditions. We also propose a Spatial Grouping Transformer Block that uses both channel attention and spatial attention, effectively balancing feature-wise relationships and spatial dependencies. Extensive experiments show the superiority of our approach, validating its effectiveness in handling the varied and intricate adverse weather degradations.
Test-Time Prompt Tuning for Zero-Shot Depth Completion	Chanhwi Jeong GIST Inhwan Bae GIST Jin-Hwi Park Chung-Ang University Hae-Gon Jeon Yonsei University	Paper Supplementary Abstract Zero-shot depth completion using metric scales remains challenging, primarily due to performance limitations such as domain specificity and sensor characteristics. One recent emerging solution is to integrate monocular depth foundation models into depth completion frameworks, yet such efforts still face issues with suboptimal performance and often require further adaptation to the target task. Surprisingly, we find that a simple test-time training, which finetunes monocular depth foundation models on sparse depth measurements from sensors just as it is, yields reasonable results. However, this test-time training obviously incurs high computational costs and introduces biases towards specific conditions, making it impractical for real-world scenarios. In this paper, we introduce a new approach toward parameter-efficient zero-shot depth completion. Our key idea in this work is to leverage visual prompt tuning, achieving sensor-specific depth scale adaptation without forgetting foundational knowledge. Experimental results on diverse datasets demonstrate that our approach outperforms relevant state-of-the-art methods, showing superior generalization and efficiency. Code is publicly available at https://github.com/ch5374/TestPromptDC
MMGeo: Multimodal Compositional Geo-Localization for UAVs	Yuxiang Ji Institute of Artificial Intelligence, Xiamen University Boyong He Institute of Artificial Intelligence, Xiamen University Zhuoyue Tan Institute of Artificial Intelligence, Xiamen University Liaoni Wu Institute of Artificial Intelligence, Xiamen University	Paper Abstract Multimodal geo-localization methods can inherently overcome the limitations of unimodal sensor systems by leveraging complementary information from different modalities. However, existing retrieval-based methods rely on a comprehensive multimodal database, which is often challenging to fulfill in practice. In this paper, we introduce a more practical problem for localizing drone-view images by collaborating multimodal data within a satellite-view reference map, which integrates multimodal information while avoiding the need for an extensive multimodal database. We present MMGEO that learns to push the composition of multimodal representations to the target reference map through a unified framework. By utilizing a comprehensive multimodal query (image, point cloud/depth/text), we can achieve more robust and accurate geo-localization, especially in unknown and complex environments. Additionally, we extend two visual geo-localization datasets GTA-UAV and UAV-VisLoc to multi-modality, establishing the first UAV geo-localization datasets that combine image, point cloud, depth and text data. Experiments demonstrate the effectiveness of MMGEO for UAV multimodal compositional geo-localization, as well as the generalization capabilities to real-world scenarios. The code and dataset are at https://github.com/Yux1angJi/MMGeo.
OcRFDet: Object-Centric Radiance Fields for Multi-View 3D Object Detection in Autonomous Driving	Mingqian Ji PCA Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology Shanshan Zhang PCA Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology Jian Yang PCA Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology	Paper Supplementary Abstract Current multi-view 3D object detection methods typically transfer 2D features into 3D space using depth estimation or 3D position encoder, but in a fully data-driven and implicit manner, which limits the detection performance. Inspired by the success of radiance fields on 3D reconstruction, we assume they can be used to enhance the detector's ability of 3D geometry estimation. However, we observe a decline in detection performance, when we directly use them for 3D rendering as an auxiliary task. From our analysis, we find the performance drop is caused by the strong responses on the background when rendering the whole scene. To address this problem, we propose object-centric radiance fields, focusing on modeling foreground objects while discarding background noises. Specifically, we employ Object-centric Radiance Fields (OcRF) to enhance 3D voxel features via an auxiliary task of rendering foreground objects. We further use opacity - the side-product of rendering- to enhance the 2D foreground BEV features via Height-aware Opacity-based Attention (HOA), where attention maps at different height levels are generated separately via multiple networks in parallel. Extensive experiments on the nuScenes validation and test datasets demonstrate that our OcRFDet achieves superior performance, outperforming previous state-of-the-art methods with 57.2% mAP and 64.8% NDS on the nuScenes test benchmark. Code is available at https://github.com/Mingqj/OcRFDet.
Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis	Kaiyang Ji ShanghaiTech University Ye Shi ShanghaiTech University Zichen Jin ShanghaiTech University Kangyi Chen ShanghaiTech University Lan Xu ShanghaiTech University Yuexin Ma ShanghaiTech University Jingyi Yu ShanghaiTech University Jingya Wang ShanghaiTech University	Paper Supplementary Abstract Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between realtime responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners' movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration. Project page: https://humanx-interaction.github. io/
H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction	Heng Jia Zhejiang University Linchao Zhu Zhejiang University Na Zhao Singapore University of Technology and Design	Paper Supplementary Abstract Despite recent advances in feed-forward 3D Gaussian Splatting, generalizable 3D reconstruction remains challenging, particularly in multi-view correspondence modeling. Existing approaches face a fundamental trade-off: explicit methods achieve geometric precision but struggle with ambiguous regions, while implicit methods provide robustness but suffer from slow convergence. We present H3R, a hybrid framework that addresses this limitation by integrating volumetric latent fusion with attention-based feature aggregation. Our framework consists of two complementary components: an efficient latent volume that enforces geometric consistency through epipolar constraints, and a camera-aware Transformer that leverages Pl¨ucker coordinates for adaptive correspondence refinement. By integrating both paradigms, our approach enhances generalization while converging 2→faster than existing methods. Furthermore, we show that spatial-aligned foundation models (e.g., SD-VAE) substantially outperform semantic-aligned models (e.g., DINOv2), resolving the mismatch between semantic representations and spatial reconstruction requirements. Our method supports variable-number and highresolution input views while demonstrating robust crossdataset generalization. Extensive experiments show that our method achieves state-of-the-art performance across multiple benchmarks, with significant PSNR improvements of 0.59 dB, 1.06 dB, and 0.22 dB on the RealEstate10K, ACID, and DTU datasets, respectively. Code is available at https://github.com/JiaHeng-DLUT/H3R.
PrimHOI: Compositional Human-Object Interaction via Reusable Primitives	Kai Jia Beijing Institute of Technology Tengyu Liu National Key Laboratory of General Artificial Intelligence, BIGAI Yixin Zhu Peking University Mingtao Pei Beijing Institute of Technology Siyuan Huang National Key Laboratory of General Artificial Intelligence, BIGAI	Paper Supplementary Abstract Synthesizing realistic Human-Object Interaction (HOI) motions is essential for creating believable digital characters and intelligent robots. Existing approaches rely on dataintensive learning models that struggle with the compositional structure of daily HOI motions, particularly for complex multi-object manipulation tasks. The exponential growth of possible interaction scenarios makes comprehensive data collection prohibitively expensive. The fundamental challenge is synthesizing unseen, complex HOI sequences without extensive task-specific training data. Here we show that PrimHOI generates complex HOI motions through spatial and temporal composition of generalizable interaction primitives defined by relative geometry. Our approach demonstrates that repetitive local contact patterns- grasping, clamping, and supporting-serve as reusable building blocks for diverse interaction sequences. Unlike previous data-driven methods requiring end-to-end training for each task variant, PrimHOI achieves zero-shot transfer to unseen scenarios through hierarchical primitive planning. Experimental validation demonstrates substantial improvements in adaptability, diversity, and motion quality compared to existing approaches.
G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation	Juntao Jian Shenzhen University Xiuping Liu Dalian University of Technology Zixuan Chen Dalian University of Technology Manyi Li Shandong University Jian Liu Shenyang University of Technology Ruizhen Hu Shenzhen University	Paper Supplementary Abstract Recent advances in dexterous grasping synthesis have demonstrated significant progress in producing reasonable and plausible grasps for many task purposes. But it remains challenging to generalize to unseen object categories and diverse task instructions. In this paper, we propose GDexGrasp, a retrieval-augmented generation approach that can produce high-quality dexterous hand configurations for unseen object categories and language-based task instructions. The key is to retrieve generalizable grasping priors, including the fine-grained contact part and the affordancerelated distribution of relevant grasping instances, for the following synthesis pipeline. Specifically, the fine-grained contact part and affordance act as generalizable guidance to infer reasonable grasping configurations for unseen objects with a generative model, while the relevant grasping distribution plays as regularization to guarantee the plausibility of synthesized grasps during the subsequent refinement optimization. Our comparison experiments validate the effectiveness of our key designs for generalization and demonstrate the remarkable performance against the existing approaches. Project page: https://g-dexgrasp. github.io/
Diffusion-based Source-biased Model for Single Domain Generalized Object Detection	Han Jiang University of Science and Technology of China Wenfei Yang University of Science and Technology of China Tianzhu Zhang University of Science and Technology of China Yongdong Zhang University of Science and Technology of China	Paper Supplementary Abstract Single domain generalized object detection aims to train an object detector on a single source domain and generalize it to any unseen domain. Although existing approaches based on data augmentation exhibit promising results, they overlook domain discrepancies across multiple augmented domains, which limits the performance of object detectors. To tackle these problems, we propose a novel diffusionbased framework, termed SDG-DiffDet, to mitigate the impact of domain gaps on object detectors. The proposed SDG-DiffDet consists of a memory-guided diffusion module and a source-guided denoising module. Specifically, in the memory-guided diffusion module, we design feature statistics memories that mine diverse style information from local parts to augment source features. The augmented features further serve as noise in the diffusion process, enabling the model to capture distribution differences between practical domain distributions. In the source-guided denoising module, we design a text-guided condition to facilitate distribution transfer from any unseen distribution to source distribution in the denoising process. By combining these two designs, our proposed SDG-DiffDet effectively models feature augmentation and target-to-source distribution transfer within a unified diffusion framework, thereby enhancing the detection performance on unseen domains. Extensive experiments demonstrate that the proposed SDG-DiffDet achieves state-of-the-art performance across two challenging scenarios.
Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction	Zeren Jiang University of Oxford Chuanxia Zheng University of Oxford Iro Laina University of Oxford Diane Larlus Naver Labs Europe Andrea Vedaldi University of Oxford	Paper Supplementary Abstract We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic priors captured by largescale pre-trained video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, disparity, and ray maps. We propose a new multi-modal alignment algorithm to align and fuse these modalities, as well as a sliding window approach at inference time, thus enabling robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods.
MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network	Jianfei Jiang University of Science and Technology Beijing Qiankun Liu University of Science and Technology Beijing Haochen Yu University of Science and Technology Beijing Hongyuan Liu University of Science and Technology Beijing Liyong Wang University of Science and Technology Beijing Jiansheng Chen University of Science and Technology Beijing Huimin Ma University of Science and Technology Beijing	Paper Supplementary Abstract Learning-based Multi-View Stereo (MVS) methods aim to predict depth maps for a sequence of calibrated images to recover dense point clouds. However, existing MVS methods often struggle with challenging regions, such as textureless regions and reflective surfaces, where feature matching fails. In contrast, monocular depth estimation inherently does not require feature matching, allowing it to achieve robust relative depth estimation in these regions. To bridge this gap, we propose MonoMVSNet, a novel monocular feature and depth guided MVS network that integrates powerful priors from a monocular foundation model into multiview geometry. Firstly, the monocular feature of the reference view is integrated into source view features by the attention mechanism with a newly designed cross-view position encoding. Then, the monocular depth of the reference view is aligned to dynamically update the depth candidates for edge regions during the sampling procedure. Finally, a relative consistency loss is further designed based on the monocular depth to supervise the depth prediction. Extensive experiments demonstrate that MonoMVSNet achieves state-of-the-art performance on the DTU and Tanks-andTemples datasets, ranking first on the Tanks-and-Temples Intermediate and Advanced benchmarks. The source code is available at https://github.com/JianfeiJ/MonoMVSNet.
Multimodal LLM Guided Exploration and Active Mapping using Fisher Information	Wen Jiang University of Pennsylvania Boshu Lei University of Pennsylvania Katrina Ashton University of Pennsylvania Kostas Daniilidis University of Pennsylvania	Paper Supplementary Abstract We present an active mapping system which plans for both long-horizon exploration goals and short-term actions using a 3D Gaussian Splatting (3DGS) representation. Existing methods either do not take advantage of recent developments in multimodal Large Language Models (LLM) or do not consider challenges in localization uncertainty, which is critical in embodied agents. We propose employing multimodal LLMs for long-horizon planning in conjunction with detailed motion planning using our information-based objective. By leveraging high-quality view synthesis from our 3DGS representation, our method employs a multimodal LLM as a zero-shot planner for long-horizon exploration goals from the semantic perspective. We also introduce an uncertainty-aware path proposal and selection algorithm that balances the dual objectives of maximizing the information gain for the environment while minimizing the cost of localization errors. Experiments conducted on the Gibson and Habitat-Matterport 3D datasets demonstrate state-of-the-art results of the proposed method.
PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos	Hanxiao Jiang Columbia University Hao-Yu Hsu University of Illinois Urbana-Champaign Kaifeng Zhang Columbia University Hsin-Ni Yu University of Illinois Urbana-Champaign Shenlong Wang University of Illinois Urbana-Champaign Yunzhu Li Columbia University	Paper Abstract Creating a physical digital twin of a real-world object has immense potential in robotics, content creation, and XR. In this paper, we present PhysTwin, a novel framework that uses sparse videos of dynamic objects under interaction to produce a photo- and physically realistic, realtime interactive virtual replica. Our approach centers on two key components: (1) a physics-informed representation that combines spring-mass models for realistic physical simulation, generative shape models for geometry, and Gaussian splats for rendering; and (2) a novel multi-stage, optimization-based inverse modeling framework that reconstructs complete geometry, infers dense physical properties, and replicates realistic appearance from videos. Our method integrates an inverse physics framework with visual perception cues, enabling high-fidelity reconstruction even from partial, occluded, and limited viewpoints. PhysTwin supports modeling various deformable objects, including ropes, stuffed animals, cloth, and delivery packages. Experiments show that PhysTwin outperforms competing methods in reconstruction, rendering, future prediction, and simulation under novel interactions. We further demonstrate its applications in interactive real-time simulation and model-based robotic motion planning. Project Page: https://jianghanxiao.github.io/phystwin-web/
Real3D: Towards Scaling Large Reconstruction Models with Real Images	Hanwen Jiang The University of Texas at Austin Qixing Huang The University of Texas at Austin Georgios Pavlakos The University of Texas at Austin	Paper Supplementary Abstract Training single-view Large Reconstruction Models (LRMs) follows the fully supervised route, requiring multi-view supervision. However, the multi-view data typically comes from synthetic 3D assets, which are hard to scale further and are not representative of the distribution of real-world object shapes. To address these limitations, we introduce Real3D, the first LRM that uses single-view real images for training, benefiting from their scalability and capturing the real-world shape distribution. Real3D introduces a novel self-training framework, including unsupervised losses at the pixel- and semantic-level, enabling LRMs to learn from these singleview images without multi-view supervision. Simultaneously, to deal with the noise of real data, Real3D also presents an automatic data curation approach to gather high-quality examples that have positive impact on training. Our experiments show that Real3D consistently outperforms prior work in diverse evaluation settings that include real and synthetic data, as well as both in-domain and out-of-domain shapes.
Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework	Jian-Jian Jiang Sun Yat-sen University Xiao-Ming Wu Sun Yat-sen University Yi-Xiang He Sun Yat-sen University Ling-An Zeng Sun Yat-sen University Yi-Lin Wei Sun Yat-sen University Dandan Zhang Imperial College London Wei-Shi Zheng Sun Yat-sen University	Paper Supplementary Abstract Bimanual robotic manipulation is an emerging and critical topic in the robotics community. Previous works primarily rely on integrated control models that take the perceptions and states of both arms as inputs to directly predict their actions. However, we think bimanual manipulation involves not only coordinated tasks but also various uncoordinated tasks that do not require explicit cooperation during execution, such as grasping objects with the closest hand, which integrated control frameworks ignore to consider due to their enforced cooperation in the early inputs. In this paper, we propose a novel decoupled interaction framework that considers the characteristics of different tasks in bimanual manipulation. The key insight of our framework is to assign an independent model to each arm to enhance the learning of uncoordinated tasks, while introducing a selective interaction module that adaptively learns weights from its own arm to improve the learning of coordinated tasks. Extensive experiments on seven tasks in the RoboTwin dataset demonstrate that: (1) Our framework achieves outstanding performance, with a 23.5% boost over the SOTA method. (2) Our framework is flexible and can be seamlessly integrated into existing methods. (3) Our framework can be effectively extended to multi-agent manipulation tasks, achieving a 28% boost over the integrated control SOTA. (4) The performance boost stems from the decoupled design itself, surpassing the SOTA by 16.5% in success rate with only 1/6 of the model size.
TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction	Dadong Jiang Tianjin University Zhi Hou Shanghai Artificial Intelligence Laboratory Zhihui Ke Tianjin University Xianghui Yang Tencent Xiaobo Zhou Tianjin University Tie Qiu Tianjin University	Paper Supplementary Abstract Dynamic scene reconstruction is a long-term challenge in 3D vision. Recent methods extend 3D Gaussian Splatting to dynamic scenes via additional deformation fields and apply explicit constraints like motion flow to guide the deformation. However, they learn motion changes from individual timestamps independently, making it challenging to reconstruct complex scenes, particularly when dealing with violent movement, extreme-shaped geometries, or reflective surfaces. To address the above issue, we design a simple yet effective plug-and-play module called TimeFormer to enable existing deformable 3D Gaussians reconstruction methods with the ability to implicitly model motion patterns from a learning perspective. Specifically, † Corresponding Author Project Page: https://patrickddj.github.io/TimeFormer TimeFormer includes a Cross-Temporal Transformer Encoder, which adaptively learns the temporal relationships of deformable 3D Gaussians. Furthermore, we propose a two-stream optimization strategy that transfers the motion knowledge learned from TimeFormer to the base stream during the training phase. This allows us to remove TimeFormer during inference, thereby preserving the original rendering speed. Extensive experiments in the multi-view and monocular dynamic scenes validate qualitative and quantitative improvement brought by TimeFormer.
VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding	Minchao Jiang School of Computer Science and Technology, Xidian University Shunyu Jia School of Computer Science and Technology, Xidian University Jiaming Gu Algorithm R&D Center, Qing Yi (Shanghai) Xiaoyuan Lu Shanghai Pudong Cryptography Research Institute Guangming Zhu School of Computer Science and Technology, Xidian University Anqi Dong Division of Decision and Control Systems and Department of Mathematics, KTH Royal Institute of Technology Liang Zhang School of Computer Science and Technology, Xidian University	Paper Supplementary Abstract 3D Gaussian Splatting (3DGS) has become horsepower in high-quality, real-time rendering for novel view synthesis of 3D scenes. However, existing methods focus primarily on geometric and appearance modeling, lacking deeper scene understanding while also incurring high training costs that complicate the originally streamlined differentiable rendering pipeline. To this end, we propose VoteSplat, a novel 3D scene understanding framework that integrates Hough voting with 3DGS. Specifically, Segment Anything Model (SAM) is utilized for instance segmentation, extracting objects, and generating 2D vote maps. We then embed spatial offset vectors into Gaussian primitives. These offsets construct 3D spatial votes by associating them with 2D image votes, while depth distortion constraints refine localization along the depth axis. For open-vocabulary object localization, VoteSplat maps 2D image semantics to 3D point clouds via voting points, reducing training costs associated with high-dimensional CLIP features while preserving semantic unambiguity. Extensive experiments demonstrate VoteSplat's effectiveness in open-vocabulary 3D instance localization, 3D point cloud understanding, click-based 3D object localization, hierarchical segmentation, and ablation studies. Our code is available at VoteSplat.
GSOT3D: Towards Generic 3D Single Object Tracking in the Wild	Yifan Jiao Institute of Software Chinese Academy of Sciences Yunhao Li Institute of Software Chinese Academy of Sciences Junhua Ding University of North Texas Qing Yang University of North Texas Song Fu University of North Texas Heng Fan University of North Texas Libo Zhang Institute of Software Chinese Academy of Sciences	Paper Supplementary Abstract In this paper, we present a novel benchmark, GSOT3D, that aims at facilitating development of generic 3D single object tracking (SOT) in the wild. Specifically, GSOT3D offers 620 sequences with 123K frames, and covers a wide selection of 54 object categories. Each sequence is offered with multiple modalities, including the point cloud (PC), RGB image, and depth. This allows GSOT3D to support various 3D tracking tasks, such as single-modal 3D SOT on PC and multi-modal 3D SOT on RGB-PC or RGB-D, and thus greatly broadens research directions for 3D object tracking. To provide highquality per-frame 3D annotations, all sequences are labeled manually with multiple rounds of meticulous inspection and refinement. To our best knowledge, GSOT3D is the largest benchmark dedicated to various generic 3D object tracking tasks. To understand how existing 3D trackers perform and to provide comparisons for future research on GSOT3D, we assess eight representative point cloud-based tracking models. Our evaluation results exhibit that these models heavily degrade on GSOT3D, and more efforts are required for robust and generic 3D object tracking. Besides, to encourage future research, we present a simple yet effective generic 3D tracker, named PROT3D, that localizes the target object viaa progressive spatial-temporal network and outperforms all current solutions by a large margin. By releasing GSOT3D, we expect to advance further 3D tracking in future research and applications. Our benchmark and model as well as the evaluation toolkit and results are publicly available at https://github.com/ailovejinx/GSOT3D.
6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting	Yufeng Jin Computer Science Department, Technische Universität Darmstadt Vignesh Prasad Computer Science Department, Technische Universität Darmstadt Snehal Jauhri Computer Science Department, Technische Universität Darmstadt Mathias Franzius Honda Research Institute Europe GmbH Georgia Chalvatzaki Hessian.AI, Darmstadt	Paper Supplementary Abstract Efficient and accurate object pose estimation is an essential component for modern vision systems in many applications such as Augmented Reality, autonomous driving, and robotics. While research in model-based 6D object pose estimation has delivered promising results, model-free methods are hindered by the high computational load in rendering and inferring consistent poses of arbitrary objects in a live RGB-D video stream. To address this issue, we present 6DOPE-GS, a novel method for online 6D object pose estimation and tracking with a single RGB-D camera by effectively leveraging advances in Gaussian Splatting. Thanks to the fast differentiable rendering capabilities of Gaussian Splatting, 6DOPE-GS can simultaneously optimize for 6D object poses and 3D object reconstruction. To achieve the necessary efficiency and accuracy for live tracking, our method uses incremental 2D Gaussian Splatting with an intelligent dynamic keyframe selection procedure to achieve high spatial object coverage and prevent erroneous pose updates. We also propose an opacity statistic-based pruning mechanism for adaptive Gaussian density control, to ensure training stability and efficiency. We evaluate our method on the HO3D and YCBInEOAT datasets and show that 6DOPE-GS matches the performance of state-of-theart baselines for model-free simultaneous 6D pose tracking and reconstruction while providing a 5x speedup. We also demonstrate the method's suitability for live, dynamic object tracking and reconstruction in a real-world setting.
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models	Yudong Jin Zhejiang University Sida Peng Zhejiang University Xuan Wang Ant Research Tao Xie Zhejiang University Zhen Xu Zhejiang University Yifan Yang Zhejiang University Yujun Shen Ant Research Hujun Bao Zhejiang University Xiaowei Zhou Zhejiang University	Paper Supplementary Abstract This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoise the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the GPU memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches. See our project page for interactive demos and video results: https://diffuman4d.github.io/.
Feature Purification Matters: Suppressing Outlier Propagation for Training-Free Open-Vocabulary Semantic Segmentation	Shuo Jin Xi'an Jiaotong-Liverpool University Siyue Yu Xi'an Jiaotong-Liverpool University Bingfeng Zhang China University of Petroleum (East China) Mingjie Sun Soochow University Yi Dong University of Liverpool Jimin Xiao Xi'an Jiaotong-Liverpool University	Paper Supplementary Abstract Training-free open-vocabulary semantic segmentation has advanced with vision-language models like CLIP, which exhibit strong zero-shot abilities. However, CLIP's attention mechanism often wrongly emphasises specific image tokens, namely outliers, which results in irrelevant over-activation. Existing approaches struggle with these outliers that arise in intermediate layers and propagate through the model, ultimately degrading spatial perception. In this paper, we propose a Self-adaptive Feature Purifier framework (SFP) to suppress propagated outliers and enhance semantic representations for open-vocabulary semantic segmentation. Specifically, based on an in-depth analysis of attention responses between image and class tokens, we design a selfadaptive outlier mitigator to detect and mitigate outliers at each layer for propagated feature purification. In addition, we introduce a semantic-aware attention enhancer to augment attention intensity in semantically relevant regions, which strengthens the purified feature to focus on objects. Further, we introduce a hierarchical attention integrator to aggregate multi-layer attention maps to refine spatially coherent feature representations for final segmentation. Our proposed SFP enables robust outlier suppression and object-centric feature representation, leading to a more precise segmentation. Extensive experiments show that our method achieves state-of-the-art performance and surpasses existing methods by an average of 4.6% mIoU on eight segmentation benchmarks. The code is released at: https://github.com/Kimsure/SFP.
GeoFormer: Geometry Point Encoder for 3D Object Detection with Graph-based Transformer	Xin Jin Chang'an University Haisheng Su Shanghai Jiao Tong University Cong Ma SenseAuto Research Kai Liu SenseAuto Research Wei Wu SenseAuto Research Fei Hui Chang'an University Junchi Yan Shanghai Jiao Tong University	Paper Abstract Lidar-based 3D detection is one of the most popular research fields in autonomous driving. 3D detectors typically detect specific targets in a scene according to the pattern formed by the spatial distribution of point clouds. However, existing voxel-based methods usually adopt MLP and global pooling (e.g., PointNet, CenterPoint) as voxel feature encoder, which makes it less effective to extract detailed spatial structure information from raw points, leading to information loss and inferior performance. In this paper, we propose a novel graph-based transformer to encode voxel features by condensing the full and detailed point's geometry, termed as GeoFormer. We first represent points within a voxel as a graph, based on relative distances to capture its spatial geometry. Then, We introduce a geometry-guided transformer architecture to encode voxel features, where the adjacent geometric clues are used to re-weight point feature similarities, enabling more effective extraction of geometric relationships between point pairs at varying distances. We highlight that GeoFormer is a plug-and-play module which can be seamlessly integrated to enhance the performance of existing voxel-based detectors. Extensive experiments conducted on three popular outdoor datasets demonstrate that our GeoFormer achieves the start-of-the-art performance on both effectiveness and robustness comparisons.
Stereo Any Video: Temporally Consistent Stereo Matching	Junpeng Jing Imperial College London Weixun Luo Imperial College London Ye Mao Imperial College London Krystian Mikolajczyk Imperial College London	Paper Supplementary Abstract This paper introduces Stereo Any Video, a powerful framework for video stereo matching. It can estimate spatially accurate and temporally consistent disparities without relying on auxiliary information such as camera poses or optical flow. The strong capability is driven by rich priors from monocular video depth models, which are integrated with convolutional features to produce stable representations. To further enhance performance, key architectural innovations are introduced: all-to-all-pairs correlation, which constructs smooth and robust matching cost volumes, and temporal convex upsampling, which improves temporal coherence. These components collectively enhance robustness, accuracy, and temporal consistency, establishing a new standard in video stereo matching. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple datasets both qualitatively and quantitatively in zero-shot settings, as well as strong generalization to real-world indoor and outdoor scenarios.
Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization	Hao Ju University of Macau Shaofei Huang University of Macau Si Liu Beihang University Zhedong Zheng University of Macau	Paper Supplementary Abstract Existing approaches to drone visual geo-localization predominantly adopt the image-based setting, where a single drone-view snapshot is matched with images from other platforms. Such task formulation, however, underutilizes the inherent video output of the drone and is sensitive to occlusions and viewpoint disparity. To address these limitations, we formulate a new video-based drone geolocalization task and propose the Video2BEV paradigm. This paradigm transforms the video into a Bird's Eye View (BEV), simplifying the subsequent inter-platform matching process. In particular, we employ Gaussian Splatting to reconstruct a 3D scene and obtain the BEV projection. Different from the existing transform methods, e.g., polar transform, our BEVs preserve more fine-grained details without significant distortion. To facilitate the discriminative intra-platform representation learning, our Video2BEV paradigm also incorporates a diffusion-based module for generating hard negative samples. To validate our approach, we introduce UniV, a new videobased geo-localization dataset that extends the image-based University-1652 dataset. UniV features flight paths at 30◦ and 45◦elevation angles with increased frame rates of up to 10 frames per second (FPS). Extensive experiments on the UniV dataset show that our Video2BEV paradigm achieves competitive recall rates and outperforms conventional video-based methods. Compared to other competitive methods, our proposed approach exhibits robustness at lower elevations with more occlusions. The code is available at: https://github.com/HaoDot/Video2BEV-Open.
Details Matter for Indoor Open-vocabulary 3D Instance Segmentation	Sanghun Jung University of Washington Jingjing Zheng Amazon Lab126 Ke Zhang Amazon Lab126 Nan Qiao Amazon Lab126 Albert Y. C. Chen Amazon Lab126 Lu Xia Amazon Lab126 Chi Liu Amazon Lab126 Yuyin Sun Amazon Lab126 Xiao Zeng Amazon Lab126 Hsiang-Wei Huang University of Washington Byron Boots University of Washington Min Sun National Tsing Hua University Cheng-Hao Kuo Amazon Lab126	Paper Supplementary Abstract Unlike closed-vocabulary 3D instance segmentation that is often trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) often leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two-stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha-CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object-centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text-to-proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state-of-the-art performance on ScanNet200 and S3DIS across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.
IM360: Large-scale Indoor Mapping with 360 Cameras	Dongki Jung University of Maryland, College Park Jaehoon Choi University of Maryland, College Park Yonghan Lee University of Maryland, College Park Dinesh Manocha University of Maryland, College Park	Paper Supplementary Abstract We present a novel 3D mapping pipeline for large-scale indoor environments. To address the significant challenges in large-scale indoor scenes, such as prevalent occlusions and textureless regions, we propose IM360, a novel approach that leverages the wide field of view of omnidirectional images and integrates the spherical camera model into the Structure-from-Motion (SfM) pipeline. Our SfM utilizes dense matching features specifically designed for 360◦images, demonstrating superior capability in image registration. Furthermore, with the aid of mesh-based neural rendering techniques, we introduce a texture optimization method that refines texture maps and accurately captures view-dependent properties by combining diffuse and specular components. We evaluate our pipeline on largescale indoor scenes, demonstrating its effectiveness in realworld scenarios. In practice, IM360 demonstrates superior performance, achieving a 3.5 PSNR increase in textured mesh reconstruction. We attain state-of-the-art performance in terms of camera localization and registration on Matterport3D and Stanford2D3D. Project page: https://jdk9405.github.io/IM360/
MAESTRO: Task-Relevant Optimization via Adaptive Feature Enhancement and Suppression for Multi-task 3D Perception	Changwon Kang Hanyang University Jisong Kim Hanyang University Hongjae Shin Seoul National University Junseo Park Seoul National University Jun Won Choi Seoul National University	Paper Supplementary Abstract The goal of multi-task learning is to learn to conduct multiple tasks simultaneously based on a shared data representation. While this approach can improve learning efficiency, it may also cause performance degradation due to task conflicts that arise when optimizing the model for different objectives. To address this challenge, we introduce MAESTRO, a structured framework designed to generate taskspecific features and mitigate feature interference in multitask 3D perception, including 3D object detection, bird'seye view (BEV) map segmentation, and 3D occupancy prediction. MAESTRO comprises three components: the Class-wise Prototype Generator (CPG), the Task-Specific Feature Generator (TSFG), and the Scene Prototype Aggregator (SPA). CPG groups class categories into foreground and background groups and generates group-wise prototypes. The foreground and background prototypes are assigned to the 3D object detection task and the map segmentation task, respectively, while both are assigned to the 3D occupancy prediction task. TSFG leverages these prototype groups to retain task-relevant features while suppressing irrelevant features, thereby enhancing the performance for each task. SPA enhances the prototype groups assigned for 3D occupancy prediction by utilizing the information produced by the 3D object detection head and the map segmentation head. Extensive experiments on the nuScenes and Occ3D benchmarks demonstrate that MAESTRO consistently outperforms existing methods across 3D object detection, BEV map segmentation, and 3D occupancy prediction tasks.
Unleashing the Temporal Potential of Stereo Event Cameras for Continuous-Time 3D Object Detection	Jae-Young Kang KAIST Hoonhee Cho KAIST Kuk-Jin Yoon KAIST	Paper Supplementary Abstract 3D object detection is essential for autonomous systems, enabling precise localization and dimension estimation. While LiDAR and RGB cameras are widely used, their fixed frame rates create perception gaps in high-speed scenarios. Event cameras, with their asynchronous nature and high temporal resolution, offer a solution by capturing motion continuously. The recent approach, which integrates event cameras with conventional sensors for continuoustime detection, struggles in fast-motion scenarios due to its dependency on synchronized sensors. We propose a novel stereo 3D object detection framework that relies solely on event cameras, eliminating the need for conventional 3D sensors. To compensate for the lack of semantic and geometric information in event data, we introduce a dual filter mechanism that extracts both. Additionally, we enhance regression by aligning bounding boxes with object-centric information. Experiments show that our method outperforms prior approaches in dynamic environments, demonstrating the potential of event cameras for robust, continuoustime 3D perception. The code is available at https: //github.com/mickeykang16/Ev-Stereo3D.
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos	Nikita Karaev Meta AI Yuri Makarov Meta AI Jianyuan Wang Meta AI Natalia Neverova Meta AI Andrea Vedaldi Meta AI Christian Rupprecht Visual Geometry Group, University of Oxford	Paper Supplementary Abstract We introduce CoTracker3, a new state-of-the-art point tracker. With CoTracker3, we revisit the design of recent trackers, removing components and reducing the number of parameters while also improving performance. We also explore the interplay of synthetic and real data. Recent trackers are trained on synthetic videos due to the difficulty of collecting tracking annotations for real data. However, this can result in suboptimal performance due to the statistical gap between synthetic and real videos. We thus suggest using off-the-shelf trackers as teachers to annotate real videos with pseudo-labels. Compared to other recent attempts at using real data for learning trackers, this scheme is much simpler and achieves better results using 1,000 times less data. CoTracker3 is available here in online (causal) and offline variants.
Towards Safer and Understandable Driver Intention Prediction	Mukilan Karuppasamy IIIT Hyderabad Shankar Gangisetty IIIT Hyderabad Shyam Nandan Rai Politecnico di Torino Carlo Masone Politecnico di Torino C V Jawahar IIIT Hyderabad	Paper Supplementary Abstract Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, mainly due to recent advances in deep learning and AI. As interactions between autonomous systems and humans increase, the interpretability of decision-making processes in driving systems becomes increasingly crucial for ensuring safe driving operations. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the eXplainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver's decisions. These explanations are derived from both the driver's eye-gaze and the ego-vehicle's perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatiotemporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on the DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability than conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations. Our data, code and models are available at: https://mukil07.github.io/VCBM.github.io/
Princeton365: A Diverse Dataset with Accurate Camera Pose	Karhan Kayan Princeton University Stamatis Alexandropoulos Princeton University Rishabh Jain Princeton University Yiming Zuo Princeton University Erich Liang Princeton University Jia Deng Princeton University	Paper Supplementary Abstract We introduce Princeton365, a large-scale diverse dataset of 365 videos with accurate camera pose. Our dataset bridges the gap between accuracy and data diversity in current SLAM benchmarks by introducing a novel ground truth collection framework that leverages calibration boards and a 360◦camera. We collect indoor, outdoor, and object scanning videos with synchronized monocular and stereo RGB video outputs as well as IMU. We further propose a new scene scale-aware evaluation metric for SLAM based on the the optical flow induced by the camera pose estimation error. In contrast to the current metrics, our new metric allows for comparison between the performance of SLAM methods across scenes as opposed to existing metrics such as Average Trajectory Error (ATE), allowing researchers to analyze the failure modes of their methods. We also propose a challenging Novel View Synthesis benchmark that covers cases not covered by current NVS benchmarks, such as fully non-Lambertian scenes with 360◦camera trajectories. Please visit princeton365.cs.princeton.edu for the dataset, code, videos, and submission.
Bridging the Sky and Ground: Towards View-Invariant Feature Learning for Aerial-Ground Person Re-Identification	Wajahat Khalid School of Cyber Science and Technology, University of Science and Technology of China Bin Liu School of Cyber Science and Technology, University of Science and Technology of China Xulin Li School of Cyber Science and Technology, University of Science and Technology of China Muhammad Waqas School of Cyber Science and Technology, University of Science and Technology of China Muhammad Sher Afgan School of Cyber Science and Technology, University of Science and Technology of China	Paper Supplementary Abstract Aerial-Ground Person Re-Identification (AG-ReID) is a practical yet challenging task that involves cross-platform matching between aerial and ground cameras. Existing person Re-Identification (Re-ID) methods are primarily designed for homogeneous camera settings, such as groundto-ground or aerial-to-aerial matching. Therefore, these conventional Re-ID approaches underperform due to the significant viewpoint discrepancies introduced by crossplatform cameras in the AG-ReID task. To address this limitation, we propose a novel and efficient approach, termed View-Invariant Feature Learning for Aerial-Ground Person Re-Identification (VIF-AGReID), which explores viewinvariant features without leveraging any auxiliary information. Our approach introduces two key components: (1) Patch-Level RotateMix (PLRM), an augmentation strategy that enhances rotational diversity within local regions of training samples, enabling the model to capture finegrained view-invariant features, and (2) View-Invariant Angular Loss (VIAL), which mitigates the impact of perspective variations by imposing angular constraints that exponentially penalize large angular deviations, optimizing the similarity of positive pairs while enhancing dissimilarity for hard negatives. These components interact synergistically to drive view-invariant feature learning, enhancing robustness across diverse viewpoints. Extensive experiments on the CARGO, AG-ReIDv1, and AG-ReIDv2 benchmarks demonstrate the effectiveness of our method in addressing the AG-ReID task.
CARIM: Caption-Based Autonomous Driving Scene Retrieval via Inclusive Text Matching	Minjoo Ki Yonsei University Daejung Kim Naver Labs Kisung Kim Naver Labs Seon Joo Kim Yonsei University Jinhan Lee Naver Labs	Paper Supplementary Abstract Text-to-video retrieval is a powerful tool for navigating vast video databases. This is especially useful in autonomous driving to retrieve scenes from a text query to simulate and evaluate a driving system in desired scenarios. However, traditional ranking-based retrieval methods often return partial matches that fail to satisfy all query conditions. To address this, we introduce Inclusive Text-to-Video Retrieval, which retrieves only videos that meet all specified conditions, regardless of additional irrelevant elements. We propose CARIM, a driving scene retrieval framework that employs inclusive text matching. By utilizing Vision-Language Model and Large Language Model to generate compressed captions for driving scenes, we reformulate text-to-video retrieval as a more efficient text-to-text retrieval problem, eliminating modality mismatch and heavy annotation cost. We present a novel positive and negative data curation strategy and an attention-based scoring mechanism tailored for driving scene retrieval. Experiments show that CARIM outperforms state-of-the-art retrieval methods, excelling in edge cases where traditional models fail.
Removing Cost Volumes from Optical Flow Estimators	Simon Kiefhaber Department of Computer Science, Technical University of Darmstadt Stefan Roth Department of Computer Science, Technical University of Darmstadt Simone Schaub-Meyer Department of Computer Science, Technical University of Darmstadt	Paper Supplementary Abstract Cost volumes are used in every modern optical flow estimator, but due to their computational and space complexity, they are often a limiting factor regarding both processing speed and the resolution of input frames. Motivated by our empirical observation that cost volumes lose their importance once all other network parts of, e.g., a RAFT-based pipeline have been sufficiently trained, we introduce a training strategy that allows removing the cost volume from optical flow estimators throughout training. This leads to significantly improved inference speed and reduced memory requirements. Using our training strategy, we create three different models covering different compute budgets. Our most accurate model reaches state-of-the-art accuracy while being 1.2x faster and having a 6x lower memory footprint than comparable models; our fastest model is capable of processing Full HD frames at 20 FPS using only 500 MB of GPU memory.
2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update	Jeongyun Kim Seoul National University Seunghoon Jeong Seoul National University Giseop Kim DGIST Myung-Hwan Jeon Kumoh National Institute of Technology Eunji Jun Hyundai Motor Group Ayoung Kim Seoul National University	Paper Supplementary Abstract Understanding the 3D geometry of transparent objects from RGB images is challenging due to their inherent physical properties, such as reflection and refraction. To address these difficulties, especially in scenarios with sparse views and dynamic environments, we introduce TRAN-D, a novel 2D Gaussian Splatting-based depth reconstruction method for transparent objects. Our key insight lies in separating transparent objects from the background, enabling focused optimization of Gaussians corresponding to the object. We mitigate artifacts with an object-aware loss that places Gaussians in obscured regions, ensuring coverage of invisible surfaces while reducing overfitting. Furthermore, we incorporate a physics-based simulation that refines the reconstruction in just a few seconds, effectively handling object removal and chain-reaction movement of remaining objects without the need for rescanning. TRAN-D is evaluated on both synthetic and real-world sequences, and it consistently demonstrated robust improvements over existing GS-based state-of-the-art methods. In comparison with baselines, TRAN-D reduces the mean absolute error by over 39% for the synthetic TRansPose sequences. Furthermore, despite being updated using only one image, TRAN-D reaches a δ < 2.5 cm accuracy of 48.46%, over 1.5 times that of baselines, which uses six images. Code and more results are available at https://jeongyun0609.github.io/TRAN-D/.
CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models	Junho Kim EverEx Hyungjin Chung EverEx Byung-Hoon Kim EverEx	Paper Supplementary Abstract Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints, a process that is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories. Recent efforts have explored the use of text queries, leveraging their enhanced stability and generalization capabilities. However, existing approaches often remain constrained by their reliance on support queries, their failure to fully utilize the rich priors embedded in pretrained large language models, and the limitations imposed by their parametric distribution assumptions. To address these challenges, we introduce CapeLLM, the first multimodal large language model (MLLM) designed for CAPE. Our method only employs query image and detailed text descriptions as an input to estimate category-agnostic keypoints. Our method encompasses effective training strategies and carefully designed instructions for applying the MLLM to CAPE. Moreover, we propose an inference mechanism that further enhances the reasoning process for unseen keypoints. while ﬂexibly modeling their underlying spatial distribution and uncertainty, allowing for adaptive refinement based on contextual cues. We conducted extensive experiments to apply the MLLM to CAPE effectively, focusing not only on the model architecture and prompt design but also on ensuring robustness across input variations. Our approach sets a new state-of-the-art on the MP-100 benchmark in the 1-shot and even 5-shot setting, marking a significant advancement in the field of categoryagnostic pose estimation. Code is available here.
DAViD: Modeling Dynamic Affordance of 3D Objects Using Pre-trained Video Diffusion Models	Hyeonwoo Kim Seoul National University Sangwon Baik Seoul National University Hanbyul Joo Seoul National University	Paper Supplementary Abstract Modeling how humans interact with objects is crucial for AI to effectively assist or mimic human behaviors. Existing studies for learning such ability primarily focus on static human-object interaction (HOI) patterns, such as contact and spatial relationships, while dynamic HOI patterns, capturing the movement of humans and objects over time, remain relatively underexplored. In this paper, we present a novel framework for learning Dynamic Affordance across various target object categories. To address the scarcity of 4D HOI datasets, our method learns the 3D dynamic affordance from synthetically generated 4D HOI samples. Specifically, we propose a pipeline that first generates 2D HOI videos from a given 3D target object using a pre-trained video diffusion model, then lifts them into 3D to generate 4D HOI samples. Leveraging these synthesized 4D HOI samples, we train DAViD, our generative 4D human-object interaction model, which is composed of two key components: (1) a human motion diffusion model (MDM) with Low-Rank Adaptation (LoRA) module to fine-tune a pretrained MDM to learn the HOI motion concepts from limited HOI motion samples, (2) a motion diffusion model for 4D object poses conditioned by produced human interaction motions. Interestingly, DAViD can integrate newly learned HOI motion concepts with pre-trained human motions to create novel HOI motions, even for multiple HOI motion concepts, demonstrating the advantage of our pipeline with LoRA in integrating dynamic HOI concepts. Through extensive experiments, we demonstrate that DAViD outperforms baselines in synthesizing HOI motion.
From Sharp to Blur: Unsupervised Domain Adaptation for 2D Human Pose Estimation Under Extreme Motion Blur Using Event Cameras	Youngho Kim KAIST Hoonhee Cho KAIST Kuk-Jin Yoon KAIST	Paper Abstract Human pose estimation is critical for applications such as rehabilitation, sports analytics, and AR/VR systems. However, rapid motion and low-light conditions often introduce motion blur, significantly degrading pose estimation due to the domain gap between sharp and blurred images. Most datasets assume stable conditions, making models trained on sharp images struggle in blurred environments. To address this, we introduce a novel domain adaptation approach that leverages event cameras, which capture high temporal resolution motion data and are inherently robust to motion blur. Using event-based augmentation, we generate motionaware blurred images, effectively bridging the domain gap between sharp and blurred domains without requiring paired annotations. Additionally, we develop a student-teacher framework that iteratively refines pseudo-labels, leveraging mutual uncertainty masking to eliminate incorrect labels and enable more effective learning. Experimental results demonstrate that our approach outperforms conventional domain-adaptive human pose estimation methods, achieving robust pose estimation under motion blur without requiring annotations in the target domain. Our findings highlight the potential of event cameras as a scalable and effective solution for domain adaptation in real-world motion blur environments. Our project codes are available at https: //github.com/kmax2001/EvSharp2Blur.
GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion	Gwanghyun Kim NVIDIA Xueting Li NVIDIA Ye Yuan NVIDIA Koki Nagano NVIDIA Tianye Li NVIDIA Jan Kautz NVIDIA Se Young Chun Seoul National University Umar Iqbal NVIDIA	Paper Supplementary Abstract Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of highquality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model's role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the limitations of traditional affine-invariant and metric depth representations. GeoMan achieves state-of-the-art performance in both qualitative and quantitative evaluations, demonstrating its effectiveness in overcoming longstanding challenges in 3D human geometry estimation from videos.
Learning 3D Scene Analogies with Neural Contextual Scene Maps	Junho Kim Seoul National University Gwangtak Bae Seoul National University Eun Sun Lee Seoul National University Young Min Kim Seoul National University	Paper Supplementary Abstract Understanding scene contexts is crucial for machines to perform tasks and adapt prior knowledge in unseen or noisy 3D environments. As data-driven learning is intractable to comprehensively encapsulate diverse ranges of layouts and open spaces, we propose teaching machines to identify relational commonalities in 3D spaces. Instead of focusing on point-wise or object-wise representations, we introduce 3D scene analogies, which are smooth maps between 3D scene regions that align spatial relationships. Unlike well-studied single instance-level maps, these scene-level maps smoothly link large scene regions, potentially enabling unique applications in trajectory transfer in AR/VR, long demonstration transfer for imitation learning, and context-aware object rearrangement. To find 3D scene analogies, we propose neural contextual scene maps, which extract descriptor fields summarizing semantic and geometric contexts, and holistically align them in a coarse-to-fine manner for map estimation. This approach reduces reliance on individual feature points, making it robust to input noise or shape variations. Experiments demonstrate the effectiveness of our approach in identifying scene analogies and transferring trajectories or object placements in diverse indoor scenes, indicating its potential for robotics and AR/VR applications. Project page including the code is available through this link: https: //82magnolia.github.io/3d_scene_analogies/.
Lightweight and Fast Real-time Image Enhancement via Decomposition of the Spatial-aware Lookup Tables	Wontae Kim IPAI, Seoul National University Keuntek Lee Department of ECE, INMC, Seoul National University Nam Ik Cho IPAI, Seoul National University	Paper Supplementary Abstract The image enhancement methods based on 3D lookup tables (3D LUTs) efficiently reduce both model size and runtime by interpolating pre-calculated values at the vertices. However, the 3D LUT methods have a limitation due to their lack of spatial information, as they convert color values on a point-by-point basis. Although spatial-aware 3D LUT methods address this limitation, they introduce additional modules that require a substantial number of parameters, leading to increased runtime as image resolution increases. To address this issue, we propose a method for generating image-adaptive LUTs by focusing on the redundant parts of the tables. Our efficient framework decomposes a 3D LUT into a linear sum of low-dimensional LUTs and employs singular value decomposition (SVD). Furthermore, we enhance the modules for spatial feature fusion to be more cache-efficient. Extensive experimental results demonstrate that our model effectively decreases both the number of parameters and runtime while maintaining spatial awareness and performance. The code is available at https://github.com/WontaeaeKim/SVDLUT.
PoseAnchor: Robust Root Position Estimation for 3D Human Pose Estimation	Jun-Hee Kim Korea University Jumin Han Korea University Seong-Whan Lee Korea University	Paper Supplementary Abstract Standard 3D human pose estimation (HPE) benchmarks employ root-centering, which normalizes poses relative to the pelvis but discards absolute root position information. While effective for evaluation, this approach limits real-world applications such as motion tracking, AR/VR, and humancomputer interaction, where absolute root position is essential. Moreover, incorporating root position into these models often leads to performance degradation. To address these limitations, we introduce PoseAnchor, a unified framework that seamlessly integrates root position estimation while improving overall pose accuracy. PoseAnchor leverages Iterative Hard Thresholding Robust Least Squares Regression (ITRR), a novel robust regression approach introduced to 3D HPE for the first time. ITRR effectively mitigates the impact of noisy 2D detections, enabling more accurate root position estimation. With ITRR, PoseAnchor enables zeroshot root localization, allowing existing models to estimate absolute root positions without retraining or architectural modifications. ITRR identifies a support set of reliable joints based on their spatial relationships to achieve robust root estimation, effectively filtering out unreliable joints. Beyond zero-shot localization, PoseAnchor incorporates ITRR into a Data-Driven Training framework that selectively utilizes the support set to optimize pose learning. By dynamically filtering high-confidence joint data, PoseAnchor mitigates noise while improving robustness.
Probabilistic Inertial Poser (ProbIP): Uncertainty-aware Human Motion Modeling from Sparse Inertial Sensors	Min Kim KAIST Younho Jeon KAIST Sungho Jo KAIST	Paper Supplementary Abstract Wearable Inertial Measurement Units (IMUs) allow nonintrusive motion tracking, but limited sensor placements can introduce uncertainty in capturing detailed full-body movements. Existing methods mitigate this issue by selecting more physically plausible motion patterns but do not directly address inherent uncertainties in the data. We introduce the Probabilistic Inertial Poser (ProbIP), a novel probabilistic model that transforms sparse IMU data into human motion predictions without physical constraints. ProbIP utilizes RU-Mamba blocks to predict a matrix Fisher distribution over rotations, effectively estimating both rotation matrices and associated uncertainties. To refine motion distribution through layers, our Progressive Distribution Narrowing (PDN) technique enables stable learning across a diverse range of motions. Experimental results demonstrate that ProbIP achieves state-of-the-art performance on multiple public datasets with six and fewer IMU sensors. Our contributions include the development of ProbIP with RUMamba blocks for probabilistic motion estimation, applying Progressive Distribution Narrowing (PDN) for uncertainty reduction, and evidence of superior results with six and reduced sensor configurations. The code will be available at https://github.com/MinKim14/ProbIP-ICCV2025.
SynAD: Enhancing Real-World End-to-End Autonomous Driving Models through Synthetic Data Integration	Jongsuk Kim KAIST Jaeyoung Lee KAIST Gyojin Han KAIST Dong-Jae Lee KAIST Minki Jeong AI Center, Samsung Electronics Junmo Kim KAIST	Paper Supplementary Abstract Recent advancements in deep learning and the availability of high-quality real-world driving datasets have propelled end-to-end autonomous driving. Despite this progress, relying solely on real-world data limits the variety of driving scenarios for training. Synthetic scenario generation has emerged as a promising solution to enrich the diversity of training data; however, its application within E2E AD models remains largely unexplored. This is primarily due to the absence of a designated ego vehicle and the associated sensor inputs, such as camera or LiDAR, typically provided in real-world scenarios. To address this gap, we introduce SynAD, the first framework designed to enhance real-world E2E AD models using synthetic data. Our method designates the agent with the most comprehensive driving information as the ego vehicle in a multi-agent synthetic scenario. We further project path-level scenarios onto maps and employ a newly developed Map-to-BEV Network to derive bird's-eye-view features without relying on sensor inputs. Finally, we devise a training strategy that effectively integrates these map-based synthetic data with real driving data. Experimental results demonstrate that SynAD effectively integrates all components and notably enhances safety performance. By bridging synthetic scenario generation and E2E AD, SynAD paves the way for more comprehensive and robust autonomous driving models.
Free-running vs Synchronous: Single-Photon Lidar for High-flux 3D Imaging	Ruangrawee Kitichotkul Boston University Shashwath Bharadwaj Boston University Joshua Rapp Mitsubishi Electric Research Laboratories Yanting Ma Mitsubishi Electric Research Laboratories Alexander Mehta University of California, Berkeley Vivek K Goyal Boston University	Paper Supplementary Abstract Conventional wisdom suggests that single-photon lidar (SPL) should operate in low-light conditions (< 0.05 photons per laser pulse repetition) to minimize dead-time effects. Many methods have been developed to mitigate these effects in synchronous SPL systems. However, solutions for free-running SPL remain limited despite the advantage of reduced histogram distortion from dead times. To improve the accuracy of free-running SPL, we propose a computationally efficient joint maximum likelihood estimator of the signal flux, the background flux, and the depth using only histograms, along with a complementary regularization framework that incorporates a learned point cloud score model as a prior. Simulations and experiments demonstrate that free-running SPL yields lower estimation errors than its synchronous counterpart under identical conditions, with our regularization further improving accuracy.
DONUT: A Decoder-Only Model for Trajectory Prediction	Markus Knoche RWTH Aachen University Daan de Geus RWTH Aachen University Bastian Leibe RWTH Aachen University	Paper Supplementary Abstract Predicting the motion of other agents in a scene is highly relevant for autonomous driving, as it allows a self-driving car to anticipate. Inspired by the success of decoderonly models for language modeling, we propose DONUT, a Decoder-Only Network for Unrolling Trajectories. Unlike existing encoder-decoder forecasting models, we encode historical trajectories and predict future trajectories with a single autoregressive model. This allows the model to make iterative predictions in a consistent manner, and ensures that the model is always provided with up-to-date information, thereby enhancing performance. Furthermore, inspired by multi-token prediction for language modeling, we introduce an ‘overprediction' strategy that gives the model the auxiliary task of predicting trajectories at longer temporal horizons. This allows the model to better anticipate the future and further improves performance. Through experiments, we demonstrate that our decoder-only approach outperforms the encoder-decoder baseline, and achieves new state-of-the-art results on the Argoverse 2 single-agent motion forecasting benchmark.
GVDepth: Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion	Karlo Koledić University of Zagreb Faculty of Electrical Engineering and Computing Luka Petrović University of Zagreb Faculty of Electrical Engineering and Computing Ivan Marković University of Zagreb Faculty of Electrical Engineering and Computing Ivan Petrović University of Zagreb Faculty of Electrical Engineering and Computing	Paper Supplementary Abstract Generalizing metric monocular depth estimation presents a significant challenge due to its ill-posed nature, while the entanglement between camera parameters and depth amplifies issues further, hindering multi-dataset training and zero-shot accuracy. This challenge is particularly evident in autonomous vehicles and mobile robotics, where data is collected with fixed camera setups, limiting the geometric diversity. Yet, this context also presents an opportunity: the fixed relationship between the camera and the ground plane imposes additional perspective geometry constraints, enabling depth regression via vertical image positions of objects. However, this cue is highly susceptible to overfitting, thus we propose a novel canonical representation that maintains consistency across varied camera setups, effectively disentangling depth from specific parameters and enhancing generalization across datasets. We also propose a novel architecture that adaptively and probabilistically fuses depths estimated via object size and vertical image position cues. A comprehensive evaluation demonstrates the effectiveness of the proposed approach on five autonomous driving datasets, achieving accurate metric depth estimation for varying resolutions, aspect ratios and camera setups. Notably, we achieve comparable accuracy to existing zero-shot methods, despite training on a single dataset with a single-camera setup. Project website: https://unizgferlamor.github.io/gvdepth/
Embodied Navigation with Auxiliary Task of Action Description Prediction	Haru Kondoh Institute of Science Tokyo Asako Kanezaki Institute of Science Tokyo	Paper Supplementary Abstract The field of multimodal robot navigation in indoor environments has garnered significant attention in recent years. However, as tasks and methods become more advanced, the action decision systems tend to become more complex and operate as black-boxes. For a reliable system, the ability to explain or describe its decisions is crucial; however, there tends to be a trade-off in that explainable systems cannot outperform non-explainable systems in terms of performance. In this paper, we propose incorporating the task of describing actions in language into the reinforcement learning of navigation as an auxiliary task. Existing studies have found it difficult to incorporate describing actions into reinforcement learning due to the absence of ground-truth data. We address this issue by leveraging knowledge distillation from pre-trained description generation models, such as vision-language models. We comprehensively evaluate our approach across various navigation tasks, demonstrating that it can describe actions while attaining high navigation performance. Furthermore, it achieves state-of-theart performance in the particularly challenging multimodal navigation task of semantic audio-visual navigation.
Leaps and Bounds: An Improved Point Cloud Winding Number Formulation for Fast Normal Estimation and Surface Reconstruction	Chamin Hewa Koneputugodage The Australian National University Dylan Campbell The Australian National University Stephen Gould The Australian National University	Paper Supplementary Abstract Recent methods for point cloud surface normal estimation predominantly use the generalized winding number field induced by the normals. Optimizing the field towards satisfying desired properties, such as the input points being on the surface defined by the field, provides a principled way to obtain globally consistent surface normals. However, we show that the existing winding number formulation for point clouds is a poor approximation near the input surface points, diverging as the query point approaches a surface point. This is problematic for methods that rely on the accuracy and stability of this approximation, requiring heuristics to compensate. Instead, we derive a more accurate approximation that is properly bounded and converges to the correct value. We then examine two distinct approaches that optimize for globally consistent normals using point cloud winding numbers. We show how the original unbounded formulation inﬂuences key design choices in both methods and demonstrate that substituting our formulation yields substantive improvements with respect to normal estimation and surface reconstruction accuracy.
EquiCaps: Predictor-Free Pose-Aware Pre-Trained Capsule Networks	Athinoulla Konstantinou University of Aberdeen Georgios Leontidis University of Aberdeen Mamatha Thota University of Lincoln Aiden Durrant University of Aberdeen	Paper Supplementary Abstract Learning self-supervised representations that are invariant and equivariant to transformations is crucial for advancing beyond traditional visual classification tasks. However, many methods rely on predictor architectures to encode equivariance, despite evidence that architectural choices, such as capsule networks, inherently excel at learning interpretable pose-aware representations. To explore this, we introduce EquiCaps (Equivariant Capsule Network), a capsule-based approach to pose-aware self-supervision that eliminates the need for a specialised predictor for enforcing equivariance. Instead, we leverage the intrinsic pose-awareness capabilities of capsules to improve performance in pose estimation tasks. To further challenge our assumptions, we increase task complexity via multi-geometric transformations to enable a more thorough evaluation of invariance and equivariance by introducing 3DIEBench-T, an extension of a 3D object-rendering benchmark dataset. Empirical results demonstrate that EquiCaps outperforms prior state-of-the-art equivariant methods on geometric tasks, including rotation and translation, achieving a supervised-level R2 of 0.78 on the 3DIEBench rotation prediction benchmark and improving upon SIE and CapsIE by 0.05 and 0.04 R2, respectively. Moreover, in contrast to non-capsule-based equivariant approaches, EquiCaps maintains robust equivariant performance under combined geometric transformations, underscoring its generalisation capabilities and the promise of predictor-free capsule architectures. Code, dataset, and weights are released at http://github.com/AberdeenML/EquiCaps.
RoboAnnotatorX: A Comprehensive and Universal Annotation Framework for Accurate Understanding of Long-horizon Robot Demonstration	Longxin Kou Tianjin University Fei Ni Tianjin University Yan Zheng Tianjin University Peilong Han Tianjin University Jinyi Liu Tianjin University Haiqin Cui Tianjin University Rui Liu Tianjin University Jianye Hao Tianjin University	Paper Supplementary Abstract Recent advances in robotics have produced numerous valuable large-scale demonstration datasets, yet their potential remains underutilized due to annotation limitations. Current datasets often suffer from sparse temporal annotations and inconsistent labeling granularity, particularly for complex long-horizon demonstrations. Traditional manual annotation methods are expensive and poorly scalable while existing automated methods struggle with temporal coherence and semantic richness across extended demonstrations. For this, we propose RoboAnnotatorX, a reliable annotation tool that enhances multimodal large language model to generate high-quality, context-rich annotations for complex long-horizon demonstrations. Specifically, we introduce a multi-scale token-efficient encoder to maintain computational efficiency while simultaneously capturing fine-grained visual details and preserving temporal information by jointly integrating scene-level anchoring, clip-level temporal dynamics, and video-level global modeling. We further construct a comprehensive dataset RoboXVQA that synthesizes diverse QA pairs from both realworld and simulated data, bridging the significant domain gap in robotics demonstrations. Moreover, we leverage a curriculum-inspired three-stage training to progressively develop capabilities from basic visual perception to sophisticated temporal reasoning. Extensive experiments demonstrate that RoboAnnotatorX significantly outperforms existing approaches in annotation quality and exhibits strong generalization across diverse robotic environments, helping unlock the full potential of existing robotic datasets. The details and visualizations are available at project website.
Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints	Jens U. Kreber University of Augsburg Joerg Stueckler University of Augsburg	Paper Supplementary Abstract Articulated objects are an important type of interactable objects in everyday environments. In this paper, we propose PhysNAP, a novel diffusion model-based approach for generating articulated objects that aligns them with partial point clouds and improves their physical plausibility. The model represents part shapes by signed distance functions (SDFs). We guide the reverse diffusion process using a point cloud alignment loss computed using the predicted SDFs. Additionally, we impose non-penetration and mobility constraints based on the part SDFs for guiding the model to generate more physically plausible objects. We also make our diffusion approach category-aware to further improve point cloud alignment if category information is available. We evaluate the generative ability and constraint consistency of samples generated with PhysNAP using the PartNet-Mobility dataset. We also compare it with an unguided baseline diffusion model and demonstrate that PhysNAP can improve constraint consistency and provides a tradeoff with generative ability.
DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding	Thomas Kreutz Telekooperation Lab, Technical University Darmstadt Max Mühlhäuser Telekooperation Lab, Technical University Darmstadt Alejandro Sanchez Guinea Telekooperation Lab, Technical University Darmstadt	Paper Supplementary Abstract Despite LiDAR (Light Detection and Ranging) being an effective privacy-preserving alternative to RGB cameras to perceive human activities, it remains largely underexplored in the context of multi-modal contrastive pre-training for human activity understanding tasks, such as human activity recognition (HAR), retrieval, or person re-identification (RE-ID). To close this gap, our work explores learning the correspondence between LiDAR point clouds, human skeleton poses, IMU data, and text in a joint embedding space. More specifically, we present DeSPITE, a Deep SkeletonPointcloud-IMU-Text Embedding model, which effectively learns a joint embedding space across these four modalities. At the heart of our empirical exploration, we have combined the existing LIPD and Babel datasets, which enabled us to synchronize data of all four modalities, allowing us to explore the learning of a new joint embedding space. Our experiments demonstrate novel human activity understanding tasks for point cloud sequences enabled through DeSPITE, including Skeleton↔Pointcloud↔IMU matching, retrieval, and temporal moment retrieval. Furthermore, we show that DeSPITE is an effective pre-training strategy for point cloud HAR through experiments in MSR-Action3D and HMPEAR. Code and models are publicly available at https://github.com/thkreutz/despite.
Benchmarking Egocentric Visual-Inertial SLAM at City Scale	Anusha Krishnan ETH Zürich Shaohui Liu ETH Zürich Paul-Edouard Sarlin Google Oscar Gentilhomme ETH Zürich David Caruso Meta Reality Labs Research Maurizio Monge Meta Reality Labs Research Richard Newcombe Meta Reality Labs Research Jakob Engel Meta Reality Labs Research Marc Pollefeys ETH Zürich, Microsoft Spatial AI Lab	Paper Supplementary Abstract Precise 6-DoF simultaneous localization and mapping (SLAM) from onboard sensors is critical for wearable devices capturing egocentric data, which exhibits specific challenges, such as a wider diversity of motions and viewpoints, prevalent dynamic visual content, or long sessions affected by time-varying sensor calibration. While recent progress on SLAM has been swift, academic research is still driven by benchmarks that do not reﬂect these challenges or do not offer sufficiently accurate ground truth poses. In this paper, we introduce a new dataset and benchmark for visualinertial SLAM with egocentric, multi-modal data. We record hours and kilometers of trajectories through a city center with glasses-like devices equipped with various sensors. We leverage surveying tools to obtain control points as indirect pose annotations that are metric, centimeter-accurate, and available at city scale. This makes it possible to evaluate extreme trajectories that involve walking at night or traveling in a vehicle. We show that state-of-the-art systems developed by academia are not robust to these challenges and we identify components that are responsible for this. In addition, we design tracks with different levels of difficulty to ease in-depth analysis and evaluation of less mature approaches. The dataset and benchmark are available at lamaria.ethz.ch.
CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector	Abhinav Kumar Michigan State University Yuliang Guo Bosch Research North America, Bosch Center for AI Zhihao Zhang Michigan State University Xinyu Huang Bosch Research North America, Bosch Center for AI Liu Ren Bosch Research North America, Bosch Center for AI Xiaoming Liu Michigan State University	Paper Supplementary Abstract Monocular 3D object detectors, while effective on data from one ego camera height, struggle with unseen or out-ofdistribution camera heights. Existing methods often rely on Plucker embeddings, image transformations or data augmentation. This paper takes a step towards this understudied problem by first investigating the impact of camera height variations on state-of-the-art (SoTA) Mono3D models. With a systematic analysis on the extended CARLA dataset with multiple camera heights, we observe that depth estimation is a primary factor influencing performance under height variations. We mathematically prove and also empirically observe consistent negative and positive trends in mean depth error of regressed and ground-based depth models, respectively, under camera height changes. To mitigate this, we propose Camera Height Robust Monocular 3D Detector (CHARM3R), which averages both depth estimates within the model. CHARM3R improves generalization to unseen camera heights by more than 45%, achieving SoTA performance on the CARLA dataset.
Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition	Pulkit Kumar University of Maryland, College Park Shuaiyi Huang University of Maryland, College Park Matthew Walmer University of Maryland, College Park Sai Saketh Rambhatla University of Maryland, College Park, GenAI, Meta Abhinav Shrivastava University of Maryland, College Park	Paper Supplementary Abstract Video understanding requires effective modeling of both motion and appearance information, particularly for fewshot action recognition. While recent advances in point tracking have been shown to improve few-shot action recognition, two fundamental challenges persist: selecting informative points to track and effectively modeling their motion patterns. We present Trokens, a novel approach that transforms trajectory points into semantic-aware relational tokens for action recognition. First, we introduce a semantic-aware sampling strategy to adaptively distribute tracking points based on object scale and semantic relevance. Second, we develop a motion modeling framework that captures both intra-trajectory dynamics through the Histogram of Oriented Displacements (HoD) and intertrajectory relationships to model complex action patterns. Our approach effectively combines these trajectory tokens with semantic features to enhance appearance features with motion information, achieving state-of-the-art performance across six diverse few-shot action recognition benchmarks: Something-Something-V2 (both full and small splits), Kinetics, UCF101, HMDB51, and FineGym. Our project page is available here.
ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition	Sanjoy Kundu Auburn University Shanmukha Vellamcheti Auburn University Sathyanarayanan N. Aakur Auburn University	Paper Supplementary Abstract Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jumpdiffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0-L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPICKitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding. Our results highlight the importance of structured search strategies, paving the way for scalable and efficient open-world activity recognition.
RadarSplat: Radar Gaussian Splatting for High-Fidelity Data Synthesis and 3D Reconstruction of Autonomous Driving Scenes	Pou-Chun Kung University of Michigan Skanda Harisha University of Michigan Ram Vasudevan University of Michigan Aline Eid University of Michigan Katherine A. Skinner University of Michigan	Paper Supplementary Abstract High-fidelity 3D scene reconstruction plays a crucial role in autonomous driving by enabling novel data generation from existing datasets. This allows simulating safety-critical scenarios and augmenting training datasets without incurring further data collection costs. While recent advances in radiance fields have demonstrated promising results in 3D reconstruction and sensor data synthesis using cameras and LiDAR, their potential for radar remains largely unexplored. Radar is crucial for autonomous driving due to its robustness in adverse weather conditions like rain, fog, and snow, where optical sensors often struggle. Although the state-of-the-art radar-based neural representation shows promise for 3D driving scene reconstruction, it performs poorly in scenarios with significant radar noise, including receiver saturation and multipath reflection. Moreover, it is limited to synthesizing preprocessed, noise-excluded radar images, failing to address realistic radar data synthesis. To address these limitations, this paper proposes RadarSplat, which integrates Gaussian Splatting with novel radar noise modeling to enable realistic radar data synthesis and enhanced 3D reconstruction. Compared to the state-ofthe-art, RadarSplat achieves superior radar image synthesis (+3.4 PSNR /2.6x SSIM) and improved geometric reconstruction (−40% RMSE /1.5x Accuracy), demonstrating its effectiveness in generating high-fidelity radar data and scene reconstruction. A project page is available at https://umautobots.github.io/radarsplat.
RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction	Johannes Künzel Fraunhofer Heinrich-Hertz-Institut, HHI, Germany Anna Hilsmann Fraunhofer Heinrich-Hertz-Institut, HHI, Germany Peter Eisert Fraunhofer Heinrich-Hertz-Institut, HHI, Germany	Paper Supplementary Abstract We introduce RIPE, an innovative reinforcement learningbased framework for weakly-supervised training of a keypoint extractor that excels in both detection and description tasks. In contrast to conventional training regimes that depend heavily on artificial transformations, pre-generated models, or 3D data, RIPE requires only a binary label indicating whether paired images represent the same scene. This minimal supervision significantly expands the pool of training data, enabling the creation of a highly generalized and robust keypoint extractor. RIPE utilizes the encoder's intermediate layers for the description of the keypoints with a hyper-column approach to integrate information from different scales. Additionally, we propose an auxiliary loss to enhance the discriminative capability of the learned descriptors. Comprehensive evaluations on standard benchmarks demonstrate that RIPE simplifies data preparation while achieving competitive performance compared to state-ofthe-art techniques, marking a significant advancement in robust keypoint extraction and description. To support further research, we have made our code publicly available at https://github.com/fraunhoferhhi/RIPE.
Thermal Polarimetric Multi-view Stereo	Takahiro Kushida Ritsumeikan University Kenichiro Tanaka Ritsumeikan University	Paper Abstract This paper introduces a novel method for detailed 3D shape reconstruction utilizing thermal polarization cues. Unlike state-of-the-art methods, the proposed approach is independent of illumination and material properties. In this paper, we formulate a general theory of polarization observation and show that long-wave infrared (LWIR) polarimetric imaging is free from the ambiguities that affect visible polarization analyses. Subsequently, we propose a method for recovering detailed 3D shapes using multi-view thermal polarimetric images. Experimental results demonstrate that our approach effectively reconstructs fine details in transparent, translucent, and heterogeneous objects, outperforming existing techniques.
MemDistill: Distilling LiDAR Knowledge into Memory for Camera-Only 3D Object Detection	Donghyeon Kwon POSTECH Youngseok Yoon POSTECH Hyeongseok Son Samsung Electronics Suha Kwak POSTECH	Paper Abstract Camera-based 3D object detection has gained attention for its cost-effectiveness, but it in general lags behind LiDARbased approaches due to its lack of explicit 3D spatial cues. To take the best of both camera- and LiDAR-based detectors, we propose MemDistill, a novel cross-modal knowledge distillation framework for 3D object detection. MemDistill transfers rich 3D knowledge from a LiDARbased teacher model to a camera-based student model through a dedicated memory unit and a scene-dependent memory retrieval module. To be specific, our framework distills the teacher's 3D knowledge, optimizes the memory to store that knowledge compactly, and learns the retriever that searches the memory to produce 3D features relevant to the input scene, compensating for the missing LiDAR modality. Experiments on the nuScenes dataset demonstrate that MemDistill significantly improves performance of its camera-only baseline, achieving the state of the art in camera-based 3D object detection.
One Look is Enough: Seamless Patchwise Refinement for Zero-Shot Monocular Depth Estimation on High-Resolution Images	Byeongjun Kwon KAIST Munchurl Kim KAIST	Paper Supplementary Abstract Zero-shot depth estimation (DE) models exhibit strong generalization performance as they are trained on large-scale datasets. However, existing models struggle with highresolution images due to the discrepancy in image resolutions of training (with smaller resolutions) and inference (for high resolutions). Processing them at full resolution leads to decreased estimation accuracy on depth with tremendous memory consumption, while downsampling to the training resolution results in blurred edges in the estimated depth images. Prevailing high-resolution depth estimation methods adopt a patch-based approach, which introduces depth discontinuity issues when reassembling the estimated depth patches, resulting in test-time inefficiency. Additionally, to obtain fine-grained depth details, these methods rely on synthetic datasets due to the real-world sparse ground truth depth, leading to poor generalizability. To tackle these limitations, we propose Patch Refine Once (PRO), an efficient and generalizable tile-based framework. Our PRO consists of two key components: (i) Grouped Patch Consistency Training that enhances test-time efficiency while mitigating the depth discontinuity problem by jointly processing four overlapping patches and enforcing a consistency loss on their overlapping regions within a single backpropagation step, and (ii) Bias Free Masking that prevents the DE models from overfitting to dataset-specific biases, enabling better generalization to real-world datasets even after training on synthetic data. Zero-shot evaluations on Booster, ETH3D, Middlebury 2014, and NuScenes demonstrate that our PRO can be seamlessly integrated into existing depth estimation models. It preserves the performance of original depth estimation models even under gridbased inference on high-resolution images, exhibiting minimal depth discontinuities along patch boundaries. Moreover, our PRO achieves significantly faster inference speed compared to prior patch-based methods.
ViLU: Learning Vision-Language Uncertainties for Failure Prediction	Marc Lafon Conservatoire National des Arts et Métiers, CEDRIC, Paris, France Yannis Karmim Conservatoire National des Arts et Métiers, CEDRIC, Paris, France Julio Silva-Rodríguez ETS Montreal Paul Couairon Sorbonne Université, CNRS, Paris, France Clément Rambour Sorbonne Université, CNRS, Paris, France Raphaël Fournier-Sniehotta Sorbonne Université, CNRS, Paris, France Ismail Ben Ayed ETS Montreal Jose Dolz ETS Montreal Nicolas Thome Sorbonne Université, CNRS, Paris, France	Paper Supplementary Abstract Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new VisionLanguage Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all taskrelevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via crossattention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it lossagnostic. In particular, our proposed approach is wellsuited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-theart failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: ViLU Repository.
CAVIS: Context-Aware Video Instance Segmentation	Seunghun Lee DGIST, Daegu, Korea Jiwan Seo DGIST, Daegu, Korea Kiljoon Han DGIST, Daegu, Korea Minwoo Choi DGIST, Daegu, Korea Sunghoon Im DGIST, Daegu, Korea	Paper Supplementary Abstract In this paper, we introduce the Context-Aware Video Instance Segmentation (CAVIS), a novel framework designed to enhance instance association by integrating contextual information adjacent to each object. To efficiently extract and leverage this information, we propose the Context-Aware Instance Tracker (CAIT), which merges contextual data surrounding the instances with the core instance features to improve tracking accuracy. Additionally, we design the Prototypical Cross-frame Contrastive (PCC) loss, which ensures consistency in object-level features across frames, thereby significantly enhancing matching accuracy. CAVIS demonstrates superior performance over state-of-the-art methods on all benchmark datasets in video instance segmentation (VIS) and video panoptic segmentation (VPS). Notably, our method excels on the OVIS dataset, known for its particularly challenging videos. Project page: this https URL
CF3: Compact and Fast 3D Feature Fields		Paper Supplementary Abstract 3D Gaussian Splatting (3DGS) has begun incorporating rich information from 2D foundation models. However, most approaches rely on a bottom-up optimization process that treats raw 2D features as ground truth, incurring increased computational costs. We propose a top-down pipeline for constructing compact and fast 3D Gaussian feature fields, namely, CF3. We first perform a fast weighted fusion of multi-view 2D features with pre-trained Gaussians. This approach enables training a per-Gaussian autoencoder directly on the lifted features, instead of training autoencoders in the 2D domain. As a result, the autoencoder better aligns with the feature distribution. More importantly, we introduce an adaptive sparsification method that optimizes the Gaussian attributes of the feature field while pruning and merging the redundant Gaussians, constructing an efficient representation with preserved geometric details. Our approach achieves a competitive 3D feature field using as little as 5% of the Gaussians compared to Feature-3DGS.
CityNav: A Large-Scale Dataset for Real-World Aerial Navigation	Jungdae Lee Institute of Science Tokyo Taiki Miyanishi The University of Tokyo Shuhei Kurita National Institute of Informatics Koya Sakamoto The University of Tokyo Daichi Azuma The University of Tokyo Yutaka Matsuo The University of Tokyo Nakamasa Inoue Institute of Science Tokyo	Paper Supplementary Abstract Vision-and-language navigation (VLN) aims to develop agents capable of navigating in realistic environments. While recent cross-modal training approaches have significantly improved navigation performance in both indoor and outdoor scenarios, aerial navigation over real-world cities remains underexplored primarily due to limited datasets and the difficulty of integrating visual and geographic information. To fill this gap, we introduce CityNav, the first large-scale real-world dataset for aerial VLN. Our dataset consists of 32,637 human demonstration trajectories, each paired with a natural language description, covering 4.65 km2 across two real cities: Cambridge and Birmingham. In contrast to existing datasets composed of synthetic scenes such as AerialVLN, our dataset presents a unique challenge because agents must interpret spatial relationships between real-world landmarks and the navigation destination, making CityNav an essential benchmark for advancing aerial VLN. Furthermore, as an initial step toward addressing this challenge, we provide a methodology of creating geographic semantic maps that can be used as an auxiliary modality input during navigation. In our experiments, we compare performance of three representative aerial VLN agents (Seq2seq, CMA and AerialVLN models) and demonstrate that the semantic map representation significantly improves their navigation performance.
CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images	Jungho Lee Yonsei University Donghyeong Kim Yonsei University Dogyoon Lee Yonsei University Suhwan Cho Yonsei University Minhyeok Lee Yonsei University Wonjoon Lee Yonsei University Taeoh Kim NAVER Cloud Dongyoon Wee NAVER Cloud Sangyoun Lee Yonsei University	Paper Supplementary Abstract 3D Gaussian Splatting (3DGS) has gained significant attention due to its high-quality novel view rendering, motivating research to address real-world challenges. A critical issue is the camera motion blur caused by movement during exposure, which hinders accurate 3D scene reconstruction. In this study, we propose CoMoGaussian, a Continuous Motion-Aware Gaussian Splatting that reconstructs precise 3D scenes from motion-blurred images while maintaining real-time rendering speed. Considering the complex motion patterns inherent in real-world camera movements, we predict continuous camera trajectories using neural ordinary differential equations (ODEs). To ensure accurate modeling, we employ rigid body transformations, preserving the shape and size of the object but rely on the discrete integration of sampled frames. To better approximate the continuous nature of motion blur, we introduce a continuous motion refinement (CMR) transformation that refines rigid transformations by incorporating additional learnable parameters. By revisiting fundamental camera theory and leveraging advanced neural ODE techniques, we achieve precise modeling of continuous camera trajectories, leading to improved reconstruction accuracy. Extensive experiments demonstrate state-of-the-art performance both quantitatively and qualitatively on benchmark datasets, which include a wide range of motion blur scenarios, from moderate to extreme blur. Project page is available at https://JhoYonsei.github.io/CoMoGaussian.
Combinative Matching for Geometric Shape Assembly	Nahyuk Lee POSTECH Juhong Min POSTECH Junhong Lee POSTECH Chunghyun Park POSTECH Minsu Cho POSTECH	Paper Supplementary Abstract This paper introduces a new shape-matching methodology, combinative matching, to combine interlocking parts for geometric shape assembly. Previous methods for geometric assembly typically rely on aligning parts by finding identical surfaces between the parts as in conventional shape matching and registration. Specifically, we explicitly model two distinct properties of interlocking shapes: ‘identical surface shape' and ‘opposite volume occupancy.' Our method thus learns to establish correspondences across regions where their surface shapes appear identical but their volumes occupy the inverted space to each other. To facilitate this process, we also learn to align regions in rotation by estimating their shape orientations via equivariant neural networks. The proposed approach significantly reduces local ambiguities in matching and allows a robust combination of parts in assembly. Experimental results on geometric assembly benchmarks demonstrate the efficacy of our method, consistently outperforming the state of the art.
EVT: Efficient View Transformation for Multi-Modal 3D Object Detection	Yongjin Lee ThorDrive Co., Ltd Hyeon-Mun Jeong ThorDrive Co., Ltd Yurim Jeon Seoul National University Sanghyun Kim ThorDrive Co., Ltd, Seoul National University	Paper Abstract Multi-modal sensor fusion in Bird's Eye View (BEV) representation has become the leading approach for 3D object detection. However, existing methods often rely on depth estimators or transformer encoders to transform image features into BEV space, which reduces robustness or introduces significant computational overhead. Moreover, the insufficient geometric guidance in view transformation results in ray-directional misalignments, limiting the effectiveness of BEV representations. To address these challenges, we propose Efficient View Transformation (EVT), a novel 3D object detection framework that constructs a wellstructured BEV representation, improving both accuracy and efficiency. Our approach focuses on two key aspects. First, Adaptive Sampling and Adaptive Projection (ASAP), which utilizes LiDAR guidance to generate 3D sampling points and adaptive kernels, enables more effective transformation of image features into BEV space and a refined BEV representation. Second, an improved query-based detection framework, incorporating group-wise mixed query selection and geometry-aware cross-attention, effectively captures both the common properties and the geometric structure of objects in the transformer decoder. On the nuScenes test set, EVT achieves state-of-the-art performance of 75.3% NDS with real-time inference speed.
FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction	Donghyun Lee Seoul National University Dawoon Jeong Seoul National University Jae W. Lee Seoul National University Hongil Yoon Google	Paper Supplementary Abstract Deep neural networks have revolutionized 3D point cloud processing, yet efficiently handling large and irregular point clouds remains challenging. To tackle this problem, we introduce FastPoint, a novel software-based acceleration technique that leverages the predictable distance trend between sampled points during farthest point sampling. By predicting the distance curve, we can efficiently identify subsequent sample points without exhaustively computing all pairwise distances. Our proposal substantially accelerates farthest point sampling and neighbor search operations while preserving sampling quality and model performance. By integrating FastPoint into state-of-the-art 3D point cloud models, we achieve 2.55x end-to-end speedup on NVIDIA RTX 3090 GPU without sacrificing accuracy.
InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation	Jungmin Lee Chung-Ang University Seonghyuk Hong National Research Institute of Cultural Heritage Juyong Lee Chung-Ang University Jaeyoon Lee Chung-Ang University Jongwon Choi Chung-Ang University	Paper Supplementary Abstract We introduce InsideOut, an extension of 3D Gaussian splatting (3DGS) that bridges the gap between high-fidelity RGB surface details and subsurface X-ray structures. The fusion of RGB and X-ray imaging is invaluable in fields such as medical diagnostics, cultural heritage restoration, and manufacturing. We collect new paired RGB and X-ray data, perform hierarchical fitting to align RGB and X-ray radiative Gaussian splats, and propose an X-ray reference loss to ensure consistent internal structures. InsideOut effectively addresses the challenges posed by disparate data representations between the two modalities and limited paired datasets. This approach significantly extends the applicability of 3DGS, enhancing visualization, simulation, and nondestructive testing capabilities across various domains.
Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning	Giwon Lee KAIST Wooseong Jeong KAIST Daehee Park DGIST Jaewoo Jeong KAIST Kuk-Jin Yoon KAIST	Paper Supplementary Abstract Motion planning is a crucial component of autonomous robot driving. While various trajectory datasets exist, effectively utilizing them for a target domain remains challenging due to differences in agent interactions and environmental characteristics. Conventional approaches, such as domain adaptation or ensemble learning, leverage multiple source datasets but suffer from domain imbalance, catastrophic forgetting, and high computational costs. To address these challenges, we propose Interaction-Merged Motion Planning (IMMP), a novel approach that leverages parameter checkpoints trained on different domains during adaptation to the target domain. IMMP follows a two-step process: pre-merging to capture agent behaviors and interactions, sufficiently extracting diverse information from the source domain, followed by merging to construct an adaptable model that efficiently transfers diverse interactions to the target domain. Our method is evaluated on various planning benchmarks and models, demonstrating superior performance compared to conventional approaches.
Joint Learning of Pose Regression and Denoising Diffusion with Score Scaling Sampling for Category-level 6D Pose Estimation	Seunghyun Lee KAIST Tae-Kyun Kim KAIST	Paper Supplementary Abstract Latest diffusion models have shown promising results in category-level 6D object pose estimation by modeling the conditional pose distribution with depth image input. The existing methods, however, suffer from slow convergence during training, learning its encoder with the diffusion denoising network in end-to-end fashion, and require an additional network that evaluates sampled pose hypotheses to filter out low-quality pose candidates. In this paper, we propose a novel pipeline that tackles these limitations by two key components. First, the proposed method pretrains the encoder with the direct pose regression head, and jointly learns the networks via the regression head and the denoising diffusion head, significantly accelerating training convergence while achieving higher accuracy. Second, sampling guidance via time-dependent score scaling is proposed s.t. the exploration-exploitation trade-off is effectively taken, eliminating the need for the additional evaluation network. The sampling guidance maintains multimodal characteristics of symmetric objects at early denoising steps while ensuring high-quality pose generation at final steps. Extensive experiments on multiple benchmarks including REAL275, HouseCat6D, and ROPE, demonstrate that the proposed method, simple yet effective, achieves state-of-the-art accuracies even with single-pose inference, while being more efficient in both training and inference.
LOMM: Latest Object Memory Management for Temporally Consistent Video Instance Segmentation	Seunghun Lee DGIST Jiwan Seo DGIST Minwoo Choi DGIST Kiljoon Han DGIST Jahoon Jeong DGIST Zane Durante Stanford University Ehsan Adeli Stanford University Sang Hyun Park DGIST Sunghoon Im DGIST	Paper Supplementary Abstract In this paper, we introduce Latest Object Memory (LOM), a system for robustly tracking and continuously updating the latest states of objects by explicitly modeling their presence across video frames. LOM enables consistent tracking and accurate identity management across frames, enhancing both performance and reliability through the video segmentation process. Building upon LOM, we present Latest Object Memory Management (LOMM) for temporally consistent video instance segmentation, significantly improving long-term instance tracking. This enables consistent tracking and accurate identity management across frames, enhancing both performance and reliability through the video segmentation process. Moreover, we introduce Decoupled Object Association (DOA), a strategy that separately handles newly appearing and already existing objects. By leveraging our memory system, DOA accurately assigns object indices, improving matching accuracy and ensuring stable identity consistency, even in dynamic scenes where objects frequently appear and disappear. Extensive experiments and ablation studies demonstrate the superiority of our method over traditional approaches, setting a new state-of-the-art in video instance segmentation. Notably, our LOMM achieves an AP score of 54.0 on YouTube-VIS 2022, a dataset known for its challenging long videos. Project page: this https URL
NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes	Han-Hung Lee Simon Fraser University Qinghong Han Simon Fraser University Angel X. Chang Simon Fraser University	Paper Supplementary Abstract In this paper, we explore the task of generating expansive outdoor scenes, ranging from castles to high-rises. Unlike indoor scene generation, which has been a primary focus of prior work, outdoor scene generation presents unique challenges, including wide variations in scene heights and the need for a method capable of rapidly producing large landscapes. To address this, we propose an efficient approach that encodes scene chunks as uniform vector sets, offering better compression and performance than the spatially structured latents used in prior methods. Furthermore, we train an explicit outpainting model for unbounded generation, which improves coherence compared to prior resamplingbased inpainting schemes while also speeding up generation by eliminating extra diffusion steps. To facilitate this task, we curate NuiScene43, a small but high-quality set of scenes, preprocessed for joint training. Notably, when trained on scenes of varying styles, our model can blend different environments, such as rural houses and city skyscrapers, within the same scene, highlighting the potential of our curation process to leverage heterogeneous scenes for joint training.
PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior	Seunggwan Lee Korea University Hwanhee Jung Korea University Byoungsoo Koh KOCCA Qixing Huang The University of Texas at Austin Sang Ho Yoon KAIST Sangpil Kim Korea University	Paper Supplementary Abstract A fundamental challenge in conditional 3D shape generation is to minimize the information loss and maximize the intention of user input. Existing approaches have predominantly focused on two types of isolated conditional signals, i.e., user sketches and text descriptions, each of which does not offer flexible control of the generated shape. In this paper, we introduce PASTA, the flexible approach that seamlessly integrates a user sketch and a text description for 3D shape generation. The key idea is to use text embeddings from a vision-language model to enrich the semantic representation of sketches. Specifically, these text-derived priors specify the part components of the object, compensating for missing visual cues from ambiguous sketches. In addition, we introduce ISG-Net which employs two types of graph convolutional networks: IndivGCN, which processes fine-grained details, and PartGCN, which aggregates these details into parts and refines the structure of objects. Extensive experiments demonstrate that PASTA outperforms existing methods in part-level editing and achieves stateof-the-art results in sketch-to-3D shape generation.
Power of Cooperative Supervision: Multiple Teachers Framework for Advanced 3D Semi-Supervised Object Detection	Jin-Hee Lee DGIST Jae-Keun Lee DGIST Jeseok Kim DGIST Kwon Soon DGIST	Paper Supplementary Abstract To ensure safe autonomous driving in complex urban environments, it is essential not only to develop highperformance object detection models but also to establish a diverse and representative dataset that captures a wide range of urban scenarios and object characteristics. To address these challenges, we introduce a new multi-class 3D LiDAR dataset that comprehensively reflects various urban environments and object types, along with a robust 3D semi-supervised object detection (SSOD) framework. Our SSOD framework leverages a novel multiple teachers model, where similar object classes are grouped and supervised by category-specialized teacher networks. This category-specific collaborative guidance enables the student network to learn more effectively, leading to improved object detection performance. Additionally, we propose the Pseudo-points Generator (PointGen), a simple yet effective technique designed to enhance the generation of highquality pseudo-labels for the teacher network, mitigating the impact of sparse LiDAR point clouds. Extensive experiments on the Waymo Open Dataset (WOD), KITTI, and our newly introduced dataset validate the effectiveness of both our dataset and SSOD framework. Experimental results demonstrate that our approach consistently outperforms state-of-the-art 3D SSOD methods across all evaluated datasets. To encourage further research in this domain, we will publicly release our multi-class LiDAR dataset and source code on our GitHub repository1.
HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation	Qinqian Lei National University of Singapore Bo Wang University of Mississippi Robby T. Tan National University of Singapore	Paper Supplementary Abstract Zero-shot human-object interaction (HOI) detection remains a challenging task, particularly in generalizing to unseen actions. Existing methods address this challenge by tapping Vision-Language Models (VLMs) to access knowledge beyond the training data. However, they either struggle to distinguish actions involving the same object or demonstrate limited generalization to unseen classes. In this paper, we introduce HOLa (Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation), a novel approach that both enhances generalization to unseen classes and improves action distinction. In training, HOLa decomposes VLM text features for given HOI classes via low-rank factorization, producing class-shared basis features and adaptable weights. These features and weights form a compact HOI representation that preserves shared information across classes, enhancing generalization to unseen classes. Subsequently, we refine action distinction by adapting weights for each HOI class and introducing human-object tokens to enrich visual interaction representations. To further distinguish unseen actions, we guide the weight adaptation with LLM-derived action regularization. Experimental results show that our method sets a new state-of-the-art across zero-shot HOI settings on HICO-DET, achieving an unseen-class mAP of 27.91 in the unseen-verb setting. Our code is available at https://github.com/ChelsieLei/HOLa.
MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps	Jiahui Lei University of Pennsylvania Kyle Genova Google DeepMind George Kopanas Google Noah Snavely Google Leonidas Guibas Google	Paper Abstract This paper addresses the challenge of learning semantically and functionally meaningful 3D motion priors from real-world videos, in order to enable prediction of future 3D scene motion from a single input image. We propose a novel pixel-aligned Motion Map (MoMap) representation for 3D scene motion, which can be generated from existing generative image models to facilitate efficient and effective motion prediction. To learn meaningful distributions over motion, we create a large-scale database of MoMaps from over 50,000 real videos and train a diffusion model on these representations. Our motion generation not only synthesizes trajectories in 3D but also suggests a new pipeline for 2D video synthesis: first generate a MoMap, then warp an image accordingly and complete the warped point-based renderings. Experimental results demonstrate that our approach generates plausible and semantically consistent 3D scene motion.
Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration	Ting Lei Peking University Shaofeng Yin Peking University Qingchao Chen Peking University Yuxin Peng Peking University Yang Liu Peking University	Paper Supplementary Abstract Open Vocabulary Human-Object Interaction (HOI) detection aims to detect interactions between humans and objects while generalizing to novel interaction classes beyond the training set. Current methods often rely on Vision and Language Models (VLMs) but face challenges due to suboptimal image encoders, as image-level pre-training does not align well with the fine-grained region-level interaction detection required for HOI. Additionally, effectively encoding textual descriptions of visual appearances remains difficult, limiting the model's ability to capture detailed HOI relationships. To address these issues, we propose INteractionaware Prompting with Concept Calibration (INP-CC), an end-to-end open-vocabulary HOI detector that integrates interaction-aware prompts and concept calibration. Specifically, we propose an interaction-aware prompt generator that dynamically generates a compact set of prompts based on the input scene, enabling selective sharing among similar interactions. This approach directs the model's attention to key interaction patterns rather than generic image-level semantics, enhancing HOI detection. Furthermore, we refine HOI concept representations through language modelguided calibration, which helps distinguish diverse HOI concepts by investigating visual similarities across categories. A negative sampling strategy is also employed to improve inter-modal similarity modeling, enabling the model to better differentiate visually similar but semantically distinct actions. Extensive experimental results demonstrate that INP-CC significantly outperforms state-of-the-art models on the SWIG-HOI and HICO-DET datasets. Code is available at https://github.com/ltttpku/INP-CC.
Occupancy Learning with Spatiotemporal Memory	Ziyang Leng University of California, Los Angeles Jiawei Yang University of Southern California Wenlong Yi University of California, Los Angeles Bolei Zhou University of California, Los Angeles	Paper Supplementary Abstract 3D occupancy becomes a promising perception representation for autonomous driving to model the surrounding environment at a fine-grained scale. However, it remains challenging to efficiently aggregate 3D occupancy over time across multiple input frames due to the high processing cost and the uncertainty and dynamics of voxels. To address this issue, we propose ST-Occ, a scene-level occupancy representation learning framework that effectively learns the spatiotemporal feature with temporal consistency. ST-Occ consists of two core designs: a spatiotemporal memory that captures comprehensive historical information and stores it efficiently through a scene-level representation and a memory attention that conditions the current occupancy representation on the spatiotemporal memory with a model of uncertainty and dynamic awareness. Our method significantly enhances the spatiotemporal representation learned for 3D occupancy prediction tasks by exploiting the temporal dependency between multi-frame inputs. Experiments show that our approach outperforms the state-of-the-art methods by a margin of 3 mIoU and reduces the temporal inconsistency by 29%. The code and model are available at https://github.com/matthew-leng/ST-Occ.
4D Gaussian Splatting SLAM	Yanyan Li Hangzhou Dianzi University Youxu Fang Hangzhou Dianzi University Zunjie Zhu Hangzhou Dianzi University Kunyi Li Technical University of Munich Yong Ding Zhejiang University Federico Tombari Google	Paper Abstract Simultaneously localizing camera poses and constructing Gaussian radiance fields in dynamic scenes establish a crucial bridge between 2D images and the 4D real world. Instead of removing dynamic objects as distractors and reconstructing only static environments, this paper proposes an efficient architecture that incrementally tracks camera poses and establishes the 4D Gaussian radiance fields in unknown scenarios by using a sequence of RGB-D images. First, by generating motion masks, we obtain static and dynamic priors for each pixel. To eliminate the influence of static scenes and improve the efficiency of learning the motion of dynamic objects, we classify the Gaussian primitives into static and dynamic Gaussian sets, while the sparse control points along with an MLP are utilized to model the transformation fields of the dynamic Gaussians. To more accurately learn the motion of dynamic Gaussians, a novel 2D optical flow map reconstruction algorithm is designed to render optical flows of dynamic objects between neighbor images, which are further used to supervise the 4D Gaussian radiance fields along with traditional photometric and geometric constraints. In experiments, qualitative and quantitative evaluation results show that the proposed method achieves robust tracking and high-quality view synthesis performance in real-world environments.
AGO: Adaptive Grounding for Open World 3D Occupancy Prediction	Peizheng Li Mercedes-Benz AG Shuxiao Ding Mercedes-Benz AG You Zhou Mercedes-Benz AG Qingwen Zhang KTH Royal Institute of Technology Onat Inak Mercedes-Benz AG Larissa Triess Mercedes-Benz AG Niklas Hanselmann Mercedes-Benz AG Marius Cordts Mercedes-Benz AG Andreas Zell University of Tübingen	Paper Supplementary Abstract Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects. Transferring open-vocabulary knowledge from vision-language models (VLMs) offers a promising direction but remains challenging. However, methods based on VLM-derived 2D pseudo-labels with traditional supervision are limited by a predefined label space and lack general prediction capabilities. Direct alignment with pretrained image embeddings, on the other hand, often fails to achieve reliable performance because of inconsistent image and text representations in VLMs. To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios. AGO first encodes surrounding images and class prompts into 3D and text embeddings, respectively, leveraging similarity-based grounding training with 3D pseudolabels. Additionally, a modality adapter maps 3D embeddings into a space aligned with VLM-derived image embeddings, reducing modality gaps. Experiments on Occ3DnuScenes show that AGO improves unknown object prediction in zero-shot and few-shot transfer while achieving state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU. Code is available at: https://github.com/EdwardLeeLPZ/AGO.
Adversarial Exploitation of Data Diversity Improves Visual Localization	Sihang Li New York University Siqi Tan New York University Bowen Chang New York University Jing Zhang New York University Chen Fengu New York University Yiming Liu New York University	Paper Supplementary Abstract Visual localization, which estimates a camera's pose within a known scene, is a fundamental capability for autonomous systems. While absolute pose regression (APR) methods have shown promise for efficient inference, they often struggle with generalization. Recent approaches attempt to address this through data augmentation with varied viewpoints, yet they overlook a critical factor: appearance diversity. In this work, we identify appearance variation as the key to robust localization. Specifically, we first lift real 2D images into 3D Gaussian Splats with varying appearance and deblurring ability, enabling the synthesis of diverse training data that varies not just in poses but also in environmental conditions such as lighting and weather. To fully unleash the potential of the appearance-diverse data, we build a two-branch joint training pipeline with an adversarial discriminator to bridge the syn-to-real gap. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, reducing translation and rotation errors by 50% and 33% on indoor datasets, and 38% and 44% on outdoor datasets. Most notably, our method shows remarkable robustness in dynamic driving scenarios under varying weather conditions and in day-to-night scenarios, where previous APR methods fail.
Amodal Depth Anything: Amodal Depth Estimation in the Wild	Zhenyu Li KAUST Mykola Lavreniuk Space Research Institute NASU-SSAU Jian Shi KAUST Shariq Farooq Bhat KAUST Peter Wonka KAUST	Paper Supplementary Abstract Amodal depth estimation aims to predict the depth of occluded (invisible) parts of objects in a scene. This task addresses the question of whether models can effectively perceive the geometry of occluded regions based on visible cues. Prior methods primarily rely on synthetic datasets and focus on metric depth estimation, limiting their generalization to real-world settings due to domain shifts and scalability challenges. In this paper, we propose a novel formulation of amodal depth estimation in the wild, focusing on relative depth prediction to improve model generalization across diverse natural images. We introduce a new large-scale dataset, Amodal Depth In the Wild (ADIW), created using a scalable pipeline that leverages segmentation datasets and compositing techniques. Depth maps are generated using large pre-trained depth models, and a scaleand-shift alignment strategy is employed to refine and blend depth predictions, ensuring consistency in ground-truth annotations. To tackle the amodal depth task, we present two complementary frameworks: Amodal-DAV2, a deterministic model based on Depth Anything V2, and AmodalDepthFM, a generative model that integrates conditional flow matching principles. Our proposed frameworks effectively leverage the capabilities of large pre-trained models with minimal modifications to achieve high-quality amodal depth predictions (Fig. 1). Experiments validate our design choices, demonstrating the flexibility of our models in generating diverse, plausible depth structures for occluded regions. Our method achieves a 50.7% improvement in RMSE over the previous SoTA on the ADIW dataset.
Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking	Yunhao Li Institute of Software Chinese Academy of Sciences Yifan Jiao Institute of Software Chinese Academy of Sciences Dan Meng OPPO Research Institute Heng Fan University of North Texas Libo Zhang Institute of Software Chinese Academy of Sciences	Paper Supplementary Abstract Open-Vocabulary Multi-Object Tracking (OV-MOT) aims to enable approaches to track objects without being limited to a predefined set of categories. Current OV-MOT methods typically rely primarily on instance-level detection and association, often overlooking trajectory information that is unique and essential for object tracking tasks. Utilizing trajectory information can enhance association stability and classification accuracy, especially in cases of occlusion and category ambiguity, thereby improving adaptability to novel classes. Thus motivated, in this paper we propose TRACT, an open-vocabulary tracker that leverages trajectory information to improve both object association and classification in OV-MOT. Specifically, we introduce a Trajectory Consistency Reinforcement (TCR) strategy, that benefits tracking performance by improving target identity and category consistency. In addition, we present TraCLIP, a plug-andplay trajectory classification module. It integrates Trajectory Feature Aggregation (TFA) and Trajectory Semantic Enrichment (TSE) strategies to fully leverage trajectory information from visual and language perspectives for enhancing the classification results. Extensive experiments on OV-TAO show that our TRACT significantly improves tracking performance, highlighting trajectory information as a valuable asset for OV-MOT. We will release TRACT at https://github.com/Nathan-Li123/TRACT.
Benefit From Seen: Enhancing Open-Vocabulary Object Detection by Bridging Visual and Textual Co-Occurrence Knowledge	Yanqi Li Beihang University Jianwei Niu Beihang University Tao Ren Institute of Software Chinese Academy of Sciences	Paper Supplementary Abstract Open-Vocabulary Object Detection (OVOD) aims to localize and recognize objects from both known and novel categories. However, existing methods rely heavily on internal knowledge from Vision-Language Models (VLMs), restricting their generalization to unseen categories due to limited contextual understanding. To address this, we propose CODet, a plug-and-play framework that enhances OVOD by integrating object co-occurrence -- a form of external contextual knowledge pervasive in real-world scenes. Specifically, CODet extracts visual co-occurrence patterns from images, aligns them with textual dependencies validated by Large Language Models (LLMs), and injects contextual co-occurrence pseudo-labels as external knowledge to guide detection. Without architectural changes, CODet consistently improves five state-of-the-art VLM-based detectors across two benchmarks, achieving notable gains (up to +2.3 AP on novel categories). Analyses further confirm its ability to encode meaningful contextual guidance, advancing open-world perception by bridging visual and textual co-occurrence knowledge.
Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios	Chunxiao Li Beijing Normal University Xiaoxiao Wang University of Chinese Academy of Sciences Meiling Li Fudan University Boming Miao Beijing Normal University Peng Sun Central University of Finance and Economics Yunjian Zhang Tsinghua University Xiangyang Ji Tsinghua University Yao Zhu Tsinghua University	Paper Supplementary Abstract With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization - RRDataset encompasses high-quality images from seven major scenarios (War & Conflict, Disasters & Accidents, Political & Social Events, Medical & Public Health, Culture & Religion, Labor & Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness - examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms. 3) Re-digitization Robustness - assessing model effectiveness on images altered through four distinct re-digitization methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a largescale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AIgenerated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms. Our dataset is publicly available at: https: //zenodo.org/records/14963880.
Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis	Lei-Lei Li Xi'an Jiaotong University Jianwu Fang National University of Singapore Junbin Xiao National University of Singapore Shanmin Pang Xi'an Jiaotong University Hongkai Yu Cleveland State University Chen Lv Nanyang Technological University Jianru Xue Xi'an Jiaotong University Tat-Seng Chua National University of Singapore	Paper Supplementary Abstract Egocentricly comprehending the causes and effects of car accidents is crucial for the safety of self-driving cars, and synthesizing causal-entity reﬂected accident videos can facilitate the capability test to respond to unaffordable accidents in reality. However, incorporating causal relations as seen in real-world videos into synthetic videos remains challenging. This work argues that precisely identifying the accident participants and capturing their related behaviors are of critical importance. In this regard, we propose a novel diffusion model Causal-VidSyn for synthesizing egocentric traffic accident videos. To enable causal entity grounding in video diffusion, Causal-VidSyn leverages the cause descriptions and driver fixations to identify the accident participants and behaviors, facilitated by accident reason answering and gaze-conditioned selection modules. To support CausalVidSyn, we further construct Drive-Gaze, the largest driver gaze dataset (with 1.54M frames of fixations) in driving accident scenarios. Extensive experiments show that CausalVidSyn surpasses state-of-the-art video diffusion models in terms of frame quality and causal sensitivity in various tasks, including accident video editing, normal-to-accident video diffusion, and text-to-video generation.
CoA-VLA: Improving Vision-Language-Action Models via Visual-Text Chain-of-Affordance	Jinming Li Shanghai University Yichen Zhu Midea Group Zhibin Tang unknown Junjie Wen East China Normal University Minjie Zhu East China Normal University Xiaoyu Liu Shanghai University Chengmeng Li Shanghai University Ran Cheng Midea Group Yaxin Peng Shanghai University Yan Peng Shanghai University Feifei Feng Midea Group	Paper Supplementary Abstract Robot foundation models, particularly Vision-LanguageAction (VLA) models, have garnered significant attention for their ability to enhance robot policy learning, greatly improving robot's generalization and robustness. OpenAI's recent model, O1, showcased impressive capabilities in solving complex problems by utilizing extensive reasoning chains. This prompts an important question: can robot models achieve better performance in multi-task, complex environments by reviewing prior observations and then providing task-specific reasoning to guide action prediction? In this paper, we introduce Chain-of-Affordance (CoAVLA), a novel approach to scaling robot models by incorporating reasoning in the format of sequential robot affordances to facilitate task completion. Specifically, we prompt the model to consider the following four types of affordances before taking action: (1) object affordance - what object to manipulate and where it is; (2) grasp affordance - the specific object part to grasp; (3) spatial affordance - the optimal space to place the object; and (4) movement affordance - the collision-free path for movement. We further transform each affordance into two prompting formats: visual affordance and textual affordance. We introduce a novel vision-language co-injection module that integrates this knowledge into the policy network. This allows the robot to leverage essential contextual information during action inference, resulting in improved precision and robustness. Our experiments demonstrate that CoA-VLA outperforms state-of-the-art robot foundation models, including OpenVLA and Octo, on a variety of tasks. Furthermore, CoA-VLA exhibits strong generalization capabilities, including recognizing unseen object poses, identifying free space, and avoiding obstacles in novel environments.
Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios	Deng Li Tianjin University Aming Wu Hefei University of Technology Yang Li Tianjin University Yaowei Wang Peng Cheng Laboratory Yahong Han Tianjin University	Paper Abstract In practice, environments constantly change over time and space, posing significant challenges for object detectors trained based on a closed-set assumption, i.e., training and test data share the same distribution. To this end, continual test-time adaptation has attracted much attention, aiming to improve detectors' generalization by finetuning a few specific parameters, e.g., BatchNorm layers. However, based on a small number of test images, fine-tuning certain parameters may affect the representation ability of other fixed parameters, leading to performance degradation. Instead, we explore a new mechanism, i.e., converting the fine-tuning process to a specificparameter generation. Particularly, we first design a dualpath LoRA-based domain-aware adapter that disentangles features into domain-invariant and domain-specific components, enabling efficient adaptation. Additionally, a conditional diffusion-based parameter generation mechanism is presented to synthesize the adapter's parameters based on the current environment, preventing the optimization from getting stuck in local optima. Finally, we propose a classcentered optimal transport alignment method to mitigate catastrophic forgetting. Extensive experiments conducted on various continuous domain adaptive object detection tasks demonstrate the effectiveness. Meanwhile, visualization results show that the representation extracted by the generated parameters can capture more object-related information and strengthen the generalization ability.
DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness	Ruining Li University of Oxford Chuanxia Zheng University of Oxford Christian Rupprecht University of Oxford Andrea Vedaldi University of Oxford	Paper Supplementary Abstract Most 3D object generators prioritize aesthetic quality, often neglecting the physical constraints necessary for practical applications. One such constraint is that a 3D object should be self-supporting, i.e., remain balanced under gravity. Previous approaches to generating stable 3D objects relied on differentiable physics simulators to optimize geometry at test time, which is slow, unstable, and prone to local optima. Inspired by the literature on aligning generative models with external feedback, we propose Direct Simulation Optimization (DSO). This framework leverages feedback from a (non-differentiable) simulator to increase the likelihood that the 3D generator directly outputs stable 3D objects. We construct a dataset of 3D objects labeled with stability scores obtained from the physics simulator. This dataset enables fine-tuning of the 3D generator using the stability score as an alignment metric, via direct preference optimization (DPO) or direct reward optimization (DRO)-a novel objective we introduce to align diffusion models without requiring pairwise preferences. Our experiments demonstrate that the fine-tuned feed-forward generator, using either the DPO or DRO objective, is significantly faster and more likely to produce stable objects than testtime optimization. Notably, the DSO framework functions even without any ground-truth 3D objects for training, allowing the 3D generator to self-improve by automatically collecting simulation feedback on its own outputs.
EDM: Efficient Deep Feature Matching	Xi Li Realsee Tong Rao Realsee Cihui Pan Realsee	Paper Supplementary Abstract Recent feature matching methods have achieved remarkable performance but lack efficiency consideration. In this paper, we revisit the mainstream detector-free matching pipeline and improve all its stages considering both accuracy and efficiency. We propose an Efficient Deep feature Matching network, EDM. We first adopt a deeper CNN with fewer dimensions to extract multi-level features. Then we present a Correlation Injection Module that conducts feature transformation on high-level deep features, and progressively injects feature correlations from global to local for efficient multi-scale feature aggregation, improving both speed and performance. In the refinement stage, a novel lightweight bidirectional axis-based regression head is designed to directly predict subpixel-level correspondences from latent features, avoiding the significant computational cost of explicitly locating keypoints on high-resolution local feature heatmaps. Moreover, effective selection strategies are introduced to enhance matching accuracy. Extensive experiments show that our EDM achieves competitive matching accuracy on various benchmarks and exhibits excellent efficiency, offering valuable best practices for real-world applications. The code is available at https://github. com/chicleee/EDM.
EgoM2P: Egocentric Multimodal Multitask Pretraining	Gen Li ETH Zürich Yutong Chen ETH Zürich Yiqian Wu ETH Zürich Kaifeng Zhao ETH Zürich Marc Pollefeys ETH Zürich Siyu Tang ETH Zürich	Paper Supplementary Abstract Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction, enabling systems to better interpret the camera wearer's actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal and multitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models. To address these challenges, we introduce a set of efficient temporal tokenizers and propose EgoM2P, a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding. This unified design supports multitasking across diverse egocentric perception and synthesis tasks, including gaze prediction, egocentric camera tracking, and monocular depth estimation from egocentric video, and also serves as a generative model for conditional egocentric video synthesis. Across these tasks, EgoM2P matches or outperforms specialist models while being an order of magnitude faster. We will fully opensource EgoM2P to support the community and advance egocentric vision research.
End-to-End Driving with Online Trajectory Evaluation via BEV World Model	Yingyan Li Chinese Academy of Sciences Yuqi Wang unknown Yang Liu unknown Jiawei He unknown Lue Fan unknown Zhaoxiang Zhang unknown	Paper Supplementary Abstract End-to-end autonomous driving has achieved remarkable progress by integrating perception, prediction, and planning into a fully differentiable framework. Yet, to fully realize its potential, an effective online trajectory evaluation is indispensable to ensure safety. By forecasting the future outcomes of a given trajectory, trajectory evaluation becomes much more effective. This goal can be achieved by employing a world model to capture environmental dynamics and predict future states. Therefore, we propose an endto-end driving framework WoTE, which leverages a BEV World model to predict future BEV states for Trajectory Evaluation. The proposed BEV world model is latencyefficient compared to image-level world models and can be seamlessly supervised using off-the-shelf BEV-space traffic simulators. We validate our framework on both the NAVSIM benchmark and the closed-loop Bench2Drive benchmark based on the CARLA simulator, achieving state-of-the-art performance. Code is released at https://github. com/liyingyanUCAS/WoTE.
Estimating 2D Camera Motion with Hybrid Motion Basis	Haipeng Li University of Electronic Science and Technology of China Tianhao Zhou University of Electronic Science and Technology of China Zhanglei Yang University of Electronic Science and Technology of China Yi Wu Xiaomi Corporation Yan Chen Xiaomi Corporation Zijing Mao Xiaomi Corporation Shen Cheng Dexmal Bing Zeng University of Electronic Science and Technology of China Shuaicheng Liu University of Electronic Science and Technology of China	Paper Supplementary Abstract Estimating 2D camera motion is a fundamental computer vision task that models the projection of 3D camera movements onto the 2D image plane. Current methods rely on either homography-based approaches, limited to planar scenes, or meshflow techniques that use grid-based local homographies but struggle with complex non-linear transformations. We introduce CamFlow, a novel framework that represents camera motion using hybrid motion bases: physical bases derived from camera geometry and stochastic bases for complex scenarios. Our approach includes a hybrid probabilistic loss function based on the Laplace distribution that enhances training robustness. For evaluation, we create a new benchmark by masking dynamic objects in existing optical flow datasets to isolate pure camera motion. Experiments show CamFlow outperforms stateof-the-art methods across diverse scenarios, demonstrating superior robustness and generalization in zero-shot settings. Code and datasets are available at our project page: https://lhaippp.github.io/CamFlow/.
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving	Yue Li University of Science and Technology of China Meng Tian Huawei Noah's Ark Lab Zhenyu Lin Huawei Noah's Ark Lab Jiangtong Zhu Huawei Noah's Ark Lab Dechang Zhu Huawei Noah's Ark Lab Haiqiang Liu Huawei Noah's Ark Lab Yueyi Zhang University of Science and Technology of China Zhiwei Xiong University of Science and Technology of China Xinhai Zhao Huawei Noah's Ark Lab	Paper Supplementary Abstract Existing benchmarks for Vision-Language Model (VLM) in autonomous driving (AD) primarily assess interpretability through open-form visual question answering (QA) within coarse-grained tasks, which remain insufficient to assess capabilities in complex driving scenarios. To this end, we introduce VLADBench, a challenging and finegrained benchmark featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. The elaborate VLADBench spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute Comprehension, and Ego Decision-Making and Planning. These domains are further broken down into 11 secondary aspects and 29 tertiary tasks for a granular evaluation. A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts. To further exploit the cognitive and reasoning interactions among the 5 domains for AD understanding, we start from a small-scale VLM and train the DS models on individual domain datasets (collected from 1.4M DS QAs across public sources). The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems. The benchmark is available at https://github.com/Depth2World/VLADBench.
Future-Aware Interaction Network For Motion Forecasting	Shijie Li I2R, ASTAR Chunyu Liu CEPRI Xun Xu I2R, ASTAR Si Yong Yeo LKCMedicine, NTU Xulei Yang I2R, A*STAR	Paper Supplementary Abstract Motion forecasting is a crucial component of autonomous driving systems, enabling the generation of accurate and smooth future trajectories to ensure safe navigation to the destination. In previous methods, potential future trajectories are often absent in the scene encoding stage, which may lead to suboptimal outcomes. Additionally, prior approaches typically employ transformer architectures for spatiotemporal modeling of trajectories and map information, which suffer from the quadratic scaling complexity of the transformer architecture. In this work, we propose an interaction-based method, named Future-Aware Interaction Network, that introduces potential future trajectories into scene encoding for a comprehensive traffic representation. Furthermore, a State Space Model (SSM), specifically Mamba, is introduced for both spatial and temporal modeling. To adapt Mamba for spatial interaction modeling, we propose an adaptive reordering strategy that transforms unordered data into a structured sequence. Additionally, Mamba is employed to refine generated future trajectories temporally, ensuring more consistent predictions. These enhancements not only improve model efficiency but also enhance the accuracy and diversity of predictions. We conduct comprehensive experiments on the widely used Argoverse 1 and Argoverse 2 datasets, demonstrating that the proposed method achieves superior performance compared to previous approaches in a more efficient way. The code is available here.
GARF: Learning Generalizable 3D Reassembly for Real-World Fractures	Sihang Li New York University Zeyu Jiang New York University Grace Chen New York University Chenyang Xu New York University Siqi Tan New York University Xue Wang New York University Irving Fang New York University Kristof Zyskowski Yale University Shannon P. McPherron Max Planck Institute Radu Iovita New York University Chen Feng New York University Jing Zhang New York University	Paper Supplementary Abstract 3D reassembly is a challenging spatial intelligence task with broad applications across scientific domains. While large-scale synthetic datasets have fueled promising learning-based approaches, their generalizability to different domains is limited. Critically, it remains uncertain whether models trained on synthetic datasets can generalize to real-world fractures where breakage patterns are more complex. To bridge this gap, we propose GARF, a generalizable 3D reassembly framework for real-world fractures. GARF leverages fracture-aware pretraining to learn fracture features from individual fragments, with ﬂow matching enabling precise 6-DoF alignments. At inference time, we introduce two-session ﬂow matching, improving robustness to unseen objects and varying numbers of fractures. In collaboration with archaeologists, paleoanthropologists, and ornithologists, we curate FRACTURA, a diverse dataset for vision and learning communities, featuring real-world fracture types across ceramics, bones, eggshells, and lithics. Comprehensive experiments have shown our approach consistently outperforms state-ofthe-art methods on both synthetic and real-world datasets, achieving 82.87% lower rotation error and 25.15% higher part accuracy on the Breaking Bad Everyday dataset. This sheds light on training on synthetic data to advance realworld 3D puzzle solving, demonstrating its strong generalization across unseen object shapes and diverse fracture types. GARF's code, data and demo are available at https://ai4ce.github.io/GARF/.
GENMO: A GENeralist Model for Human MOtion	Jiefeng Li NVIDIA Jinkun Cao NVIDIA Haotian Zhang NVIDIA Davis Rempe NVIDIA Jan Kautz NVIDIA Umar Iqbal NVIDIA Ye Yuan NVIDIA	Paper Supplementary Abstract Human motion modeling traditionally separates motion generation and estimation into distinct tasks with specialized models. Motion generation models focus on creating diverse, realistic motions from inputs like text, audio, or keyframes, while motion estimation models aim to reconstruct accurate motion trajectories from observations like videos. Despite sharing underlying representations of temporal dynamics and kinematics, this separation limits knowledge transfer between tasks and requires maintaining separate models. We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework. Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals. Leveraging the synergy between regression and diffusion, GENMO achieves accurate global motion estimation while enabling diverse motion generation. We also introduce an estimation-guided training objective that exploits in-the-wild videos with 2D annotations and text descriptions to enhance generative diversity. Furthermore, our novel architecture handles variablelength motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control. This unified approach creates synergistic benefits: generative priors improve estimated motions under challenging conditions like occlusions, while diverse video data enhances generation capabilities. Extensive experiments demonstrate GENMO's effectiveness as a generalist framework that successfully handles multiple human motion tasks within a single model.
GenFlow3D: Generative Scene Flow Estimation and Prediction on Point Cloud Sequences	Hanlin Li University of Science and Technology of China Wenming Weng University of Science and Technology of China Yueyi Zhang MiroMind Zhiwei Xiong University of Science and Technology of China	Paper Abstract Scene ﬂow provides the fundamental information of the scene dynamics. Existing scene ﬂow estimation methods typically rely on the correlation between only a consecutive point cloud pair, which makes them limited to the instantaneous state of the scene and face challenges in real-world scenarios with factors like occlusion, noise, and diverse motion of background and foreground. In this paper, we study the joint sequential scene ﬂow estimation and future scene ﬂow prediction on point cloud sequences. The expanded sequential input introduces long-term and high-order motion information. We propose GenFlow3D, a recurrent neural network model which integrates diffusion in the decoder to better incorporate the two tasks and enhance the ability to extract general motion patterns. A transformer-based denoising network is adopted to help capture useful information. Depending on the input point clouds, discriminative condition signals are generated to guide the diffusion decoder to switch among different modes specific for scene ﬂow estimation and prediction in a multi-scale manner. GenFlow3D is evaluated on the real-world datasets nuScenes and Argoverse 2, and demonstrates superior performance compared with the existing methods. Our code is available at https://github.com/ustc-hlli/GenFlow3D.
Generalized Few-Shot Point Cloud Segmentation via LLM-Assisted Hyper-Relation Matching	Zhaoyang Li University of Science and Technology of China Yuan Wang University of Science and Technology of China Guoxin Xiong University of Science and Technology of China Wangkai Li University of Science and Technology of China Yuwen Pan University of Science and Technology of China Tianzhu Zhang National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory	Paper Supplementary Abstract Generalized few-shot point cloud segmentation (GFS3DSeg) aims to segment objects of both base and novel classes using abundant base class samples and limited novel class samples. Existing GFS-3DSeg methods encounter bottlenecks due to the scarcity of novel class data and inter-class confusion. In this paper, we propose the LLM-Assisted Hyper-Relation Matching (LARM) framework, which leverages the wealth of prior knowledge in Large Language Models (LLM) to enrich novel category prototypes and introduces a hyper-relation matching strategy to mitigate false matches between point features and category prototypes caused by inter-class confusion. The proposed LARM enjoys several merits. First, the vast knowledge embedded in LLM can be an effective complement to vanilla category prototypes, enabling them to exhibit greater robustness. Second, the hyper-relation matching strategy harnesses the structural information implicit in the inter-class relationships, making it more robust than individual feature comparisons. Extensive experiments on two benchmarks demonstrate that LARM outperforms previous state-of-the-art methods by large margins.
Global-Aware Monocular Semantic Scene Completion with State Space Models	Shijie Li I2R, ASTAR Zhongyao Cheng I2R, ASTAR Rong Li HKUST(GZ) Shuai Li University of Bonn Juergen Gall University of Bonn Xun Xu I2R, ASTAR Xulei Yang I2R, ASTAR	Paper Supplementary Abstract Monocular Semantic Scene Completion (MonoSSC) reconstructs and interprets 3D environments from a single image, enabling diverse real-world applications. However, existing methods are often constrained by the local receptive field of Convolutional Neural Networks (CNNs), making it challenging to handle the non-uniform distribution of projected points and effectively reconstruct missing information caused by the 3D-to-2D projection. In this work, we introduce GA-MonoSSC, a hybrid architecture for MonoSSC that effectively captures global context in both the 2D image domain and 3D space. Specifically, we propose a DualHead Multi-Modality Encoder, which leverages a Transformer architecture to capture spatial relationships across all features in the 2D image domain, enabling more comprehensive 2D feature extraction. Additionally, we introduce the Frustum Mamba Decoder, built on the State Space Model (SSM), to efficiently capture long-range dependencies in 3D space. Furthermore, we propose a frustum reordering strategy within the Frustum Mamba Decoder to mitigate feature discontinuities in the reordered voxel sequence, ensuring better alignment with the scan mechanism of the State Space Model (SSM) for improved 3D representation learning. We conduct extensive experiments on the widely used Occ-ScanNet and NYUv2 datasets, demonstrating that our proposed method achieves state-of-the-art performance, validating its effectiveness. The code is available here.
Global Regulation and Excitation via Attention Tuning for Stereo Matching	Jiahao LI City University of Hong Kong Xinhong Chen City University of Hong Kong Zhengmin JIANG City University of Hong Kong Qian Zhou City University of Hong Kong Yung-Hui Li Hon Hai Research Institute Jianping Wang City University of Hong Kong	Paper Abstract Stereo matching achieves significant progress with iterative algorithms like RAFT-Stereo and IGEV-Stereo. However, these methods struggle in ill-posed regions with occlusions, textureless, or repetitive patterns, due to a lack of global context and geometric information for effective iterative refinement. To enable the existing iterative approaches to incorporate global context, we propose the Global Regulation and Excitation via Attention Tuning (GREAT) framework which encompasses three attention modules. Specifically, Spatial Attention (SA) captures the global context within the spatial dimension, Matching Attention (MA) extracts global context along epipolar lines, and Volume Attention (VA) works in conjunction with SA and MA to construct a more robust cost-volume excited by global context and geometric details. To verify the universality and effectiveness of this framework, we integrate it into several representative iterative stereo-matching methods and validate it through extensive experiments, collectively denoted as GREAT-Stereo. This framework demonstrates superior performance in challenging ill-posed regions. Applied to IGEV-Stereo, among all published methods, our GREAT-IGEV ranks first on the Scene Flow test set, KITTI 2015, and ETH3D leaderboards, and achieves second on the Middlebury benchmark. Code is available at https://github.com/JarvisLee0423/GREAT-Stereo.
Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training	Zhenxin Li Fudan University Shihao Wang The Hong Kong Polytechnic University Shiyi Lan NVIDIA Zhiding Yu NVIDIA Zuxuan Wu Fudan University Jose M. Alvarez NVIDIA	Paper Supplementary Abstract End-to-end autonomous driving research currently faces a critical challenge in bridging the gap between open-loop training and closed-loop deployment. Current approaches are trained to predict trajectories in an open-loop environment, which struggle with quick reactions to other agents in closed-loop environments and risk generating kinematically infeasible plans due to the gap between open-loop training and closed-loop driving. In this paper, we introduce Hydra-NeXt, a novel multi-branch planning framework that unifies trajectory prediction, control prediction, and a trajectory refinement network in one model. Unlike current open-loop trajectory prediction models that only handle general-case planning, Hydra-NeXt further utilizes a control decoder to focus on short-term actions, which enables faster responses to dynamic situations and reactive agents. Moreover, we propose the Trajectory Refinement module to augment and refine the planning decisions by effectively adhering to kinematic constraints in closed-loop environments. This unified approach bridges the gap between open-loop training and closed-loop driving, demonstrating superior performance of 65.89 Driving Score (DS) and 48.20% Success Rate (SR) on the Bench2Drive dataset without relying on external experts for data collection. Hydra-NeXt surpasses the previous state-of-the-art by 22.98 DS and 17.49 SR, marking a significant advancement in autonomous driving. Code will be available at https://github.com/woxihuanjiangguo/Hydra-NeXt.
IMoRe: Implicit Program-Guided Reasoning for Human Motion Q&A	Chen Li Institute of High-Performance Computing, Agency for Science, Technology and Research Chinthani Sugandhika Nanyang Technological University Yeo Keat Ee Institute of High-Performance Computing, Agency for Science, Technology and Research Eric Peh Institute of High-Performance Computing, Agency for Science, Technology and Research Hao Zhang Institute of High-Performance Computing, Agency for Science, Technology and Research Hong Yang Institute of High-Performance Computing, Agency for Science, Technology and Research Deepu Rajan Nanyang Technological University Basura Fernando Institute of High-Performance Computing, Agency for Science, Technology and Research	Paper Supplementary Abstract Existing human motion Q&A methods rely on explicit program execution, where the requirement for manually defined functional modules may limit the scalability and adaptability. To overcome this, we propose an implicit program-guided motion reasoning (IMoRe) framework that unifies reasoning across multiple query types without manually designed modules. Unlike existing implicit reasoning approaches that infer reasoning operations from question words, our model directly conditions on structured program functions, ensuring a more precise execution of reasoning steps. Additionally, we introduce a program-guided reading mechanism, which dynamically selects multi-level motion representations from a pretrained motion Vision Transformer (ViT), capturing both high-level semantics and fine-grained motion cues. The reasoning module iteratively refines memory representations, leveraging structured program functions to extract relevant information for different query types. Our model achieves state-of-the-art performance on Babel-QA and generalizes to a newly constructed motion Q&A dataset based on HuMMan, demonstrating its adaptability across different motion reasoning datasets. Code and dataset are available at https://github.com/LUNAProject22/IMoRe.
Intermediate Connectors and Geometric Priors for Language-Guided Affordance Segmentation on Unseen Object Categories	Yicong Li National University of Singapore Yiyang Chen National University of Singapore Zhenyuan Ma National University of Singapore Junbin Xiao National University of Singapore Xiang Wang University of Science and Technology of China Angela Yao University of Science and Technology of China	Paper Supplementary Abstract Language-guided Affordance Segmentation (LASO) aims to identify actionable object regions based on text instructions. At the core of its practicality is learning generalizable affordance knowledge that captures functional regions across diverse objects. However, current LASO solutions struggle to extend learned affordances to object categories that are not encountered during training. Scrutinizing these designs, we identify limited generalizability on unseen categories, stemming from (1) underutilized generalizable patterns in the intermediate layers of both 3D and text backbones, which impedes the formation of robust affordance knowledge, and (2) the inability to handle substantial variability in affordance regions across object categories due to a lack of structural knowledge of the target region. Towards this, we introduce a GeneraLized frAmework on uNseen CategoriEs (GLANCE), incorporating two key components: a cross-modal connector that links intermediate stages of the text and 3D backbones to enrich pointwise embeddings with affordance concepts, and a VLM-guided query generator that provides affordance priors by extracting a few 3D key points based on the intra-view reliability and crossview consistency of their multi-view segmentation masks. Extensive experiments on two benchmark datasets demonstrate that GLANCE outperforms state-of-the-art methods (SoTAs), with notable improvements in generalization to unseen categories. Our code is available at https://github.com/Monoxide-Chen/Affordance.
LMM-Det: Make Large Multimodal Models Excel in Object Detection	Jincheng Li AI Research Chunyu Xie Beihang University Ji Ao AI Research Dawei Leng AI Research Yuhui Yin AI Research	Paper Supplementary Abstract Large multimodal models (LMMs) have garnered widespread attention and interest within the artificial intelligence research and industrial communities, owing to their remarkable capability in multimodal understanding, reasoning, and in-context learning, among others. While LMMs have demonstrated promising results in tackling multimodal tasks like image captioning, visual question answering, and visual grounding, the object detection capabilities of LMMs exhibit a significant gap compared to specialist detectors. To bridge the gap, we depart from the conventional methods of integrating heavy detectors with LMMs and propose a simple yet effective approach that leverages a Large Multimodal Model for vanilla object Detection without relying on specialized detection modules. Specifically, we conduct a comprehensive exploratory analysis when a large multimodal model meets with object detection, revealing that the recall rate degrades significantly compared with specialist detection models. To mitigate this, we propose to increase the recall rate by introducing data distribution adjustment and inference optimization tailored for object detection. We re-organize the instruction conversations to enhance the object detection capabilities of large multimodal models. We claim that a large multimodal model possesses detection capability without any extra detection modules. Extensive experiments support our claim and show the effectiveness of the versatile LMM-Det. The datasets, models, and codes are available at https://github.com/360CVGroup/LMM-Det.
Language Decoupling with Fine-grained Knowledge Guidance for Referring Multi-object Tracking	Guangyao Li Xiamen University Siping Zhuang Xiamen University Yajun Jian Xiamen University Yan Yan Xiamen University Hanzi Wang Xiamen University	Paper Abstract Referring multi-object tracking (RMOT) aims to detect and track specific objects based on natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, often failing to exploit finegrained linguistic cues that are crucial for distinguishing objects with similar characteristics. Notably, these cues play distinct roles at different tracking stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose DKGTrack, a novel RMOT method that enhances language comprehension for precise object tracking by decoupling language expressions into localized descriptions and motion states. To improve the accuracy of language-guided object identification, we introduce a Static Semantic Enhancement (SSE) module, which enhances region-level vision-language alignment through hierarchical cross-modal feature interaction, providing more discriminative object representations for tracking. Furthermore, we propose a Motion Perception Alignment (MPA) module that explicitly aligns object queries with motion descriptions, enabling accurate object trajectory prediction across frames. Experimental results on multiple RMOT benchmarks demonstrate the effectiveness of our method, which achieves competitive performance in challenging tracking scenarios. The code is available at https://github.com/acyddl/DKGTrack.
Learning Precise Affordances from Egocentric Videos for Robotic Manipulation	Gen Li University of Edinburgh Nikolaos Tsagkas University of Edinburgh Jifei Song Huawei Noah's Ark Lab Ruaridh Mon-Williams University of Edinburgh Sethu Vijayakumar University of Edinburgh Kun Shao Huawei Noah's Ark Lab Laura Sevilla-Lara University of Edinburgh	Paper Supplementary Abstract Affordance, defined as the potential actions that an object offers, is crucial for embodied AI agents. For example, such knowledge directs an agent to grasp a knife by the handle for cutting or by the blade for safe handover. While existing approaches have made notable progress, affordance research still faces three key challenges: data scarcity, poor generalization, and real-world deployment. Specifically, there is a lack of large-scale affordance datasets with precise segmentation maps, existing models struggle to generalize across different domains or novel object and affordance classes, and little work demonstrates deployability in real-world scenarios. In this work, we address these issues by proposing a complete affordance learning system that (1) takes in egocentric videos and outputs precise affordance annotations without human labeling, (2) leverages geometric information and vision foundation models to improve generalization, and (3) introduces a framework that facilitates affordance-oriented robotic manipulation such as tool grasping and robot-to-human tool handover. Experimental results show that our model surpasses the state-of-the-art by 13.8% in mIoU, and the framework achieves 77.1% successful grasping among 179 trials, including evaluations on seen, unseen classes, and cluttered scenes. Project page: https://reagan1311.github.io/affgrasp.
M2EIT: Multi-Domain Mixture of Experts for Robust Neural Inertial Tracking	Yan Li School of Systems Science and Engineering, Sun Yat-sen University Yang Xu Tianjin University Changhao Chen The Hong Kong University of Science and Technology (Guangzhou) Zhongchen Shi Defense Innovation Institute, Academy of Military Sciences (AMS) Wei Chen Defense Innovation Institute, Academy of Military Sciences (AMS) Liang Xie Defense Innovation Institute, Academy of Military Sciences (AMS) Hongbo Chen School of Systems Science and Engineering, Sun Yat-sen University Erwei Yin Defense Innovation Institute, Academy of Military Sciences (AMS)	Paper Supplementary Abstract Inertial tracking (IT), independent of the environment and external infrastructure, has long been the ideal solution for providing location services to humans. Despite significant strides in inertial tracking empowered by deep learning, prevailing neural inertial tracking predominantly utilizes conventional spatial-temporal features from inertial measurements. Unfortunately, the frequency domain dimension is usually overlooked in the current literature. To this end, in this paper, we propose a Multi-Domain Mixture of Experts model for Neural Inertial Tracking, named M2EIT. Specifically, M2EIT first leverages ResNet as a spatial decomposition expert to capture spatial relationships between multivariate timeseries, and State Space Model (SSM)- based Bi-Mamba, the other expert to focus on learning temporal correlations. In the frequency domain mapping, we then introduce the Wavelet-based frequency decomposition expert, which decomposes IMU samples into low-frequency bands and high-frequency bands using the Haar wavelet transform for simulating motion patterns at different temporal scales. To bridge the semantic gap across multiple domains and integrate them adaptively, we design the Multi-Representation Alignment Router (MAR), which consists of a dual cross-domain translation layer, followed by a dynamic router, to achieve multi-domain semantic alignment and optimize expert contributions. Extensive experiments conducted on three real-world datasets demonstrate that the proposed M2EIT can achieve SOTA results in neural inertial tracking.
MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance	Quanhao Li Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University Zhen Xing Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University Rui Wang Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University Hui Zhang Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University Qi Dai Microsoft Research Asia Zuxuan Wu Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University	Paper Supplementary Abstract Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths. However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality. Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios. Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation. To address these challenges, we introduce MagicMotion, a novel image-tovideo generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality. Furthermore, we present MagicData, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce MagicBench, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects. Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics.
Morph: A Motion-free Physics Optimization Framework for Human Motion Generation	Zhuo Li WeChat, Tencent Inc Mingshuang Luo State Key Laboratory of AI Safety, Institute of Computing Technology, CAS Ruibing Hou Peng Cheng Laboratory Xin Zhao MoE Key Laboratory of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Hao Liu WeChat, Tencent Inc Hong Chang State Key Laboratory of AI Safety, Institute of Computing Technology, CAS Zimo Liu Peng Cheng Laboratory Chen Li WeChat, Tencent Inc	Paper Supplementary Abstract Human motion generation has been widely studied due to its crucial role in areas such as digital humans and humanoid robot control. However, many current motion generation approaches disregard physics constraints, frequently resulting in physically implausible motions with pronounced artifacts such as floating and foot sliding. Meanwhile, training an effective motion physics optimizer with noisy motion data remains largely unexplored. In this paper, we propose Morph, a Motion-Free physics optimization framework, consisting of a Motion Generator and a Motion Physics Refinement module, for enhancing physical plausibility without relying on expensive real-world motion data. Specifically, the motion generator is responsible for providing large-scale synthetic, noisy motion data, while the motion physics refinement module utilizes these synthetic data to learn a motion imitator within a physics simulator, enforcing physical constraints to project the noisy motions into a physically-plausible space. Additionally, we introduce a prior reward module to enhance the stability of the physics optimization process and generate smoother and more stable motions. These physically refined motions are then used to fine-tune the motion generator, further enhancing its capability. This collaborative training paradigm enables mutual enhancement between the motion generator and the motion physics refinement module, significantly improving practicality and robustness in realworld applications. Experiments on both text-to-motion and music-to-dance generation tasks demonstrate that our framework achieves state-of-the-art motion quality while improving physical plausibility drastically. Ground Penetration Leaning Backward Interpenetration Foot Sliding Floating Unnatural Rotation Figure 1. Examples of physical inconsistencies in generations.
MultiModal Action Conditioned Video Simulation	Yichen Li MIT CSAIL Antonio Torralba MIT CSAIL	Paper Supplementary Abstract Current video models fail as world model as they lack finegraiend control. General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. We consider senses of proprioception, kinesthesia, force haptics, and muscle activation. Such multimodal senses naturally enables fine-grained interactions that are difficult to simulate with text-conditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further propose a regularization scheme to enhance causality of the action trajectory features in representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate the effectiveness and practicality of our work.†
NATRA: Noise-Agnostic Framework for Trajectory Prediction with Noisy Observations	Rongqing Li Beijing Institute of Technology Changsheng Li Beijing Institute of Technology Ruilin Lv Beijing Institute of Technology Yuhang Li Beijing Institute of Technology Yang Gao Meituan Xiaolu Zhang Ant Group JUN ZHOU Ant Group	Paper Supplementary Abstract Trajectory prediction aims to forecast an agent's future trajectories based on its historical observed trajectories, which is a critical task for various applications such as autonomous driving, robotics, and surveillance systems. Most existing trajectory prediction methods assume that the observed trajectories collected for forecasting are clean. However, in real-world scenarios, noise is inevitably introduced into the observations, resulting in the collapse of the existing approaches. Therefore, it is essential to perform robust trajectory prediction based on noisy observations, which is a more practical scenario. In this paper, we propose NATRA, a Noise-Agnostic framework capable of tackling the problem of TRAjectory prediction with arbitrary types of noisy observations. Specifically, we put forward a mutual information-based mechanism to denoise the original noisy observations. It optimizes the produced trajectories to exhibit a pattern that closely resembles the clean trajectory pattern while deviating from the noisy one. Considering that the trajectory structure may be destroyed through the only optimization of mutual information, we introduce an additional reconstruction loss to preserve the structure information of the produced observed trajectories. Moreover, we further propose a ranking loss to further enhance the performance. Because NATRA does not rely on any specific module tailored to particular noise distributions, it can handle arbitrary types of noise in principle. Additionally, our proposed NATRA can be easily integrated into existing trajectory prediction models. Extensive experiments on both synthetic and real-world noisy datasets demonstrate the effectiveness of our method.
PBCAT: Patch-Based Composite Adversarial Training against Physically Realizable Attacks on Object Detection	Xiao Li Department of Computer Science and Technology, BNRist, IDG/McGovern Institute for Brain Research, THBI, Tsinghua University Yiming Zhu University of Science and Technology Beijing Yifan Huang University of Science and Technology Beijing Wei Zhang Department of Computer Science and Technology, BNRist, IDG/McGovern Institute for Brain Research, THBI, Tsinghua University Yingzhe He Huawei Technologies Jie Shi Huawei Technologies Xiaolin Hu Department of Computer Science and Technology, BNRist, IDG/McGovern Institute for Brain Research, THBI, Tsinghua University	Paper Supplementary Abstract Object detection plays a crucial role in many securitysensitive applications. However, several recent studies have shown that object detectors can be easily fooled by physically realizable attacks, e.g., adversarial patches and recent adversarial textures, which pose realistic and urgent threats. Adversarial Training (AT) has been recognized as the most effective defense against adversarial attacks. While AT has been extensively studied in the l→attack settings on classification models, AT against physically realizable attacks on object detectors has received limited exploration. Early attempts are only performed to defend against adversarial patches, leaving AT against a wider range of physically realizable attacks under-explored. In this work, we consider defending against various physically realizable attacks with a unified AT method. We propose PBCAT, a novel Patch-Based Composite Adversarial Training strategy. PBCAT optimizes the model by incorporating the combination of small-area gradient-guided adversarial patches and imperceptible global adversarial perturbations covering the entire image. With these designs, PBCAT has the potential to defend against not only adversarial patches but also unseen physically realizable attacks such as adversarial textures. Extensive experiments in multiple settings demonstrated that PBCAT significantly improved robustness against various physically realizable attacks over state-of-the-art defense methods. Notably, it improved the detection accuracy by 29.7% over previous defense methods under one recent adversarial texture attack1. Code is available at https://github.com/LixiaoTHU/oddefense-PatchAT
PointGAC: Geometric-Aware Codebook for Masked Point Modeling	Abiao Li Jiangxi University of Finance and Economics Chenlei Lv Shenzhen University Yuming Fang Jiangxi University of Finance and Economics Yifan Zuo Jiangxi University of Finance and Economics Jian Zhang University of Technology Sydney Guofeng Mei Fondazione Bruno Kessler	Paper Abstract Most masked point cloud modeling (MPM) methods follow a regression paradigm to reconstruct the coordinate or feature of masked regions. However, they tend to overconstrain the model to learn the details of the masked region, resulting in failure to capture generalized features. To address this limitation, we propose PointGAC, a novel clustering-based MPM method that aims to align the feature distribution of masked regions. Specially, it features an online codebook-guided teacher-student framework. Firstly, it presents a geometry-aware partitioning strategy to extract initial patches. Then, the teacher model updates a codebook via online k-means based on features extracted from the complete patches. This procedure facilitates codebook vectors to become cluster centers. Afterward, we assigns the unmasked features to their corresponding cluster centers, and the student model aligns the assignment for the reconstructed masked features. This strategy focuses on identifying the cluster centers to which the masked features belong, enabling the model to learn more generalized feature representations. Benefiting from a proposed codebook maintenance mechanism, codebook vectors are actively updated, which further increases the efficiency of semantic feature learning. Experiments validate the effectiveness of the proposed method on various downstream tasks. Code is available at https://github.com/LAB123-tech/PointGAC.
Proactive Scene Decomposition and Reconstruction	Baicheng Li School of Intelligence Science and Technology, Peking University Zike Yan AIR, Tsinghua University Dong Wu School of Intelligence Science and Technology, Peking University Hongbin Zha School of Intelligence Science and Technology, Peking University	Paper Supplementary Abstract Human behaviors are the major causes of scene dynamics and inherently contain rich cues regarding the dynamics. This paper formalizes a new task of proactive scene decomposition and reconstruction, an online approach that leverages human-object interactions to iteratively disassemble and reconstruct the environment. By observing these intentional interactions, we can dynamically refine the decomposition and reconstruction process, addressing inherent ambiguities in static object-level reconstruction. The proposed system effectively integrates multiple tasks in dynamic environments such as accurate camera and object pose estimation, instance decomposition, and online map updating, capitalizing on cues from human-object interactions in egocentric live streams for a flexible, progressive alternative to conventional object-level reconstruction methods. Aided by the Gaussian splatting technique, accurate and consistent dynamic scene modeling is achieved with photorealistic and efficient rendering. The efficacy is validated in multiple real-world scenarios with promising advantages.
RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control	Teng Li College of Computer Science & Technology, Zhejiang University Guangcong Zheng College of Computer Science & Technology, Zhejiang University Rui Jiang College of Computer Science & Technology, Zhejiang University Shuigen Zhan College of Computer Science & Technology, Zhejiang University Tao Wu College of Computer Science & Technology, Zhejiang University Yehao Lu College of Computer Science & Technology, Zhejiang University Yining Lin Supremind Chuanyun Deng Central Media Technology Institute, 2012 Lab, Huawei Yepan Xiong Central Media Technology Institute, 2012 Lab, Huawei Min Chen Central Media Technology Institute, 2012 Lab, Huawei Lin Cheng Central Media Technology Institute, 2012 Lab, Huawei Xi Li College of Computer Science & Technology, Zhejiang University	Paper Supplementary Abstract Recent advancements in camera-trajectory-guided imageto-video generation offer higher precision and better support for complex camera control compared to text-based approaches. However, they also introduce significant usability challenges, as users often struggle to provide precise camera parameters when working with arbitrary realworld images without knowledge of their depth nor scene scale. To address these real-world application issues, we propose RealCam-I2V, a novel diffusion-based video generation framework that integrates monocular metric depth estimation to establish 3D scene reconstruction in a preprocessing step. During training, the reconstructed 3D scene enables scaling camera parameters from relative to metric scales, ensuring compatibility and scale consistency across diverse real-world images. In inference, RealCam-I2V offers an intuitive interface where users can precisely draw camera trajectories by dragging within the 3D scene. To further enhance precise camera control and scene consistency, we propose scene-constrained noise shaping, which shapes high-level noise and also allows the framework to maintain dynamic and coherent video generation in lower noise stages. RealCam-I2V achieves significant improvements in controllability and video quality on the RealEstate10K and out-of-domain images. We further enables applications like camera-controlled looping video generation and generative frame interpolation. Project page: zgctroy.github.io/RealCam-I2V.
Robust Low-light Scene Restoration via Illumination Transition	Ze Li The Hong Kong University of Science and Technology, Hong Kong SAR Feng Zhang Nanjing University of Posts and Telecommunications, Nanjing, China Xiatian Zhu University of Surrey, Guildford, United Kingdom Meng Zhang The Hong Kong University of Science and Technology, Hong Kong SAR Yanghong Zhou The Hong Kong Polytechnic University, Hong Kong SAR P. Y. Mok The Hong Kong University of Science and Technology, Hong Kong SAR	Paper Abstract Synthesizing normal-light novel views from low-light multiview images is an important yet challenging task, given the low visibility and high ISO noise present in the input images. Existing low-light enhancement methods often struggle to effectively preprocess such low-light inputs, as they fail to consider correlations among multiple views. Although other state-of-the-art methods have introduced illumination-related components offering alternative solutions to the problem, they often result in drawbacks such as color distortions and artifacts, and they provide limited denoising effectiveness. In this paper, we propose a novel Robust Low-light Scene Restoration framework (RoSe), which enables effective synthesis of novel views in normal lighting conditions from low-light multiview image inputs, by formulating the task as an illuminance transition estimation problem in 3D space, conceptualizing it as a specialized rendering task. This multiviewconsistent illuminance transition field establishes a robust connection between low-light and normal-light conditions. By further exploiting the inherent low-rank property of illumination to constrain the transition representation, we achieve more effective denoising without complex 2D techniques or explicit noise modeling. To implement RoSe, we design a concise dual-branch architecture and introduce a low-rank denoising module. Experiments demonstrate that RoSe significantly outperforms state-of-the-art models in both rendering quality and multiview consistency on standard benchmarks. The codes and data are available at https://pegasus2004.github.io/RoSe.
SAS: Segment Any 3D Scene with Integrated 2D Priors	Zhuoyuan Li University of Science and Technology of China Jiahao Lu University of Science and Technology of China Jiacheng Deng University of Science and Technology of China Hanzhi Chang University of Science and Technology of China Lifan Wu University of Science and Technology of China Yanzhe Liang University of Science and Technology of China Tianzhu Zhang Deep Space Exploration Laboratory	Paper Supplementary Abstract The open vocabulary capability of 3D models is increasingly valued, as traditional methods with models trained with fixed categories fail to recognize unseen objects in complex dynamic 3D scenes. In this paper, we propose a simple yet effective approach, SAS, to integrate the open vocabulary capability of multiple 2D models and migrate it to 3D domain. Specifically, we first propose Model Alignment via Text to map different 2D models into the same embedding space using text as a bridge. Then, we propose Annotation-Free Model Capability Construction to explicitly quantify the 2D model's capability of recognizing different categories using diffusion models. Following this, point cloud features from different 2D models are fused with the guide of constructed model capabilities. Finally, the integrated 2D open vocabulary capability is transferred to 3D domain through feature distillation. SAS outperforms previous methods by a large margin across multiple datasets, including ScanNet v2, Matterport3D, and nuScenes, while its generalizability is further validated on downstream tasks, e.g., gaussian segmentation and instance segmentation.
SD2Actor: Continuous State Decomposition via Diffusion Embeddings for Robotic Manipulation	Jiayi Li National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University	Paper Supplementary Abstract Language-conditioned robot manipulation in the continuous spectrum presents a persistent challenge due to the difficult of mapping states to target actions. Previous methods face limitations in effectively modeling object states, primarily due to their reliance on executing ambiguous instructions devoid of explicit state information. In response, we present SD2Actor, a zero-shot robotic manipulation framework that possesses the capability to generate precise actions in continuous states. Specifically, given the novel instructions, we aim to generate instructionfollowing and accurate robot manipulation actions. Instead of time-consuming optimization and finetuning, our zeroshot method generalizes to any object state with a wide range of translations and versatile rotations. At its core, we quantify multiple base states in the training set and utilize their combination to refine the target action generated by the diffusion model. To obtain novel state representations, we initially employ LLMs to extract the novel state from the instruction and decompose it into multiple learned base states. We then employ the linear combination of base state embeddings to produce novel state features. Moreover, we introduce the orthogonalization loss to constrain the state embedding space, which ensures the validity of linear interpolation. Experiments demonstrate that SD2Actor outperforms state-of-the-art methods across a diverse range of manipulation tasks in ARNOLD Benchmark. Moreover, SD2Actor can effectively learn generalizable policies from a limited number of human demonstrations, achieving promising accuracy in a variety of realworld manipulation tasks.
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining	Yue Li University of Amsterdam Qi Ma ETH Zürich Runyi Yang INSAIT, Sofia University 'St. Kliment Ohridski' Huapeng Li ETH Zürich Mengjiao Ma Nanjing University of Aeronautics and Astronautics Bin Ren INSAIT, Sofia University 'St. Kliment Ohridski' Nikola Popovic INSAIT, Sofia University 'St. Kliment Ohridski' Nicu Sebe University of Trento Ender Konukoglu ETH Zürich Theo Gevers University of Amsterdam Martin R. Oswald University of Amsterdam Danda Pani Paudel INSAIT, Sofia University 'St. Kliment Ohridski'	Paper Supplementary Abstract Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training or together at inference. This highlights the clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable manner remains an open challenge. To address these limitations, we introduce SceneSplat in Fig. 1, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. To power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising 7916 scenes derived from seven established datasets, such as ScanNet and Matterport3D. Generating SceneSplat-7K required computational resources equivalent to 150 GPU days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes. Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed method over the established baselines. Our code, model, and datasets will be released at SceneSplat.
ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion		Paper Supplementary Abstract Joint reconstruction of human-object interaction marks a significant milestone in comprehending the intricate interrelations between humans and their surrounding environment. Nevertheless, previous optimization methods often struggle to achieve physically plausible reconstruction results due to the lack of prior knowledge about human-object interactions. In this paper, we introduce ScoreHOI, an effective diffusion-based optimizer that introduces diffusion priors for the precise recovery of human-object interactions. By harnessing the controllability within score-guided sampling, the diffusion model can reconstruct a conditional distribution of human and object pose given the image observation and object feature. During inference, the ScoreHOI effectively improves the reconstruction results by guiding the denoising process with specific physical constraints. Furthermore, we propose a contact-driven iterative refinement approach to enhance the contact plausibility and improve the reconstruction accuracy. Extensive evaluations on standard benchmarks demonstrate ScoreHOI's superior performance over state-of-the-art methods, highlighting its ability to achieve a precise and robust improvement in joint human-object interaction reconstruction.
TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos	Jinxi Li LAR Group, The Hong Kong Polytechnic University Ziyang Song LAR Group, The Hong Kong Polytechnic University Bo Yang LAR Group, The Hong Kong Polytechnic University	Paper Supplementary Abstract In this paper, we aim to model 3D scene geometry, appearance, and physical information just from dynamic multiview videos in the absence of any human labels. By leveraging physics-informed losses as soft constraints or integrating simple physics models into neural nets, existing works often fail to learn complex motion physics, or doing so requires additional labels such as object types or masks. We propose a new framework named TRACE to model the motion physics of complex dynamic 3D scenes. The key novelty of our method is that, by formulating each 3D point as a rigid particle with size and orientation in space, we directly learn a translation rotation dynamics system for each particle, explicitly estimating a complete set of physical parameters to govern the particle's motion over time. Extensive experiments on three existing dynamic datasets and one newly created challenging synthetic datasets demonstrate the extraordinary performance of our method over baselines in the task of future frame extrapolation. A nice property of our framework is that multiple objects or parts can be easily segmented just by clustering the learned physical parameters. Our datasets and code are available at https://github.com/vLAR-group/TRACE.
Task-Specific Zero-shot Quantization-Aware Training for Object Detection	Changhao Li School of Computational Science and Engineering, Georgia Institute of Technology Xinrui Chen Shenzhen International Graduate School, Tsinghua University Ji Wang School of Software, Tsinghua University Kang Zhao Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua-Bosch Joint ML Center, Tsinghua University Jianfei Chen Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua-Bosch Joint ML Center, Tsinghua University	Paper Supplementary Abstract Quantization is a key technique to reduce network size and computational complexity by representing the network parameters with a lower precision. Traditional quantization methods rely on access to original training data, which is often restricted due to privacy concerns or security challenges. Zero-shot Quantization (ZSQ) addresses this by using synthetic data generated from pre-trained models, eliminating the need for real training data. Recently, ZSQ has been extended to object detection. However, existing methods use unlabeled task-agnostic synthetic images that lack the specific information required for object detection, leading to suboptimal performance. In this paper, we propose a novel task-specific ZSQ framework for object detection networks, which consists of two main stages. First, we introduce a bounding box and category sampling strategy to synthesize a task-specific calibration set from the pre-trained network, reconstructing object locations, sizes, and category distributions without any prior knowledge. Second, we integrate task-specific training into the knowledge distillation process to restore the performance of quantized detection networks. Extensive experiments conducted on the MS-COCO and Pascal VOC datasets demonstrate the efficiency and state-of-the-art performance of our method. Our project is publicly accessible athttps://dfq-dojo.github.io/dfq-toolkit-web.
Towards Long-Horizon Vision-Language-Action System: Reasoning, Acting and Memory	Daixun Li Xidian University Yusi Zhang Xidian University Mingxiang Cao Xidian University Donglai Liu Xidian University Weiying Xie Xidian University Tianlin Hui Xidian University Lunkai Lin AgileX Robotics Zhiqiang Xie AgileX Robotics Yunsong Li Xidian University	Paper Abstract Vision-Language-Action (VLA) is crucial for autonomous decision-making in embodied systems. While current methods have advanced single-skill abilities, their short-horizon capability limits applicability in real-world scenarios. To address this challenge, we innovatively propose MindExplore, a general hierarchical VLA system with cross-skill for long-horizon tasks in highly dynamic sand. The key insight is to iteratively align the knowledge domain of task planning and action execution. Thus, this task-oriented action enables outstanding generalization across a wide range of real-world scenarios. In the reasoning layer, task-specific chains of thought (CoT) are designed for planning longhorizon task sequences and providing meta-action signals. In the acting layer, a simple but powerful Mixture of Policy Experts strategy is built inspired by signals and multimodal inputs for adaptively selecting skill experts and generating closed-loop action sequences. Also, it integrates a lightweight Multimodal Diffusion Policy (MMDP) to enhance spatial perception by fusing multi-visual modality features. Besides, the pioneering memory mechanism establishes feedback between the reasoning and acting layers, facilitating adaptive execution of long-horizon tasks and real-time replanning. Notably, we create SandGo-1k and SandThink-21k, the first expert-level multimodal embodied dataset and CoT dataset tailored for sandy environments. At a high execution frequency of 30 FPS, MindExplore is 3.01 x more successful than existing methods in unstructured and dynamic environments.
Triad: Empowering LMM-based Anomaly Detection with Expert-guided Region-of-Interest Tokenizer and Manufacturing Process	Yuanze Li Harbin Institute of Technology Shihao Yuan Harbin Institute of Technology Haolin Wang Harbin Institute of Technology Qizhang Li Harbin Institute of Technology Ming Liu Harbin Institute of Technology Chen Xu Pengcheng Lab, Guangzhou Guangming Shi Pengcheng Lab, Guangzhou Wangmeng Zuo Harbin Institute of Technology	Paper Supplementary Abstract Although recent methods have tried to introduce large multimodal models (LMMs) into industrial anomaly detection (IAD), their generalization in the IAD field is far inferior to that for general purposes. We summarize the main reasons for this gap into two aspects. On one hand, general-purpose LMMs lack cognition of defects in the visual modality, thereby failing to sufficiently focus on defect areas. Therefore, we propose to modify the AnyRes structure of the LLaVA model, providing the potential anomalous areas identified by existing IAD models to the LMMs. On the other hand, existing methods mainly focus on identifying defects by learning defect patterns or comparing with normal samples, yet they fall short of understanding the causes of these defects. Considering that the generation of defects is closely related to the manufacturing process, we propose a manufacturing-driven IAD paradigm. An instruction-tuning dataset for IAD (InstructIAD) and a data organization approach for Chain-of-Thought with manufacturing (CoT-M) are designed to leverage the manufacturing process for IAD. Based on the above two modifications, we present Triad, a novel LMM-based method incorporating an expert-guided region-of-interest tokenizer and manufacturing process for industrial anomaly detection. Extensive experiments show that our Triad not only demonstrates competitive performance against current LMMs but also achieves further improved accuracy when equipped with manufacturing processes. Source code, training data, and pre-trained models will be publicly available at https: //github.com/tzjtatata/Triad.
U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration	Xiaofan Li Baidu Inc. Zhihao Xu Baidu Inc. Chenming Wu Baidu Inc. Zhao Yang Baidu Inc. Yumeng Zhang Baidu Inc. Jiang-Jiang Liu Baidu Inc. Haibao Yu Baidu Inc. Xiaoqing Ye Baidu Inc. Yuan Wang Baidu Inc. Shirui Li Baidu Inc. Xun Sun Baidu Inc. Ji Wan Baidu Inc. Jun Wang Baidu Inc.	Paper Supplementary Abstract Accurate localization using visual information is a critical yet challenging task, especially in urban environments where nearby buildings and construction sites significantly degrade GNSS (Global Navigation Satellite System) signal quality. This issue underscores the importance of visual localization techniques in scenarios where GNSS signals are unreliable. This paper proposes U-ViLAR, a novel uncertainty-aware visual localization framework designed to address these challenges while enabling adaptive localization using high-definition (HD) maps or navigation maps. Specifically, our method first extracts features from the input visual data and maps them into Bird'sEye-View (BEV) space to enhance spatial consistency with the map input. Subsequently, we introduce: a) Perceptual Uncertainty-guided Association, which mitigates errors caused by perception uncertainty, and b) Localization Uncertainty-guided Registration, which reduces errors introduced by localization uncertainty. By eﬀectively balancing the coarse-grained large-scale localization capability of association with the fine-grained precise localization capability of registration, our approach achieves robust and accurate localization. Experimental results demonstrate that our method achieves state-of-the-art performance across multiple localization tasks. Furthermore, our model has undergone rigorous testing on large-scale autonomous driving ﬂeets and has demonstrated stable performance in various challenging urban scenarios.
UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling	Peiming Li State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School Ziyi Wang State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School Yulin Yuan The Zhejiang University-University of Illinois Urbana-Champaign Institute, Zhejiang University Hong Liu State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School Xiangming Meng The Zhejiang University-University of Illinois Urbana-Champaign Institute, Zhejiang University Junsong Yuan State University of New York at Buffalo Mengyuan Liu State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School	Paper Abstract Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https: //github.com/wangzy01/UST-SSM.
Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA	Zhixuan Li College of Computing and Data Science, Nanyang Technological University Hyunse Yoon Department of Electrical and Electronic Engineering, Yonsei University Sanghoon Lee Department of Electrical and Electronic Engineering, Yonsei University Weisi Lin College of Computing and Data Science, Nanyang Technological University	Paper Supplementary Abstract Amodal segmentation aims to infer the complete shape of occluded objects, even when the occluded region's appearance is unavailable. However, current amodal segmentation methods lack the capability to interact with users through text input and struggle to understand or reason about implicit and complex purposes. While methods like LISA integrate multi-modal large language models (LLMs) with segmentation for reasoning tasks, they are limited to predicting only visible object regions and face challenges in handling complex occlusion scenarios. To address these limitations, we propose a novel task named amodal reasoning segmentation, aiming to predict the complete amodal shape of occluded objects while providing answers with elaborations based on user text input. We develop a generalizable dataset generation pipeline and introduce a new dataset focusing on daily life scenarios, encompassing diverse real-world occlusions. Furthermore, we present AURA (Amodal Understanding and Reasoning Assistant), a novel model with advanced global and spatial-level designs specifically tailored to handle complex occlusions. Extensive experiments validate AURA's effectiveness on the proposed dataset. The code, model, and dataset are released in this page.
Dual Reciprocal Learning of Language-based Human Motion Understanding and Generation	Chen Liang State Key Lab of Brain-Machine Intelligence, Zhejiang University Zhicheng Shi State Key Lab of Brain-Machine Intelligence, Zhejiang University Wenguan Wang State Key Lab of Brain-Machine Intelligence, Zhejiang University Yi Yang State Key Lab of Brain-Machine Intelligence, Zhejiang University	Paper Supplementary Abstract Language-based human motion understanding focuses on describing human motions using natural language descriptions. Conversely, human motion generation aims to generate human motions from textual inputs. Despite significant progress in both fields, further advancements are hindered by two primary challenges: i) Both tasks rely heavily on vast amounts of paired motion-language data for model training. However, human labeling is costly, making it increasingly unsustainable as model scales increase. ii) Existing models often learn the two tasks in parallel. The strong reciprocity between them has not been fully explored. In response, this work proposes Dual Reciprocal Learning (DRL) for language-based human motion understanding and generation. DRL establishes a symmetric learning framework where both tasks collaboratively evolve in a closed-loop, bootstrapping manner, effectively leveraging the reciprocity between them. In DRL, the tasks serve as evaluators for each other, enabling the generation of informative feedback signals even with easily acquired unpaired, unidirectional motion or language data. Furthermore, to mitigate dataset-specific bias in existing evaluations, we propose a generalized protocol that extends evaluation to a general-domain cross-modal feature space. Experimental results on standard benchmarks demonstrate that DRL achieves remarkable performance boosts over representative baselines in both tasks across evaluation protocols.
Efficient Event Camera Data Pretraining with Adaptive Prompt Fusion	Quanmin Liang School of Computer Science and Engineering, Sun Yat-Sen University Qiang Li Xpeng Motors Technology Co Ltd Shuai Liu School of Computer Science and Engineering, Sun Yat-Sen University Xinzi Cao School of Computer Science and Engineering, Sun Yat-Sen University Jinyi Lu School of Computer Science and Engineering, Sun Yat-Sen University Feidiao Yang Department of Intelligent Computing, Pengcheng Laboratory Wei Zhang Department of Intelligent Computing, Pengcheng Laboratory Kai Huang School of Computer Science and Engineering, Sun Yat-Sen University Yonghong Tian Department of Intelligent Computing, Pengcheng Laboratory	Paper Supplementary Abstract Applying pretraining-finetuning paradigm to event cameras presents significant challenges due to the scarcity of largescale event datasets and the inherently sparse nature of event data, which increases the risk of overfitting during extensive pretraining. In this paper, we explore the transfer of pretrained image knowledge to the domain of event cameras to address this challenge. The key to our approach lies in adapting event data representations to align with image pretrained models while simultaneously integrating spatiotemporal information and mitigating data sparsity. To achieve this, we propose a lightweight SpatioTemporal information fusion Prompting (STP) method, which progressively fuses the spatiotemporal characteristics of event data through a dynamic perception module with multi-scale spatiotemporal receptive fields, enabling compatibility with image pretrained models. STP enhances event data representation by capturing local information within a large receptive field and performing global information exchange along the temporal dimension. This strategy effectively reduces sparse regions in event data while refining fine-grained details, all while preserving its inherent spatiotemporal structure. Our method significantly outperforms previous state-of-the-art approaches across classification, semantic segmentation, and optical flow estimation tasks. For instance, it achieves a top-1 accuracy of 68.87% (+4.04%) on N-ImageNet with only 1/10 of the pretraining parameters and 1/3 of the training epochs. Our code is available at https://github.com/Lqm26/STP.
EventUPS: Uncalibrated Photometric Stereo Using an Event Camera	Jinxiu Liang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Bohan Yu State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Siqi Yang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Haotian Zhuang Tsinghua University Jieji Ren Shanghai Jiaotong University Peiqi Duan State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Boxin Shi State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University	Paper Supplementary Abstract We present EventUPS, the first uncalibrated photometric stereo (UPS) method using an event camera-a neuromorphic sensor that asynchronously detects brightness changes with microsecond resolution. Traditional frame-based UPS methods are hindered by high bandwidth demands and limited use in dynamic scenes. These methods require dense image correspondence under varying illumination and are incompatible with the fundamentally different sensing paradigm of event data. Our approach introduces three key innovations: an augmented null space formulation that directly relates each event to joint constraints on surface normals and lighting, naturally handling ambient illumination; a continuous parameterization of time-varying illumination that connects asynchronous events to synchronized lighting estimation; and a lighting fixture with known relative geometry that reduces ambiguity to a convex-concave uncertainty. We validate EventUPS using a custom-built LED lighting system. Experimental results show that our method achieves accuracy surpassing its frame-based counterpart while requiring only 5% of the data bandwidth.
Fine-grained Spatiotemporal Grounding on Egocentric Videos	Shuo Liang The Chinese University of Hong Kong Yiwu Zhong The Chinese University of Hong Kong Zi-Yuan Hu The Chinese University of Hong Kong Yeyao Tao The Chinese University of Hong Kong Liwei Wang The Chinese University of Hong Kong	Paper Supplementary Abstract Spatiotemporal video grounding aims to localize target entities in videos based on textual queries. While existing research has made significant progress in exocentric videos, the egocentric setting remains relatively underexplored, despite its growing importance in applications such as augmented reality and robotics. In this work, we conduct a systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts. To address these challenges, we introduce EgoMask, the first pixel-level benchmark for finegrained spatiotemporal grounding in egocentric videos. It is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks across short-, medium-, and long-term videos. Additionally, we create EgoMask-Train, a large-scale training dataset to facilitate model development. Experiments demonstrate that the state-of-the-art spatiotemporal grounding models perform poorly on our benchmark EgoMask, but fine-tuning on EgoMask-Train yields significant improvements, while preserving performance on exocentric datasets. Our work thus provides essential resources and insights for advancing egocentric video understanding. Our code is available at https://github.com/LaVi-Lab/EgoMask.
Gradient-Reweighted Adversarial Camouflage for Physical Object Detection Evasion	Jiawei Liang Shenzhen Campus of Sun Yat-sen University Siyuan Liang Nanyang Technological University Tianrui Lou Shenzhen Campus of Sun Yat-sen University Ming Zhang National Key Laboratory of Science and Technology on Information System Security Wenjin Li Nsfocus Dunqiu Fan Nsfocus Xiaochun Cao Shenzhen Campus of Sun Yat-sen University	Paper Supplementary Abstract Object detection is widely used in real-world applications such as autonomous driving, yet adversarial camouflage poses a significant threat by deceiving detectors from multiple viewpoints. Existing techniques struggle to maintain consistent attack efficacy across different viewpoints. To address this, we propose GRAC, an adversarial camouflage framework that enhances attack effectiveness across viewpoints and distances. First, we identify conflicts in gradient updates across angles and introduce gradient reweighting to resolve them, enabling coordinated optimization. Second, we model light interactions to simulate illumination changes, improving robustness under varying lighting conditions. Additionally, we address non-uniform texture updates arising from inconsistent sampling density during rendering by applying pooling-based texture regularization to improve smoothness. Extensive experiments in both simulated and physical environments demonstrate that GRAC outperforms existing methods across diverse conditions.1
Instance-Level Video Depth in Groups Beyond Occlusions	Yuan Liang South China University of Technology Yang Zhou South China University of Technology Ziming Sun South China University of Technology Tianyi Xiang South China University of Technology Guiqing Li South China University of Technology Shengfeng He Singapore Management University	Paper Abstract Depth estimation in dynamic, multi-object scenes remains a major challenge, especially under severe occlusions. Existing monocular models, including foundation models, struggle with instance-wise depth consistency due to their reliance on global regression. We tackle this problem from two key aspects: data and methodology. First, we introduce the Group Instance Depth (GID) dataset, the first large-scale video depth dataset with instance-level annotations, featuring 101,500 frames from real-world activity scenes. GID bridges the gap between synthetic and real-world depth data by providing high-fidelity depth supervision for multi-object interactions. Second, we propose InstanceDepth, the first occlusion-aware depth estimation framework for multi-object environments. Our twostage pipeline consists of (1) Holistic Depth Initialization, which assigns a coarse scene-level depth structure, and (2) Instance-Aware Depth Rectification, which refines instancewise depth using object masks, shape priors, and spatial relationships. By enforcing geometric consistency across occlusions, our method sets a new state-of-the-art on the GID dataset and multiple benchmarks. Our code and dataset can be found at https://github.com/ViktorLiang/GID.
Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space	Yingping Liang Beijing Institute of Technology Yutao Hu School of Computer Science and Engineering, Southeast University Wenqi Shao Shanghai Al Laboratory Ying Fu Beijing Institute of Technology	Paper Abstract Feature matching plays a fundamental role in many computer vision tasks, yet existing methods rely on scarce and clean multi-view image collections, which constrains their generalization to diverse and challenging scenarios. Moreover, conventional feature encoders are typically trained on single-view 2D images, limiting their capacity to capture 3D-aware correspondences. In this paper, we propose a novel two-stage framework that lifts 2D images to 3D space, named as Lift to Match (L2M), taking full advantage of large-scale and diverse single-view images. To be specific, in the first stage, we learn a 3D-aware feature encoder using a combination of multi-view image synthesis and 3D feature Gaussian representation, which injects 3D geometry knowledge into the encoder. In the second stage, a novelview rendering strategy, combined with large-scale synthetic data generation from single-view images, is employed to learn a feature decoder for robust feature matching, thus achieving generalization across diverse domains. Extensive experiments demonstrate that our method achieves superior generalization across zero-shot evaluation benchmarks, highlighting the effectiveness of the proposed framework for robust feature matching. Code is available at https://github.com/Sharpiless/L2M.
Perspective-Invariant 3D Object Detection	Ao Liang National University of Singapore Lingdong Kong National University of Singapore Dongyue Lu National University of Singapore Youquan Liu Fudan University Jian Fang Shenyang Institute of Automation, Chinese Academy of Sciences Huaici Zhao Shenyang Institute of Automation, Chinese Academy of Sciences Wei Tsang Ooi National University of Singapore	Paper Supplementary Abstract With the rise of robotics, LiDAR-based 3D object detection has garnered significant attention in both academia and industry. However, existing datasets and methods predominantly focus on vehicle-mounted platforms, leaving other autonomous platforms underexplored. To bridge this gap, we introduce Pi3DET, the first benchmark featuring LiDAR data and 3D bounding box annotations collected from multiple platforms: vehicle, quadruped, and drone, thereby facilitating research in 3D object detection for non-vehicle platforms as well as cross-platform 3D detection. Based on Pi3DET, we propose a novel cross-platform adaptation framework that transfers knowledge from the well-studied vehicle platform to other platforms. This framework achieves perspective-invariant 3D detection through robust alignment at both geometric and feature levels. Additionally, we estab- (∗) Ao, Lingdong, and Dongyue contributed equally to this work. lish a benchmark to evaluate the resilience and robustness of current 3D detectors in cross-platform scenarios, providing valuable insights for developing adaptive 3D perception systems. Extensive experiments validate the effectiveness of our approach on challenging cross-platform tasks, demonstrating substantial gains over existing adaptation methods. We hope this work paves the way for generalizable and unified 3D perception systems across diverse and complex environments. Our Pi3DET dataset, cross-platform benchmark suite, and annotation toolkit have been made publicly available.
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations	Tianming Liang Sun Yat-sen University Kun-Yu Lin Sun Yat-sen University Chaolei Tan Sun Yat-sen University Jianguo Zhang Southern University of Science and Technology Wei-Shi Zheng Sun Yat-sen University Jian-Fang Hu Sun Yat-sen University	Paper Supplementary Abstract Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This is challenging as it involves deep visionlanguage understanding, pixel-level dense prediction and spatiotemporal reasoning. Despite notable progress in recent years, existing methods still exhibit a noticeable gap when considering all these aspects. In this work, we propose ReferDINO, a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixellevel dense perception and cross-modal spatiotemporal reasoning. In detail, ReferDINO integrates two key components: 1) a grounding-guided deformable mask decoder that utilizes location prediction to progressively guide mask prediction through differentiable deformation mechanisms; 2) an object-consistent temporal enhancer that injects pretrained time-varying text features into inter-frame interaction to capture object-aware dynamic changes. Moreover, a confidence-aware query pruning strategy is designed to accelerate object decoding without compromising model performance. Extensive experimental results on five benchmarks demonstrate that our ReferDINO significantly outperforms previous methods (e.g., +3.9% \ protect \mathcal {J}\&\mathcal {F} on RefYouTube-VOS) with real-time inference speed (51 FPS).
Spatial Alignment and Temporal Matching Adapter for Video-Radar Remote Physiological Measurement	Qian Liang University of Science and Technology of China Ruixu Geng University of Science and Technology of China Jinbo Chen Nanyang Technological University Haoyu Wang University of Science and Technology of China Yan Chen University of Science and Technology of China Yang Hu University of Science and Technology of China	Paper Supplementary Abstract Remote physiological measurement (RPM) based on video and radar has made significant progress in recent years. However, unimodal methods based solely on video or radar sensor have notable limitations due to their measurement principles, and multimodal RPM that combines these modalities has emerged as a promising direction. Despite its potential, the lack of large-scale multimodal data and the significant modality gap between video and radar pose substantial challenges in building robust videoradar RPM models. To handle these problems, we suggest leveraging unimodal pre-training and present the Spatial alignment and Temporal Matching (SATM) Adapter to effectively fine-tune pre-trained unimodal backbones into a multimodal RPM model. Given the distinct measurement principles of video- and radar-based methods, we propose Spatial Alignment to align the spatial distribution of their features. Furthermore, Temporal Matching is applied to mitigate waveform discrepancies between video and radar signals. By integrating these two modules into adapters, the unimodal backbones could retain their modality-specific knowledge while effectively extracting complementary features from each other. Extensive experiments across various challenging scenarios, including low light conditions and head motions, demonstrate that our approach significantly surpasses the state-of-the-art methods.
UniDxMD: Towards Unified Representation for Cross-Modal Unsupervised Domain Adaptation in 3D Semantic Segmentation	Zhengyin Liang State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University Hui Yin State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University Min Liang Beijing University of Technology Qianqian Du Beijing Jiaotong University Ying Yang Beijing Jiaotong University Hua Huang Beijing Jiaotong University	Paper Supplementary Abstract Modality or domain distribution shifts pose formidable challenges in 3D semantic segmentation. Existing methods predominantly address either cross-modal or cross-domain adaptation in isolation, leading to insufficient exploration of semantic associations and complementary features in heterogeneous data. To bridge this gap, we present UniDxMD, a unified representation method for cross-modal unsupervised domain adaptation (UDA) in 3D semantic segmentation that simultaneously tackles both cross-modal and cross-domain adaptation objectives. Our core insight is deriving a unified discrete representation from heterogeneous data to mitigate distribution shifts, inspired by vector quantization. Specifically, we propose a differentiable, clusterbased soft quantization mechanism (CSQM) that maps heterogeneous data (spanning modalities and domains) into a shared discrete latent space. Then, we introduce latent space regularization (LSR), leveraging joint prototypes that satisfy semantic relation consistency as learnable anchors to enhance the compactness and semantic discriminability of the discrete latent space. Our method paves the way for advancing cross-modal UDA in 3D semantic segmentation towards the unified representation. Extensive results across four challenging cross-modal UDA scenarios demonstrate the superiority of our method. Code is available here.
I2-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting	Zhimin Liao National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University Ping Wei National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University Ruijie Zhang National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University Shuaijia Chen National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University Haoxuan Wang National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University Ziyang Ren National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University	Paper Abstract Forecasting the evolution of 3D scenes and generating unseen scenarios via occupancy-based world models offer substantial potential for addressing corner cases in autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this issue, we propose I2-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike decoder-only GPT-style autoregressive models, I2-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to enable high-level control over scene generation. The decoder, conditioned on transformation matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that I2-World achieves state-of-theart performance, outperforming existing methods by 25.1% in mIoU and 36.9% in IoU for 4D occupancy forecasting while exhibiting exceptional computational efficiency. It nearly requires 2.9 GB of training memory and achieves real-time inference at 37.0 FPS. Our code is available on https://github.com/lzzzzzm/II-World.
LLM-Assisted Semantic Guidance for Sparsely Annotated Remote Sensing Object Detection	Wei Liao Nanjing University of Science and Technology, Nanjing, Jiangsu, China Chunyan Xu Nanjing University of Science and Technology, Nanjing, Jiangsu, China Chenxu Wang Nanjing University of Science and Technology, Nanjing, Jiangsu, China Zhen Cui Beijing Normal University, Beijing, China	Paper Abstract Sparse annotation in remote sensing object detection poses significant challenges due to dense object distributions and category imbalances. Although existing Dense Pseudo-Label methods have demonstrated substantial potential in pseudo-labeling tasks, they remain constrained by selection ambiguities and inconsistencies in confidence estimation. In this paper, we introduce an LLM-assisted semantic guidance framework tailored for sparsely annotated remote sensing object detection, exploiting the advanced semantic reasoning capabilities of large language models (LLMs) to distill high-confidence pseudo-labels. By integrating LLM-generated semantic priors, we propose a Class-Aware Dense Pseudo-Label Assignment mechanism that adaptively assigns pseudo-labels for both unlabeled and sparsely labeled data, ensuring robust supervision across varying data distributions. Additionally, we develop an Adaptive Hard-Negative Reweighting Module to stabilize the supervised learning branch by mitigating the influence of confounding background information. Extensive experiments on DOTA and HRSC2016 demonstrate that the proposed method outperforms existing single-stage detector-based frameworks, significantly improving detection performance under sparse annotations. Our source code is available at https://github. com/wuxiuzhilianni/RSST.
MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent	Xinyao Liao Nanyang Technological University Xianfang Zeng StepFun Liao Wang StepFun Gang Yu StepFun Guosheng Lin Nanyang Technological University Chi Zhang Westlake University	Paper Supplementary Abstract We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields, providing flexible and precise motion guidance. Specifically, the agent extracts the object movement and camera motion described in the text, and converts them into object trajectories and camera extrinsics, respectively. An analytical optical flow composition module integrates these motion representations in 3D space and projects them into a unified optical flow. An optical flow adapter takes the flow to control the base image-to-video diffusion model for generating fine-grained controlled videos. After that, an optional rethinking step can be adopted to ensure the generated video is aligned well with motion information in the prompt. The significant improvement in the Video-Text Camera Motion metrics on VBench indicates that our method achieves precise control over camera motion. We further construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation	Xiao Lin College of Electronic and Information Engineering, Tongji University, Shanghai, China Yun Peng College of Electronic and Information Engineering, Tongji University, Shanghai, China Liuyi Wang College of Electronic and Information Engineering, Tongji University, Shanghai, China Xianyou Zhong College of Electronic and Information Engineering, Tongji University, Shanghai, China Minghao Zhu College of Electronic and Information Engineering, Tongji University, Shanghai, China Yi Feng College of Electronic and Information Engineering, Tongji University, Shanghai, China Jingwei Yang College of Electronic and Information Engineering, Tongji University, Shanghai, China Chengju Liu College of Electronic and Information Engineering, Tongji University, Shanghai, China Qijun Chen State Key Laboratory of Autonomous Intelligent Unmanned Systems, Tongji University, Shanghai, China	Paper Supplementary Abstract In the effort to achieve robust and generalizable categorylevel object pose estimation, recent methods primarily focus on learning fundamental representations from data. However, the inherent biases within the data are often overlooked: the repeated training samples and similar environments may mislead the models to over-rely on specific patterns, hindering models' performance on novel instances. In this paper, we present CleanPose, a novel method that mitigates the data biases to enhance categorylevel pose estimation by integrating causal learning and knowledge distillation. By incorporating key causal variables (structural information and hidden confounders) into causal modeling, we propose the causal inference module based on front-door adjustment, which promotes unbiased estimation by reducing potential spurious correlations. Additionally, to further confront the data bias at the feature level, we devise a residual-based knowledge distillation approach to transfer unbiased semantic knowledge from 3D foundation model, providing comprehensive causal supervision. Extensive experiments across multiple benchmarks (REAL275, CAMERA25 and HouseCat6D) hightlight the superiority of proposed CleanPose over stateof-the-art methods. Code will be available at https: //github.com/chrislin0621/CleanPose.
ClearSight: Human Vision-Inspired Solutions for Event-Based Motion Deblurring	Xiaopeng Lin The Hong Kong University of Science and Technology (Guangzhou) Yulong Huang The Hong Kong University of Science and Technology (Guangzhou) Hongwei Ren The Hong Kong University of Science and Technology (Guangzhou) Zunchang Liu The Hong Kong University of Science and Technology (Guangzhou) Hongxiang Huang The Hong Kong University of Science and Technology (Guangzhou) Yue Zhou The Hong Kong University of Science and Technology (Guangzhou) Haotian Fu The Hong Kong University of Science and Technology (Guangzhou) Bojun Cheng The Hong Kong University of Science and Technology (Guangzhou)	Paper Supplementary Abstract Motion deblurring addresses the challenge of image blur caused by camera or scene movement. Event cameras provide motion information that is encoded in the asynchronous event streams. To efficiently leverage the temporal information of event streams, we employ Spiking Neural Networks (SNNs) for motion feature extraction and Artificial Neural Networks (ANNs) for color information processing. Due to the non-uniform distribution and inherent redundancy of event data, existing cross-modal feature fusion methods exhibit certain limitations. Inspired by the visual attention mechanism in the human visual system, this study introduces a bioinspired dual-drive hybrid network (BDHNet). Specifically, the Neuron Configurator Module (NCM) is designed to dynamically adjust neuron configurations based on cross-modal features, thereby focusing the spikes in blurry regions and adapting to varying blurry scenarios dynamically. Additionally, the Region of Blurry Attention Module (RBAM) is introduced to generate a blurry mask in an unsupervised manner, effectively extracting motion clues from the event features and guiding more accurate cross-modal feature fusion. Extensive subjective and objective evaluations demonstrate that our method outperforms current state-of-the-art methods on both synthetic and real-world datasets.
DRaM-LHM: A Quaternion Framework for Iterative Camera Pose Estimation	Chen Lin CCB & CCM, Flatiron Institute Weizhi Du University of Michigan, Ann Arbor Zhixiang Min Stevens Institute of Technology Baochen She Stanford University Enrique Dunn Stevens Institute of Technology Sonya M. Hanson CCB & CCM, Flatiron Institute	Paper Supplementary Abstract We explore a quaternion adjugate matrix-based representation for rotational motion in the Perspective-n-Point (PnP) problem. Leveraging quadratic quaternion terms within a Determinant Ratio Matrix (DRaM) estimation framework, we extend its application to perspective scenarios, providing a robust and efficient initialization for iterative PnP pose estimation. Notably, by solving the orthographic projection least-squares problem, DRaM provides a reliable initialization that enhances the accuracy and stability of iterative PnP solvers. Experiments on synthetic and real data demonstrate its efficiency, accuracy, and robustness, particularly under high noise conditions. Furthermore, our nonminimal formulation ensures numerical stability, making it effective for real-world applications.
Global Motion Corresponder for 3D Point-Based Scene Interpolation under Large Motion	Junru Lin University of Toronto Chirag Vashist Stanford University Mikaela Angelina Uy Nvidia Colton Stearns Nvidia Xuan Luo Google Leonidas Guibas Stanford University Ke Li Simon Fraser University	Paper Supplementary Abstract Existing dynamic scene interpolation methods typically assume that the motion between consecutive timesteps is small enough so that displacements can be locally approximated by linear models. In practice, even slight deviations from this small-motion assumption can cause conventional techniques to fail. In this paper, we introduce Global Motion Corresponder (GMC), a novel approach that robustly handles large motion and achieves smooth transitions. GMC learns unary potential fields that predict SE(3) mappings into a shared canonical space, balancing correspondence, spatial and semantic smoothness, and local rigidity. We demonstrate that our method significantly outperforms existing baselines on 3D scene interpolation when the two states undergo large global motions. Furthermore, our method enables extrapolation capabilities where other baseline methods cannot.
GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding	Zijun Lin Nanyang Technological University Shuting He Shanghai University of Finance and Economics Cheston Tan Centre for Frontier AI Research, A*STAR Bihan Wen Nanyang Technological University	Paper Supplementary Abstract Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple steps as a whole, without extracting useful temporal information from each step. However, the instructions in SG3D often contain pronouns such as 'it', 'here' and 'the same' to make language expressions concise. This requires grounding methods to understand the context and retrieve relevant information from previous steps to correctly locate object sequences. Due to the lack of an effective module for collecting related historical information, state-of-theart 3DVG methods face significant challenges in adapting to the SG3D task. To fill this gap, we propose GroundFlow - a plug-in module for temporal reasoning on 3D point cloud sequential grounding. Firstly, we demonstrate that integrating GroundFlow improves the task accuracy of 3DVG baseline methods by a large margin (+7.5% and +10.2%) in the SG3D benchmark, even outperforming a 3D large language model pre-trained on various datasets. Furthermore, we selectively extract both short-term and long-term step information based on its relevance to the current instruction, enabling GroundFlow to take a comprehensive view of historical information and maintain its temporal understanding advantage as step counts increase. Overall, our work introduces temporal reasoning capabilities to existing 3DVG models and achieves state-of-the-art performance in the SG3D benchmark across five datasets.
MCOP: Multi-UAV Collaborative Occupancy Prediction	Zefu Lin University of Chinese Academy of Sciences (UCAS) Wenbo Chen Institute of Automation, Chinese Academy of Sciences (CASIA) Xiaojuan Jin Institute of Automation, Chinese Academy of Sciences (CASIA) Yuran Yang Beijing University of Posts and Telecommunications (BUPT) Lue Fan Institute of Automation, Chinese Academy of Sciences (CASIA) Yixin Zhang Tencent Yufeng Zhang University of Chinese Academy of Sciences (UCAS) Zhaoxiang Zhang University of Chinese Academy of Sciences (UCAS)	Paper Abstract Unmanned Aerial Vehicle (UAV) swarm systems necessitate efficient collaborative perception mechanisms for diverse operational scenarios. Current Bird's Eye View (BEV)- based approaches exhibit two main limitations: boundingbox representations fail to capture complete semantic and geometric information of the scene, and their performance significantly degrades when encountering undefined or occluded objects. To address these limitations, we propose a novel multi-UAV collaborative occupancy prediction framework. Our framework effectively preserves 3D spatial structures and semantics through integrating a Spatial-Aware Feature Encoder and Cross-Agent Feature Integration. To enhance efficiency, we further introduce Altitude-Aware Feature Reduction to compactly represent scene information, along with a Dual-Mask Perceptual Guidance mechanism to adaptively select features and reduce communication overhead. Due to the absence of suitable benchmark datasets, we extend three datasets for evaluation: two virtual datasets (Air-to-Pred-Occ and UAV3D-Occ) and one real-world dataset (GauUScene-Occ). Experiments results demonstrate that our method achieves state-of-the-art accuracy, significantly outperforming existing collaborative methods while reducing communication overhead to only a fraction of previous approaches.
Pretend Benign: A Stealthy Adversarial Attack by Exploiting Vulnerabilities in Cooperative Perception	Hongwei Lin Xiamen University Dongyu Pan Xiamen University Qiming Xia Xiamen University Hai Wu Xiamen University Cheng Wang Xiamen University Siqi Shen Xiamen University Chenglu Wen Xiamen University	Paper Supplementary Abstract Recently, learning-based multi-agent cooperative perception has garnered widespread attention. However, the inherent vulnerabilities of neural networks, combined with the risks posed by cooperative communication as a wideopen backdoor, render these systems highly susceptible to adversarial attacks. Existing attack methods lack stealth as they perturb transmitted information indiscriminately, producing numerous false positives that are readily detected by consensus-based defenses. This paper proposes Pretend Benign (PB), a novel stealthy adversarial attack method that exploits vulnerabilities in cooperative perception to enable the attacker to disguise as a benign cooperator. To achieve this, we first introduce the Attack Region Selection (ARS) module, which divides the perception area into subregions based on confidence levels to pinpoint optimal attack locations. Then, we propose Multi-target Adversarial Perturbation Generation (MAPG), which maintains consensus, gain the victim's trust, and thereby reverse the normal cooperative role of perception. To mitigate the latency in adversarial signal generation and communication, we further propose a real-time attack by predicting future information through historical feature ﬂow. Extensive experiments on the OPV2V and V2XSet datasets demonstrate that PB effectively bypasses state-of-the-art defense methods, underscoring its stealth and efficacy.
RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models	Yijing Lin University of Science and Technology of China Mengqi Huang University of Science and Technology of China Shuhan Zhuang University of Science and Technology of China Zhendong Mao University of Science and Technology of China	Paper Supplementary Abstract Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGen- †Zhendong Mao is the corresponding author. eral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to incontext learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate crossmodal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project Page: realgeneral web; GitHub Link: https://github.com/Lyne1/RealGeneral
SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting	Shengjie Lin Toyota Technological Institute at Chicago Jiading Fang Toyota Technological Institute at Chicago Muhammad Zubair Irshad Toyota Research Institute Vitor Campagnolo Guizilini Toyota Research Institute Rares Andrei Ambrus Toyota Research Institute Greg Shakhnarovich Toyota Technological Institute at Chicago Matthew R. Walter Toyota Technological Institute at Chicago	Paper Supplementary Abstract Reconstructing articulated objects prevalent in daily environments is crucial for applications in augmented/virtual reality and robotics. However, existing methods face scalability limitations (requiring 3D supervision or costly annotations), robustness issues (being susceptible to local optima), and rendering shortcomings (lacking speed or photorealism). We introduce SPLART, a self-supervised, category-agnostic framework that uses 3D Gaussian Splatting (3DGS) to reconstruct and infer the kinematics of articulated objects from two sets of posed RGB images captured at different articulation states, enabling real-time photorealistic rendering for novel viewpoints and articulations. SPLART augments 3DGS with a differentiable mobility parameter per Gaussian, achieving refined part segmentation. A multi-stage optimization strategy is employed to progressively handle reconstruction, part segmentation, and articulation estimation, significantly enhancing robustness and accuracy. SPLART exploits geometric self-supervision, effectively addressing challenging scenarios without requiring 3D annotations or category-specific priors. Evaluations on established and newly proposed benchmarks, along with applications to real-world scenarios using a handheld RGB camera, demonstrate SPLART's state-of-the-art performance and real-world practicality. Code is publicly available at https://github.com/ripl/splart.
VMBench: A Benchmark for Perception-Aligned Video Motion Generation	Xinran Ling AMAP, Alibaba Group Chen Zhu AMAP, Alibaba Group Meiqi Wu AMAP, Alibaba Group Hangyu Li AMAP, Alibaba Group Xiaokun Feng CRISE, Institute of Automation, Chinese Academy of Sciences Cundian Yang AMAP, Alibaba Group Aiming Hao AMAP, Alibaba Group Jiashu Zhu AMAP, Alibaba Group Jiahong Wu AMAP, Alibaba Group Xiangxiang Chu AMAP, Alibaba Group	Paper Supplementary Abstract Video generation has advanced rapidly, improving evaluation methods, yet assessing video's motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based on these findings, we introduce VMBench-a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: (1) PerceptionDriven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models' strengths and weaknesses in motion quality. (2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multilevel prompt library covering six key dynamic scene dimensions. (3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3% improvement in Spearman's correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment. Additionally, we release VMBench at https://github.com/AMAP-ML/VMBench, setting a new standard for evaluating and advancing motion generation models.
4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads	Ling Liu IIIS, Tsinghua University Jun Tian IIIS, Tsinghua University Li Yi IIIS, Tsinghua University	Paper Supplementary Abstract 4D panoptic segmentation in a streaming setting is critical for highly dynamic environments, such as evacuating dense crowds and autonomous driving in complex scenarios, where real-time, fine-grained perception within a constrained time budget is essential. In this paper, we introduce 4DSegStreamer, a novel framework that employs a DualThread System to efficiently process streaming frames. The framework is general and can be seamlessly integrated into existing 3D and 4D segmentation methods to enable realtime capability. It also demonstrates superior robustness compared to existing streaming perception approaches, particularly under high FPS conditions. The system consists of a predictive thread and an inference thread. The predictive thread leverages historical motion and geometric information to extract features and forecast future dynamics. The inference thread ensures timely prediction for incoming frames by aligning with the latest memory and compensating for ego-motion and dynamic object movements. We evaluate 4DSegStreamer on the indoor HOI4D dataset and the outdoor SemanticKITTI and nuScenes datasets. Comprehensive experiments demonstrate the effectiveness of our approach, particularly in accurately predicting dynamic objects in complex scenes.
AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations	Junli Liu Northwestern Polytechnical University Qizhi Chen Shanghai AI Laboratory Zhigang Wang Shanghai AI Laboratory Yiwen Tang Northwestern Polytechnical University Yiting Zhang Shanghai AI Laboratory Chi Yan Shanghai AI Laboratory Dong Wang Shanghai AI Laboratory Xuelong Li TeleAI Bin Zhao Northwestern Polytechnical University	Paper Abstract Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, e.g., appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning. Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code will be released at https://github.com/Ideal-ljl/AerialVG.
CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving	Changxing Liu Shanghai Jiao Tong University Genjia Liu Shanghai Jiao Tong University Zijun Wang Shanghai Jiao Tong University Jinchang Yang Shanghai Jiao Tong University Siheng Chen Shanghai Jiao Tong University	Paper Supplementary Abstract Vehicle-to-vehicle (V2V) cooperative autonomous driving holds great promise for improving safety by addressing the perception and prediction uncertainties inherent in single-agent systems. However, traditional cooperative methods are constrained by rigid collaboration protocols and limited generalization to unseen interactive scenarios. While LLM-based approaches offer generalized reasoning capabilities, their challenges in spatial planning and unstable inference latency hinder their direct application in cooperative driving. To address these limitations, we propose CoLMDriver, the first full-pipeline LLM-based cooperative driving system, enabling effective languagebased negotiation and real-time driving control. CoLMDriver features a parallel driving pipeline with two key components: (i) an LLM-based negotiation module under an critic-feedback paradigm, which continuously refines cooperation policies through feedback from previous decisions of all vehicles; and (ii) an intention-guided waypoint generator, which translates negotiation outcomes into executable waypoints. Additionally, we introduce InterDrive, a CARLA-based simulation benchmark comprising 10 challenging interactive driving scenarios for evaluating V2V cooperation. Experimental results demonstrate that CoLMDriver significantly outperforms existing approaches, achieving an 11% higher success rate across diverse highly interactive V2V driving scenarios. Code will be released on https://github.com/cxliu0314/CoLMDriver.
Controllable 3D Outdoor Scene Generation via Scene Graphs	Yuheng Liu Texas A&M University Xinke Li City University of Hong Kong Yuning Zhang Southwest Jiaotong University Lu Qi UC Merced Xin Li Texas A&M University Wenping Wang Texas A&M University Chongshou Li Southwest Jiaotong University Xueting Li NVIDIA Ming-Hsuan Yang UC Merced	Paper Supplementary Abstract Three-dimensional scene generation is crucial in computer vision, with applications spanning autonomous driving and gaming. However, current methods offer limited or nonintuitive user control. In this work, we propose a method that uses scene graph as a user-friendly control format to generate outdoor 3D scenes. We develop an interactive system that transforms a sparse scene graph into a dense Bird's Eye View (BEV) Embedding Map, which guides a conditional diffusion model to generate 3D scenes that match the scene graph description. Users can easily create or modify scene graphs to generate large-scale outdoor scenes. We create a large-scale dataset with paired scene graphs and 3D semantic scenes to train the BEV embedding and diffusion models. Experimental results show that our approach consistently produces high-quality 3D urban scenes closely aligned with the input scene graphs. To the best of our knowledge, this is the first approach to generate 3D outdoor scenes conditioned on scene graphs. Code is available at https://github.com/yuhengliu02/control-3d-scene.
CountSE: Soft Exemplar Open-set Object Counting	Shuai Liu School of Software Engineering, Xi'an Jiaotong University Peng Zhang School of Software Engineering, Xi'an Jiaotong University Shiwei Zhang School of Software Engineering, Xi'an Jiaotong University Wei Ke School of Software Engineering, Xi'an Jiaotong University	Paper Abstract Open-set counting is garnering increasing attention due to its capability to enumerate objects of arbitrary category. It can be generally categorized into two methodologies: text-guided zero-shot counting methods and exemplarguided few-shot counting methods. Previous text-guided zero-shot methods only provide limited object information through text, resulting in poor performance. Besides, though exemplar-guided few-shot approaches gain better results, they rely heavily on manually annotated visual exemplars, resulting in low efficiency and high labor intensity. Therefore, we propose CountSE, which simultaneously achieves high efficiency and high performance. CountSE is a new text-guided zero-shot object counting algorithm that generates multiple precise soft exemplars at different scales to enhance counting models driven solely by semantics. Specifically, to obtain richer object information and address the diversity in object scales, we introduce Semantic-guided Exemplar Selection, a module that generates candidate soft exemplars at various scales and selects those with high similarity scores. Then, to ensure accuracy and representativeness, Clustering-based Exemplar Filtering is introduced to refine the candidate exemplars by effectively eliminating inaccurate exemplars through clustering analysis. In the text-guided zero-shot setting, CountSE outperforms all state-of-the-art methods on the FSC-147 benchmark by at least 15%. Additionally, experiments on two other widely used datasets demonstrate that CountSE significantly outperforms all previous text-guided zero-shot counting methods and is competitive with the most advanced exemplarguided few-shot methods. Codes will be available. Code is available at https://github.com/pppppz22/CountSE.
Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion	Enyu Liu Huazhong University of Science and Technology En Yu Huazhong University of Science and Technology Sijia Chen Huazhong University of Science and Technology Wenbing Tao Huazhong University of Science and Technology	Paper Abstract 3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception. Recent advancements have primarily focused on refining voxellevel features to construct 3D scenes. However, treating voxels as the basic interaction units inherently limits the utilization of class-level information, which is proven critical for enhancing the granularity of completion results. To address this, we propose Disentangling Instance and Scene Contexts (DISC), a novel dual-stream paradigm that enhances learning for both instance and scene categories through separated optimization. Specifically, we replace voxel queries with discriminative class queries, which incorporate class-specific geometric and semantic priors. Additionally, we exploit the intrinsic properties of classes to design specialized decoding modules, facilitating targeted interactions and efficient class-level information flow. Experimental results demonstrate that DISC achieves state-ofthe-art (SOTA) performance on both SemanticKITTI and SSCBench-KITTI-360 benchmarks, with mIoU scores of 17.35 and 20.55, respectively. Remarkably, DISC even outperforms multi-frame SOTA methods using only singleframe input and significantly improves instance category performance, surpassing both single-frame and multi-frame SOTA instance mIoU by 17.9% and 11.9%, respectively, on the SemanticKITTI hidden test. The code is available at https://github.com/Enyu-Liu/DISC.
E-NeMF: Event-based Neural Motion Field for Novel Space-time View Synthesis of Dynamic Scenes	Yan Liu College of Computer Science and Technology, Zhejiang University Zehao Chen College of Computer Science and Technology, Zhejiang University Haojie Yan College of Computer Science and Technology, Zhejiang University De Ma College of Computer Science and Technology, Zhejiang University Huajin Tang College of Computer Science and Technology, Zhejiang University Qian Zheng College of Computer Science and Technology, Zhejiang University Gang Pan College of Computer Science and Technology, Zhejiang University	Paper Abstract Synthesizing novel space-time views from a monocular video is a highly ill-posed problem, and its effectiveness relies on accurately reconstructing motion and appearance of the dynamic scene. Frame-based methods for novel spacetime view synthesis in dynamic scenes rely on simplistic motion assumptions due to the absence of inter-frame cues, which makes them fall in complex motion. Event camera captures inter-frame cues with high temporal resolution, which makes it hold the promising potential to handle complex motion. However, it is still difficult due to the event noise and sparsity. To mitigate the impact caused by event noise and sparsity, we propose E-NeMF, which alleviates the impact of event noise with Parametric Motion Representation and mitigates the event sparsity with Flow Prediction Module. Experiments on multiple real-world datasets demonstrate our superior performance in handling complex motion. Codes will be released at https://github.com/zjubmi-lab/E-NeMF.
Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow	Ruyang Liu School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University Shangkun Sun School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University Haoran Tang School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University Wei Gao School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University Ge Li School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University	Paper Abstract Long-form video understanding has always been a challenging problem due to the significant redundancy in both temporal and spatial contents. This challenge is further exacerbated by the limited context length of Multimodal Large Language Models (MLLMs). To address this issue, many previous works have attempted to extract key video information, where the 'key' is typically semantic-aware and heavily dependent on the CLIP model as prior. In this paper, we propose Flow4Agent, a novel framework that pioneeringly incorporates motion priors from optical flow to facilitate LLM-based long video understanding. Flow4Agent mitigates the redundancy in long videos at both temporal and spatial levels through two core modules: Temporal Granularity Optimization (TGO) adaptively refines framelevel hierarchies, which first leverages coarse flow priors to group similar visual contents and then applies semantic priors to filter out highly irrelevant scene information. Motion Token Pruning (MTP) further refines the intra-frame visual representations, pruning high-redundancy video tokens using fine-grained optical flow information. Extensive experiments demonstrate that our Flow4Agent outperforms existing methods across a wide range of video MLLM benchmarks, especially for hour-level video understanding tasks, achieving 64.7% on Video-MME, 71.4% on MLVU and 60.4% on LongVideoBench.
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency	Tianqi Liu School of AIA, Huazhong University of Science and Technology Zihao Huang School of AIA, Huazhong University of Science and Technology Zhaoxi Chen S-Lab, Nanyang Technological University Guangcong Wang Great Bay University Shoukang Hu School of AIA, Huazhong University of Science and Technology Liao Shen School of AIA, Huazhong University of Science and Technology Huiqiang Sun School of AIA, Huazhong University of Science and Technology Zhiguo Cao School of AIA, Huazhong University of Science and Technology Wei Li S-Lab, Nanyang Technological University Ziwei Liu S-Lab, Nanyang Technological University	Paper Supplementary Abstract We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multi-view videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.
Improving SAM for Camouflaged Object Detection via Dual Stream Adapters	Jiaming Liu School of Computer Science, Shanghai Jiao Tong University Linghe Kong School of Computer Science, Shanghai Jiao Tong University Guihai Chen School of Computer Science, Shanghai Jiao Tong University	Paper Abstract Segment anything model (SAM) has shown impressive general-purpose segmentation performance on natural images, but its performance on camouflaged object detection (COD) is unsatisfactory. In this paper, we propose SAMDSA that performs COD for RGB-D inputs via Dual Stream Adapters. While keeping the SAM architecture intact, dual stream adapters are expanded on the image encoder to learn potential complementary information from RGB images and depth images, and fine-tune the mask decoder and its depth-aware replica to perform dual-stream mask prediction. In practice, the dual stream adapters are embedded into the attention block of the image encoder in a parallel manner to facilitate the refinement and correction of the two types of image embeddings. To mitigate channel discrepancies arising from dual stream embeddings that do not directly interact with each other, we augment the association of dual stream embeddings using bidirectional knowledge distillation including a model distiller and a modal distiller. In addition, to predict the masks for RGB and depth attention maps, we integrate the two types of image embeddings which are jointly learned with the prompt embeddings to update the initial prompt, and then feed them into the mask decoders to synchronize the consistency of image embeddings and prompt embeddings. Experimental results on four COD benchmarks show that our SAM-DSA achieves excellent detection performance gains over SAM and achieves state-of-the-art results with a given fine-tuning paradigm.
Learning Efficient and Generalizable Human Representation with Human Gaussian Model	Yifan Liu Tsinghua University Shengjun Zhang Tsinghua University Chensheng Dai Tsinghua University Yang Chen Nanyang Technological University Hao Liu WeChat Vision, Tencent Inc. Chen Li WeChat Vision, Tencent Inc. Yueqi Duan Tsinghua University	Paper Supplementary Abstract Modeling animatable human avatars from videos is a long-standing and challenging problem. While conventional methods require per-instance optimization, recent feed-forward methods have been proposed to generate 3D Gaussians with a learnable network. However, these methods predict Gaussians for each frame independently, without fully capturing the relations of Gaussians from different timestamps. To address this, we propose Human Gaussian Graph to model the connection between predicted Gaussians and human SMPL mesh, so that we can leverage information from all frames to recover an animatable human representation. Specifically, the Human Gaussian Graph contains dual layers where Gaussians are the first layer nodes and mesh vertices serve as the second layer nodes. Based on this structure, we further propose the intra-node operation to aggregate various Gaussians connected to one mesh vertex, and inter-node operation to support message passing among mesh node neighbors. Experimental results on novel view synthesis and novel pose animation demonstrate the efficiency and generalization of our method.
MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments	Zhixuan Liu Carnegie Mellon University Haokun Zhu Carnegie Mellon University Rui Chen Carnegie Mellon University Jonathan Francis Carnegie Mellon University Soonmin Hwang Hanyang University Ji Zhang Carnegie Mellon University Jean Oh Carnegie Mellon University	Paper Abstract We introduce a diffusion-based approach for generating privacy-preserving digital twins of multi-room indoor environments from depth images only. Central to our approach is a novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense. MOSAIC operates through a multichannel inference-time optimization that avoids error accumulation common in sequential or single-room constraints in panorama-based approaches. MOSAIC scales to complex scenes with zero extra training and provably reduces the variance during denoising process when more overlapping views are added, leading to improved generation quality. Experiments show that MOSAIC outperforms state-ofthe-art baselines on image fidelity metrics in reconstructing complex multi-room environments. Resources and code are at https://mosaic-cmubig.github.io.
MixRI: Mixing Features of Reference Images for Novel Object Pose Estimation	Xinhang Liu School of Electronics and Information, Northwestern Polytechnical University Jiawei Shi School of Electronics and Information, Northwestern Polytechnical University Zheng Dang CVLab, EPFL, Switzerland Yuchao Dai School of Electronics and Information, Northwestern Polytechnical University	Paper Supplementary Abstract We present MixRI, a lightweight network that solves the CAD-based novel object pose estimation problem in RGB images. It can be instantly applied to a novel object at test time without finetuning. We design our network to meet the demands of real-world applications, emphasizing reduced memory requirements and fast inference time. Unlike existing works that utilize many reference images and have large network parameters, we directly match points based on the multi-view information between the query and reference images with a lightweight network. Thanks to our reference image fusion strategy, we significantly decrease the number of reference images, thus decreasing the time needed to process these images and the memory required to store them. Furthermore, with our lightweight network, our method requires less inference time. Though with fewer reference images, experiments on seven core datasets in the BOP challenge show that our method achieves comparable results with other methods that require more reference images and larger network parameters1.
Multi-Object Sketch Animation by Scene Decomposition and Motion Planning	Jingyu Liu Renmin University of China Zijie Xin Renmin University of China Yuhan Fu Renmin University of China Ruixiang Zhao Renmin University of China Bangxiang Lan Renmin University of China Xirong Li Renmin University of China	Paper Supplementary Abstract Sketch animation, which brings static sketches to life by generating dynamic video sequences, has found widespread applications in GIF design, cartoon production, and daily entertainment. While current methods for sketch animation perform well in single-object sketch animation, they struggle in multi-object scenarios. By analyzing their failures, we identify two major challenges of transitioning from single-object to multi-object sketch animation: objectaware motion modeling and complex motion optimization. For multi-object sketch animation, we propose MoSketch based on iterative optimization through Score Distillation Sampling (SDS) and thus animating a multi-object sketch in a training-data free manner. To tackle the two challenges in a divide-and-conquer strategy, MoSketch has four novel modules, i.e., LLM-based scene decomposition, LLMbased motion planning, multi-grained motion refinement, and compositional SDS. Extensive qualitative and quantitative experiments demonstrate the superiority of our method over existing sketch animation approaches. MoSketch takes a pioneering step towards multi-object sketch animation, opening new avenues for future research and applications.
OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering	Shiyong Liu Huawei Noah's Ark Lab Xiao Tang Huawei Noah's Ark Lab Zhihao Li Huawei Noah's Ark Lab Yingfan He The Chinese University of Hong Kong (Shenzhen) Chongjie Ye The Chinese University of Hong Kong (Shenzhen) Jianzhuang Liu Shenzhen Institutes of Advanced Technology Binxiao Huang The University of Hong Kong Shunbo Zhou Huawei Embodied Intelligence Lab Xiaofei Wu Huawei Noah's Ark Lab	Paper Supplementary Abstract In large-scale scene reconstruction using 3D Gaussian splatting, it is common to partition the scene into multiple smaller regions and reconstruct them individually. However, existing division methods are occlusion-agnostic, meaning that each region may contain areas with severe occlusions. As a result, the cameras within those regions are less correlated, leading to a low average contribution to the overall reconstruction. In this paper, we propose an occlusion-aware scene division strategy that clusters training cameras based on their positions and co-visibilities to acquire multiple regions. Cameras in such regions exhibit stronger correlations and a higher average contribution, facilitating high-quality scene reconstruction. We further propose a region-based rendering technique to accelerate large scene rendering, which culls Gaussians invisible to the region where the viewpoint is located. Such a technique significantly speeds up the rendering without compromising quality. Extensive experiments on multiple large scenes show that our method achieves superior reconstruction results with faster rendering speed compared to existing state-of-the-art approaches. Project page: https: //occlugaussian.github.io.
Omni-scene Perception-oriented Point Cloud Geometry Enhancement for Coordinate Quantization	Wang Liu Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University Wei Gao Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University; Peng Cheng Laboratory	Paper Abstract Information quantization has been widely adopted in multimedia content, such as images, videos, and point clouds. The goal of information quantization is to achieve efficient storage and transmission by reducing data precision or redundancy. However, the information distortion caused by quantization will lead to the degradation of signal fidelity and the performance of downstream tasks. This paper focuses on the geometry quantization distortion of point clouds and proposes a unified learning-based quality enhancement framework for omni-scene point clouds. Based on the characteristics of geometry quantization distortion, we analyze and find that existing upsampling methods are not competitive in dealing with point reduction and geometry displacement simultaneously caused by coordinate quantization. Therefore, we design a general rootinggrowing-pruning paradigm to efficiently perceive the geometry feature of quantized point clouds and improve the quality significantly. In addition, a novel loss constraint term related to the quantization step parameter is proposed to further improve quality and accelerate model convergence. To the best of our knowledge, this is the first unified quality enhancement framework for object and scene point clouds with coordinate quantization. Extensive experiments verify the superiority of the proposed method on multi-scale point clouds with different levels of quantization distortion, including object (ModelNet40, 8iVFB) and scene (KITTI). In particular, the enhanced point clouds improve the performance of downstream analysis tasks, including classification and 3D object detection.
PartField: Learning 3D Feature Fields for Part Segmentation and Beyond	Minghua Liu NVIDIA Mikaela Angelina Uy NVIDIA Donglai Xiang NVIDIA Hao Su UCSD Sanja Fidler NVIDIA; University of Toronto; Vector Institute Nicholas Sharp NVIDIA Jun Gao NVIDIA; University of Toronto; Vector Institute	Paper Supplementary Abstract We propose PartField, a feedforward approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy without relying on predefined templates or text-based names, and can be applied to open-world 3D shapes across various modalities. PartField requires only a 3D feedforward pass at inference time, significantly improving runtime and robustness compared to prior approaches. Our model is trained by distilling 2D and 3D part proposals from a mix of labeled datasets and image segmentations on large unsupervised datasets, via a contrastive learning formulation. It produces a continuous feature field which can be clustered to yield a hierarchical part decomposition. Comparisons show that PartField is up to 20% more accurate and often orders of magnitude faster than other recent class-agnostic part-segmentation methods. Beyond single-shape part decomposition, consistency in the learned field emerges across shapes, enabling tasks such as co-segmentation and correspondence, which we demonstrate in several applications of these general-purpose, hierarchical, and consistent 3D feature fields. Check our Webpage! https://research.nvidia.com/labs/toronto-ai/partfield-release/
PriOr-Flow: Enhancing Primitive Panoramic Optical Flow with Orthogonal View	Longliang Liu Huazhong University of Science and Technology Miaojie Feng Huazhong University of Science and Technology Junda Cheng Huazhong University of Science and Technology Jijun Xiang Huazhong University of Science and Technology Xuan Zhu Huazhong University of Science and Technology Xin Yang Optics Valley Laboratory	Paper Abstract Panoramic optical flow enables a comprehensive understanding of temporal dynamics across wide fields of view. However, severe distortions caused by sphere-toplane projections, such as the equirectangular projection (ERP), significantly degrade the performance of conventional perspective-based optical flow methods, especially in polar regions. To address this challenge, we propose PriOr-Flow, a novel dual-branch framework that leverages the low-distortion nature of the orthogonal view to enhance optical flow estimation in these regions. Specifically, we introduce the Dual-Cost Collaborative Lookup (DCCL) operator, which jointly retrieves correlation information from both the primitive and orthogonal cost volumes, effectively mitigating distortion noise during cost volume construction. Furthermore, our Ortho-Driven Distortion Compensation (ODDC) module iteratively refines motion features of the primitive branch, further suppressing polar distortions. Extensive experiments demonstrate that PriOrFlow is compatible with various perspective-based iterative optical flow methods and consistently achieves stateof-the-art performance on publicly available panoramic optical flow datasets, setting a new benchmark for widefield motion estimation. The code is publicly available at: https://github.com/longliangLiu/PriOr-Flow.
QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization	Yueh-Cheng Liu Technical University of Munich Lukas Höllein Technical University of Munich Matthias Nießner Technical University of Munich Angela Dai Technical University of Munich	Paper Supplementary Abstract Surface reconstruction is fundamental to computer vision and graphics, enabling applications in 3D modeling, mixed reality, robotics, and more. Existing approaches based on volumetric rendering obtain promising results, but optimize on a per-scene basis, resulting in a slow optimization that can struggle to model under-observed or textureless regions. We introduce QuickSplat, which learns datadriven priors to generate dense initializations for 2D gaussian splatting optimization of large-scale indoor scenes. This provides a strong starting point for the reconstruction, which accelerates the convergence of the optimization and improves the geometry of flat wall structures. We further learn to jointly estimate the densification and update of the scene parameters during each iteration; our proposed densifier network predicts new Gaussians based on the rendering gradients of existing ones, removing the needs of heuristics for densification. Extensive experiments on large-scale indoor scene reconstruction demonstrate the superiority of our data-driven optimization. Concretely, we accelerate runtime by 8x, while decreasing depth errors by 48% in comparison to state of the art methods.
SGAD: Semantic and Geometric-aware Descriptor for Local Feature Matching	Xiangzeng Liu Xidian University Chi Wang Xidian University Guanglu Shi Xidian University Xiaodong Zhang Xidian University Qiguang Miao Xidian University Miao Fan Navinfo Europe B.V	Paper Supplementary Abstract Local feature matching remains a fundamental challenge in computer vision. Recent Area to Point Matching (A2PM) methods have improved matching accuracy. However, existing research based on this framework relies on inefficient pixel-level comparisons and complex graph matching that limit scalability. In this work, we introduce the Semantic and Geometric-aware Descriptor Network (SGAD), which fundamentally rethinks area-based matching by generating highly discriminative area descriptors that enable direct matching without complex graph optimization. This approach significantly improves both accuracy and efficiency of area matching. We further improve the performance of area matching through a novel supervision strategy that decomposes the area matching task into classification and ranking subtasks. Finally, we introduce the Hierarchical Containment Redundancy Filter (HCRF) to eliminate overlapping areas by analyzing containment graphs. SGAD demonstrates remarkable performance gains, reducing runtime by 60x (0.82s vs. 60.23s) compared to MESA. Extensive evaluations show consistent improvements across multiple point matchers: SGAD+LoFTR reduces runtime compared to DKM, while achieving higher accuracy (0.82s vs. 1.51s, 65.98 vs. 61.11) in outdoor pose estimation, and SGAD+ROMA delivers +7.39% AUC@5◦ in indoor pose estimation, establishing a new state-ofthe-art.
Spatial-Temporal Aware Visuomotor Diffusion Policy Learning	Zhenyang Liu Fudan University Yikai Wang Nanyang Technological University Kuanning Wang Fudan University Longfei Liang NeuHelium Co., Ltd Xiangyang Xue Fudan University Yanwei Fu Fudan University	Paper Supplementary Abstract Visual imitation learning is effective for robots to learn versatile tasks. However, many existing methods rely on behavior cloning with supervised historical trajectories, limiting their 3D spatial and 4D spatiotemporal awareness. Consequently, these methods struggle to capture the 3D structures and 4D spatiotemporal relationships necessary for real-world deployment. In this work, we propose 4D Diffusion Policy (DP4), a novel visual imitation learning method that incorporates spatiotemporal awareness into diffusion-based policies. Unlike traditional approaches that rely on trajectory cloning, DP4 leverages a dynamic Gaussian world model to guide the learning of 3D spatial and 4D spatiotemporal perceptions from interactive environments. Our method constructs the current 3D scene from a single-view RGB-D observation and predicts the future 3D scene, optimizing trajectory generation by explicitly modeling both spatial and temporal dependencies. Extensive experiments across 17 simulation tasks with 173 variants and 3 real-world robotic tasks demonstrate that the 4D Diffusion Policy (DP4) outperforms baseline methods, improving the average simulation task success rate by 16.4% (Adroit), 14% (DexArt), and 6.45% (RLBench), and the average realworld robotic task success rate by 8.6%.
TAD-E2E: A Large-scale End-to-end Autonomous Driving Dataset	Chang Liu ADLab, Tencent Mingxu Zhu ADLab, Tencent Zheyuan Zhang ADLab, Tencent Linna Song ADLab, Tencent Xiao Zhao ADLab, Tencent Qingliang Luo ADLab, Tencent Qi Wang ADLab, Tencent Chufan Guo ADLab, Tencent Kuifeng Su ADLab, Tencent	Paper Abstract End-to-end autonomous driving technology has recently become a focal point of research and application in autonomous driving. State-of-the-art (SOTA) methods are often trained and evaluated on the nuScenes dataset. However, the nuScenes dataset, introduced in 2019 for 3D perception tasks, faces several limitations-such as insufficient scale, simple scenes, and homogeneous driving behaviors-that restrict the upper-bound development of end-toend autonomous driving algorithms. In light of these issues, we propose a novel, large-scale real-world dataset specifically designed for end-to-end autonomous driving tasks, named TAD-E2E, which is 25x larger, 1.7x scene complexity over nuScenes, and features a highly diverse range of driving behaviors. We replicated SOTA methods on the TADE2E dataset and observed that these methods no longer performed well, as expected. Additionally, in response to the challenging scenarios presented in the TAD-E2E dataset, we devised a multimodal sparse end-to-end method that significantly outperforms SOTA methods. Ablation studies demonstrate the effectiveness of our method, and we analyze the contributions of each module. The dataset will be released in the near future.
Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers	An-Lun Liu National Yang Ming Chiao Tung University Yu-Wei Chao NVIDIA Yi-Ting Chen National Yang Ming Chiao Tung University	Paper Supplementary Abstract In this paper, we study task-oriented human grasp synthesis, a new grasp synthesis task that demands both task and context awareness. At the core of our method is the task-aware contact maps. Unlike traditional contact maps that only reason about the manipulated object and its relation with the hand, our enhanced maps take into account scene and task information. This comprehensive map is critical for hand-object interaction, enabling accurate grasping poses that align with the task. We propose a two-stage pipeline that first constructs a task-aware contact map informed by the scene and task. In the subsequent stage, we use this contact map to synthesize task-oriented human grasps. We introduce a new dataset and a metric for the proposed task to evaluate our approach. Our experiments validate the importance of modeling both scene and task, demonstrating significant improvements over existing methods in both grasp quality and task performance. See our project page for more details: https://hcis-lab.github.io/TOHGS/
Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge	Linshen Liu Johns Hopkins University Boyan Su Johns Hopkins University Junyue Jiang Johns Hopkins University Guanlin Wu Johns Hopkins University Cong Guo Duke University Ceyu Xu HKUST Hao Frank Yang Johns Hopkins University	Paper Abstract This paper presents Edge-based Mixture of Experts (MoE) Collaborative Computing (EMC2‡), an optimal computing system designed for autonomous vehicles (AVs) that simultaneously achieves low-latency and high-accuracy 3D object detection. Unlike existing works, the EMC2 introduces a novel scenario-aware MoE architecture optimized for fusing complementary sparse 3D point clouds and dense 2D images to achieve robust multimodal representations for detection. Furthermore, EMC2 integrates an adaptive multimodal data bridge with multi-scale region proposing and scenario-aware routing, dynamically dispatching features to complementary experts based on object visibility and distance. In addition, EMC2 integrates joint hardwaresoftware optimizations, including hardware resource utilization optimization and computational graph simplification, to ensure efficient and real-time inference on resourceconstrained edge devices. Experiments on open-source benchmarks clearly show the EMC2 advancements as an end-to-end system. On the KITTI dataset, it achieves an average accuracy improvement of 3.58% and a 159.06% inference speedup compared to 15 baseline methods on Jetson platforms, with similar performance gains on the nuScenes dataset, highlighting its capability to advance reliable, realtime 3D object detection tasks for AVs.
Underwater Visual SLAM with Depth Uncertainty and Medium Modeling	Rui Liu ReLER, CCAI, Zhejiang University Sheng Fan ReLER, CCAI, Zhejiang University Wenguan Wang ReLER, CCAI, Zhejiang University Yi Yang ReLER, CCAI, Zhejiang University	Paper Supplementary Abstract Underwater visual simultaneous localization and mapping (SLAM) faces critical challenges in light attenuation and degraded geometric consistency. Despite recent advances of visual SLAM in indoor and urban scenes, these approaches typically assume a clear medium and neglect medium-light interactions, leading to performance degradation in underwater environments. To overcome these limitations, we propose DUV-SLAM, a dense underwater visual SLAM framework that integrates uncertainty-aware geometry estimation with physics-inspired neural scattering modeling. Our method introduces two core innovations: i) depth uncertainty quantification derived from differentiable bundle adjustment, which propagates geometric confidence to guide mapping optimization; and ii) a neural-Gaussian hybrid representation that combines adaptive 3D Gaussians for underwater reconstruction with a neural field capturing wavelength-dependent medium properties, optimized using a combination of photometric, geometric, and distribution losses. Experiments on synthetic and real-world datasets demonstrate that DUV-SLAM achieves high-quality monocular reconstruction while maintaining real-time efficiency and robust tracking accuracy.
Unified Open-World Segmentation with Multi-Modal Prompts	Yang Liu Zhejiang University Yufei Yin Hangzhou Dianzi University Chenchen Jing Zhejiang University of Technology Muzhi Zhu Zhejiang University Hao Chen Zhejiang University Yuling Xi Zhejiang University Bo Feng Apple Hao Wang Apple Shiyu Li Apple Chunhua Shen Zhejiang University	Paper Supplementary Abstract In this work, we present COSINE, a unified open-world segmentation model that Consolidates Open-vocabulary Segmentation and IN-context sEgmentation with multimodal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and incontext segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches. Our code is released at https://github.com/aim-uofa/COSINE.
Video Motion Graphs	Haiyang Liu The University of Tokyo Zhan Xu Adobe Research Fa-Ting Hong Adobe Research Hsin-Ping Huang Adobe Research Yi Zhou Adobe Research Yang Zhou Adobe Research	Paper Supplementary Abstract We present Video Motion Graphs, a system designed to generate realistic human motion videos. Using a reference video and conditional signals such as music or motion tags, the system synthesizes new videos by first retrieving video clips with gestures matching the conditions and then generating interpolation frames to seamlessly connect clip boundaries. The core of our approach is HMInterp, a robust Video Frame Interpolation (VFI) model that enables seamless interpolation of discontinuous frames, even for complex motion scenarios like dancing. HMInterp i) employs a dual-branch interpolation approach, combining a Motion Diffusion Model for human skeleton motion interpolation with a diffusion-based video frame interpolation model for final frame generation. ii) adopts condition progressive training to effectively leverage identity strong and weak conditions, such as images and pose. These designs ensure both high video texture quality and accurate motion trajectory. Results show that our Video Motion Graphs outperforms existing generative- and retrieval-based methods for multi-modal conditioned human motion video generation. Project page can be found here.
When Confidence Fails: Revisiting Pseudo-Label Selection in Semi-supervised Semantic Segmentation	Pan Liu Central South University Jinshi Liu Central South University	Paper Supplementary Abstract While significant advances exist in pseudo-label generation for semi-supervised semantic segmentation, pseudolabel selection remains understudied. Existing methods typically use fixed confidence thresholds to retain highconfidence predictions as pseudo-labels. However, these methods cannot cope with network overconfidence tendency, where correct and incorrect predictions overlap significantly in high-confidence regions, making separation challenging and amplifying model cognitive bias. Meanwhile, the direct discarding of low-confidence predictions disrupts spatial-semantic continuity, causing critical context loss. We propose Confidence Separable Learning (CSL) to address these limitations. CSL formulates pseudo-label selection as a convex optimization problem within the confidence distribution feature space, establishing sample-specific decision boundaries to distinguish reliable from unreliable predictions. Additionally, CSL introduces random masking of reliable pixels to guide the network in learning contextual relationships from lowreliability regions, thereby mitigating the adverse effects of discarding uncertain predictions. Extensive experimental results on the Pascal, Cityscapes, and COCO benchmarks show that CSL performs favorably against state-ofthe-art methods. Code and model weights are available at:https://github.com/PanLiuCSU/CSL.
mmCooper: A Multi-agent Multi-stage Communication-efficient and Collaboration-robust Cooperative Perception Framework	Bingyi Liu Wuhan University Of Technology Jian Teng Wuhan University Of Technology Hongfei Xue University of North Carolina at Charlotte Enshu Wang Wuhan University Chuanhui Zhu Wuhan University Of Technology Pu Wang University of North Carolina at Charlotte Libing Wu Wuhan University	Paper Supplementary Abstract Collaborative perception significantly enhances individual vehicle perception performance through the exchange of sensory information among agents. However, realworld deployment faces challenges due to bandwidth constraints and inevitable calibration errors during information exchange. To address these issues, we propose mmCooper, a novel multi-agent, multi-stage, communicationefficient, and collaboration-robust cooperative perception framework. Our framework leverages a multi-stage collaboration strategy that dynamically and adaptively balances intermediate- and late-stage information to share among agents, enhancing perceptual performance while maintaining communication efficiency. To support robust collaboration despite potential misalignments and calibration errors, our framework prevents misleading low-confidence sensing information from transmission and refines the received detection results from collaborators to improve accuracy. The extensive evaluation results on both real-world and simulated datasets demonstrate the effectiveness of the mmCooper framework and its components.
PseudoMapTrainer: Learning Online Mapping without HD Maps	Christian L¨owens Bosch Research Thorben Funke Bosch Research Jingchao Xie Bosch Research; Technical University of Munich Alexandru Paul Condurache Automated Driving, Bosch; University of L¨ubeck	Paper Supplementary Abstract Online mapping models show remarkable results in predicting vectorized maps from multi-view camera images only. However, all existing approaches still rely on ground-truth high-definition maps during training, which are expensive to obtain and often not geographically diverse enough for reliable generalization. In this work, we propose PseudoMapTrainer, a novel approach to online mapping that uses pseudo-labels generated from unlabeled sensor data. We derive those pseudo-labels by reconstructing the road surface from multi-camera imagery using Gaussian splatting and semantics of a pre-trained 2D segmentation network. In addition, we introduce a mask-aware assignment algorithm and loss function to handle partially masked pseudo-labels, allowing for the first time the training of online mapping models without any ground-truth maps. Furthermore, our pseudo-labels can be effectively used to pretrain an online model in a semi-supervised manner to leverage large-scale unlabeled crowdsourced data. The code is available at github.com/boschresearch/PseudoMapTrainer.
HUMOTO: A 4D Dataset of Mocap Human Object Interactions	Jiaxin Lu University of Texas at Austin Chun-Hao Paul Huang Adobe Research Uttaran Bhattacharya Adobe Research Qixing Huang University of Texas at Austin Yi Zhou Adobe Research	Paper Supplementary Abstract levels of human text annotation. Figure 1. Overview of the HUMOTO dataset. The dataset contains mocap 4D human-object interaction animations with multiple objects. The unique features of the dataset include its detailed, accurate interaction modeling, specifically the detailed hand pose. The objects are precisely modeled by artists. We additionally provide different abstract levels of text annotation for the interactions. Abstract We present Human Motions with Objects (HUMOTO), a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications. Featuring 735 sequences (7,875 seconds at 30 fps), HUMOTO captures interactions with 63 precisely modeled objects and 72 articulated parts. Our innovations include a scene-driven LLM scripting pipeline creating complete, purposeful tasks with natural progression, and a mocapand-camera recording setup to effectively handle occlusions. Spanning diverse activities from cooking to outdoor picnics, HUMOTO preserves both physical accuracy and logical task flow. Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations. We also provide benchmarks compared to other datasets. HUMOTO's comprehensive full-body motion and simultaneous multi-object interactions address key data-capturing challenges and provide opportunities to advance realistic human-object interaction modeling across ∗The work was mainly conducted at Adobe Research. research domains with practical applications in animation, robotics, and embodied AI systems. Project Page: https: //jiaxin-lu.github.io/humoto/.
InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models	Yifan Lu NVIDIA Xuanchi Ren NVIDIA Jiawei Yang University of Southern California Tianchang Shen NVIDIA Zhangjie Wu NVIDIA Jun Gao NVIDIA Yue Wang University of Southern California Siheng Chen Shanghai Jiao Tong University Mike Chen NVIDIA Sanja Fidler NVIDIA Jiahui Huang NVIDIA	Paper Supplementary Abstract We present InfiniCube, a scalable and controllable method to generate unbounded and dynamic 3D driving scenes with high fidelity. Previous methods for scene generation are constrained either by their applicability to indoor scenes or by their lack of controllability. In contrast, we take advantage of recent advances in 3D and video generative models to achieve large dynamic scene generation with ﬂexible controls like HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned 3D voxel generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of pixel-aligned guidance buffers, synthesizing a consistent appearance on long-video generation for large-scale scenes. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift videos to dynamic 3D Gaussians with controllable objects. Our method generates realistic and dynamic 3D driving scenes, and extensive experiments validate the effectiveness of our model design.
Jigsaw++: Imagining Complete Shape Priors for Object Reassembly	Jiaxin Lu University of Texas at Austin Gang Hua Amazon Qixing Huang University of Texas at Austin	Paper Supplementary Abstract The automatic assembly problem has attracted increasing interest due to its complex challenges that involve 3D representation. This paper introduces Jigsaw++, a novel generative method designed to tackle the multifaceted challenges of reconstructing complete shape for the reassembly problem. Existing approach focusing primarily on piecewise information for both part and fracture assembly, often overlooking the integration of complete object prior. Jigsaw++ distinguishes itself by learning a shape prior of complete objects. It employs the proposed 'retargeting' strategy that effectively leverages the output of any existing assembly method to generate complete shape reconstructions. This capability allows it to function orthogonally to the current methods. Through extensive evaluations on Breaking Bad dataset and PartNet, Jigsaw++ has demonstrated its effectiveness, reducing reconstruction errors and enhancing the precision of shape reconstruction, which sets a new direction for future reassembly model developments.
ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving	Yuhang Lu ShanghaiTech University Jiadong Tu ShanghaiTech University Yuexin Ma ShanghaiTech University Xinge Zhu The Chinese University of Hong Kong	Paper Supplementary Abstract End-to-end autonomous driving has emerged as a promising approach to unify perception, prediction, and planning within a single framework, reducing information loss and improving adaptability. However, existing methods often rely on fixed and sparse trajectory supervision, limiting their ability to capture the hierarchical reasoning process that human drivers naturally employ. To bridge this gap, we propose ReAL-AD, a Reasoning-Augmented Learning framework that structures decision-making in autonomous driving based on the three-tier human cognitive model: Driving Strategy, Driving Decision, and Driving Operation, where Vision-Language Models (VLMs) are incorporated to enhance situational awareness and structured reasoning across these levels. Specifically, we introduce: (1) the Strategic Reasoning Injector, which formulates highlevel driving strategies by interpreting complex traffic contexts from VLM-generated insights; (2) the Tactical Reasoning Integrator, which refines strategic intent into interpretable tactical choices such as lane changes, overtaking, and speed adjustments; and (3) the Hierarchical Trajectory Decoder, which progressively translates tactical decisions into precise control actions for smooth and humanlike trajectory execution. Extensive evaluations show that integrating our framework improves planning accuracy and safety by over 30%, making end-to-end autonomous driving more interpretable and aligned with human-like hierarchical reasoning. The project page can be found at: 4dvlab.github.io/project page/realad
Serialization based Point Cloud Oversegmentation	Chenghui Lu Huaqiao University Jianlong Kwan Huaqiao University Dilong Li Huaqiao University Ziyi Chen Huaqiao University Haiyan Guan Nanjing University of Information Science and Technology	Paper Supplementary Abstract Point cloud oversegmentation, as a fundamental preprocessing step for 3D understanding, is a challenging task due to its spatial proximity and semantic similarity requirements. Most existing works struggle to efficiently group semantically consistent points into superpoints while maintaining spatial proximity. In this paper, we propose a novel serialization based point cloud oversegmentation method, which leverages serialization to avoid complex spatial queries, directly accessing neighboring points through sequence locality for similarity matching and superpoint clustering. Specifically, we first serialize point clouds into a Hilbert curve and spatially-continuously partition them into initial segments. Then, to guarantee the internal semantic consistency of superpoints, we design an adaptive update algorithm that clusters superpoints by matching feature similarities between neighboring segments and refines segment features via Cross-Attention. Experiments on largescale indoor and outdoor datasets demonstrate state-of-theart performance in point cloud oversegmentation. Moreover, it is also adaptable to semantic segmentation and achieves promising performance. The code is available at https://github.com/CHL-glitch/SPCNet.
VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions	Haoang Lu Xi'an Jiaotong University Yuanqi Su Xi'an Jiaotong University Xiaoning Zhang unknown Longjun Gao unknown Yu Xue unknown Le Wang unknown	Paper Abstract This paper introduces VisHall3D, a novel two-stage framework for monocular semantic scene completion that aims to address the issues of feature entanglement and geometric inconsistency prevalent in existing methods. VisHall3D decomposes the scene completion task into two stages: reconstructing the visible regions (vision) and inferring the invisible regions (hallucination). In the first stage, VisFrontierNet, a visibility-aware projection module, is introduced to accurately trace the visual frontier while preserving finegrained details. In the second stage, OcclusionMAE, a hallucination network, is employed to generate plausible geometries for the invisible regions using a noise injection mechanism. By decoupling scene completion into these two distinct stages, VisHall3D effectively mitigates feature entanglement and geometric inconsistency, leading to significantly improved reconstruction quality. The effectiveness of VisHall3D is validated through extensive experiments on two challenging benchmarks: SemanticKITTI and SSCBench-KITTI-360. VisHall3D achieves state-of-the-art performance, outperforming previous methods by a significant margin and paves the way for more accurate and reliable scene understanding in autonomous driving and other applications.
monoVLN: Bridging the Observation Gap between Monocular and Panoramic Vision and Language Navigation	Renjie Lu Sun Yat-sen University Yu Zhou Sun Yat-sen University Hao Cheng Hunan University Jingke Meng Sun Yat-sen University Wei-Shi Zheng Sun Yat-sen University	Paper Supplementary Abstract Vision and Language Navigation(VLN) requires agents to navigate 3D environments by following natural language instructions. While existing methods predominantly assume access to panoramic observations, many practical robotics are equipped with monocular RGBD cameras, creating a significant configuration disparity. In this work, we address this critical gap by developing a novel 3DGS-based framework for monocular VLN agents, focusing on the intrinsic information incompleteness challenge. Our approach incorporates two key innovations: (1) implicit partial completion module for inferring representations of missing regions in incompletely rendered panoramic feature maps, and (2) an uncertainty-aware active perception strategy that enables the agent to actively acquire visual observation when uncertain about its decision. Extensive experiments on R2R-CE and RxR-CE datasets demonstrate that our monoVLN outperforms all existing monocular methods, significantly improve 8% success rate on R2R-CE compared to previous monocular methods. We also validate our monoVLN in real-world environments, providing a practical solution for real-world VLN.
Beyond the Frame: Generating 360deg Panoramic Videos from Perspective Videos	Rundong Luo Cornell University Matthew Wallingford University of Washington Ali Fahardi University of Washington Noah Snavely Cornell University Wei-Chiu Ma Cornell University	Paper Supplementary Abstract 360◦videos have emerged as a promising medium to represent our dynamic visual world. Compared to the 'tunnel vision' of standard cameras, their borderless field of view offers a more complete perspective of our surroundings. While existing video models excel at producing standard videos, their ability to generate full panoramic videos remains elusive. In this paper, we investigate the task of video-to-360◦ generation: given a perspective video as input, our goal is to generate a full panoramic video that is consistent with the original video. Unlike conventional video generation tasks, the output's field of view is significantly larger, and the model is required to have a deep understanding of both the spatial layout of the scene and the dynamics of objects to maintain spatio-temporal consistency. To address these challenges, we first leverage the abundant 360◦videos available online and develop a high-quality data filtering pipeline to curate pairwise training data. We then carefully design a series of geometry- and motion-aware operations to facilitate the learning process and improve the quality of 360◦ video generation. Experimental results demonstrate that our model can generate realistic and coherent 360◦videos from in-the-wild perspective video. In addition, we showcase its potential applications, including video stabilization, camera viewpoint control, and interactive visual question answering.
Gradient Decomposition and Alignment for Incremental Object Detection	Wenlong Luo Northwestern Polytechnical University Shizhou Zhang Northwestern Polytechnical University De Cheng Xidian University Yinghui Xing Northwestern Polytechnical University Guoqiang Liang Northwestern Polytechnical University Peng Wang Northwestern Polytechnical University Yanning Zhang Northwestern Polytechnical University	Paper Supplementary Abstract Incremental object detection (IOD) is crucial for enabling AI systems to continuously learn new object classes over time while retaining knowledge of previously learned categories, allowing model to adapt to dynamic environments without forgetting prior information. Existing IOD methods primarily employ knowledge distillation to mitigate catastrophic forgetting, yet these approaches overlook class overlap issues, often resulting in suboptimal performance. In this paper, we propose a novel framework for IOD that leverages a decoupled gradient alignment technique on top of the specially proposed pseudo-labeling strategy. Our method employs a Gaussian Mixture Model to accurately estimate pseudo-labels of previously learned objects in current training images, effectively functioning as a knowledge-replay mechanism. This strategy reinforces prior knowledge retention and prevents the misclassification of unannotated foreground objects from earlier classes as background. Furthermore, we introduce an adaptive gradient decomposition and alignment method to maintain model stability while facilitating positive knowledge transfer. By aligning gradients from both old and new classes, our approach preserves previously learned knowledge while enhancing plasticity for new tasks. Extensive experiments on two IOD benchmarks demonstrate the effectiveness of the proposed method, achieving superior performances to state-of-the-art methods. The code and datasets are available at https://github.com/FHR-L/GDA-IOD.
MS3D: High-Quality 3D Generation via Multi-Scale Representation Modeling	Guan Luo Tsinghua University Jianfeng Zhang ByteDance Seed	Paper Supplementary Abstract High-quality textured mesh reconstruction from sparseview images remains a fundamental challenge in computer graphics and computer vision. Traditional large reconstruction models operate in a single-scale manner, forcing the models to simultaneously capture global structure and local details, often resulting in compromised reconstructed shapes. In this work, we propose MS3D, a novel multi-scale 3D reconstruction framework. At its core, our method introduces a hierarchical structured latent representation for multi-scale modeling, coupled with a multiscale feature extraction and integration mechanism. This enables progressive reconstruction, effectively decomposing the complex task of detailed geometry reconstruction into a sequence of easier steps. This coarse-to-fine approach effectively captures multi-frequency details, learns complex geometric patterns, and generalizes well across diverse objects while preserving fine-grained details. Extensive experiments demonstrate MS3D outperforms state-ofthe-art methods and is broadly applicable to both imageand text-to-3D generation. The entire pipeline reconstructs high-quality textured meshes in under five seconds.
Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration	Katie Z Luo Cornell University Minh-Quan Dao Inria Zhenzhen Liu Cornell University Mark Campbell Cornell University Wei-Lun Chao The Ohio State University Kilian Q Weinberger Cornell University Ezio Malis Inria Vincent Frémont École Centrale de Nantes Bharath Hariharan Cornell University Mao Shan University of Sydney Stewart Worrall University of Sydney Julie Stephany Berrio Perez University of Sydney	Paper Supplementary Abstract Vehicle-to-everything (V2X) collaborative perception has emerged as a promising solution to address the limitations of single-vehicle perception systems. However, existing V2X datasets are limited in scope, diversity, and quality. To address these gaps, we present Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes collected from three connected autonomous vehicles (CAVs) equipped with two different configurations of LiDAR sensors, plus a roadside unit with dual LiDARs. Our dataset provides point clouds and bounding box annotations across 10 classes, ensuring reliable data for perception training. We provide detailed statistical analysis on the quality of our dataset and extensively benchmark existing V2X methods on it. Mixed Signals is ready-to-use, with precise alignment and consistent annotations across time and viewpoints. We hope our work advances research in the emerging, impactful field of V2X perception. Dataset details at https://mixedsignalsdataset.cs.cornell.edu/.
DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation	Jiangran Lyu Peking University Ziming Li Peking University Xuesong Shi unknown Chaoyi Xu unknown Yizhou Wang Peking University He Wang Peking University	Paper Supplementary Abstract Nonprehensile manipulation is crucial for handling objects that are too thin, large, or otherwise ungraspable in unstructured environments. While conventional planningbased approaches struggle with complex contact modeling, learning-based methods have recently emerged as a promising alternative. However, existing learning-based approaches face two major limitations: they heavily rely on multi-view cameras and precise pose tracking, and they fail to generalize across varying physical conditions, such as changes in object mass and table friction. To address these challenges, we propose the Dynamics-Adaptive World Action Model (DyWA), a novel framework that enhances action learning by jointly predicting future states while adapting to dynamics variations based on historical trajectories. By unifying the modeling of geometry, state, physics, and robot actions, DyWA enables more robust policy learning under partial observability. Compared to baselines, our method improves the success rate by 31.5% using only single-view point cloud observations in the simulation. Furthermore, DyWA achieves an average success rate of 68% in real-world experiments, demonstrating its ability to generalize across diverse object geometries, adapt to varying table friction, and robustness in challenging scenarios such as half-filled water bottles and slippery surfaces.
ResGS: Residual Densification of 3D Gaussian for Efficient Detail Recovery	Yanzhe Lyu University of Science and Technology of China Kai Cheng unknown Xin Kang unknown Xuejin Chen MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China	Paper Supplementary Abstract Recently, 3D Gaussian Splatting (3D-GS) has prevailed in novel view synthesis, achieving high fidelity and efficiency. However, it often struggles to capture rich details and complete geometry. Our analysis reveals that the 3D-GS densification operation lacks adaptiveness and faces a dilemma between geometry coverage and detail recovery. To address this, we introduce a novel densification operation, residual split, which adds a downscaled Gaussian as a residual. Our approach is capable of adaptively retrieving details and complementing missing geometry. To further support this method, we propose a pipeline named ResGS. Specifically, we integrate a Gaussian image pyramid for progressive supervision and implement a selection scheme that prioritizes the densification of coarse Gaussians over time. Extensive experiments demonstrate that our method achieves SOTA rendering quality. Consistent performance improvements can be achieved by applying our residual split on various 3DGS variants, underscoring its versatility and potential for broader application in 3D-GS-based applications. Project page: https://yanzhelyu.github.io/resgs.github.io/.
BezierGS: Dynamic Urban Scene Reconstruction with Bezier Curve Gaussian Splatting	Zipei Ma Fudan University Junzhe Jiang Fudan University Yurui Chen Fudan University Li Zhang Fudan University	Paper Supplementary Abstract The realistic reconstruction of street scenes is critical for developing real-world simulators in autonomous driving. Most existing methods rely on object pose annotations, using these poses to reconstruct dynamic objects and move them during the rendering process. This dependence on high-precision object annotations limits large-scale and extensive scene reconstruction. To address this challenge, we propose B´ezier curve Gaussian splatting (B´ezierGS), which represents the motion trajectories of dynamic objects using learnable B´ezier curves. This approach fully leverages the temporal information of dynamic objects and through learnable curve modeling, automatically corrects pose errors. By introducing additional supervision on dynamic object rendering and inter-curve consistency constraints, we achieve reasonable and accurate separation and reconstruction of scene elements. Extensive experiments on the Waymo Open Dataset and the nuPlan benchmark demonstrate that B´ezierGS outperforms state-of-theart alternatives in both dynamic and static scene components reconstruction and novel view synthesis.
DCHM: Depth-Consistent Human Modeling for Multiview Detection	Jiahao Ma Australian National University Tianyu Wang unknown Miaomiao Liu unknown David Ahmedt-Aristizabal unknown Chuong Nguyen unknown	Paper Supplementary Abstract Multiview pedestrian detection typically involves two stages: human modeling and pedestrian localization. Human modeling represents pedestrians in 3D space by fusing multiview information, making its quality crucial for detection accuracy. However, existing methods often introduce noise and have low precision. While some approaches reduce noise by fitting on costly multiview 3D annotations, they often struggle to generalize across diverse scenes. To eliminate reliance on human-labeled annotations and accurately model humans, we propose Depth-Consistent Human Modeling (DCHM), a framework designed for consistent depth estimation and multiview fusion in global coordinates. Specifically, our proposed pipeline with superpixelwise Gaussian Splatting achieves multiview depth consistency in sparse-view, large-scaled, and crowded scenarios, producing precise point clouds for pedestrian localization. Extensive validations demonstrate that our method significantly reduces noise during human modeling, outperforming previous state-of-the-art baselines. Additionally, to our knowledge, DCHM is the first to reconstruct pedestrians and perform multiview segmentation in such a challenging setting. Code is available on the project page.
Find Any Part in 3D	Ziqi Ma California Institute of Technology Yisong Yue California Institute of Technology Georgia Gkioxari California Institute of Technology	Paper Supplementary Abstract Why don't we have foundation models in 3D yet? A key limitation is data scarcity. For 3D object part segmentation, existing datasets are small in size and lack diversity. We show that it is possible to break this data barrier by building a data engine powered by 2D foundation models. Our data engine automatically annotates any number of object parts: 1,755x more unique part types than existing datasets combined. By training on our annotated data with a simple contrastive objective, we obtain an open-world model that generalizes to any part in any object based on any text query. Even when evaluated zero-shot, we outperform existing methods on the datasets they train on. We achieve 260% improvement in mIoU and boost speed by 6x to 300x. Our scaling analysis confirms that this generalization stems from the data scale, which underscores the impact of our data engine. Finally, to advance general-category openworld 3D part segmentation, we release a benchmark covering a wide range of objects and parts. Project website: https://ziqi-ma.github.io/find3dsite/
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers	Shijie Ma ARC Lab, Tencent PCG Yuying Ge ARC Lab, Tencent PCG Teng Wang ARC Lab, Tencent PCG Yuxin Guo ARC Lab, Tencent PCG Yixiao Ge ARC Lab, Tencent PCG Ying Shan ARC Lab, Tencent PCG	Paper Supplementary Abstract The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving finegrained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth explorations, we have finally arrived at an effective method, namely GenHancer, which consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be further plugged into multimodal large language models for better vision-centric performance. All the models and codes are made publicly available.
InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild	Yiyi Ma Shenzhen International Graduate School, Tsinghua University Yuanzhi Liang Institute of Artificial Intelligence, China Telecom Xiu Li Shenzhen International Graduate School, Tsinghua University Chi Zhang Institute of Artificial Intelligence, China Telecom Xuelong Li Institute of Artificial Intelligence, China Telecom	Paper Supplementary Abstract We present Interleaved Learning for Motion Synthesis (InterSyn), a novel framework that targets the generation of realistic interaction motions by learning from integrated motions that consider both solo and multi-person dynamics. Unlike previous methods that treat these components separately, InterSyn employs an interleaved learning strategy to capture the natural, dynamic interactions and nuanced coordination inherent in real-world scenarios. Our framework comprises two key modules: the Interleaved Interaction Synthesis (INS) module, which jointly models solo and interactive behaviors in a unified paradigm from a first-person perspective to support multiple character interactions, and the Relative Coordination Refinement (REC) module, which refines mutual dynamics and ensures synchronized motions among characters. Experimental results show that the motion sequences generated by InterSyn exhibit higher text-to-motion alignment and improved diversity compared with recent methods, setting a new benchmark for robust and natural motion synthesis. Additionally, our code will be open-sourced in the future to promote further research and development in this area. Project website: https://myy888.github.io/InterSyn/
MaGS: Reconstructing and Simulating Dynamic 3D Objects with Mesh-adsorbed Gaussian Splatting	Shaojie Ma Zhejiang University Yawei Luo Zhejiang University Wei Yang Huazhong University of Science and Technology Yi Yang Zhejiang University	Paper Supplementary Abstract 3D reconstruction and simulation, although interrelated, have distinct objectives: reconstruction requires a ﬂexible 3D representation that can adapt to diverse scenes, while simulation needs a structured representation to model motion principles effectively. This paper introduces the Mesh-adsorbed Gaussian Splatting (MaGS) method to address this challenge. MaGS constrains 3D Gaussians to roam near the mesh, creating a mutually adsorbed meshGaussian 3D representation. Such representation harnesses both the rendering ﬂexibility of 3D Gaussians and the structured property of meshes. To achieve this, we introduce RMD-Net, a network that learns motion priors from video data to refine mesh deformations, alongside RGDNet, which models the relative displacement between the mesh and Gaussians to enhance rendering fidelity under mesh constraints. To generalize to novel, user-defined deformations beyond input video without reliance on temporal data, we propose MPE-Net, which leverages inherent mesh information to bootstrap RMD-Net and RGD-Net. Due to the universality of meshes, MaGS is compatible with various deformation priors such as ARAP, SMPL, and soft physics simulation. Extensive experiments on the D-NeRF, DG-Mesh, and PeopleSnapshot datasets demonstrate that MaGS achieves state-of-the-art performance in both reconstruction and simulation. †Corresponding Author Project page: https://wcwac.github.io/MaGS-page/
MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion	Yikun Ma Sun Yat-sen University Yiqing Li Sun Yat-sen University Jiawei Wu Sun Yat-sen University Xing Luo Peng Cheng Laboratory Zhi Jin Sun Yat-sen University	Paper Supplementary Abstract 1 Generative models have made remarkable advancements and are capable of producing high-quality content. However, performing controllable editing with generative models remains challenging, due to their inherent uncertainty in outputs. This challenge is particularly pronounced in motion editing, which involves the processing of spatial information. While some physics-based generative methods have attempted to implement motion editing, they typically operate on single-view images with simple motions, such as translation and dragging. These methods struggle to handle complex motions, such as, rotation and stretching, and ensure multi-view consistency, often necessitating resourceintensive retraining. To address these challenges, we propose MotionDiff, a training-free zero-shot diffusion method that leverages optical ﬂow for complex motion editing among multi-view images. Specifically, given a static scene, users can interactively select objects of interest to add motion priors. The proposed Point Kinematic Model (PKM) then estimates corresponding multi-view optical ﬂows during the Multi-view Flow Estimation Stage (MFES). Subsequently, these optical ﬂows are utilized to generate multiview motion results through decoupled motion representation in the Multi-view Motion Diffusion Stage (MMDS). Extensive experiments demonstrate that MotionDiff outperforms other physics-based generative motion editing methods in achieving high-quality multi-view consistent motion results. Notably, MotionDiff does not require retraining, enabling users to conveniently adapt it for various downstream tasks. Code is available at https://github.com/MrMa-yikun/MotionDiff.
ReMP-AD: Retrieval-enhanced Multi-modal Prompt Fusion for Few-Shot Industrial Visual Anomaly Detection	Hongchi Ma Harbin Institute of Technology Guanglei Yang Harbin Institute of Technology Debin Zhao Harbin Institute of Technology Yanli Ji Sun Yat-Sen University Wangmeng Zuo Harbin Institute of Technology	Paper Supplementary Abstract Industrial visual inspection is crucial for detecting defects in manufactured products, but it traditionally relies on human operators, leading to inefficiencies. Industrial Visual Anomaly Detection (IVAD) has emerged as a promising solution, with methods such as zero-shot, few-shot, and reconstruction-based techniques. However, zero-shot methods struggle with subtle anomalies, and reconstructionbased methods fail to capture fine-grained details. Few-shot methods, which use limited samples and prompts, offer a more efficient approach. Despite their promise, challenges remain in managing intra-class variation among references and in effectively extracting more representative anomaly features. This paper presents Retrieval-enhanced Multimodal Prompt Fusion Anomaly Detection (ReMP-AD), a framework that introduces Intra-Class Token Retrieval (ICTR) to reduce noise in the memory bank and VisionLanguage Prior Fusion (VLPF) to guide the encoder in capturing more distinctive and relevant features of anomalies. Experiments on the VisA and MVTec-AD datasets demonstrate that ReMP-AD outperforms existing methods, achieving 97.8%/94.1% performance in 4-shot anomaly segmentation and classification. Our approach also shows strong results on the PCB-Bank dataset, highlighting its effectiveness in few-shot industrial anomaly detection. Code is available at https://github.com/cshcma/ReMP-AD.git
On the Recovery of Cameras from Fundamental Matrices	Rakshith Madhavan Politecnico di Milano Federica Arrigoni Politecnico di Milano	Paper Supplementary Abstract The viewing graph is a compact tool to encode the geometry of multiple views: nodes represent uncalibrated cameras and edges represent fundamental matrices (when available). Most research focuses on theoretical analyses, exploring for which viewing graphs it is possible (in principle) to retrieve cameras from fundamental matrices, in the sense that the problem admits a unique solution for noiseless data. However, the practical task of recovering cameras from noisy fundamental matrices is still open, as available methods are limited to special graphs (such as those covered by triplets). In this paper, we develop the first method that can deal with the recovery of cameras from noisy fundamental matrices in a general viewing graph. Experimental results demonstrate the promise of the proposed approach on a variety of synthetic and real scenarios.
Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection	Subhajit Maity University of Central Florida Ayan Kumar Bhunia University of Surrey Subhadeep Koley University of Surrey Pinaki Nath Chowdhury University of Surrey Aneeshan Sain University of Surrey Yi-Zhe Song University of Surrey	Paper Supplementary Abstract Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering crossmodal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.
A Hyperdimensional One Place Signature to Represent Them All: Stackable Descriptors For Visual Place Recognition	Connor Malone Queensland University of Technology Somayeh Hussaini Queensland University of Technology Tobias Fischer Queensland University of Technology Michael Milford Queensland University of Technology	Paper Supplementary Abstract Visual Place Recognition (VPR) enables coarse localization by comparing query images to a reference database of geo-tagged images. Recent breakthroughs in deep learning architectures and training regimes have led to methods with improved robustness to factors like environment appearance change, but with the downside that the required training and/or matching compute scales with the number of distinct environmental conditions encountered. Here, we propose Hyperdimensional One Place Signatures (HOPS) to simultaneously improve the performance, compute and scalability of these state-of-the-art approaches by fusing the descriptors from multiple reference sets captured under different conditions. HOPS scales to any number of environmental conditions by leveraging the Hyperdimensional Computing framework. Extensive evaluations demonstrate that our approach is highly generalizable and consistently improves recall performance across all evaluated VPR methods and datasets by large margins. Arbitrarily fusing reference images without compute penalty enables numerous other useful possibilities, three of which we demonstrate here: improved performance with reduced dimensionality descriptors, stacking synthetic images, and coarse localization to an entire traverse or environmental section.
AccidentalGS: 3D Gaussian Splatting from Accidental Camera Motion	Mao Mao Zhejiang University Xujie Shen Zhejiang University Guyuan Chen Zhejiang University Boming Zhao Zhejiang University Jiarui Hu Zhejiang University Hujun Bao Zhejiang University Zhaopeng Cui Zhejiang University	Paper Supplementary Abstract Neural 3D modeling and novel view synthesis with Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS) typically requires the multi-view images with wide baselines and accurate camera poses as input. However, scenarios with accidental camera motions are rarely studied. In this paper, we propose AccidentalGS, the first method for neural 3D modeling and novel view synthesis from accidental camera motions. To achieve this, we present a novel joint optimization framework that considers geometric and photometric errors, using a simplified camera model for stability. We also introduce a novel online adaptive depth-consistency loss to prevent the overfitting of the Gaussian model to input images. Extensive experiments on both synthetic and real-world datasets show that AccidentalGS achieves more accurate camera poses and realistic novel views compared to existing methods, and supports 3D modeling and neural rendering even for the Moon with telescope-like images.
Tree Skeletonization from 3D Point Clouds by Denoising Diffusion	Elias Ariel Marks University of Bonn Lucas Nunes University of Bonn Federico Magistri University of Bonn Matteo Sodano University of Bonn Rodrigo Marcuzzi University of Bonn Lars Zimmermann University of Bonn Jens Behley University of Bonn Cyrill Stachniss University of Bonn	Paper Supplementary Abstract The natural world presents complex organic structures, such as tree canopies, that humans can interpret even when only partially visible. Understanding tree structures is key for forest monitoring, orchard management, and automated harvesting applications. However, reconstructing tree topologies from sensor data, called tree skeletonization, remains a challenge for computer vision approaches. Traditional methods for tree skeletonization rely on handcrafted features, regression, or generative models, whereas recent advances focus on deep learning approaches. Existing methods often struggle with occlusions caused by dense foliage, limiting their applicability over the annual vegetation cycle. Furthermore, the lack of real-world data with reference information limits the evaluation of these methods to synthetic datasets, which does not validate generalization to real environments. In this paper, we present a novel approach for tree skeletonization that combines a generative denoising diffusion probabilistic model for predicting node positions and branch directions with a classical minimum spanning tree algorithm to infer tree skeletons from 3D point clouds, even with strong occlusions. Additionally, we provide a dataset of an apple orchard with 280 trees scanned 10 times during the growing season with corresponding reference skeletons, enabling quantitative evaluation. Experiments show the superior performance of our approach on real-world data and competitive results compared to state-of-art approaches on synthetic benchmarks.
LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes	Juliette Marrie Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK Romain Menegaux Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK Michael Arbel Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK Diane Larlus Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK Julien Mairal Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK	Paper Supplementary Abstract We address the problem of extending the capabilities of vision foundation models such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a novel method to uplift 2D image features into Gaussian Splatting representations of 3D scenes. Unlike traditional approaches that rely on minimizing a reconstruction loss, our method employs a simpler and more efficient feature aggregation technique, augmented by a graph diffusion mechanism. Graph diffusion refines 3D features, such as coarse segmentation masks, by leveraging 3D geometry and pairwise similarities induced by DINOv2. Our approach achieves performance comparable to the state of the art on multiple downstream tasks while delivering significant speed-ups. Notably, we obtain competitive segmentation results using only generic DINOv2 features, despite DINOv2 not being trained on millions of annotated segmentation masks like SAM. When applied to CLIP features, our method demonstrates strong performance in open-vocabulary object segmentation tasks, highlighting the versatility of our approach.1
Visual Modality Prompt for Adapting Vision-Language Object Detectors	Heitor R. Medeiros ETS Montreal Atif Belal ETS Montreal Srikanth Muralidharan ETS Montreal Eric Granger ETS Montreal Marco Pedersoli ETS Montreal	Paper Supplementary Abstract The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and traditional detectors. Recently, vision-language detectors (VLDs), such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities; however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose Mod-Prompt, a visual prompt strategy to adapt VLDs to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inferencefriendly modality prompt decoupled residual, facilitating a more robust adaptation. Empirical benchmarking results show our method for modality adaptation on YOLO-World and Grounding DINO for challenging infrared (LLVIP, FLIR) and depth (NYUv2) datasets, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability. Our code is available at https: //github.com/heitorrapela/ModPrompt.
Diffusion-Based Extreme High-speed Scenes Reconstruction with the Complementary Vision Sensor	Yapeng Meng Tsinghua University Yihan Lin Tsinghua University Taoyi Wang Tsinghua University Yuguo Chen Tsinghua University Lijian Wang Tsinghua University Rong Zhao Tsinghua University	Paper Supplementary Abstract Recording and reconstructing high-speed scenes poses a significant challenge. While high-speed cameras can capture fine temporal details, their extremely high bandwidth demands make continuous recording unsustainable. Conversely, traditional RGB cameras, typically operating at 30 FPS, rely on frame interpolation to synthesize high-speed motion, often introducing artifacts and motion blur. Human visual system inspired sensors, like event cameras, offer high-speed sparse temporal or spatial variation data, partially alleviating these issues. However, existing methods still suffer from RGB blur, temporal aliasing, and loss of event information. To overcome these challenges, we leverage a novel complementary vision sensor, Tianmouc, which outputs high-speed, multi-bit, sparse spatio-temporal difference information with RGB frames. Building on this unique sensing modality, we introduce a Cascaded Bi-directional Recurrent Diffusion Model (CBRDM) that achieves accurate, sharp, color-rich video frames reconstruction. Our method outperforms state-of-the-art RGB interpolation algorithms in quantitative evaluations and surpasses eventbased methods in real-world comparisons. Code and dataset are at https://github.com/Tianmouc/GenRec. †These authors contributed equally to this work ‡This work was performed while the author was at Tsinghua University
Temporal Rate Reduction Clustering for Human Motion Segmentation	Xianghan Meng Beijing University of Posts and Telecommunications Zhengyu Tong Beijing University of Posts and Telecommunications Zhiyuan Huang Beijing University of Posts and Telecommunications Chun-Guang Li Beijing University of Posts and Telecommunications	Paper Supplementary Abstract Human Motion Segmentation (HMS), which aims to partition videos into non-overlapping human motions, has attracted increasing research attention recently. Existing approaches for HMS are mainly dominated by subspace clustering methods, which are grounded on the assumption that high-dimensional temporal data align with a Union-ofSubspaces (UoS) distribution. However, the frames in video capturing complex human motions with cluttered backgrounds may not align well with the UoS distribution. In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering (TR2C), which jointly learns structured representations and affinity to segment the sequences of frames in video. Specifically, the structured representations learned by TR2C enjoy temporally consistency and are aligned well with a UoS structure, which is favorable for addressing the HMS task. We conduct extensive experiments on five benchmark HMS datasets and achieve state-of-the-art performances with different feature extractors. The code is available at: https://github. com/mengxianghan123/TR2C.
GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration	Li Mi EPFL Manon Béchaz EPFL Zeming Chen EPFL Antoine Bosselut EPFL Devis Tuia EPFL	Paper Supplementary Abstract Active Geo-localization (AGL) is the task of localizing a goal, represented in various modalities (e.g., aerial images, ground-level images, or text), within a predefined search area. Current methods approach AGL as a goal-reaching reinforcement learning (RL) problem with a distance-based reward. They localize the goal by implicitly learning to minimize the relative distance from it. However, when distance estimation becomes challenging or when encountering unseen targets and environments, the agent exhibits reduced robustness and generalization ability due to the less reliable exploration strategy learned during training. In this paper, we propose GeoExplorer, an AGL agent that incorporates curiosity-driven exploration through intrinsic rewards. Unlike distance-based rewards, our curiosity-driven reward is goal-agnostic, enabling robust, diverse, and contextually relevant exploration based on effective environment modeling. These capabilities have been proven through extensive experiments across four AGL benchmarks, demonstrating the effectiveness and generalization ability of GeoExplorer in diverse settings, particularly in localizing unfamiliar targets and environments.
FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation	Cui Miao National University of Defense Technology Tao Chang National University of Defense Technology Meihan Wu National University of Defense Technology Hongbin Xu Bytedance Seed Chun Li Shenzhen MSU-BIT University Ming Li Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) Xiaodong Wang National University of Defense Technology	Paper Abstract Vision-language-action (VLA) models have significantly advanced robotic manipulation by enabling robots to interpret language instructions for task execution. However, training these models often relies on large-scale user-specific data, raising concerns about privacy and security, which in turn limits their broader adoption. To address this, we propose FedVLA, the first federated VLA learning framework, enabling distributed model training that preserves data privacy without compromising performance. Our framework integrates task-aware representation learning, adaptive expert selection, and expert-driven federated aggregation, enabling efficient and privacy-preserving training of VLA models. Specifically, we introduce an InstructionOriented Scene-Parsing mechanism, which decomposes and enhances object-level features based on task instructions, improving contextual understanding. To effectively learn diverse task patterns, we design a Dual Gating Mixtureof-Experts (DGMoE) mechanism, where not only input tokens but also self-aware experts adaptively decide their activation. Finally, we propose an Expert-Driven Aggregation strategy at the federated server, where model aggregation is guided by activated experts, ensuring effective cross-client knowledge transfer. Extensive simulations and real-world robotic experiments demonstrate the effectiveness of our proposals. Notably, DGMoE significantly improves computational efficiency compared to its vanilla counterpart, while FedVLA achieves task success rates comparable to centralized training, effectively preserving data privacy.
Multi-view Gaze Target Estimation	Qiaomu Miao Stony Brook University Vivek Raju Golani Stony Brook University Jingyi Xu Stony Brook University Progga Paromita Dutta Stony Brook University Minh Hoai The University of Adelaide Dimitris Samaras Stony Brook University	Paper Supplementary Abstract This paper presents a method that utilizes multiple camera views for the gaze target estimation (GTE) task. The approach integrates information from different camera views to improve accuracy and expand applicability, addressing limitations in existing single-view methods that face challenges such as face occlusion, target ambiguity, and out-ofview targets. Our method processes a pair of camera views as input, incorporating a Head Information Aggregation (HIA) module for leveraging head information from both views for more accurate gaze estimation, an Uncertaintybased Gaze Selection (UGS) for identifying the most reliable gaze output, and an Epipolar-based Scene Attention (ESA) module for cross-view background information sharing. This approach significantly outperforms single-view baselines, especially when the second camera provides a clear view of the person's face. Additionally, our method can estimate the gaze target in the first view using the image of the person in the second view only, a capability not possessed by single-view GTE methods. Furthermore, the paper introduces a multi-view dataset for developing and evaluating multi-view GTE methods. Data and code are available at https://www3.cs.stonybrook.edu/˜cvl/multiview_gte.html.
Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation	Ziliang Miao The University of Hong Kong Runjian Chen The University of Hong Kong Yixi Cai KTH Royal Institute of Technology Buwei He KTH Royal Institute of Technology Wenquan Zhao Southern University of Science and Technology Wenqi Shao Shanghai AI Laboratory Bo Zhang Shanghai AI Laboratory Fu Zhang The University of Hong Kong	Paper Supplementary Abstract Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems such as self-driving vehicles. While previous supervised approaches rely on costly manual annotations, LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose Temporal Overlapping Prediction (TOP), a self-supervised pre-training method designed to alleviate this annotation burden. TOP learns powerful spatiotemporal representations by predicting the occupancy states of temporal overlapping points that are commonly observed in current and adjacent scans. To further ground these representations in the current scene's geometry, we introduce an auxiliary pretraining objective of reconstructing the occupancy of the current scan. Extensive experiments on the nuScenes and SemanticKITTI datasets validate our method's effectiveness. TOP consistently outperforms existing supervised and self-supervised pre-training baselines across both pointlevel Intersection-over-Union (IoU) and object-level Recall metrics. Notably, it achieves a relative improvement of up to 28.77% over a training-from-scratch baseline and demonstrates strong transferability across LiDAR setups. Our code is publicly available at https://github.com/ZiliangMiao/TOP.
Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting	Xingyu Miao Durham University Haoran Duan Tsinghua University Quanhao Qian DAMO Academy, Alibaba Group Jiuniu Wang DAMO Academy, Alibaba Group Yang Long Durham University Ling Shao UCAS-Terminus AI Lab, UCAS Deli Zhao DAMO Academy, Alibaba Group Ran Xu DAMO Academy, Alibaba Group Gongjie Zhang DAMO Academy, Alibaba Group	Paper Supplementary Abstract Spatial intelligence is emerging as a transformative frontier in AI, yet it remains constrained by the scarcity of largescale 3D datasets. Unlike the abundant 2D imagery, acquiring 3D data typically requires specialized sensors and laborious annotation. In this work, we present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations - including point clouds, camera poses, depth maps, and pseudo-RGBD - via integrated depth estimation, camera calibration, and scale calibration. Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding. By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence. We release two generated spatial datasets, i.e., COCO-3D and Objects365-v2-3D, and demonstrate through extensive experiments that our generated data can benefit various 3D tasks, ranging from fundamental perception to MLLMbased reasoning. These results validate our pipeline as an effective solution for developing AI systems capable of perceiving, understanding, and interacting with physical environments.
Not all Views are Created Equal: Analyzing Viewpoint Instabilities in Vision Foundation Models	Mateusz Michalkiewicz Rice University Sheena Bai Rice University Mahsa Baktashmotlagh The University of Queensland Varun Jampani Stability AI Guha Balakrishnan Rice University	Paper Supplementary Abstract In this paper, we analyze the viewpoint stability of foundational models - specifically, their sensitivity to changes in viewpoint- and define instability as significant feature variations resulting from minor changes in viewing angle, leading to generalization gaps in 3D reasoning tasks. We investigate nine foundational models, focusing on their responses to viewpoint changes, including the often-overlooked accidental viewpoints where specific camera orientations obscure an object's true 3D structure. Our methodology enables recognizing and classifying accidental, stable and other viewpoints using feature representations alone, without accessing the actual images at inference time. Our findings indicate that while foundation models consistently encode accidental viewpoints, they vary in their interpretation of other viewpoints due to inherent biases, at times leading to object misclassifications based on geometric resemblance. Through quantitative and qualitative evaluations on three downstream tasks - classification, VQA, and 3D reconstruction - we illustrate the impact of viewpoint instability and underscore the importance of feature robustness across diverse viewing conditions.
VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions	Marko Mihajlovic ETH Zürich Siwei Zhang ETH Zürich Gen Li ETH Zürich Kaifeng Zhao ETH Zürich Lea Müller UC Berkeley Siyu Tang ETH Zürich	Paper Supplementary Abstract Parametric human body models play a crucial role in computer graphics and vision, enabling applications ranging from human motion analysis to understanding humanenvironment interactions. Traditionally, these models use surface meshes, which pose challenges in efficiently handling interactions with other geometric entities, such as objects and scenes, typically represented as meshes or point clouds. To address this limitation, recent research has explored volumetric neural implicit body models. However, existing works are either insufficiently robust for complex human articulations or impose high computational and memory costs, limiting their widespread use. To this end, we introduce VolumetricSMPL, a neural volumetric body model that leverages Neural Blend Weights (NBW) to generate compact, yet efficient MLP decoders. Unlike prior approaches that rely on large MLPs, NBW dynamically blends a small set of learned weight matrices using predicted shape- and pose-dependent coefficients, significantly improving computational efficiency while preserving expressiveness. VolumetricSMPL outperforms prior volumetric occupancy model COAP with 10! faster inference, 6! lower GPU memory usage, enhanced accuracy, and a Signed Distance Function (SDF) for efficient and differentiable contact modeling. We demonstrate VolumetricSMPL's strengths across four challenging tasks: (1) reconstructing human-object interactions from in-the-wild images, (2) recovering human meshes in 3D scenes from egocentric views, (3) scene-constrained motion synthesis, and (4) resolving self-intersections. Our results highlight its broad applicability and significant performance and efficiency gains.
Discontinuity-aware Normal Integration for Generic Central Camera Models	Francesco Milano ETH Zurich Manuel López-Antequera Meta Naina Dhingra Meta Roland Siegwart ETH Zurich Robert Thiel Meta	Paper Supplementary Abstract Recovering a 3D surface from its surface normal map, a problem known as normal integration, is a key component for photometric shape reconstruction techniques such as shape-from-shading and photometric stereo. The vast majority of existing approaches for normal integration handle only implicitly the presence of depth discontinuities and are limited to orthographic or ideal pinhole cameras. In this paper, we propose a novel formulation that allows modeling discontinuities explicitly and handling generic central cameras. Our key idea is based on a local planarity assumption, that we model through constraints between surface normals and ray directions. Compared to existing methods, our approach more accurately approximates the relation between depth and surface normals, achieves state-of-the-art results on the standard normal integration benchmark, and is the first to directly handle generic central camera models.
S2M2: Scalable Stereo Matching Model for Reliable Depth Estimation	Junhong Min Samsung Electronics Youngpil Jeon Samsung Electronics Jimin Kim Samsung Electronics Minyong Choi Samsung Electronics	Paper Supplementary Abstract The pursuit of a generalizable stereo matching model, capable of performing well across varying resolutions and disparity ranges without dataset-specific fine-tuning, has revealed a fundamental trade-off. Iterative local search methods achieve high scores on constrained benchmarks, but their core mechanism inherently limits the global consistency required for true generalization. However, global matching architectures, while theoretically more robust, have historically been rendered infeasible by prohibitive computational and memory costs. We resolve this dilemma with S²M²: a global matching architecture that achieves state-ofthe-art accuracy and high efficiency without relying on cost volume filtering or deep refinement stacks. Our design integrates a multi-resolution transformer for robust long-range correspondence, trained with a novel loss function that concentrates probability on feasible matches. This approach enables a more robust joint estimation of disparity, occlusion, and confidence. S²M² establishes a new state of the art on Middlebury v3 and ETH3D benchmarks, significantly outperforming prior methods in most metrics while reconstructing high-quality details with competitive efficiency.
R-LiViT: A LiDAR-Visual-Thermal Dataset Enabling Vulnerable Road User Focused Roadside Perception	Jonas Mirlach XITASO GmbH Lei Wan Karlsruhe Institute of Technology Andreas Wiedholz XITASO GmbH Hannan Ejaz Keen XITASO GmbH Andreas Eich LiangDao GmbH	Paper Abstract In autonomous driving, the integration of roadside perception systems is essential for overcoming occlusion challenges and enhancing the safety of Vulnerable Road Users (VRUs). While LiDAR and visual (RGB) sensors are commonly used, thermal imaging remains underrepresented in datasets, despite its acknowledged advantages for VRU detection in extreme lighting conditions. In this paper, we present R-LiViT, the first dataset to combine LiDAR, RGB, and thermal imaging from a roadside perspective, with a strong focus on VRUs. R-LiViT captures three intersections during both day and night, ensuring a diverse dataset. It includes 10,000 LiDAR frames and 2,400 temporally and spatially aligned RGB and thermal images across 150 traffic scenarios, with 7 and 8 annotated classes respectively, providing a comprehensive resource for tasks such as object detection and tracking. The dataset1 and the code for reproducing our evaluation results2 are made publicly available.https://github.com/XITASO/r-livit
PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks	Clinton Ansun Mo The University of Sydney Kun Hu Edith Cowan University Chengjiang Long Meta Reality Labs Dong Yuan The University of Sydney Wan-Chi Siu Hong Kong Polytechnic University Zhiyong Wang The University of Sydney	Paper Supplementary Abstract Motion skeletons drive 3D character animation by transforming bone hierarchies, but differences in proportions or structure make motion data hard to transfer across skeletons, posing challenges for data-driven motion synthesis. Temporal Point Clouds (TPCs) offer an unstructured, crosscompatible motion representation. Though reversible with skeletons, TPCs mainly serve for compatibility, not for direct motion task learning. Doing so would require data synthesis capabilities for the TPC format, which presents unexplored challenges regarding its unique temporal consistency and point identifiability. Therefore, we propose PUMPS, the primordial autoencoder architecture for TPC data. PUMPS independently reduces frame-wise point clouds into sampleable feature vectors, from which a decoder extracts distinct temporal points using latent Gaussian noise vectors as sampling identifiers. We introduce linear assignment-based point pairing to optimise the TPC reconstruction process, and negate the use of expensive pointwise attention mechanisms in the architecture. Using these latent features, we pre-train a motion synthesis model capable of performing motion prediction, transition generation, and keyframe interpolation. For these pre-training tasks, PUMPS performs remarkably well even without native dataset supervision, matching state-of-the-art performance. When fine-tuned for motion denoising or estimation, PUMPS outperforms many respective methods without deviating from its generalist architecture. The code is available at: https://github.com/MiniEval/PUMPS.. Figure 1. Overview of PUMPS pre-training, zero-shot evaluation, and fine-tuning pipelines. PUMPS consists of an auto-encoder (encoder-decoder modules) and latent synthesis component, which are pre-trained successively.
TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras	Mohammad Mohammadi University of Toronto Ziyi Wu University of Toronto Igor Gilitschenski University of Toronto	Paper Supplementary Abstract Long-term temporal information is crucial for event-based perception tasks, as raw events only encode pixel brightness changes. Recent works show that when trained from scratch, recurrent models achieve better results than feedforward models in these tasks. However, when leveraging self-supervised pre-trained weights, feedforward models can outperform their recurrent counterparts. Current self-supervised learning (SSL) methods for event-based pretraining largely mimic RGB image-based approaches. They pre-train feedforward models on raw events within a short time interval, ignoring the temporal information of events. In this work, we introduce TESPEC, a self-supervised pretraining framework tailored for learning spatio-temporal information. TESPEC is well-suited for recurrent models, as it is the first framework to leverage long event sequences during pre-training. TESPEC employs the masked image modeling paradigm with a new reconstruction target. We design a novel method to accumulate events into pseudo grayscale videos containing high-level semantic information about the underlying scene, which is robust to sensor noise and reduces motion blur. Reconstructing this target thus requires the model to reason about long-term history of events. Extensive experiments demonstrate our state-ofthe-art results in downstream tasks, including object detection, semantic segmentation, and monocular depth estimation. Project webpage: https://mhdmohammadi. github.io/TESPEC_webpage.
DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic	Munish Monga Sony Research India Vishal Chudasama Sony Research India Pankaj Wasnik Sony Research India Biplab Banerjee Indian Institute of Technology, Bombay	Paper Supplementary Abstract Real-world object detection systems, such as those in autonomous driving and surveillance, must continuously learn new object categories and simultaneously adapt to changing environmental conditions. Existing approaches, Class Incremental Object Detection (CIOD) and Domain Incremental Object Detection (DIOD)-only address one aspect of this challenge. CIOD struggles in unseen domains, while DIOD suffers from catastrophic forgetting when learning new classes, limiting their real-world applicability. To overcome these limitations, we introduce Dual Incremental Object Detection (DuIOD), a more practical setting that simultaneously handles class and domain shifts in an exemplarfree manner. We propose DuET, a Task Arithmetic-based model merging framework that enables stable incremental learning while mitigating sign conflicts through a novel Directional Consistency Loss. Unlike prior methods, DuET is detector-agnostic, allowing models like YOLO11 and RTDETR to function as real-time incremental object detectors. To comprehensively evaluate both retention and adaptation, we introduce the Retention-Adaptability Index (RAI), which combines the Average Retention Index (Avg RI) for catastrophic forgetting and the Average Generalization Index for domain adaptability into a common ground. Extensive experiments on the Pascal Series and Diverse Weather Series demonstrate DuET's effectiveness, achieving a +13.12% RAI improvement while preserving 89.3% Avg RI on the Pascal Series (4 tasks), as well as a +11.39% RAI improvement with 88.57% Avg RI on the Diverse Weather Series (3 tasks), outperforming existing methods.
Selective Contrastive Learning for Weakly Supervised Affordance Grounding	WonJun Moon Sungkyunkwan University Hyun Seok Seong Sungkyunkwan University Jae-Pil Heo Sungkyunkwan University	Paper Supplementary Abstract Facilitating an entity's interaction with objects requires accurately identifying parts that afford specific actions. Weakly supervised affordance grounding (WSAG) seeks to imitate human learning from third-person demonstrations, where humans intuitively grasp functional parts without needing pixel-level annotations. To achieve this, grounding is typically learned using a shared classifier across images from different perspectives, along with distillation strategies incorporating part discovery process. However, since affordancerelevant parts are not always easily distinguishable, models primarily rely on classification, often focusing on common class-specific patterns that are unrelated to affordance. To address this limitation, we move beyond isolated part-level learning by introducing selective prototypical and pixel contrastive objectives that adaptively learn affordance-relevant cues at both the part and object levels, depending on the granularity of the available information. Initially, we find the action-associated objects in both egocentric (object-focused) and exocentric (third-person example) images by leveraging CLIP. Then, by cross-referencing the discovered objects of complementary views, we excavate the precise part-level affordance clues in each perspective. By consistently learning to distinguish affordance-relevant regions from affordanceirrelevant background context, our approach effectively shifts activation from irrelevant areas toward meaningful affordance cues. Experimental results demonstrate the effectiveness of our method.
DIMO: Diverse 3D Motion Generation for Arbitrary Objects	Linzhan Mou University of Pennsylvania Jiahui Lei University of Pennsylvania Chen Wang University of Pennsylvania Lingjie Liu University of Pennsylvania Kostas Daniilidis University of Pennsylvania	Paper Supplementary Abstract We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. The core idea of our work is to leverage the rich priors in well-trained video models to extract the common motion patterns and then embed them into a shared low-dimensional latent space. Specifically, we first generate multiple videos of the same object with diverse motions. We then embed each motion into a latent vector and train a shared motion decoder to learn the distribution of motions represented by a structured and compact motion representation, i.e., neural key point trajectories. The canonical 3D Gaussians are then driven by these key points and fused to model the geometry and appearance. During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass and support several interesting applications including 3D motion interpolation and language-guided motion generation. Our project page is available at https://linzhanm.github.io/dimo.
Diff2I2P: Differentiable Image-to-Point Cloud Registration with Diffusion Prior	Juncheng Mu Tsinghua University Chengwei Ren Tsinghua University Weixiang Zhang Tsinghua University Liang Pan Shanghai AI Laboratory Xiao-Ping Zhang Shenzhen Ubiquitous Data Enabling Key Lab Yue Gao Tsinghua University	Paper Supplementary Abstract Learning cross-modal correspondences is essential for image-to-point cloud (I2P) registration. Existing methods achieve this mostly by utilizing metric learning to enforce feature alignment across modalities, disregarding the inherent modality gap between image and point data. Consequently, this paradigm struggles to ensure accurate crossmodal correspondences. To this end, inspired by the crossmodal generation success of recent large diffusion models, we propose Diff2I2P, a fully Differentiable I2P registration framework, leveraging a novel and effective Diffusion prior for bridging the modality gap. Specifically, we propose a Control-Side Score Distillation (CSD) technique to distill knowledge from a depth-conditioned diffusion model to directly optimize the predicted transformation. However, the gradients on the transformation fail to backpropagate onto the cross-modal features due to the non-differentiability of correspondence retrieval and PnP solver. To this end, we further propose a Deformable Correspondence Tuning (DCT) module to estimate the correspondences in a differentiable way, followed by the transformation estimation using a differentiable PnP solver. With these two designs, the Diffusion model serves as a strong prior to guide the crossmodal feature learning of image and point cloud for forming robust correspondences, which significantly improves the registration. Extensive experimental results demonstrate that Diff2I2P consistently outperforms SoTA I2P registration methods, achieving over 7% improvement in registration recall on the 7-Scenes benchmark. Code will be available at https://github.com/mujc2021/Diff2I2P.
O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views	Lorenzo Mur-Labadia University of Zaragoza Maria Santos-Villafranca University of Zaragoza Jesus Bermudez-Cameo University of Zaragoza Alejandro Perez-Yus University of Zaragoza Ruben Martinez-Cantin University of Zaragoza Jose J. Guerrero University of Zaragoza	Paper Supplementary Abstract Understanding the world from multiple perspectives is essential for intelligent systems operating together, where segmenting common objects across different views remains an open problem. We introduce a new approach that redefines cross-image segmentation by treating it as a mask matching task. Our method consists of: (1) A MaskContext Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) an Ego↔Exo CrossAttention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space, and (4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects. O-MaMa achieves the state of the art in the Ego-Exo4D Correspondences benchmark, obtaining relative gains of +22 % and +76 % in the Ego2Exo and Exo2Ego IoU against the official challenge baselines, and a +13 % and +6 % compared with the SOTA with 1 % of the training parameters.
Scaling Transformer-Based Novel View Synthesis with Models Token Disentanglement and Synthetic Data	Nithin Gopalakrishnan Nair Johns Hopkins University Srinivas Kaza Google Xuan Luo Google Vishal M. Patel Johns Hopkins University Stephen Lombardi Google Jungyeon Park Google	Paper Supplementary Abstract Large transformer-based models have made significant progress in generalizable novel view synthesis (NVS) from sparse input views, generating novel viewpoints without the need for test-time optimization. However, these models are constrained by the limited diversity of publicly available scene datasets, making most real-world (in-the-wild) scenes out-of-distribution. To overcome this, we incorporate synthetic training data generated from diffusion models, which improves generalization across unseen domains. While synthetic data offers scalability, we identify artifacts introduced during data generation as a key bottleneck affecting reconstruction quality. To address this, we propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning. This refinement not only improves reconstruction quality over standard transformers but also enables scalable training with synthetic data. As a result, our method outperforms existing models on both in-dataset and cross-dataset evaluations, achieving state-of-the-art results across multiple benchmarks while significantly reducing computational costs.
PARTE: Part-Guided Texturing for 3D Human Reconstruction from a Single Image	Hyeongjin Nam Seoul National University Donghwan Kim Seoul National University Gyeongsik Moon Korea University Kyoung Mu Lee Seoul National University	Paper Supplementary Abstract The misaligned human texture across different human parts is one of the main limitations of existing 3D human reconstruction methods. Each human part, such as a jacket or pants, should maintain a distinct texture without blending into others. The structural coherence of human parts serves as a crucial cue to infer human textures in the invisible regions of a single image. However, most existing 3D human reconstruction methods do not explicitly exploit such part segmentation priors, leading to misaligned textures in their reconstructions. In this regard, we present PARTE, which utilizes 3D human part information as a key guide to reconstruct 3D human textures. Our framework comprises two core components. First, to infer 3D human part information from a single image, we propose a 3D part segmentation module (PartSegmenter) that initially reconstructs a textureless human surface and predicts human part labels based on the textureless surface. Second, to incorporate part information into texture reconstruction, we introduce a part-guided texturing module (PartTexturer), which acquires prior knowledge from a pre-trained image generation network on texture alignment of human parts. Extensive experiments demonstrate that our framework achieves state-of-the-art quality in 3D human reconstruction.
Hierarchical 3D Scene Graphs Construction Outdoors	Jon Nyffeler ETH Zürich Federico Tombari Google Daniel Barath ETH Zürich	Paper Supplementary Abstract Understanding and structuring outdoor environments in 3D is critical for numerous applications, including robotics, urban planning, and autonomous navigation. In this work, we propose a pipeline to construct hierarchical 3D scene graphs from outdoor data, consisting of posed images and 3D reconstructions. Our approach systematically extracts and organizes objects and their subcomponents, enabling representations that span from entire buildings to their facades and individual windows. By leveraging geometric and semantic relationships, our method efficiently groups objects into meaningful hierarchies while ensuring robust spatial consistency. We integrate efficient feature extraction, hierarchical object merging, and relationship inference to generate structured scene graphs that capture both global and local dependencies. Our approach scales to large outdoor environments while maintaining efficiency, and we demonstrate its effectiveness on real-world datasets. We also demonstrate that these constructed outdoor scene graphs are beneficial for downstream applications, such as 3D scene alignment. The code is available on GitHub.
PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups	Sakuya Ota Institute of Science Tokyo Qing Yu LY Corporation Kent Fujiwara LY Corporation Satoshi Ikehata National Institute of Informatics (NII) Ikuro Sato Institute of Science Tokyo	Paper Supplementary Abstract Generating realistic group interactions involving multiple characters remains challenging due to increasing complexity as group size expands. While existing conditional diffusion models incrementally generate motions by conditioning on previously generated characters, they rely on single shared prompts, limiting nuanced control and leading to overly simplified interactions. In this paper, we introduce Person-Interaction Noise Optimization (PINO), a novel, training-free framework designed for generating realistic and customizable interactions among groups of arbitrary size. PINO decomposes complex group interactions into semantically relevant pairwise interactions, and leverages pretrained two-person interaction diffusion models to incrementally compose group interactions. To ensure physical plausibility and avoid common artifacts such as overlapping or penetration between characters, PINO employs physics-based penalties during noise optimization. This approach allows precise user control over character orientation, speed, and spatial relationships without additional training. Comprehensive evaluations demonstrate that PINO generates visually realistic, physically coherent, and adaptable multi-person interactions suitable for diverse animation, gaming, and robotics applications.
Region-aware Anchoring Mechanism for Efficient Referring Visual Grounding	Shuyi Ouyang Zhejiang University Ziwei Niu Zhejiang University Hongyi Wang Zhejiang University Yen-Wei Chen Ritsumeikan University Lanfen Lin Zhejiang University	Paper Supplementary Abstract Referring Visual Grounding (RVG) tasks revolve around utilizing vision-language interactions to incorporate object information from language expressions, thereby enabling targeted object detection or segmentation within images. Transformer-based methods have enabled effective interaction through attention mechanisms, achieving notable performance in RVG tasks. However, existing strategies for RVG, which involve direct interaction between visual and linguistic features, face three key challenges: (i) tendency to focus on a single target, (ii) insufficient control over linguistic noise, and (iii) high computational cost. To address these challenges, we propose a Region-aware Anchoring Mechanism (RaAM) that mediates vision-language interactions. In RaAM, region-aware anchors engage in alternating interactions with vision and language modalities, acting as indicators for object presence across different regions within the image. RaAM (i) directs attention to multiple target regions for better localization, (ii) reduces crossmodal redundancy by using anchors as buffers, and (iii) lowers time complexity. In addition, we design region and pixel level loss functions to enhance object presence assessment and edge precision. We evaluate our RaAM-RVG on four benchmark datasets and integrate RaAM into various models by replacing their interaction design. Results show that RaAM outperforms state-of-the-art methods with lower computational cost.
Self-Supervised Sparse Sensor Fusion for Long Range Perception	Edoardo Palladin Torc Robotics Samuel Brucker Torc Robotics Filippo Ghilotti Torc Robotics Praveen Narayanan Torc Robotics Mario Bijelic Torc Robotics Felix Heide Princeton University	Paper Supplementary Abstract Outside of urban hubs, autonomous cars and trucks have to master driving on intercity highways. Safe, long-distance highway travel at speeds exceeding 100 km/h demands perception distances of at least 250 m, which is about five times the 50-100m typically addressed in city driving, to allow sufficient planning and braking margins. Increasing the perception ranges also allows to extend autonomy from light two-ton passenger vehicles to large-scale forty-ton trucks, which need a longer planning horizon due to their high inertia. However, most existing perception approaches focus on shorter ranges and rely on Bird's Eye View (BEV) representations, which incur quadratic increases in memory and compute costs as distance grows. To overcome this limitation, we built on top of a sparse representation and introduced an efficient 3D encoding of multi-modal and temporal features, along with a novel self-supervised pretraining scheme that enables large-scale learning from unlabeled camera-LiDAR data. Our approach extends perception distances to 250 meters and achieves an 26.6% improvement in mAP in object detection and a decrease of 30.5% in Chamfer Distance in LiDAR forecasting compared to existing methods, reaching distances up to 250 meters.
Exploring Weather-aware Aggregation and Adaptation for Semantic Segmentation under Adverse Conditions	Yuwen Pan University of Science and Technology of China Rui Sun University of Science and Technology of China Wangkai Li University of Science and Technology of China Tianzhu Zhang National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory	Paper Abstract Semantic segmentation under adverse conditions is critical for reliable visual perception in challenging weather environments. These extreme scenarios introduce distortions, such as low contrast and reduced visibility, making traditional segmentation models struggle. The scarcity of labeled data in such conditions makes it difficult to train models directly for these environments. Unsupervised domain adaptation (UDA) has been proposed as a solution to transfer knowledge from labeled source domains (normal weather) to unlabeled target domains (adverse weather). However, existing methods face significant challenges, particularly due to weather unawareness and feature heterogeneity. Many models fail to account for the unique characteristics of different weather conditions, and the significant feature discrepancies between normal and adverse weather images hinder effective adaptation. In this paper, we propose a novel weather-aware aggregation and adaptation network that leverages characteristic knowledge to achieve weather homogenization and enhance scene perception. Specifically, we introduce amplitude prompt aggregation to capture essential characteristics from the Fourier frequency domain that are indicative of different weather conditions. Additionally, we employ weather heterogeneity adaptation to mitigate the inter-domain heterogeneity, thereby achieving feature homogenization across diverse environments. Extensive experimental results on multiple challenging benchmarks demonstrate that our method achieves consistent improvements for semantic segmentation under adverse conditions.
Liberated-GS: 3D Gaussian Splatting Independent from SfM Point Clouds	Weihong Pan Zhejiang University Xiaoyu Zhang SenseTime Research Hongjia Zhai Zhejiang University Xiaojun Xiang SenseTime Research Hanqing Jiang SenseTime Research Guofeng Zhang Zhejiang University	Paper Supplementary Abstract 3D Gaussian Splatting (3DGS) has demonstrated impressive performance in novel view synthesis and real-time rendering. However, it heavily relies on high-quality initial sparse points from Structure-from-Motion (SfM), which often struggles in textureless regions, degrading the geometry and visual quality of 3DGS. To address this limitation, we propose a novel initialization pipeline, achieving highfidelity reconstruction from dense image sequences without relying on SfM-derived point clouds. Specifically, we first propose an effective depth alignment method to align the estimated monocular depth with depth rendered from an under-optimized coarse Gaussian model using an unbiased depth rasterization approach and ensemble them afterward. After that, to efficiently process dense image sequences, we incorporate a progressive segmented initialization process to generate the initial points. Extensive experiments demonstrate the superiority of our method over previous approaches and its compatibility with other advanced 3D Gaussian models. Notably, our method outperforms the SfM-based method by a 14.4% improvement in LPIPS on the Mip-NeRF360 datasets and a 30.7% improvement on the Tanks and Temples datasets.
LookOut: Real-World Humanoid Egocentric Navigation	Boxiao Pan Stanford University Adam W. Harley Stanford University Francis Engelmann Stanford University C. Karen Liu Stanford University Leonidas J. Guibas Stanford University	Paper Abstract The ability to predict collision-free future trajectories from egocentric observations is crucial in applications such as humanoid robotics, VR /AR, and assistive navigation. In this work, we introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video. In particular, we predict both head translations and rotations to learn the active information-gathering behavior expressed through head-turning events. To solve this task, we propose a framework that reasons over temporally aggregated 3D latent features, which models the geometric and semantic constraints for both the static and dynamic parts of the environment. Motivated by the lack of training data in this space, we further contribute a data collection pipeline using the Project Aria glasses, and present a dataset collected through this approach. Our dataset, dubbed Aria Navigation Dataset (AND), consists of 4 hours of recording of users navigating in real-world scenarios. It includes diverse situations and navigation behaviors, providing a valuable resource for learning real-world egocentric navigation policies. Extensive experiments show that our model learns human-like navigation behaviors such as waiting /slowing down, rerouting, and looking around for traffic while generalizing to unseen environments. Check out our project webpage at https://sites.google. com/stanford.edu/lookout.
Augmented and Softened Matching for Unsupervised Visible-Infrared Person Re-Identification	Zhiqi Pang Harbin Institute of Technology Chunyu Wang Harbin Institute of Technology Lingling Zhao Harbin Institute of Technology Junjie Wang Nanjing Medical University	Paper Supplementary Abstract Color variations, a key challenge in the unsupervised visible-infrared person re-identification (UVI-ReID) task, have garnered significant attention. While existing UVIReID methods have made substantial efforts during the optimization phase to enhance the model's robustness to color variations, they often overlook the impact of color variations on the acquisition of pseudo-labels. To address this, in this paper, we focus on improving the robustness of pseudo-labels to color variations through data augmentation and propose an augmented and softened matching (ASM) method. Specifically, we first develop the crossmodality augmented matching (CAM) module, which performs channel augmentation on visible images to generate augmented images. Then, based on the fusion of the visibleinfrared and augmented-infrared centroid similarity matrices, CAM establishes cross-modality correspondences that are robust to color variations. To increase training stability, we design a soft-labels momentum update (SMU) strategy, which converts traditional one-hot labels into soft-labels through momentum updates, thus adapting to CAM. During the optimization phase, we introduce the cross-modality soft contrastive loss and cross-modality hard contrastive loss to promote modality-invariant learning from the perspectives of shared and diversified features, respectively. Extensive experimental results validate the effectiveness of the proposed method, showing that ASM not only outperforms state-of-the-art unsupervised methods but also competes with some supervised methods.
ForeSight: Multi-View Streaming Joint Object Detection and Trajectory Forecasting	Sandro Papais University of Toronto Letian Wang University of Toronto Brian Cheong University of Toronto Steven L. Waslander University of Toronto	Paper Supplementary Abstract We introduce ForeSight, a novel joint detection and forecasting framework for vision-based 3D perception in autonomous vehicles. Traditional approaches treat detection and forecasting as separate sequential tasks, limiting their ability to leverage temporal cues. ForeSight addresses this limitation with a multi-task streaming and bidirectional learning approach, allowing detection and forecasting to share query memory and propagate information seamlessly. The forecast-aware detection transformer enhances spatial reasoning by integrating trajectory predictions from a multiple hypothesis forecast memory queue, while the streaming forecast transformer improves temporal consistency using past forecasts and refined detections. Unlike tracking-based methods, ForeSight eliminates the need for explicit object association, reducing error propagation with a tracking-free model that efficiently scales across multiframe sequences. Experiments on the nuScenes dataset show that ForeSight achieves state-of-the-art performance, achieving an EPA of 54.9%, surpassing previous methods by 9.3%, while also attaining the best mAP and minADE among multi-view detection and forecasting models.1
A Unified Framework for Motion Reasoning and Generation in Human Interaction	Jeongeun Park Korea University Sungjoon Choi Korea University Sangdoo Yun Naver AI Lab	Paper Supplementary Abstract Recent advancements in large language models (LLMs) have greatly enhanced their ability to generate natural and contextually relevant text, enabling more human-like AI interactions. However, generating and understanding interactive human-like motion, where multiple individuals engage in coordinated movements, remains challenging due to the complexity of modeling these coordinated interactions. Furthermore, a unified and versatile model is required to handle diverse interactive scenarios, such as chat systems that dynamically adapt to user instructions and assigned roles. To tackle these problems, we introduce MoLaM, the Interactive Motion-LAnguage Model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversational contexts. Unlike previous studies primarily focusing on uni-directional tasks (e.g. Works done during Jeongeun Park was an intern at Naver AI Lab. text-to-motion or motion-to-text), MoLaM employs a unified architecture capable of simultaneously understanding and generating both motion and text modalities. Given the lack of an appropriate dataset to address this challenge, we introduce Inter-MT2, a large-scale instructiontuning dataset containing 82.7K multi-turn interactive motion instructions, spanning 153K interactive motion samples. Inter-MT2 covers diverse instructional scenarios including editing, question answering, and story generation, with interactive motions leveraging off-the-shelf large language models and motion diffusion models. We extensively evaluate the versatility of MoLaM across multiple interactive motion-related tasks: motion-to-text, text-to-motion, reaction generation, motion editing, and reasoning about motion sequences. Remarkably, MoLaM is the first model capable of effectively addressing all these tasks with a single unified framework, achieving competitive performance compared to task-specific methods.
Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model	Daehee Park DGIST Monu Surana Qualcomm Research Pranav Desai Qualcomm Research Ashish Mehta Qualcomm Research Reuben MV John Qualcomm Research Kuk-Jin Yoon KAIST	Paper Supplementary Abstract While data-driven trajectory prediction has enhanced the reliability of autonomous driving systems, it still struggles with rarely observed long-tail scenarios. Prior works addressed this by modifying model architectures, such as using hypernetworks. In contrast, we propose refining the training process to unlock each model's potential without altering its structure. We introduce Generative Active Learning for Trajectory prediction (GALTraj), the first method to successfully deploy generative active learning into trajectory prediction. It actively identifies rare tail samples where the model fails and augments these samples with a controllable diffusion model during training. In our framework, generating scenarios that are diverse, realistic, and preserve tail-case characteristics is paramount. Accordingly, we design a tail-aware generation method that applies tailored diffusion guidance to generate trajectories that both capture rare behaviors and respect traffic rules. Unlike prior simulation methods focused solely on scenario diversity, GALTraj is the first to show how simulator-driven augmentation benefits long-tail learning in trajectory prediction. Experiments on multiple trajectory datasets (WOMD, Argoverse2) with popular backbones (QCNet, MTR) confirm that our method significantly boosts performance on tail samples and also enhances accuracy on head samples.
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models	Sung-Yeon Park Purdue University Can Cui Purdue University Yunsheng Ma Purdue University Ahmadreza Moradipari Toyota InfoTech Labs Rohit Gupta Toyota InfoTech Labs Kyungtae Han Toyota InfoTech Labs Ziran Wang Purdue University	Paper Supplementary Abstract Recent advances in multi-modal large language models (MLLMs) have demonstrated strong performance across various domains; however, their ability to comprehend driving scenes remains less proven. The complexity of driving scenarios, which includes multi-view information, poses significant challenges for existing MLLMs. In this paper, we introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. To further support generalization to multi-view driving scenarios, we also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs. For context-aware analysis of traffic scenes, we categorize our dataset into nine subtasks across three core skills: Road Environment Perception, Spatial Relations Recognition, and Ego-Centric Reasoning. Furthermore, we present BEV-LLM, integrating Bird's-Eye-View (BEV) features from multi-view images into MLLMs. Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives. In contrast, BEV-LLM demonstrates remarkable adaptability to this domain, outperforming other models in six of the nine subtasks. These findings highlight how BEV integration enhances multiview MLLMs while also identifying key areas that require further refinement for effective adaptation to driving scenes. NuPlanQA is available at our GitHub repository
SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection	Chaesong Park Seoul National University Eunbin Seo Hyundai Motor Group Jihyeon Hwang Seoul National University Jongwoo Lim Seoul National University	Paper Supplementary Abstract In this paper, we introduce SC-Lane, a novel slope-aware and temporally consistent heightmap estimation framework for 3D lane detection. Unlike previous approaches that rely on fixed slope anchors, SC-Lane adaptively determines the fusion of slope-specific height features, improving robustness to diverse road geometries. To achieve this, we propose a Slope-Aware Adaptive Feature module that dynamically predicts the appropriate weights from image cues for integrating multi-slope representations into a unified heightmap. Additionally, a Height Consistency Module enforces temporal coherence, ensuring stable and accurate height estimation across consecutive frames, which is crucial for real-world driving scenarios. To evaluate the effectiveness of SC-Lane, we employ three standardized metrics-Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and thresholdbased accuracy-which, although common in surface and depth estimation, have been underutilized for road height assessment. Using the LiDAR-derived heightmap dataset introduced in prior work [20], we benchmark our method under these metrics, thereby establishing a rigorous standard for future comparisons. Extensive experiments on the OpenLane benchmark demonstrate that SCLane significantly improves both height estimation and 3D lane detection, achieving state-of-the-art performance with an F-score of 64.3%, outperforming existing methods by a notable margin. For detailed results and a demonstration video, please refer to our project page: https://parkchaesong.github.io/sclane/
SFUOD: Source-Free Unknown Object Detection	Keon-Hee Park Kyung Hee University Seun-An Choe Kyung Hee University Gyeong-Moon Park Korea University	Paper Supplementary Abstract Source-free object detection adapts a detector pre-trained on a source domain to an unlabeled target domain without requiring access to labeled source data. While this setting is practical as it eliminates the need for the source dataset during domain adaptation, it operates under the restrictive assumption that only pre-defined objects from the source domain exist in the target domain. This closed-set setting prevents the detector from detecting undefined objects. To ease this assumption, we propose Source-Free Unknown Object Detection (SFUOD), a novel scenario which enables the detector to not only recognize known objects but also detect undefined objects as unknown objects. To this end, we propose CollaPAUL (Collaborative tuning and Principal Axis-based Unknown Labeling), a novel framework for SFUOD. Collaborative tuning enhances knowledge adaptation by integrating target-dependent knowledge from the auxiliary encoder with source-dependent knowledge from the pre-trained detector through a cross-domain attention mechanism. Additionally, principal axes-based unknown labeling assigns pseudo-labels to unknown objects by estimating objectness via principal axes projection and confidence scores from model predictions. The proposed CollaPAUL achieves state-of-the-art performances on SFUOD benchmarks, and extensive experiments validate its effectiveness. Our code is available at SFUOD.
Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control	Seongmin Park Hanyang University Hyungmin Kim Hanyang University Sangwoo Kim Hanyang University Wonseok Jeon Hyundai Motor Company Juyoung Yang Hyundai Motor Company Byeongwook Jeon Hyundai Motor Company Yoonseon Oh Hanyang University Jungwook Choi Hanyang University	Paper Supplementary Abstract Deep neural network (DNN)-based policy models, such as vision-language-action (VLA) models, excel at automating complex decision-making from multi-modal inputs. However, scaling these models greatly increases computational overhead, complicating deployment in resourceconstrained settings like robot manipulation and autonomous driving. To address this, we propose SaliencyAware Quantized Imitation Learning (SQIL), which combines quantization-aware training with a selective lossweighting strategy for mission-critical states. By identifying these states via saliency scores and emphasizing them in the training loss, SQIL preserves decision fidelity under low-bit precision. We validate SQIL's generalization capability across extensive simulation benchmarks with environment variations, real-world tasks, and cross-domain tasks (self-driving, physics simulation), consistently recovering full-precision performance. Notably, a 4-bit weightquantized VLA model for robotic manipulation achieves up to 2.5x speedup and 2.5x energy savings on an edge GPU with minimal accuracy loss. These results underline SQIL 's potential for efficiently deploying large IL-based policy models on resource-limited devices.
SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering	Byeongjun Park KAIST Hyojun Go EverEx Hyelin Nam EverEx Byung-Hoon Kim Yonsei University Hyungjin Chung EverEx Changick Kim KAIST	Paper Supplementary Abstract Recent progress in 3D/4D scene generation emphasizes the importance of physical alignment throughout video generation and scene reconstruction. However, existing methods improve the alignment separately at each stage, making it difficult to manage subtle misalignments arising from another stage. Here, we present SteerX, a zero-shot inferencetime steering method that unifies scene reconstruction into the generation process, tilting data distributions toward better geometric alignment. To this end, we introduce two geometric reward functions for 3D/4D scene generation by using pose-free feed-forward scene reconstruction models. Through extensive experiments, we demonstrate the effectiveness of SteerX in improving 3D/4D scene generation.
UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation	Chaitanya Patel Stanford University Hiroki Nakamura Panasonic Holdings Corporation Yuta Kyuragi Panasonic R&D Company of America Kazuki Kozuka Panasonic Holdings Corporation Juan Carlos Niebles Stanford University Ehsan Adeli Stanford University	Paper Supplementary Abstract Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in realworld egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion's simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.
Revisiting Point Cloud Completion: Are We Ready For The Real-World?	Stuti Pathak UAntwerp Prashant Kumar IIT Delhi Dheeraj Baiju BITS Pilani Nicholus Mboga GIM Gunther Steenackers UAntwerp Rudi Penne UAntwerp	Paper Supplementary Abstract Point clouds acquired in constrained, challenging, uncontrolled, and multi-sensor real-world settings are noisy, incomplete, and non-uniformly sparse. This presents acute challenges for the vital task of point cloud completion. Using tools from Algebraic Topology and Persistent Homology (PH), we demonstrate that current benchmark object point clouds lack rich topological features that are integral part of point clouds captured in realistic environments. To facilitate research in this direction, we contribute the first real-world industrial dataset for point cloud completion, RealPC - a diverse, rich and varied set of point clouds. It consists of ∼40,000 pairs across 21 categories of industrial structures in railway establishments. Benchmark results on several strong baselines reveal that existing methods fail in realworld scenarios. We discover a striking observation - unlike current datasets, RealPC consists of multiple 0- and 1-dimensional PH-based topological features. We prove that integrating these topological priors into existing works helps improve completion. We present how 0-dimensional PH priors extract the global topology of a complete shape in the form of a 3D skeleton and assist a model in generating topologically consistent complete shapes. Since computing Homology is expensive, we present a simple, yet effective Homology Sampler guided network, BOSHNet that bypasses the Homology computation by sampling proxy backbones akin to 0-dim PH. These backbones provide similar benefits of 0-dim PH right from the start of the training, unlike similar methods where accurate backbones are obtained only during later phases of the training. The code is available at https://github.com/stutipathak5/Point-CloudCompletion.
MistSense: Versatile Online Detection of Procedural and Execution Mistakes	Constantin Patsch Technical University of Munich Yuankai Wu Technical University of Munich Marsil Zakour Technical University of Munich Driton Salihu Technical University of Munich Eckehard Steinbach Technical University of Munich	Paper Abstract Online mistake detection is crucial across various domains, ranging from industrial automation to educational applications, as mistakes can be corrected by the human operator after their detection due to the continuous inference on a video stream. While prior research mainly addresses procedural errors that often relate to temporal and ordering information, identifying a broader range of error types is essential for real-world implementation. In this work, we present MistSense, an approach for online mistake identification that includes versatility by considering both procedural errors, which involve incorrect action sequences, and execution errors, such as motor inaccuracies or improper equipment use. Our method integrates RGB and hand pose features to capture fine-grained contextual cues in order to detect a mistake. By jointly modeling spatial and sequential aspects of human actions, our framework enables robust and adaptive error detection in dynamic environments. Once a mistake has been detected, we leverage a large language model (LLM) which provides an error explanation that gives the user further insights into why an action has been identified as a mistake. The evaluation on common mistake detection benchmarks shows the effectiveness of our approach.
D2ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition	Wenjie Pei Harbin Institute of Technology, Shenzhen Qizhong Tan Harbin Institute of Technology, Shenzhen Guangming Lu Harbin Institute of Technology, Shenzhen Jiandong Tian Shenyang Institute of Automation, Chinese Academy of Sciences Jun Yu Harbin Institute of Technology, Shenzhen	Paper Supplementary Abstract Adapting pre-trained image models to video modality has proven to be an effective strategy for robust fewshot action recognition. In this work, we explore the potential of adapter tuning in image-to-video model adaptation and propose a novel video adapter tuning framework, called Disentangled-and-Deformable SpatioTemporal Adapter (D2ST-Adapter). It features a lightweight design, low adaptation overhead and powerful spatiotemporal feature adaptation capabilities. D2ST-Adapter is structured with an internal dual-pathway architecture that enables built-in disentangled encoding of spatial and temporal features within the adapter, seamlessly integrating into the single-stream feature learning framework of pretrained image models. In particular, we develop an efficient yet effective implementation of the D2ST-Adapter, incorporating the specially devised anisotropic Deformable SpatioTemporal Attention as its pivotal operation. This mechanism can be individually tailored for two pathways with anisotropic sampling densities along the spatial and temporal domains in 3D spatio-temporal space, enabling disentangled encoding of spatial and temporal features while maintaining a lightweight design. Extensive experiments by instantiating our method on both pre-trained ResNet and ViT demonstrate the superiority of our method over stateof-the-art methods. Our method is particularly well-suited to challenging scenarios where temporal dynamics are critical for action recognition. Code is available at https: //github.com/qizhongtan/D2ST-Adapter.
Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics	Muleilan Pei HKUST Shaoshuai Shi Voyager Research, Didi Chuxing Xuesong Chen Voyager Research, Didi Chuxing Xu Liu Zhuoyu Technology Shaojie Shen HKUST	Paper Supplementary Abstract Motion forecasting for on-road traffic agents presents both a significant challenge and a critical necessity for ensuring safety in autonomous driving systems. In contrast to most existing data-driven approaches that directly predict future trajectories, we rethink this task from a planning perspective, advocating a 'First Reasoning, Then Forecasting' strategy that explicitly incorporates behavior intentions as spatial guidance for trajectory prediction. To achieve this, we introduce an interpretable, reward-driven intention reasoner grounded in a novel query-centric Inverse Reinforcement Learning (IRL) scheme. Our method first encodes traffic agents and scene elements into a unified vectorized representation, then aggregates contextual features through a query-centric paradigm. This enables the derivation of a reward distribution, a compact yet informative representation of the target agent's behavior within the given scene context via IRL. Guided by this reward heuristic, we perform policy rollouts to reason about multiple plausible intentions, providing valuable priors for subsequent trajectory generation. Finally, we develop a hierarchical DETR-like decoder integrated with bidirectional selective state space models to produce accurate future trajectories along with their associated probabilities. Extensive experiments on the largescale Argoverse and nuScenes motion forecasting datasets demonstrate that our approach significantly enhances trajectory prediction confidence, achieving highly competitive performance relative to state-of-the-art methods.
HiERO: Understanding the Hierarchy of Human Behavior Enhances Reasoning on Egocentric Videos	Simone Alberto Peirone Politecnico di Torino Francesca Pistilli Politecnico di Torino Giuseppe Averta Politecnico di Torino	Paper Supplementary Abstract Human activities are particularly complex and variable, and this makes challenging for deep learning models to reason about them. However, we note that such variability does have an underlying structure, composed of a hierarchy of patterns of related actions. We argue that such structure can emerge naturally from unscripted videos of human activities, and can be leveraged to better reason about their content. We present HiERO, a weakly-supervised method to enrich video segments features with the corresponding hierarchical activity threads. By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture. We prove the potential of our enriched features with multiple video-text alignment benchmarks (EgoMCQ, EgoNLQ) with minimal additional training, and in zeroshot for procedure learning tasks (EgoProceL and Ego4D Goal-Step). Notably, HiERO achieves state-of-the-art performance in all the benchmarks, and for procedure learning tasks it outperforms fully-supervised methods by a large margin (+12.5% F1 on EgoProceL) in zero shot. Our results prove the relevance of using knowledge of the hierarchy of human activities for multiple reasoning tasks in egocentric vision. Project page: sapeirone.github.io/HiERO.
A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds	Jizong Peng dConstruct Robotics Tze Ho Elden Tse National University of Singapore Kai Xu National University of Singapore Wenchao Gao dConstruct Robotics Angela Yao National University of Singapore	Paper Supplementary Abstract 3D Gaussian Splatting (3DGS) is a powerful reconstruction technique; however, it requires initialization from accurate camera poses and high-fidelity point clouds. Typically, the initialization is taken from Structure-from-Motion (SfM) algorithms; however, SfM is time-consuming and restricts the application of 3DGS in real-world scenarios and largescale scene reconstruction. We introduce a constrained optimization method for simultaneous camera pose estimation and 3D reconstruction that does not require SfM support. Core to our approach is decomposing a camera pose into a sequence of camera-to-(device-)center and (device-)centerto-world optimizations. To facilitate, we propose two optimization constraints conditioned on the sensitivity of each parameter group and restricts the search space of each parameter. In addition, as we learn the scene geometry directly from the noisy point clouds, we propose geometric constraints to improve the reconstruction quality. Experiments demonstrate that the proposed method significantly outperforms the existing (multi-modal) 3DGS baseline and methods supplemented by COLMAP on both our collected dataset and two public benchmarks. Project webpage: https://eldentse.github.io/contrainedoptimization-3dgs.
On the Provable Importance of Gradients for Autonomous Language-Assisted Image Clustering	Bo Peng University of Technology Sydney Jie Lu University of Technology Sydney Guangquan Zhang University of Technology Sydney Zhen Fang University of Technology Sydney	Paper Supplementary Abstract This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data. Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-theart clustering performance on various benchmarks.
DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching	Emery Pierson LIX, Ecole Polytechnique Lei Li Technical University of Munich Angela Dai Technical University of Munich Maks Ovsjanikov LIX, Ecole Polytechnique	Paper Supplementary Abstract Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation strategy from diffusion models in the spectral domain. Experiments demonstrate that our learned regularization leads to better results than axiomatic approaches for zero-shot non-rigid shape matching. Our code is available at: https://github.com/daidedou/diffumatch/
MaskControl: Spatio-Temporal Control for Masked Motion Synthesis	Ekkasit Pinyoanuntapong University of North Carolina at Charlotte Muhammad Usama Saleem University of North Carolina at Charlotte Korrawe Karunratanakul ETH Zürich Pu Wang University of North Carolina at Charlotte Hongfei Xue University of North Carolina at Charlotte Chen Chen University of Central Florida Chuan Guo Snap Inc. Junli Cao Snap Inc. Jian Ren Snap Inc. Sergey Tulyakov Snap Inc.	Paper Supplementary Abstract Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, Logits Regularizer implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure highfidelity generation. Second, Logit Optimization explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce Differentiable Expectation Sampling (DES) to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by 77%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at https://www.ekkasit.com/ControlMM-page/
SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection	Maximilian Pittner Bosch Mobility Solutions, Robert Bosch GmbH Joel Janai Bosch Mobility Solutions, Robert Bosch GmbH Mario Faigle Bosch Mobility Solutions, Robert Bosch GmbH Alexandru Paul Condurache Institute of Neuro- and Bioinformatics, University of Lübeck	Paper Supplementary Abstract 3D lane detection has emerged as a critical challenge in autonomous driving, encompassing identification and localization of lane markings and the 3D road surface. Conventional 3D methods detect lanes from dense birds-eyeviewed (BEV) features, though erroneous transformations often result in a poor feature representation misaligned with the true 3D road surface. While recent sparse lane detectors have surpassed dense BEV approaches, they completely disregard valuable lane-specific priors. Furthermore, existing methods fail to utilize historic lane observations, which yield the potential to resolve ambiguities in situations of poor visibility. To address these challenges, we present SparseLaneSTP, a novel method that integrates both geometric properties of the lane structure and temporal information into a sparse lane transformer. It introduces a new lane-specific spatio-temporal attention mechanism, a continuous lane representation tailored for sparse architectures as well as temporal regularization. Identifying weaknesses of existing 3D lane datasets, we also introduce a precise and consistent 3D lane dataset using a simple yet effective auto-labeling strategy. Our experimental section proves the benefits of our contributions and demonstrates state-of-the-art performance across all detection and error metrics on existing 3D lane detection benchmarks as well as on our novel dataset.
Long-Context State-Space Video World Models	Ryan Po Stanford University Yotam Nitzan Adobe Research Richard Zhang Adobe Research Berlin Chen Princeton University Tri Dao Princeton University Eli Shechtman Adobe Research Gordon Wetzstein Stanford University Xun Huang Adobe Research	Paper Supplementary Abstract Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. Unlike previous approaches that retrofit SSMs for non-causal vision tasks, our method fully exploits the inherent advantages of SSMs in causal sequence modeling. Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory, combined with dense local attention to ensure coherence between consecutive frames. We evaluate the long-term memory capabilities of our model through spatial retrieval and reasoning tasks over extended horizons. Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory, while maintaining practical inference speeds suitable for interactive applications.
FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases	Matteo Poggi University of Bologna Fabio Tosi University of Bologna	Paper Supplementary Abstract We present FlowSeek, a novel framework for optical flow requiring minimal hardware resources for training. FlowSeek marries the latest advances on the design space of optical flow networks with cutting-edge single-image depth foundation models and classical low-dimensional motion parametrization, implementing a compact, yet accurate architecture. FlowSeek is trained on a single consumer-grade GPU, a hardware budget about 8x lower compared to most recent methods, and still achieves superior cross-dataset generalization on Sintel Final and KITTI, with a relative improvement of 10 and 15% over the previous state-of-the-art SEA-RAFT, as well as on Spring and LayeredFlow datasets.
Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle	Miroslav Purkrabek Czech Technical University in Prague Jiri Matas Czech Technical University in Prague	Paper Supplementary Abstract Human pose estimation methods work well on isolated people but struggle with multiple-bodies-in-proximity scenarios. Previous work has addressed this problem by conditioning pose estimation by detected bounding boxes or keypoints, but overlooked instance masks. We propose to iteratively enforce mutual consistency of bounding boxes, instance masks, and poses. The introduced BBox-MaskPose (BMP) method uses three specialized models that improve each other's output in a closed loop. All models are adapted for mutual conditioning, which improves robustness in multi-body scenes. MaskPose, a new maskconditioned pose estimation model, is the best among topdown approaches on OCHuman. BBox-Mask-Pose pushes SOTA on OCHuman dataset in all three tasks - detection, instance segmentation, and pose estimation. It also achieves SOTA performance on COCO pose estimation. The method is especially good in scenes with large instances overlap, where it improves detection by 39% over the baseline detector. With small specialized models and faster runtime, BMP is an effective alternative to large human-centered foundational models. Code and models are available on the project website 1. 1MiraPurkrabek.github.io/BBox-Mask-Pose/
COVTrack: Continuous Open-Vocabulary Tracking via Adaptive Multi-Cue Fusion	Zekun Qian College of Intelligence and Computing, Tianjin University Ruize Han Shenzhen University of Advanced Technology Zhixiang Wang College of Intelligence and Computing, Tianjin University Junhui Hou City University of Hong Kong Wei Feng College of Intelligence and Computing, Tianjin University	Paper Abstract Open-Vocabulary Multi-Object Tracking (OVMOT) aims to detect and track diverse object categories in videos, including both seen (base) and unseen (novel) categories. Current methods rely on appearance features from generated image pairs or utilize the discontinuous annotations of the video dataset (TAO) for training, primarily due to the lack of available continuous annotated video datasets for OVMOT. This limitation affects their effectiveness, since continuous target trajectories are necessary for robust tracker learning. In this work, we propose the CTAO dataset, which provides a continuous version of TAO, thereby constructing the first continuous annotated training dataset for OVMOT. This addresses the previous limitations in training data availability. Additionally, we introduce COVTrack, a unified framework that effectively integrates motion and semantic features with appearance features, in which the multi-cue feature aggregation strategy dynamically aggregates and balances these features, based on the confidence estimation from both intra-frame and interframe contexts. Our proposed framework significantly improves OVMOT performance, establishing COVTrack as a state-of-the-art solution on OVMOT benchmarks.
PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors	Kangan Qian School of Vehicle and Mobility, Tsinghua University Jinyu Miao School of Vehicle and Mobility, Tsinghua University Xinyu Jiao School of Vehicle and Mobility, Tsinghua University Ziang Luo School of Vehicle and Mobility, Tsinghua University Zheng Fu School of Vehicle and Mobility, Tsinghua University Yining Shi School of Vehicle and Mobility, Tsinghua University Yunlong Wang School of Vehicle and Mobility, Tsinghua University Kun Jiang School of Vehicle and Mobility, Tsinghua University Diange Yang School of Vehicle and Mobility, Tsinghua University	Paper Supplementary Abstract Reliable spatial and motion perception is essential for safe autonomous navigation. Recently, class-agnostic motion prediction on bird's-eye view (BEV) cell grids derived from LiDAR point clouds has gained significant attention. However, existing frameworks typically perform cell classification and motion prediction on a per-pixel basis, neglecting important motion field priors such as rigidity constraints, temporal consistency, and future interactions between agents. These limitations lead to degraded performance, particularly in sparse and distant regions. To address these challenges, we introduce PriorMotion, an innovative generative framework designed for class-agnostic motion prediction that integrates essential motion priors by modeling them as distributions within a structured latent space. Specifically, our method captures structured motion priors using raster-vector representations and employs a variational autoencoder with distinct dynamic and static components to learn future motion distributions in the latent space. Experiments on the nuScenes dataset demonstrate that PriorMotion outperforms state-of-the-art methods across both traditional metrics and our newly proposed evaluation criteria. Notably, we achieve improvements of approximately 15.24% in accuracy for fast-moving objects, an 3.59% increase in generalization, a reduction of 0.0163 in motion stability, and a 31.52% reduction in prediction errors in distant regions. Further validation on FMCW LiDAR sensors confirms the robustness of our approach.
VOVTrack: Exploring the Potentiality in Raw Videos for Open-Vocabulary Multi-Object Tracking	Zekun Qian College of Intelligence and Computing, Tianjin University Ruize Han Shenzhen University of Advanced Technology Junhui Hou City University of Hong Kong Linqi Song City University of Hong Kong Wei Feng College of Intelligence and Computing, Tianjin University	Paper Supplementary Abstract Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, not fully leveraging the video information. In this work, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video analysis standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new promptguided attention mechanism for more accurate detection (localization and classification) of time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object tracking (association). Experimental results underscore that VOVTrack establishes itself as a state-of-the-art solution for the open-vocabulary tracking task.
Active Perception Meets Rule-Guided RL: A Two-Phase Approach for Precise Object Navigation in Complex Environments	Liang Qin MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, China Min Wang Institute of Artificial Intelligence, Hefei Comprehensive National Science Center Peiwei Li University of Science and Technology of China Wengang Zhou University of Science and Technology of China Houqiang Li University of Science and Technology of China	Paper Abstract Object Goal Navigation (ObjectNav) in unknown environments presents significant challenges, particularly in OpenVocabulary Mobile Manipulation (OVMM), where robots must efficiently explore large spaces, locate small objects, and accurately position themselves for subsequent manipulation. Existing approaches struggle to meet these demands: rule-based methods offer structured exploration but lack adaptability, while reinforcement learning (RL)-based methods enhance adaptability but fail to ensure effective long-term navigation. Moreover, both approaches often overlook precise stopping positions, which are critical for successful manipulation. To address these challenges, we propose APRR (Active Perception meets Rule-guided RL), a two-phase framework, which designs a new rule-guided RL policy for the exploration phase and a novel active target perception policy for the last-mile navigation phase. Inspired by human search behavior, our rule-guided RL policy enables efficient and adaptive exploration by combining structured heuristics with learning-based decisionmaking. In the last-mile navigation phase, we introduce an RL-based policy enhanced with active target perception, allowing the robot to refine its position dynamically based on real-time detection feedback. Experimental results demonstrate that APRR improves the success rate by 13%, significantly outperforming existing methods. Furthermore, real-world experiments validate the practicality and effectiveness of APRR in real-world mobile manipulation scenarios, offering a robust and adaptable solution for precise object navigation. The code is available at https://github.com/qinliangql/APRR.
Learning on the Go: A Meta-learning Object Navigation Model	Xiaorong Qin Key Lab of Intelligent Information Processing Laboratory of the Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing Xinhang Song Key Lab of Intelligent Information Processing Laboratory of the Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing Sixian Zhang Key Lab of Intelligent Information Processing Laboratory of the Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing Xinyao Yu Key Lab of Intelligent Information Processing Laboratory of the Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing Xinmiao Zhang University of Chinese Academy of Sciences, Beijing Shuqiang Jiang Key Lab of Intelligent Information Processing Laboratory of the Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing	Paper Supplementary Abstract Object navigation tasks require an agent to locate a target object using visual observations in unseen environments, where unfamiliar layouts and novel object appearances can hinder navigation. Most existing methods lack the adaptability needed to handle these uncertainties, as their navigation models remain fixed during testing. In this paper, we address this challenge by examining object-conditioned trajectory distribution shifts in navigation caused by changes in environmental dynamics. We propose learning a central conditional distribution as a prior that approximates the specific distributions of diverse environments. To retain environment-specific information during navigation, we allow each environment-specific distribution to approximate this central distribution rather than relying on it directly. To implement this, we introduce a meta-learning mechanism that integrates with traditional navigation methods, offering tailored solutions for various types of navigation approaches. Our approach, Learning on the Go (LOG), enables agents to learn on the go, allowing for flexible, adaptive, real-time learning during navigation. Our theoretical analysis highlights the benefits of learning a central distribution for effective generalization across environments, and empirical results confirm the proposed method's effectiveness, demonstrating superior performance compared to existing approaches.
RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints	Yiran Qin Sun Yat-sen University Li Kang Shanghai Jiao Tong University Xiufeng Song Shanghai Jiao Tong University Zhenfei Yin Oxford Xiaohong Liu Shanghai Jiao Tong University Xihui Liu HKU Ruimao Zhang Sun Yat-sen University Lei Bai Shanghai Artificial Intelligence Laboratory	Paper Supplementary Abstract Designing effective embodied multi-agent systems is critical for solving complex real-world tasks across domains. Due to the complexity of multi-agent embodied systems, existing methods fail to automatically generate safe and efficient training data for such systems. To this end, we propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents. We design various interfaces tailored to different types of constraints, enabling seamless interaction with the physical world. Leveraging compositional constraints and specifically designed interfaces, we develop an automated data collection framework for embodied multi-agent systems and introduce the first benchmark for embodied multi-agent manipulation, RoboFactory. Based on RoboFactory benchmark, we adapt and evaluate the method of imitation learning and analyzed its performance in different difficulty agent tasks. Furthermore, we explore the architectures and training strategies for multi-agent imitation learning, aiming to build safe and efficient embodied multi-agent systems.
Bias-Resilient Weakly Supervised Semantic Segmentation Using Normalizing Flows	Xianglin Qiu XJTLU Xiaoyang Wang Shanghai AI Laboratory Zhen Zhang iHorry Jimin Xiao XJTLU	Paper Supplementary Abstract Weakly supervised semantic segmentation (WSSS) aims to generate dense labels using sparse annotations, such as image-level labels. Existing class activation map (CAM) generation methods have been able to locate rough objects. However, due to the limited information provided by image level labels, the bias activation problem, including overactivation, becomes another key obstacle in WSSS. To rectify such bias activation, we attempt to mine pixel level class feature distribution information from the entire dataset. Specifically, we propose to use normalizing ﬂow to model the class feature distribution of all pixels across the entire dataset and design a Bias-Resilient WSSS framework based on Normalizing Flow (BRNF). Normalizing ﬂow has the ability to map complex distributions to normal distributions. Building upon it, we designed an additional Gaussian mixture classifier which classifies pixels from the perspective of feature distributions, providing supplementary information to the conventional MLP based classifier. In addition, we use this distribution to sample low bias features as positive anchors for contrastive learning, thereby encouraging feature optimization toward the correct low-bias direction. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving state-ofthe-art performance on WSSS benchmarks. Code will be available at https://github.com/DpDark/BRNF.
Feature Extraction and Representation of Pre-training Point Cloud Based on Diffusion Models	Chang Qiu Southeast University Feipeng Da Southeast University Zilei Zhang Southeast University	Paper Abstract The pretrain-finetune paradigm of pre-training a model on large amounts of image and text data and then fine-tuning the model for a specific task has led to significant progress in many 2D image and natural language processing tasks. Similarly, the use of pre-training methods in point cloud data can also enhance the working performance and generalization ability of the model. Therefore, in this paper, we propose a pre-training framework based on a diffusion model called PreDifPoint. It is able to accomplish the pretraining of the model's backbone network through a diffusion process of gradual denoising. We aggregate the potential features extracted from the backbone network, input them as conditions into the subsequent diffusion model, and direct the point-to-point mapping relationship of the noisy point clouds at neighboring time steps, so as to generate high-quality point clouds and at the same time better perform various downstream tasks of the point clouds. We also introduce a bi-directional covariate attention (DXCAAttention) mechanism for capturing complex feature interactions, fusing local and global features, and improving the detail recovery of point clouds. In addition, we propose a density-adaptive sampling strategy, which can help the model dynamically adjust the sampling strategy between different time steps, and guide the model to pay more attention to the denser regions in the point cloud, thus improving the effectiveness of the model in point cloud recovery. Our PreDifPoint framework achieves more competitive results on various real-world datasets. Specifically, PreDifPoint achieves an overall accuracy of 87.96%, which is 0.35% higher than PointDif, on the classification task on PB-T50395RS, a variant of ScanObjectNN dataset.
LHM: Large Animatable Human Reconstruction Model for Single Image to 3D in Seconds	Lingteng Qiu Tongyi Lab, Alibaba Group Xiaodong Gu Tongyi Lab, Alibaba Group Peihao Li Tongyi Lab, Alibaba Group Qi Zuo Tongyi Lab, Alibaba Group Weichao Shen Tongyi Lab, Alibaba Group Junfei Zhang Tongyi Lab, Alibaba Group Kejie Qiu Tongyi Lab, Alibaba Group Weihao Yuan Tongyi Lab, Alibaba Group Guanying Chen Tongyi Lab, Alibaba Group Zilong Dong Tongyi Lab, Alibaba Group Liefeng Bo Tongyi Lab, Alibaba Group	Paper Supplementary Abstract Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feedforward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability. Our code is available on https://github.com/aigc3d/LHM
Multi-View 3D Point Tracking	Frano Rajiˇc ETH Z¨urich Haofei Xu ETH Z¨urich Marko Mihajlovic ETH Z¨urich Siyuan Li ETH Z¨urich Irem Demir ETH Z¨urich Emircan G¨undo˘gdu ETH Z¨urich Lei Ke Carnegie Mellon University Sergey Prokudin ETH Z¨urich Marc Pollefeys ETH Z¨urich Siyu Tang ETH Z¨urich	Paper Supplementary Abstract We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or prior multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feedforward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks-Panoptic Studio and DexYCB- achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Our method generalizes well to diverse camera setups of 1-8 views with varying vantage points and video lengths of 24-150 frames. By releasing our tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for real-world applications. Project page: https://ethz-vlg.github.io/mvtracker.
AMD: Adaptive Momentum and Decoupled Contrastive Learning Framework for Robust Long-Tail Trajectory Prediction	Bin Rao State Key Laboratory of Internet of Things for Smart City, University of Macau Haicheng Liao State Key Laboratory of Internet of Things for Smart City, University of Macau Yanchen Guan State Key Laboratory of Internet of Things for Smart City, University of Macau Chengyue Wang State Key Laboratory of Internet of Things for Smart City, University of Macau Bonan Wang State Key Laboratory of Internet of Things for Smart City, University of Macau Jiaxun Zhang State Key Laboratory of Internet of Things for Smart City, University of Macau Zhenning Li State Key Laboratory of Internet of Things for Smart City, University of Macau	Paper Supplementary Abstract Accurately predicting the future trajectories of traffic agents is essential in autonomous driving. However, due to the inherent imbalance in trajectory distributions, tail data in natural datasets often represents more complex and hazardous scenarios. Existing studies typically rely solely on a base model's prediction error, without considering the diversity and uncertainty of long-tail trajectory patterns. We propose an adaptive momentum and decoupled contrastive learning framework (AMD), which integrates unsupervised and supervised contrastive learning strategies. By leveraging an improved momentum contrast learning (MoCo-DT) and decoupled contrastive learning (DCL) module, our framework enhances the model's ability to recognize rare and complex trajectories. Additionally, we design four types of trajectory random augmentation methods and introduce an online iterative clustering strategy, allowing the model to dynamically update pseudo-labels and better adapt to the distributional shifts in long-tail data. We propose three different criteria to define long-tail trajectories and conduct extensive comparative experiments on the nuScenes and ETH/UCY datasets. The results show that AMD not only achieves optimal performance in long-tail trajectory prediction but also demonstrates outstanding overall prediction accuracy.
Beyond Perspective: Neural 360-Degree Video Compression	Andy Regensky Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg Marc Windsheimer Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg Fabian Brand Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg Andr´e Kaup Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg	Paper Supplementary Abstract Neural video codecs (NVCs) have seen fast-paced advancement in recent years and already perform close to stateof-the-art traditional video codecs like H.266/VVC. However, NVC investigations have so far focused on improving performance for classical perspective video leaving the increasingly important 360-degree video format unexplored. In this paper, we address this issue and present how existing NVCs can be optimized for 360-degree video while also improving performance on perspective video. As no suitable datasets for neural 360-degree video compression exist, we publish a large-scale 360-degree video dataset consisting of more than 6000 user generated 9-frame sequences with resolutions ranging from 0.5K to 8K. We propose a novel method for training data augmentation exploiting the spherical characteristics of 360-degree video that shows to be crucial for achieving maximum compression performance. An additional positional feature encoding further supports the NVC in dynamic bitrate allocation notably improving the performance for both 360-degree and perspective video. Overall, we achieve rate savings of almost 8% for 360degree video and more than 3% for perspective video with minimal complexity overhead. The dataset is available at: https://huggingface.co/datasets/FAULMS/UGC360. Source code and pre-trained model weights are available at: https://github.com/FAU-LMS/NVC360.
GauUpdate: New Object Insertion in 3D Gaussian Fields with Consistent Global Illumination	Chengwei Ren Tsinghua University Fan Zhang Shanghai AI Laboratory Liangchao Xu Nanjing University Liang Pan Shanghai AI Laboratory Ziwei Liu Nanyang Technological University Wenping Wang Texas A&M University Xiao-Ping Zhang Tsinghua University Yuan Liu The Hong Kong University of Science and Technology	Paper Supplementary Abstract 3D Gaussian Splatting (3DGS) is a prevailing technique to reconstruct large-scale 3D scenes from multiview images for novel view synthesis, like a room, a block, and even a city. Such large-scale scenes are not static with changes constantly happening in these scenes, like a new building being built or a new decoration being set up. To keep the reconstructed 3D Gaussian fields up-to-date, a naive way is to reconstruct the whole scene after changing, which is extremely costly and inefficient. In this paper, we propose a new method called GauUpdate that allows partially updating an old 3D Gaussian field with new objects from a new 3D Gaussian field. However, simply inserting the new objects leads to inconsistent appearances because the old and new Gaussian fields may have different lighting environments from each other. GauUpdate addresses this problem by applying inverse rendering techniques in the 3DGS to recover both the materials and environmental lights. Based on the materials and lighting, we relight the new objects in the old 3D Gaussian field for consistent global illumination. For an accurate estimation of the materials and lighting, we put additional constraints on the materials and lighting conditions, that these two fields share the same materials but different environment lights, to improve their qualities. We conduct experiments on both synthetic scenes and realworld scenes to evaluate GauUpdate, which demonstrate that GauUpdate achieves realistic object insertion in 3D Gaussian fields with consistent appearances.
Multi-modal Segment Anything Model for Camouflaged Scene Segmentation	Guangyu Ren Xi'an Jiaotong-Liverpool University Hengyan Liu Xi'an Jiaotong-Liverpool University Michalis Lazarou Imperial College London Tania Stathaki Imperial College London	Paper Abstract Camouflaged scenes, where objects blend seamlessly into their environments, pose significant challenges to both human observers and computer vision systems. To address this, we propose a novel framework that leverages off-the-shelf foundation models to generate multi-modal prompts for the Segment Anything Model (SAM), thus eliminating the need for manual prompts and significantly improving overall performance on this downstream task. At first, we generate an image caption using the BLIP model and obtain its text embedding through the use of a text encoder. We then generate a visual embedding through the vision encoder of the BLIP model and use both as inputs to SAM to provide additional semantic information about the image. Finally, we propose a couple of architectural novelties, a) we effectively integrate the multi-modal information in SAM through a multi-level adapter and b) we replace the dense embedding of SAM with the image embedding of its image encoder. Our method achieves new state-of-the-art performance in 11 out of 12 metrics in three benchmark datasets for camouflaged detection. Additionally, our method can be successfully adapted to other tasks such as medical image segmentation performing on par or even outperforming the state-of-the-art methods. Our code is available in https://github.com/icqialanqian/Vision-Language-SAM.
Neural Compression for 3D Geometry Sets	Siyu Ren City University of Hong Kong Junhui Hou City University of Hong Kong Weiyao Lin Shanghai Jiao Tong University Wenping Wang Texas A&M University	Paper Supplementary Abstract We present NeCGS, the first neural compression paradigm, which can compress a geometry set encompassing thousands of detailed and diverse 3D mesh models by up to 900 times with high accuracy and preservation of detailed geometric structures. Specifically, we first propose TSDF-Def, a new implicit representation that is capable of accurately representing irregular 3D mesh models with various structures into regular 4D tensors of uniform and compact size, where 3D surfaces can be extracted through the deformable marching cubes. Then we construct a quantization-aware auto-decoder network architecture to regress these 4D tensors to explore the local geometric similarity within each shape and across different shapes for redundancy removal, resulting in more compact representations, including an embedded feature of a smaller size associated with each 3D model and a network parameter shared by all models. We finally encode the resulting features and network parameters into bitstreams through entropy coding. Besides, our NeCGS can handle the dynamic scenario well, where new 3D models are constantly added to a compressed set. Extensive experiments and ablation studies demonstrate the significant advantages of our NeCGS over state-of-the-art methods both quantitatively and qualitatively. The source code is publicly available at https://github.com/rsy6318/NeCGS.
Seeing the Unseen: A Semantic Alignment and Context-Aware Prompt Framework for Open-Vocabulary Camouflaged Object Segmentation	Peng Ren College of Computer Science and Technology, Jilin University Tian Bai College of Computer Science and Technology, Jilin University Jing Sun School of Information and Communication Engineering, Dalian Minzu University Fuming Sun School of Information and Communication Engineering, Dalian Minzu University	Paper Abstract Open-Vocabulary Camouflaged Object Segmentation (OVCOS) aims to segment camouflaged objects of any category based on text descriptions. Despite existing openvocabulary methods exhibit strong segmentation capabilities, they still have a major limitation in camouflaged scenarios: semantic confusion, which leads to incomplete segmentation and class shift in the model. To mitigate the above limitation, we propose a framework for OVCOS, named SuCLIP. Specifically, we design a context-aware prompt scheme that leverages the internal knowledge of the CLIP visual encoder to enrich the text prompt and align it with local visual features, thereby enhancing the text prompt. To better align the visual semantic space and the text semantic space, we design a class-aware feature selection module to dynamically adjust text and visual embeddings, making them more matched with camouflaged object. Meanwhile, we introduce a semantic consistency loss to mitigate the semantic deviation between the text prompt and visual features, ensuring semantic consistency between the segmentation results and the text prompt. Finally, we design a text query decoder that precisely maps textual semantics to pixel-level segmentation results, thereby achieving semantic-spatial consistent decoding. Experimental results show that SuCLIP significantly outperforms the advanced method OVCoser on the OVCamo dataset.
TOTP: Transferable Online Pedestrian Trajectory Prediction with Temporal-Adaptive Mamba Latent Diffusion	Ziyang Ren National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University Ping Wei National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University Shangqi Deng National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University Haowen Tang National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University Jiapeng Li National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University Huan Li National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University	Paper Abstract Pedestrian trajectory prediction is crucial for many intelligent tasks. While existing methods predict future trajectories from fixed-frame historical observations, they are limited by the observational perspective and the need for extensive historical information, resulting in prediction delays and inﬂexible generalization in real-time systems. In this paper, we propose a novel task called Transferable Online Pedestrian Trajectory Prediction (TOTP), which synchronously predicts future trajectories with variable observations and enables effective task transfer under different observation constraints. To advance TOTP modeling, we propose a Temporal-Adaptive Mamba Latent Diffusion (TAMLD) model. It utilizes the Social-Implicit Mamba Synthesizer to extract motion states with social interaction and refine temporal representations through TemporalAware Distillation. A Trend-Conditional Mamba Decomposer generates the motion latent distribution of the future motion trends and predicts future motion trajectories through sampling decomposition. We utilize Motion-Latent Mamba Diffusion to reconstruct the latent space disturbed by imbalanced temporal noise. Our method achieves stateof-the-art results on multiple datasets and tasks, showcasing temporal adaptability and generalization ability.
Fast Globally Optimal and Geometrically Consistent 3D Shape Matching	Paul Roetzer University of Bonn Florian Bernard University of Bonn	Paper Supplementary Abstract Geometric consistency, i.e. the preservation of neighbourhoods, is a natural and strong prior in 3D shape matching. Geometrically consistent matchings are crucial for many downstream applications, such as texture transfer or statistical shape modelling. Yet, in practice, geometric consistency is often overlooked, or only achieved under severely limiting assumptions (e.g. a good initialisation). In this work, we propose a novel formalism for computing globally optimal and geometrically consistent matchings between 3D shapes which is scalable in practice. Our key idea is to represent the surface of the source shape as a collection of cyclic graphs, which are then consistently matched to the target shape. Mathematically, we construct a hyper product graph (between source and target shape), and then cast 3D shape matching as a minimum-cost circulation ﬂow problem in this hyper graph, which yields global geometrically consistent matchings between both shapes. We empirically show that our formalism is efficiently solvable and that it leads to high-quality results. Our code is publicly available.1 1https://github.com/paul0noah/geco
CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image	Wonseok Roh Korea University Hwanhee Jung Korea University Jong Wook Kim Korea University Seunggwan Lee Korea University Innfarn Yoo CNAPS.AI Inc. Andreas Lugmayr Google Seunggeun Chi Purdue University Karthik Ramani Purdue University Sangpil Kim Korea University	Paper Supplementary Abstract Recently, generalizable feed-forward methods based on 3D Gaussian Splatting have gained significant attention for their potential to reconstruct 3D scenes using finite resources. These approaches create a 3D radiance field, parameterized by per-pixel 3D Gaussian primitives, from just a few images in a single forward pass. Unlike multi-view methods that benefit from cross-view correspondences, 3D scene reconstruction with a single-view image remains an underexplored area. In this work, we introduce CATSplat, a novel generalizable transformer-based framework designed to break through the inherent constraints in monocular settings. First, we propose leveraging textual guidance from a visual-language model to complement insufficient information from single-view image features. By incorporating scene-specific contextual details from text embeddings through cross-attention, we pave the way for context-aware 3D scene reconstruction beyond relying solely on visual cues. Moreover, we advocate utilizing spatial guidance from 3D point features toward comprehensive geometric understanding under monocular settings. With 3D priors, image features can capture rich structural insights for predicting 3D Gaussians without multi-view techniques. Extensive experiments on large-scale datasets demonstrate the state-of-the-art performance of CATSplat in single-view 3D scene reconstruction with high-quality novel view synthesis.
HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction	Sara Rojas KAUST Matthieu Armando NAVER LABS Europe Bernard Ghanem KAUST Philippe Weinzaepfel NAVER LABS Europe Vincent Leroy NAVER LABS Europe Gr´egory Rogez NAVER LABS Europe	Paper Supplementary Abstract Recovering the 3D geometry of a scene from a sparse set of uncalibrated images is a long-standing problem in computer vision. While recent learning-based approaches such as DUSt3R and MASt3R have demonstrated impressive results by directly predicting dense scene geometry, they are primarily trained on outdoor scenes with static environments and struggle to handle human-centric scenarios. In this work, we introduce HAMSt3R, an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated multi-view images. First, we exploit DUNE, a strong image encoder obtained by distilling, among others, the encoders from MASt3R and from a state-of-the-art Human Mesh Recovery (HMR) model, multi-HMR, for a better understanding of scene geometry and human bodies. Our method then incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments, enabling a more comprehensive 3D reconstruction. By leveraging the outputs of our different heads, HAMSt3R produces a dense point map enriched with human semantic information in 3D. Unlike existing methods that rely on complex optimization pipelines, our approach is fully feedforward and efficient, making it suitable for real-world applications. We evaluate our model on EgoHumans and EgoExo4D, two challenging benchmarks containing diverse human-centric scenarios. Additionally, we validate its generalization to traditional multi-view stereo and multi-view pose regression tasks. Our results demonstrate that our method can reconstruct humans effectively while preserving strong performance in general 3D reconstruction tasks, bridging the gap between human and scene understanding in 3D vision.
MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation	Fu Rong National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University Meng Lan Hong Kong University of Science and Technology Qian Zhang Horizon Robotics Lefei Zhang National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University	Paper Supplementary Abstract Referring video object segmentation (RVOS) aims to segment objects in a video according to textual descriptions, which requires the integration of multimodal information and temporal dynamics perception. The Segment Anything Model 2 (SAM 2) has shown great effectiveness across various video segmentation tasks. However, its application to offline RVOS is challenged by the translation of the text into effective prompts and a lack of global context awareness. In this paper, we propose a novel RVOS framework, termed MPG-SAM 2, to address these challenges. Specifically, MPG-SAM 2 employs a multimodal encoder to jointly encode video and textual features, generating semantically aligned video and text embeddings along with multimodal class tokens. A mask prior generator is devised to utilize the video embeddings and class tokens to create pseudo masks of target objects and global context. These masks are fed into the prompt encoder as dense prompts, along with multimodal class tokens as sparse prompts to generate accurate prompts for SAM 2. To provide the online SAM 2 with a global view, we propose a hierarchical global-historical aggregator, which allows SAM 2 to aggregate global and historical information of target objects at both pixel and object levels, enhancing the target representation and temporal consistency. Extensive experiments on several RVOS benchmarks demonstrate the superiority of MPG-SAM 2 and the effectiveness of the proposed modules. The code is available at https://github.com/rongfu-dsb/MPG-SAM2.
PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining	Ciyu Ruan Shenzhen International Graduate School, Tsinghua University Ruishan Guo Shenzhen International Graduate School, Tsinghua University Zihang Gong Harbin Institute of Technology Jingao Xu Carnegie Mellon University Wenhan Yang Pengcheng Laboratory Xinlei Chen Shenzhen International Graduate School, Tsinghua University	Paper Supplementary Abstract Event cameras excel in high temporal resolution and dynamic range but suffer from dense noise in rainy conditions. Existing event deraining methods face trade-offs between temporal precision, deraining effectiveness, and computational efficiency. In this paper, we propose PREMamba, a novel point-based event camera deraining framework that fully exploits the spatiotemporal characteristics of raw event and rain. Our framework introduces a 4D event cloud representation that integrates dual temporal scales to preserve high temporal precision, a Spatio-Temporal Decoupling and Fusion module (STDF) that enhances deraining capability by enabling shallow decoupling and interaction of temporal and spatial information, and a Multi-Scale State Space Model (MS3M) that captures deeper rain dynamics across dual-temporal and multi-spatial scales with linear computational complexity. Enhanced by frequencydomain regularization, PRE-Mamba achieves superior performance (0.95 SR, 0.91 NR, and 0.4s/M events) with only 0.26M parameters on EventRain-27K, a comprehensive dataset with labeled synthetic and real-world sequences. Moreover, our method generalizes well across varying rain intensities, viewpoints, and even snowy conditions. Code and dataset: https://github.com/softword-tt/PRE-Mamba.
CAD-Recode: Reverse Engineering CAD Code from Point Clouds	Danila Rukhovich SnT, University of Luxembourg Elona Dupont SnT, University of Luxembourg Dimitrios Mallis SnT, University of Luxembourg Kseniya Cherenkova Artec3D, Luxembourg Anis Kacem SnT, University of Luxembourg Djamila Aouada SnT, University of Luxembourg	Paper Supplementary Abstract Computer-Aided Design (CAD) models are typically constructed by sequentially drawing parametric sketches and applying CAD operations to obtain a 3D model. The problem of 3D CAD reverse engineering consists of reconstructing the sketch and CAD operation sequences from 3D representations such as point clouds. In this paper, we address this challenge through novel contributions across three levels: CAD sequence representation, network design, and training dataset. In particular, we represent CAD sketch-extrude sequences as Python code. The proposed CAD-Recode translates a point cloud into Python code that, when executed, reconstructs the CAD model. Taking advantage of the exposure of pre-trained Large Language Models (LLMs) to Python code, we leverage a relatively small LLM as a decoder for CAD-Recode and combine it with a lightweight point cloud projector. CAD-Recode is trained on a procedurally generated dataset of one million CAD sequences. CAD-Recode significantly outperforms existing methods across the DeepCAD, Fusion360 and realworld CC3D datasets. Furthermore, we show that our CAD Python code output is interpretable by off-the-shelf LLMs, enabling CAD editing and CAD-specific question answering from point clouds.
DAViD: Data-efficient and Accurate Vision Models from Synthetic Data	Fatemeh Saleh Microsoft, Cambridge Sadegh Aliakbarian Microsoft, Cambridge Charlie Hewitt Microsoft, Cambridge Lohit Petikam Microsoft, Cambridge Xiao-Xian Microsoft, Cambridge Antonio Criminisi Microsoft, Cambridge Thomas J. Cashman Microsoft, Cambridge Tadas Baltruˇsaitis Microsoft, Cambridge	Paper Supplementary Abstract The state of the art in human-centric computer vision achieves high accuracy and robustness across a diverse range of tasks. The most effective models in this domain have billions of parameters, thus requiring extremely large datasets, expensive training regimes, and compute-intensive inference. In this paper, we demonstrate that it is possible to train models on much smaller but high-fidelity synthetic datasets, with no loss in accuracy and higher efficiency. Using synthetic training data provides us with excellent levels of detail and perfect labels, while providing strong guarantees for data provenance, usage rights, and user consent. Procedural data synthesis also provides us with explicit control on data diversity, that we can use to address unfairness in the models we train. Extensive quantitative assessment on real input images demonstrates accuracy of our models on three dense prediction tasks: depth estimation, surface normal estimation, and soft foreground segmentation. Our models require only a fraction of the cost of training and inference when compared with foundational models of similar accuracy. Our human-centric synthetic dataset and trained models are available at https://aka.ms/DAVi
MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning	Mohammadreza Salehi VIS Lab, UvA Shashanka Venkataramanan Valeo.ai Ioana Simion VIS Lab, UvA Efstratios Gavves VIS Lab, UvA Cees G. M. Snoek VIS Lab, UvA Yuki M Asano Fundamental AI Lab, UTN	Paper Supplementary Abstract Dense self-supervised learning has shown great promise for learning pixel- and patch-level representations, but extending it to videos remains challenging due to the complexity of motion dynamics. Existing approaches struggle as they rely on static augmentations that fail under object deformations, occlusions, and camera movement, leading to inconsistent feature learning over time. We propose a motion-guided self-supervised learning framework that clusters dense point tracks to learn spatiotemporally consistent representations. By leveraging an off-the-shelf point tracker, we extract long-range motion trajectories and optimize feature clustering through a momentum-encoder-based optimal transport mechanism. To ensure temporal coherence, we propagate cluster assignments along tracked points, enforcing feature consistency across views despite viewpoint changes. Integrating motion as an implicit supervisory signal, our method learns representations that generalize across frames, improving robustness in dynamic scenes and challenging occlusion scenarios. By initializing from strong image-pretrained models and leveraging video data for training, we improve state-of-the-art by 1% to 6% on six image and video datasets and four evaluation benchmarks. The implementation is publicly available at our GitHub repository: github.com/SMSD75/MoSiC
Correspondence-Free Fast and Robust Spherical Point Pattern Registration	Anik Sarker Dept. of Mechanical Engineering, Virginia Tech Alan T. Asbeck Dept. of Mechanical Engineering, Virginia Tech	Paper Supplementary Abstract Current methods to estimate the rotation between two spherical (\protect \mathbb {S}^2) patterns typically rely on maximizing their spherical cross-correlation. However, these approaches exhibit computational complexities greater than cubic O(n^3) with respect to rotation space discretization. We propose a rotation estimation algorithm between two spherical patterns with linear time complexity O(n). Unlike existing methods, we explicitly represent spherical patterns as discrete 3D point sets on the unit sphere, reformulating rotation estimation as a spherical point-set alignment (i.e., the Wahba problem for 3D unit vectors). We introduce three novel algorithms: (1) SPMC (Spherical Pattern Matching by Correlation), (2) FRS (Fast Rotation Search), and (3) a hybrid approach (SPMC+FRS) that combines the advantages of the previous two methods. Our experiments demonstrate that in the \protect \mathbb {S}^2 domain and in correspondence-free settings, our algorithms are over 10x faster and over 10x more accurate than current state-of-the-art methods for the Wahba problem with outliers. We validate our approach through extensive simulations on a new dataset of spherical patterns, the 'Robust Vector Alignment Dataset.' Furthermore, we adapt our methods to two real-world tasks: (i) Point Cloud Registration (PCR) and (ii) rotation estimation for spherical images. In the PCR task, our approach successfully registers point clouds exhibiting overlap ratios as low as 65%. In spherical image alignment, we show that our method robustly estimates rotations even under challenging conditions involving substantial clutter (over 19%) and large rotational offsets. Our results highlight the effectiveness and robustness of our algorithms in realistic, complex scenarios. Our dataset and code are available at: https://github.com/ARLab-VT/Robust-VectorSet-Alignment
Lidar Waveforms are Worth 40x128x33 Words	Dominik Scheuble Mercedes-Benz AG Hanno Holzhüter MicroVision Steven Peters Torc Robotics Mario Bijelic Torc Robotics Felix Heide Torc Robotics	Paper Supplementary Abstract Lidar has become crucial for autonomous driving, providing high-resolution 3D scans that are key for accurate scene understanding. To this end, lidar sensors measure the timeresolved full waveforms from the returning laser light, which a subsequent digital signal processor (DSP) converts to point clouds by identifying peaks in the waveform. Conventional automotive lidar DSPs process each waveform individually, ignoring potentially valuable context from neighboring waveforms. As a result, lidar point clouds are prone to artifacts from low signal-to-noise ratio (SNR) regions, highly reflective objects, and environmental conditions like fog. While leveraging neighboring waveforms is investigated extensively in transient imaging, applications remain limited to scientific or experimental hardware. In this work, we propose a learned DSP that directly processes full waveforms using a transformer architecture, leveraging features from adjacent waveforms to generate high-fidelity multiecho point clouds. To assess our method, we capture data in real-world driving scenarios and a weather chamber with a conventional automotive lidar. Trained on synthetic and real data, the method improves Chamfer distance by 32cm and 20cm compared to conventional peak finding and existing transient imaging approaches, respectively. This translates to maximum range improvements of up to 17m in fog and 14m in nominal real-world conditions.
Prior2Former - Evidential Modeling of Mask Transformers for Assumption-Free Open-World Panoptic Segmentation	Sebastian Schmidt Technical University of Munich Julius Körner Technical University of Munich Dominik Fuchsgruber Technical University of Munich Stefano Gasperini Technical University of Munich Federico Tombari Technical University of Munich Stephan Günnemann Technical University of Munich	Paper Supplementary Abstract In panoptic segmentation, individual instances must be separated within semantic classes. As state-of-the-art methods rely on a pre-defined set of classes, they struggle with novel categories and out-of-distribution (OOD) data. This is particularly problematic in safety-critical applications, such as autonomous driving, where reliability in unseen scenarios is essential. We address the gap between outstanding benchmark performance and reliability by proposing Prior2Former (P2F), the first approach for segmentation vision transformers rooted in evidential learning. P2F extends the mask vision transformer architecture by incorporating a Beta prior for computing model uncertainty in pixel-wise binary mask assignments. This design enables high-quality uncertainty estimation that effectively detects novel and OOD objects, enabling state-of-the-art anomaly instance segmentation and open-world panoptic segmentation. Unlike most segmentation models addressing unknown classes, P2F operates without access to OOD data samples or contrastive training on void (i.e., unlabeled) classes, making it highly applicable in real-world scenarios where such prior information is unavailable. Additionally, P2F can be flexibly applied to anomaly instance and panoptic segmentation. Through comprehensive experiments on the Cityscapes, COCO, SegmentMeIfYouCan, and OoDIS datasets, P2F demonstrates state-of-the-art performance across the board.
SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians	Liam Schoneveld Woven by Toyota Zhe Chen Woven by Toyota Davide Davoli Toyota Motor Europe NV/SA Jiapeng Tang Technical University of Munich Saimon Terazawa Woven by Toyota Ko Nishino Kyoto University Matthias Nießner Technical University of Munich	Paper Supplementary Abstract Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians). Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach. Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-ofthe-art in emotion classification.
MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning	Mattia Segu Google Marta Tintore Gazulla Google Yongqin Xian Google Luc Van Gool INSAIT, Sofia University, St. Kliment Ohridski Federico Tombari Google	Paper Supplementary Abstract Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resourceconstrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both highperformance computing platforms and mobile devices.
Blended Point Cloud Diffusion for Localized Text-guided Shape Editing	Etai Sella Tel Aviv University Noam Atia Tel Aviv University Ron Mokady BRIA AI Hadar Averbuch-Elor Cornell University	Paper Supplementary Abstract Natural language offers a highly intuitive interface for enabling localized fine-grained edits of 3D shapes. However, prior works face challenges in preserving global coherence while locally modifying the input 3D shape. In this work, we introduce an inpainting-based framework for editing shapes represented as point clouds. Our approach leverages foundation 3D diffusion models for achieving localized shape edits, adding structural guidance in the form of a partial conditional shape, ensuring that other regions correctly preserve the shape's identity. Furthermore, to encourage identity preservation also within the local edited region, we propose an inference-time coordinate blending algorithm which balances reconstruction of the full shape with inpainting at a progression of noise levels during the inference process. Our coordinate blending algorithm seamlessly blends the original shape with its edited version, enabling a fine-grained editing of 3D shapes, all while circumventing the need for computationally expensive and often inaccurate inversion. Extensive experiments show that our method outperforms alternative techniques across a wide range of metrics that evaluate both fidelity to the original shape and also adherence to the textual description.
BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes	Minkyun Seo Computer Science Engineering and Interdisciplinary Program of AI, Seoul National University Hyungtae Lim Laboratory for Information & Decision Systems, Massachusetts Institute of Technology Kanghee Lee Computer Science Engineering and Interdisciplinary Program of AI, Seoul National University Luca Carlone Laboratory for Information & Decision Systems, Massachusetts Institute of Technology Jaesik Park Computer Science Engineering and Interdisciplinary Program of AI, Seoul National University	Paper Supplementary Abstract Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors, and (c) raw coordinate usage, which exacerbates scale discrepancies. To address these issues, we present a zero-shot registration pipeline called BUFFERX by (a) adaptively determining voxel size/search radii, (b) using farthest point sampling to bypass learned detectors, and (c) leveraging patch-wise scale normalization for consistent coordinate bounds. In particular, we present a multi-scale patch-based descriptor generation and a hierarchical inlier search across scales to improve robustness in diverse scenes. We also propose a novel generalizability benchmark using 11 datasets that cover various indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X achieves substantial generalization without prior information or manual parameter tuning for the test datasets. Our code is available at https://github.com/MIT-SPARK/BUFFER-X.
MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances	Yunzhe Shao School of Software and BNRist, Tsinghua University Xinyu Yi School of Software and BNRist, Tsinghua University Lu Yin School of Informatics, Xiamen University Shihui Guo School of Informatics, Xiamen University Junhai Yong School of Software and BNRist, Tsinghua University Feng Xu School of Software and BNRist, Tsinghua University	Paper Supplementary Abstract This paper proposes a novel method, named MagShield, designed to address the issue of magnetic disturbances in sparse inertial motion capture (MoCap) systems. Existing Inertial Measurement Units (IMUs) are prone to orientation estimation errors in magnetically disturbed environments, limiting the practical application of inertial Mocap systems in real-world scenarios. To address this problem, MagShield employs a 'detect-then-correct' strategy, first detecting magnetic disturbances through multi-IMU joint analysis, and then correcting orientation errors using human motion priors. MagShield can be integrated with most existing sparse inertial MoCap systems, improving their performance in magnetically disturbed environments. Experimental results demonstrate that MagShield significantly enhances the accuracy of motion capture under magnetic interference and exhibits good compatibility across different sparse inertial MoCap systems. Code and dataset are available at https://github.com/YZ-Shiao/MagShield.
DM-EFS: Dynamically Multiplexed Expanded Features Set Form for Robust and Efficient Small Object Detection	Aashish Sharma KLASS Engineering and Solutions	Paper Supplementary Abstract In this paper, we address the problem of small object detection (SOD) by introducing our novel approach - Dynamically Multiplexed Expanded Features Set (DM-EFS) form. Detecting small objects is challenging as they usually suffer from inadequate feature representation. Hence, to address this, we propose the Expanded Features Set (EFS) form - a simple yet effective idea to improve the feature representation of small objects by utilizing the untapped higher resolution features from the shallower layers of the backbone module. We observe that the EFS form improves the SOD performance. However, due to processing of additional features, it has a higher computational cost which reduces inference efficiency. Hence, to address this, we propose Dynamic Feature Multiplexing (DFM) - a novel design that optimizes the usage of the EFS form during inference by dynamically multiplexing it to create our aforementioned DMEFS form. Since our DM-EFS form is a multiplexed (or subsampled) optimal version of the EFS form, it improves the SOD performance like the EFS form but with a lower computational cost. Extensive experiments confirm the efficacy of our DM-EFS approach. Integrated with YOLOv7 base model, our DM-EFS achieves state-of-the art results on diverse SOD datasets outperforming the base model and SOD baselines, with on-par or even better inference efficiency.
GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space	David G. Shatwell Center for Research in Computer Vision, University of Central Florida Ishan Rajendrakumar Dave Adobe Sirnam Swetha Center for Research in Computer Vision, University of Central Florida Mubarak Shah Center for Research in Computer Vision, University of Central Florida	Paper Supplementary Abstract Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely on cues like brightness, hue, and shadow positioning, while seasonal changes and weather inform date estimation. However, these visual cues significantly depend on geographic context, closely linking timestamp prediction to geo-localization. To address this interdependence, we introduce GT-Loc, a novel retrieval-based method that jointly predicts the capture time (hour and month) and geo-location (GPS coordinates) of an image. Our approach employs separate encoders for images, time, and location, aligning their embeddings within a shared high-dimensional feature space. Recognizing the cyclical nature of time, instead of conventional contrastive learning with hard positives and negatives, we propose a temporal metric-learning objective providing soft targets by modeling pairwise time differences over a cyclical toroidal surface. We present new benchmarks demonstrating that our joint optimization surpasses previous time prediction methods, even those using the ground-truth geo-location as an input during inference. Additionally, our approach achieves competitive results on standard geo-localization tasks, and the unified embedding space facilitates compositional and text-based image retrieval.
STEP-DETR: Advancing DETR-based Semi-Supervised Object Detection with Super Teacher and Pseudo-Label Guided Text Queries	Tahira Shehzadi DFKI Khurram Azeem Hashmi DFKI Shalini Sarode DFKI Didier Stricker DFKI Muhammad Zeshan Afzal DFKI	Paper Supplementary Abstract This paper addresses key limitations in current SemiSupervised Object Detection (SSOD) frameworks, focusing on issues related to pseudo-label quality, confidence bias, and inefficient query generation. Traditional methods, including CNN-based and DETR-based architectures, often face challenges such as noisy pseudo-labels, overfitting to common object categories, and consequently face difficulty detecting rare objects. Specifically, recent DETR-based SSOD approaches struggle with the one-to-many assignment strategy, which produces noisy pseudo-labels and overlapping predictions, resulting in suboptimal performance. To address these challenges, we propose STEP-DETR, a transformer-based SSOD framework. STEP-DETR introduces Super Teacher to generate higher-quality pseudolabels and improve the student's learning process. Furthermore, STEP-DETR proposes Pseudo-Label Text Queries, which incorporate text embeddings from Super Teacher, balancing the student's confidence across common and rare categories, thereby mitigating confidence bias and enhancing generalization. Moreover, Denoising Text Guided Object Queries synthesizes query-label pairs for foreground and background using contrastive learning, enabling the model to better distinguish objects from background noise. To further boost performance and training efficiency, a Query Refinement Module is incorporated to filter out redundant denoising queries. On MS-COCO and Pascal VOC benchmarks, STEP-DETR outperforms state-of-the-art methods, demonstrating its effectiveness in improving semi-supervised object detection. Notably, with just 10% labeled data, it achieves 45.4 mAP, surpassing the baseline Semi-DETR by 1.9 mAP.
AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs	Yi-Ting Shen University of Maryland, College Park Sungmin Eum DEVCOM Army Research Laboratory Doheon Lee University of Maryland, College Park Rohit Shete University of Maryland, College Park Chiao-Yi Wang University of Maryland, College Park Heesung Kwon DEVCOM Army Research Laboratory Shuvra S. Bhattacharyya University of Maryland, College Park	Paper Supplementary Abstract Composed pose retrieval (CPR) enables users to search for human poses by specifying a reference pose and a transition description, but progress in this field is hindered by the scarcity and inconsistency of annotated pose transitions. Existing CPR datasets rely on costly human annotations or heuristic-based rule generation, both of which limit scalability and diversity. In this work, we introduce AutoComPose, the first framework that leverages multimodal large language models (MLLMs) to automatically generate rich and structured pose transition descriptions. Our method enhances annotation quality by structuring transitions into fine-grained body part movements and introducing mirrored/swapped variations, while a cyclic consistency constraint ensures logical coherence between forward and reverse transitions. To advance CPR research, we construct and release two dedicated benchmarks, AIST-CPR and PoseFixCPR, supplementing prior datasets with enhanced attributes. Extensive experiments demonstrate that training retrieval models with AutoComPose yields superior performance over human-annotated and heuristic-based methods, significantly reducing annotation costs while improving retrieval quality. Our work pioneers the automatic annotation of pose transitions, establishing a scalable foundation for future CPR research.
BlinkTrack: Feature Tracking over 80 FPS via Events and Images	Yichen Shen State Key Lab of CAD&CG, Zhejiang University Yijin Li State Key Lab of CAD&CG, Zhejiang University Shuo Chen State Key Lab of CAD&CG, Zhejiang University Guanglin Li State Key Lab of CAD&CG, Zhejiang University Zhaoyang Huang Avolution AI Hujun Bao State Key Lab of CAD&CG, Zhejiang University Zhaopeng Cui State Key Lab of CAD&CG, Zhejiang University Guofeng Zhang State Key Lab of CAD&CG, Zhejiang University	Paper Supplementary Abstract Event cameras, known for their high temporal resolution and ability to capture asynchronous changes, have gained significant attention for their potential in feature tracking, especially in challenging conditions. However, event cameras lack the fine-grained texture information that conventional cameras provide, leading to error accumulation in tracking. To address this, we propose a novel framework, BlinkTrack, which integrates event data with grayscale images for high-frequency feature tracking. Our method extends the traditional Kalman filter into a learning-based framework, utilizing differentiable Kalman filters in both event and image branches. This approach improves singlemodality tracking and effectively solves the data association and fusion from asynchronous event and image data. We also introduce new synthetic and augmented datasets to better evaluate our model. Experimental results indicate that BlinkTrack significantly outperforms existing methods, exceeding 80 FPS with multi-modality data and 100 FPS with preprocessed event data. Codes and dataset are available at https://github.com/ColieShen/BlinkTrack.
Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision	Tianma Shen Santa Clara University Aditya Puranik Santa Clara University James Vong Santa Clara University Vrushabh Deogirikar Santa Clara University Ryan Fell Santa Clara University Julianna Dietrich Santa Clara University Maria Kyrarini Santa Clara University Christopher Kitts Santa Clara University David C. Jeong Santa Clara University	Paper Supplementary Abstract Egocentric human body estimation allows for the inference of user body pose and shape from a wearable camera's firstperson perspective. Although pose estimation techniques have been used to overcome self-occlusions and image distortions caused by head-mounted fisheye images, similar advances in 3D human mesh recovery (HMR) techniques have been limited. We address this gap with Fish2Mesh, a fisheye-aware transformer-based model designed for 3D egocentric human mesh recovery. We propose an egocentric position embedding block to generate an ego-specific position table for the Swin Transformer to reduce fisheye image distortion. Our model utilizes multi-task heads for SMPL parametric regression and camera translations, estimating 3D and 2D joints as auxiliary loss to support model training. Further, we augment egocentric camera data with a training dataset by employing the pre-trained 4D-Human model and third-person cameras for weak supervision. Our experiments demonstrate that Fish2Mesh outperforms state-of-the-art 3D HMR models. Code and data are available on our website.
Online Reasoning Video Segmentation with Just-in-Time Digital Twins	Yiqing Shen Johns Hopkins University, Baltimore, MD, USA Bohan Liu Johns Hopkins University, Baltimore, MD, USA Chenjia Li Johns Hopkins University, Baltimore, MD, USA Lalithkumar Seenivasan Johns Hopkins University, Baltimore, MD, USA Mathias Unberath Johns Hopkins University, Baltimore, MD, USA	Paper Abstract Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. Our innovation is the introduction of a just-in-time digital twin concept, where - given an implicit query - a LLM plans the construction of a low-level scene representation from highlevel video using specialist vision models. We refer to this approach to creating a digital twin as 'just-in-time' because the LLM planner will anticipate the need for specific information and only request this limited subset instead of always evaluating every specialist model. The LLM then performs reasoning on this digital twin representation to identify target objects. To evaluate our approach, we introduce a new comprehensive video reasoning segmentation benchmark comprising 200 videos with 895 implicit text queries. The benchmark spans three reasoning categories (semantic, spatial, and temporal) with three different reasoning chain complexity. Experimental results demonstrate that our method performs best across all reasoning categories, suggesting that our just-in-time digital twin can bridge the gap between high-level reasoning and lowlevel perception in embodied AI. Benchmark is available at https://github.com/yiqings/jitbench/.
Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing	Hongyu Shen Beijing Institute of Technology Junfeng Ni Tsinghua University Yixin Chen State Key Laboratory of General Artificial Intelligence, BIGAI Weishuo Li State Key Laboratory of General Artificial Intelligence, BIGAI Mingtao Pei Beijing Institute of Technology Siyuan Huang State Key Laboratory of General Artificial Intelligence, BIGAI	Paper Supplementary Abstract We address the challenge of lifting 2D visual segmentation to 3D in Gaussian Splatting. Existing methods often suffer from inconsistent 2D masks across viewpoints and produce noisy segmentation boundaries as they neglect these semantic cues to refine the learned Gaussians. To overcome this, we introduce Gaussian Instance Tracing (GIT), which augments the standard Gaussian representation with an instance weight matrix across input views. Leveraging the inherent consistency of Gaussians in 3D, we use this matrix to identify and correct 2D segmentation inconsistencies. Furthermore, since each Gaussian ideally corresponds to a single object, we propose a GIT-guided adaptive density control mechanism to split and prune ambiguous Gaussians during training, resulting in sharper and more coherent 2D and 3D segmentation boundaries. Experimental results show that our method extracts clean 3D assets and consistently improves 3D segmentation in both online (e.g., selfprompting) and ofﬂine (e.g., contrastive lifting) settings, enabling applications such as hierarchical segmentation, object extraction, and scene editing.
SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images	Yu Sheng University of Science and Technology of China Jiajun Deng The University of Adelaide Xinran Zhang University of Science and Technology of China Yu Zhang University of Science and Technology of China Bei Hua University of Science and Technology of China Yanyong Zhang University of Science and Technology of China Jianmin Ji University of Science and Technology of China	Paper Supplementary Abstract A major breakthrough in 3D reconstruction is the feedforward paradigm to generate pixel-wise 3D points or Gaussian primitives from sparse, unposed images. To further incorporate semantics while avoiding the significant memory and storage costs of high-dimensional semantic features, existing methods extend this paradigm by associating each primitive with a compressed semantic feature vector. However, these methods have two major limitations: (a) the naively compressed feature compromises expressiveness, affecting the model's ability to capture finegrained semantics, and (b) the pixel-wise primitive prediction introduces redundancy in overlapping areas, causing unnecessary memory overhead. To this end, we introduce SpatialSplat, a feedforward framework that produces redundancy-aware Gaussians and capitalizes on a dual-field semantic representation. Particularly, with the insight that primitives within the same instance exhibit high semantic consistency, we decompose the semantic representation into a coarse feature field that encodes uncompressed semantics with minimal primitives, and a fine-grained yet low-dimensional feature field that captures detailed interinstance relationships. Moreover, we propose a selective Gaussian mechanism, which retains only essential Gaussians in the scene, effectively eliminating redundant primitives. Our proposed Spatialsplat learns accurate semantic information and detailed instances prior with more compact 3D Gaussians, making semantic 3D reconstruction more applicable. We conduct extensive experiments to evaluate our method, demonstrating a remarkable 60% reduction in scene representation parameters while achieving superior performance over state-of-the-art methods. The code will be made available for future investigation.
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers For Motion Transfer	Qingyu Shi PKU Jianzong Wu PKU Jinbin Bai NUS Jiangning Zhang ZJU Lu Qi UC Merced Yunhai Tong PKU-Wuhan Institute for Artificial Intelligence Xiangtai Li NTU	Paper Supplementary Abstract The motion transfer task aims to transfer motion from a source video to newly generated videos, requiring the model to decouple motion from appearance. Previous diffusionbased methods primarily rely on separate spatial and temporal attention mechanisms within the 3D U-Net. In contrast, state-of-the-art video Diffusion Transformers (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information. Thus, the interaction between spatial and temporal dimensions makes decoupling motion and appearance more challenging for DiT models. In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension, facilitating the decoupling of foreground motion from background appearance. Meanwhile, the temporal kernel effectively captures temporal variations in DiT features, which are closely related to motion. Moreover, we introduce explicit supervision along dense trajectories in the latent feature space to further enhance motion consistency. Additionally, we present MTBench, a general and challenging benchmark for motion transfer. We also introduce a hybrid motion fidelity metric that considers both the global and local motion similarity. Therefore, our work provides a more comprehensive Figure 2. Method comparision. Our method exhibits the best balance between motion fidelity and edit fidelity. evaluation than previous works. Extensive experiments on MTBench demonstrate that DeT achieves the best trade-off between motion fidelity and edit fidelity.
DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving	Chen Shi The Chinese University of Hong Kong, Shenzhen Shaoshuai Shi Didi Chuxing, China Kehua Sheng Didi Chuxing, China Bo Zhang Didi Chuxing, China Li Jiang The Chinese University of Hong Kong, Shenzhen	Paper Supplementary Abstract Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX introduces Omni Scene Modeling (OSM), a module that unifies multimodal supervision-3D point cloud forecasting, 2D semantic representation, and image generation-to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamicaware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX's predictions to enhance task-specific inference. Extensive experiments demonstrate DriveX's effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state-of-the-art results on diverse tasks including occupancy prediction, flow estimation, and end-to-end driving. These results validate DriveX's capability as a generalpurpose world model, paving the way for robust and unified autonomous driving frameworks.
GenM3: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation	Junyu Shi The Hong Kong University of Science and Technology (Guangzhou) Lijiang Liu The Hong Kong University of Science and Technology (Guangzhou) Yong Sun The Hong Kong University of Science and Technology (Guangzhou) Zhiyuan Zhang The Hong Kong University of Science and Technology (Guangzhou) Jinni Zhou The Hong Kong University of Science and Technology (Guangzhou) Qiang Nie The Hong Kong University of Science and Technology (Guangzhou)	Paper Supplementary Abstract Scaling up motion datasets is crucial to enhance motion generation capabilities. However, training on large-scale multi-source datasets introduces data heterogeneity challenges due to variations in motion content. To address this, we propose Generative Pretrained Multi-path Motion Model (GenM3), a comprehensive framework designed to learn unified motion representations. GenM3 comprises two components: 1) a Multi-Expert VQ-VAE (MEVQ-VAE) that adapts to different dataset distributions to learn a unified discrete motion representation, and 2) a Multi-path Motion Transformer (MMT) that improves intra-modal representations by using separate modality-specific pathways, each with densely activated experts to accommodate variations within that modality, and improves inter-modal alignment by the text-motion shared pathway. To enable largescale training, we integrate and unify 11 high-quality motion datasets (approximately 220 hours of motion data) and augment it with textual annotations (nearly 10,000 motion sequences labeled by a large language model and 300+ by human experts). After training on our integrated dataset, GenM3 achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin. It also demonstrates strong zero-shot generalization on IDEA400 dataset, highlighting its effectiveness and adaptability across diverse motion scenarios.
Ultra-Precision 6DoF Pose Estimation Using 2-D Interpolated Discrete Fourier Transform	Guowei Shi UM-SJTU Joint Institute, Shanghai Jiao Tong University Zian Mao UM-SJTU Joint Institute, Shanghai Jiao Tong University Peisen Huang UM-SJTU Joint Institute, Shanghai Jiao Tong University	Paper Supplementary Abstract Ultra-precision estimation of 6DoF pose is essential in applications such as semiconductor manufacturing and nanoscale manipulation. Conventional vision-based techniques are often hampered by sensitivity to defocus and limited estimation accuracy. In this paper, we propose a novel two-dimensional interpolated Discrete Fourier Transform (2D-IpDFT) method for robust 6DoF pose estimation using periodic patterns. We further develop a mathematical framework that links image parameters-phase and frequency-to 6DoF pose, which is applicable to both orthographic and quasi-orthographic imaging systems. Extensive experiments on a low-cost setup, featuring an industrial camera and an etched checkerboard pattern, demonstrate translation estimation accuracy at the nanometer level and rotation estimation accuracy at the microradian level.
VoxelKP: A Voxel-based Network Architecture for Human Keypoint Estimation in LiDAR Data	Jian Shi KAUST Peter Wonka KAUST	Paper Supplementary Abstract We present VoxelKP, a novel fully sparse network architecture tailored for human keypoint estimation in LiDAR data. The key challenge is that objects are distributed sparsely in 3D space, while human keypoint detection requires detailed local information wherever humans are present. First, we introduce a dual-branch fully sparse spatial-context block where the spatial branch focuses on learning the local spatial correlations between keypoints within each human instance, while the context branch aims to retain the global spatial information. Second, we use a spatially aware multi-scale BEV fusion technique to leverage absolute 3D coordinates when projecting 3D voxels to a 2D grid encoding a bird's eye view for better preservation of the global context of each human instance. We evaluate our method on the Waymo dataset and achieve an improvement of 27% on the MPJPE metric compared to the state-of-the-art, HUM3DIL, trained on the same data, and 12% against the state-of-the-art, GC-KPL, pretrained on a 25x larger dataset. To the best of our knowledge, VoxelKP is the first single-staged, fully sparse network that is specifically designed for addressing the challenging task of 3D keypoint estimation from LiDAR data, achieving stateof-the-art performance. Our code is available at https: //github.com/shijianjian/VoxelKP.
Simultaneous Motion And Noise Estimation with Event Cameras	Shintaro Shiba Keio University Yoshimitsu Aoki Keio University Guillermo Gallego Technische Universität Berlin	Paper Supplementary Abstract Event cameras are emerging vision sensors whose noise is challenging to characterize. Existing denoising methods for event cameras are often designed in isolation and thus consider other tasks, such as motion estimation, separately (i.e., sequentially after denoising). However, motion is an intrinsic part of event data, since scene edges cannot be sensed without motion. We propose, to the best of our knowledge, the first method that simultaneously estimates motion in its various forms (e.g., ego-motion, optical ﬂow) and noise. The method is ﬂexible, as it allows replacing the one-step motion estimation of the widely-used Contrast Maximization framework with any other motion estimator, such as deep neural networks. The experiments show that the proposed method achieves state-of-the-art results on the E-MLB denoising benchmark and competitive results on the DND21 benchmark, while demonstrating effectiveness across motion estimation and intensity reconstruction tasks. Our approach advances event-data denoising theory and expands practical denoising use-cases via open-source code. Project page: https://github.com/tub-rip/ESMD
OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving	Kota Shimomura Chubu University Masaki Nambata Elith Inc. Atsuya Ishikawa Honda R&D Co., Ltd. Ryota Mimura Honda R&D Co., Ltd. Koki Inoue Elith Inc. Takayoshi Yamashita Honda R&D Co., Ltd. Takayuki Kawabuchi Honda R&D Co., Ltd.	Paper Supplementary Abstract Although autonomous driving systems demonstrate high perception performance, they still face limitations when handling rare situations or complex road structures. Such road infrastructures are designed for human drivers, safety improvements are typically introduced only after accidents occur. This reactive approach poses a significant challenge for autonomous systems, which require proactive risk mitigation. To address this issue, we propose OD-RASE, a framework for enhancing the safety of autonomous driving systems by detecting road structures that cause traffic accidents and connecting these findings to infrastructure development. First, we formalize an ontology based on specialized domain knowledge of road traffic systems. In parallel, we generate infrastructure improvement proposals using a large-scale visual language model (LVLM) and use ontology-driven data filtering to enhance their reliability. This process automatically annotates improvement proposals on pre-accident road images, leading to the construction of a new dataset. Furthermore, we introduce the Baseline approach (OD-RASE model), which leverages LVLM and a diffusion model to produce both infrastructure improvement proposals and generated images of the improved road environment. Our experiments demonstrate that ontologydriven data filtering enables highly accurate prediction of accident-causing road structures and the corresponding improvement plans. We believe that this work contributes to the overall safety of traffic environments and marks an important step toward the broader adoption of autonomous driving systems.
DiffRefine: Diffusion-based Proposal Specific Point Cloud Densification for Cross-Domain Object Detection	Sangyun Shin Department of Computer Science, University of Oxford Yuhang He Microsoft Research Xinyu Hou Department of Computer Science, University of Oxford Samuel Hodgson Department of Computer Science, University of Oxford Andrew Markham Department of Computer Science, University of Oxford Niki Trigoni Department of Computer Science, University of Oxford	Paper Abstract The robustness of 3D object detection in large-scale outdoor point clouds degrades significantly when deployed in an unseen environment due to domain shifts. To minimize the domain gap, existing works on domain adaptive detection focuses on several factors, including point density, object shape and sizes, to reduce the false negative detections. However, the adaptation results indicate that there are still remaining challenges. We argue that this is due to the challenge in recognizing comparably less distinctive region on object surface due to sparsity, occlusion, etc. In this work, we aim to reinforce those features by generating points on object surface to make them straightforwardly recognizable. We draw our motivation from a common observation that detection proposals already contain the accurate bounding boxes, but with relatively low objectness score predictions, which lead to false negatives. Given these box proposals, we densify sparse object points with a diffusion approach. As a result, our model DiffRefine can act as a simple additional module before second-stage refinement, where most existing detection models for two-stage detection can use. Experimental results on domain adaptive detection show competitive performance, especially on vanishing points due to distance on various detection architectures.
Seam360GS: Seamless 360deg Gaussian Splatting from Real-World Omnidirectional Images	Changha Shin Yonsei University Woong Oh Cho Yonsei University Seon Joo Kim Yonsei University	Paper Supplementary Abstract 360° visual content is widely shared on platforms such as YouTube and plays a central role in virtual reality, robotics, and autonomous navigation. However, consumer-grade dual-fisheye systems consistently yield imperfect panoramas due to inherent lens separation and angular distortions. In this work, we introduce a novel calibration framework that incorporates a dual-fisheye camera model into the 3D Gaussian splatting pipeline. Our approach not only simulates the realistic visual artifacts produced by dualfisheye cameras but also enables the synthesis of seamlessly rendered 360° images. By jointly optimizing 3D Gaussian parameters alongside calibration variables that emulate lens gaps and angular distortions, our framework transforms imperfect omnidirectional inputs into flawless novel view synthesis. Extensive evaluations on real-world datasets confirm that our method produces seamless renderings-even from imperfect images-and outperforms existing 360° rendering models.
AnimalClue: Recognizing Animals by their Traces	Risa Shinoda The University of Osaka Nakamasa Inoue Institute of Science Tokyo Iro Laina Visual Geometry Group, University of Oxford Christian Rupprecht Visual Geometry Group, University of Oxford Hirokatsu Kataoka National Institute of Advanced Industrial Science and Technology (AIST)	Paper Supplementary Abstract Wildlife observation plays an important role in biodiversity conservation, necessitating robust methodologies for monitoring wildlife populations and interspecies interactions. Recent advances in computer vision have significantly contributed to automating fundamental wildlife observation tasks, such as animal detection and species identification. However, accurately identifying species from indirect evidence like footprints and feces remains relatively underexplored, despite its importance in contributing to wildlife monitoring. To bridge this gap, we introduce AnimalClue, the first large-scale dataset for species identification from images of indirect evidence. Our dataset consists of 159,605 bounding boxes encompassing five categories of indirect clues: footprints, feces, eggs, bones, and feathers. It covers 968 species, 200 families, and 65 orders. Each image is annotated with species-level labels, bounding boxes or segmentation masks, and fine-grained trait information, including activity patterns and habitat preferences. Unlike existing datasets primarily focused on direct visual features (e.g., animal appearances), AnimalClue presents unique challenges for classification, detection, and instance segmentation tasks due to the need for recognizing more detailed and subtle visual features. In our experiments, we extensively evaluate representative vision models and identify key challenges in animal identification from their traces. Our dataset and code are available at https: //dahlian00.github.io/AnimalCluePage/
Unsupervised RGB-D Point Cloud Registration for Scenes with Low Overlap and Photometric Inconsistency	Yejun Shou The State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University Haocheng Wang The State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University Lingfeng Shen The State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University Qian Zheng College of Computer Science and Technology, Zhejiang University Gang Pan College of Computer Science and Technology, Zhejiang University Yanlong Cao The State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University	Paper Supplementary Abstract Point cloud registration is a fundamental task in 3D vision, playing a crucial role in various fields. With the rapid advancement of RGB-D sensors, unsupervised point cloud registration methods based on RGB-D sequences have demonstrated excellent performance. However, existing methods struggle in scenes with low overlap and photometric inconsistency. Low overlap results in numerous correspondence outliers, while photometric inconsistency hinders the model's ability to extract discriminative features. To address these challenges, we first propose the Overlapping Constraint for Inliers Detection (OCID) module, which filters and optimizes the initial correspondence set using an overlapping constraint. This module robustly selects reliable correspondences within the overlapping region while maintaining a balance between accuracy and efficiency. Additionally, we introduce a novel scene representation, 3DGS, which integrates both geometric and texture information, making it particularly well-suited for RGB-D registration tasks. Building on this, we propose the Gaussian Rendering for Photometric Adaptation (GRPA) module, which refines the geometric transformation and enhances the model's adaptability to scenes with inconsistent photometric information. Extensive experiments on ScanNet and ScanNet1500 demonstrate that our method achieves state-of-the-art performance. The code will be released at OG-UPCR.
Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation	Xincheng Shuai Fudan University Henghui Ding Fudan University Zhenyuan Qin Fudan University Hao Luo DAMO Academy, Alibaba group Xingjun Ma Fudan University Dacheng Tao Nanyang Technological University	Paper Supplementary Abstract Controlling the movements of dynamic objects and the camera within generated videos is a meaningful yet challenging task. Due to the lack of datasets with comprehensive 6D pose annotations, existing text-to-video methods can not simultaneously control the motions of both camera and objects in 3D-aware manner, resulting in limited controllability over generated contents. To address this issue and facilitate the research in this field, we introduce a Synthetic Dataset for Free-Form Motion Control (SynFMC). The proposed SynFMC dataset includes diverse object and environment categories and covers various motion patterns according to specific rules, simulating common and complex real-world scenarios. The complete 6D pose information facilitates models learning to disentangle the motion effects from objects and the camera in a video. To provide precise 3D-aware motion control, we further propose a method trained on SynFMC, Free-Form Motion Control (FMC). FMC can control the 6D poses of objects and camera independently or simultaneously, producing high-fidelity videos. Moreover, it is compatible with various personalized text-to-image (T2I) models for different content styles. Extensive experiments demonstrate that the proposed FMC outperforms previous methods across multiple scenarios.
You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception	Hao Si The University of Tokyo Ehsan Javanmardi The University of Tokyo Manabu Tsukada The University of Tokyo	Paper Supplementary Abstract Collaborative perception enables vehicles to overcome individual perception limitations by sharing information, allowing them to see further and through occlusions. In real-world scenarios, models on different vehicles are often heterogeneous due to manufacturer variations. Existing methods for heterogeneous collaborative perception address this challenge by fine-tuning adapters or the entire network to bridge the domain gap. However, these methods are impractical in real-world applications, as each new collaborator must undergo joint training with the ego vehicle on a dataset before inference, or the ego vehicle stores models for all potential collaborators in advance. Therefore, we pose a new question: Can we tackle this challenge directly during inference, eliminating the need for joint training? To answer this, we introduce Progressive Heterogeneous Collaborative Perception (PHCP), a novel framework that formulates the problem as few-shot unsupervised domain adaptation. Unlike previous work, PHCP dynamically aligns features by self-training an adapter during inference, eliminating the need for labeled data and joint training. Extensive experiments on the OPV2V dataset demonstrate that PHCP achieves strong performance across diverse heterogeneous scenarios. Notably, PHCP achieves performance comparable to SOTA methods trained on the entire dataset while using only a small amount of unlabeled data.
Recovering Parametric Scenes from Very Few Time-of-Flight Pixels	Carter Sifferman University of Wisconsin-Madison Yiquan Li University of Wisconsin-Madison Yiming Li University of Wisconsin-Madison Fangzhou Mu University of Wisconsin-Madison Michael Gleicher University of Wisconsin-Madison Mohit Gupta University of Wisconsin-Madison Yin Li University of Wisconsin-Madison	Paper Supplementary Abstract We aim to recover the geometry of 3D parametric scenes using very few depth measurements from low-cost, commercially available time-of-ﬂight sensors. These sensors offer very low spatial resolution (i.e., a single pixel), but image a wide field-of-view per pixel and capture detailed timeof-ﬂight data in the form of time-resolved photon counts. This time-of-ﬂight data encodes rich scene information and thus enables recovery of simple scenes from sparse measurements. We investigate the feasibility of using a distributed set of few measurements (e.g., as few as 15 pixels) to recover the geometry of simple parametric scenes with a strong prior, such as estimating the 6D pose of a known object. To achieve this, we design a method that utilizes both feed-forward prediction to infer scene parameters, and differentiable rendering within an analysis-bysynthesis framework to refine the scene parameter estimate. We develop hardware prototypes and demonstrate that our method effectively recovers object pose given an untextured 3D model in both simulations and controlled real-world captures, and show promising initial results for other parametric scenes. We additionally conduct experiments to explore the limits and capabilities of our imaging solution. Our project webpage is available at cpsiff.github.io/recovering parametric scenes
Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation	Andrea Simonelli Meta Reality Labs Zürich Norman Müller Meta Reality Labs Zürich Peter Kontschieder Meta Reality Labs Zürich	Paper Supplementary Abstract The increasing availability of digital 3D environments, whether through image-based 3D reconstruction, generation, or scans obtained by robots, is driving innovation across various applications. These come with a significant demand for 3D interaction, such as 3D Interactive Segmentation, which is useful for tasks like object selection and manipulation. Additionally, there is a persistent need for solutions that are efficient, precise, and performing well across diverse settings, particularly in unseen environments and with unfamiliar objects. In this work, we introduce a 3D interactive segmentation method that consistently surpasses previous state-of-the-art techniques on both in-domain and out-of-domain datasets. Our simple approach integrates a voxel-based sparse encoder with a lightweight transformerbased decoder that implements implicit click fusion, achieving superior performance and maximizing efficiency. Our method demonstrates substantial improvements on benchmark datasets, including ScanNet [3], ScanNet++ [35], S3DIS [1], and KITTI-360 [17], and also on unseen geometric distributions such as the ones obtained by Gaussian Splatting [12]. The project page is available here: https: //simonelli-andrea.github.io/easy3d.
MonoSOWA: Scalable Monocular 3D Object Detector Without Human Annotations	Jan Skvrna Czech Technical University in Prague Lukas Neumann Czech Technical University in Prague	Paper Supplementary Abstract Inferring object 3D position and orientation from a single RGB camera is a foundational task in computer vision with many important applications. Traditionally, 3D object detection methods are trained in a fully-supervised setup, requiring LiDAR and vast amounts of human annotations, which are laborious, costly, and do not scale well with the ever-increasing amounts of data being captured. We present a novel method to train a 3D object detector from a single RGB camera without domain-specific human annotations, making orders of magnitude more data available for training. The method uses newly proposed Local Object Motion Model to disentangle object movement source between subsequent frames, is approximately 700 times faster than previous work and compensates camera focal length differences to aggregate multiple datasets. The method is evaluated on three public datasets, where despite using no human labels, it outperforms prior work by a significant margin. It also shows its versatility as a pre-training tool for fully-supervised training and shows that combining pseudo-labels from multiple datasets can achieve comparable accuracy to using human labels from a single dataset. The source code and model are available at https://github.com/jskvrna/MonoSOWA.
DMesh++: An Efficient Differentiable Mesh for Complex Shapes	Sanghyun Son University of Maryland Matheus Gadelha Adobe Research Yang Zhou Adobe Research Matthew Fisher Adobe Research Zexiang Xu Adobe Research Yi-Ling Qiao University of Maryland Ming C. Lin University of Maryland Yi Zhou Adobe Research	Paper Supplementary Abstract Recent probabilistic methods for 3D triangular meshes capture diverse shapes by differentiable mesh connectivity, but face high computational costs with increased shape details. We introduce a new differentiable mesh processing method that addresses this challenge and efficiently handles meshes with intricate structures. Our method reduces time complexity from O(N) to O(log N) and requires significantly less memory than previous approaches. Building on this innovation, we present a reconstruction algorithm capable of generating complex 2D and 3D shapes from point clouds or multi-view images.
MDP-Omni: Parameter-free Multimodal Depth Prior-based Sampling for Omnidirectional Stereo Matching	Eunjin Son Jeonbuk National University HyungGi Jo Jeonbuk National University Wookyong Kwon Electronics and Telecommunications Research Institute (ETRI) Sang Jun Lee Jeonbuk National University	Paper Abstract Omnidirectional stereo matching (OSM) estimates 360◦ depth by performing stereo matching on multi-view fisheye images. Existing methods assume a unimodal depth distribution, matching each pixel to a single object. However, this assumption constrains the sampling range, causing oversmoothed depth artifacts, especially at object boundaries. To address these limitations, we propose MDP-Omni, a novel OSM network that leverages parameter-free multimodal depth priors. Specifically, we design a sampling strategy that adaptively adjusts the sampling range based on a multimodal probability distribution, without introducing any additional parameters. Furthermore, we present the azimuth-based multi-view volume fusion module to build a single cost volume. It mitigates false matches caused by occlusions in warped multi-view volumes. Experimental results demonstrate that MDP-Omni significantly improves existing methods, particularly in capturing fine details.
CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving	Rui Song Fraunhofer IVI Chenwei Liang Fraunhofer IVI Yan Xia USTC Walter Zimmer TU Munich Hu Cao TU Munich Holger Caesar TU Delft Andreas Festag TH Ingolstadt Alois Knoll TU Munich	Paper Supplementary Abstract Dynamic scene rendering opens new avenues in autonomous driving by enabling closed-loop simulations with photorealistic data, which is crucial for validating endto-end algorithms. However, the complex and highly dynamic nature of traffic environments presents significant challenges in accurately rendering these scenes. In this paper, we introduce a novel 4D Gaussian Splatting (4DGS) approach, which incorporates context and temporal deformation awareness to improve dynamic scene rendering. Specifically, we employ a 2D semantic segmentation foundation model to self-supervise the 4D semantic features of Gaussians, ensuring meaningful contextual embedding. Simultaneously, we track the temporal deformation of each Gaussian across adjacent frames. By aggregating and encoding both semantic and temporal deformation features, each Gaussian is equipped with cues for potential deformation compensation within 3D space, facilitating a more precise representation of dynamic scenes. Experimental results show that our method improves 4DGS's ability to capture fine details in dynamic scene rendering for autonomous driving and outperforms other self-supervised methods in 4D reconstruction and novel view synthesis. Furthermore, CoDa-4DGS deforms semantic features with each Gaussian, enabling broader applications.
OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics	Yeon-Ji Song Seoul National University Jaein Kim Seoul National University Suhyung Choi Seoul National University Jin-Hwa Kim NAVER AI Lab Byoung-Tak Zhang Seoul National University	Paper Supplementary Abstract Human perception involves decomposing complex multiobject scenes into time-static object appearance (i.e., size, shape, color) and time-varying object motion (i.e., position, velocity, acceleration). For machines to achieve human-like intelligence in real-world interactions, understanding these physical properties of objects is essential, forming the foundation for dynamic video prediction. While recent advancements in object-centric transformers have demonstrated potential in video prediction, they primarily focus on object appearance, often overlooking motion dynamics, which is crucial for modeling dynamic interactions and maintaining temporal consistency in complex environments. To address these limitations, we propose OCK, a dynamic video prediction model leveraging object-centric kinematics and object slots. We introduce a novel component named Object Kinematics that comprises explicit object motions, serving as an additional attribute beyond conventional appearance features to model dynamic scenes. The Object Kinematics are integrated into various OCK mechanisms, enabling spatiotemporal prediction of complex object interactions over long video sequences. Our model demonstrates superior performance in handling complex scenes with intricate object attributes and motions, highlighting its potential for applicability in vision-related dynamics learning tasks.
A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks	Hang Su ShanghaiTech University Yunlong Feng ShanghaiTech University Daniel Gehrig University of Pennsylvania Panfeng Jiang ShanghaiTech University Ling Gao Amap, Alibaba Group Xavier Lagorce ShanghaiTech University Laurent Kneip Shanghai Engineering Research Center of Intelligent Vision and Imaging	Paper Supplementary Abstract Structure and continuous motion estimation from point correspondences is a fundamental problem in computer vision that has been powered by well-known algorithms such as the familiar 5-point or 8-point algorithm. However, despite their acclaim, these algorithms are limited to processing point correspondences originating from a pair of views each one representing an instantaneous capture of the scene. Yet, in the case of rolling shutter cameras, or more recently, event cameras, this synchronization breaks down. In this work, we present a unified approach for structure and linear motion estimation from 2D point correspondences with arbitrary timestamps, from an arbitrary set of views. By formulating the problem in terms of first-order dynamics and leveraging a constant velocity motion model, we derive a novel, linear point incidence relation allowing for the efficient recovery of both linear velocity and 3D points with predictable degeneracies and solution multiplicities. Owing to its general formulation, it can handle correspondences from a wide range of sensing modalities such as global shutter, rolling shutter, and event cameras, and can even combine correspondences from different collocated sensors. We validate the effectiveness of our solver on both simulated and real-world data, where we show consistent improvement across all modalities when compared to recent approaches. We believe our work opens the door to efficient structure and motion estimation from asynchronous data. Code can be found at https://github.com/suhang99/AsyncTrack-Motion-Solver.
Dense Policy: Bidirectional Autoregressive Learning of Actions	Yue Su Shanghai Jiao Tong University Xinyu Zhan Shanghai Jiao Tong University Hongjie Fang Shanghai Jiao Tong University Han Xue Shanghai Jiao Tong University Hao-Shu Fang Shanghai Jiao Tong University Yong-Lu Li Shanghai Jiao Tong University Cewu Lu Shanghai Jiao Tong University Lixin Yang Shanghai Jiao Tong University	Paper Supplementary Abstract Mainstream visuomotor policies predominantly rely on generative models for holistic action prediction, while current autoregressive policies, predicting the next token or chunk, have shown suboptimal results. This motivates a search for more effective learning methods to unleash the potential of autoregressive policies for robotic manipulation. This paper introduces a bidirectionally expanded learning approach, termed Dense Policy, to establish a new paradigm for autoregressive policies in action prediction. It employs a lightweight encoder-only architecture to iteratively unfold the action sequence from an initial single frame into the target sequence in a coarse-to-fine manner with logarithmic-time inference. Extensive experiments validate that our dense policy has superior autoregressive learning capabilities and can surpass existing holistic generative policies. Our model, data, and code are available at: https://selen-suyue.github.io/DspNet/.
FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers	Haisheng Su Shanghai Jiao Tong University Junjie Zhang Xi'an Jiaotong University Feixiang Song SenseAuto Research Sanping Zhou Xi'an Jiaotong University Wei Wu SenseAuto Research Junchi Yan Shanghai Jiao Tong University Nanning Zheng Xi'an Jiaotong University	Paper Supplementary Abstract Detecting 3D objects accurately from multi-view 2D images is a challenging yet essential task in the field of autonomous driving. Current methods resort to integrating depth prediction to recover the spatial information for object query decoding, which necessitates explicit supervision from LiDAR points during the training phase. However, the predicted depth quality is still unsatisfactory such as depth discontinuity of object boundaries and indistinction of small objects, which are mainly caused by the sparse supervision of projected points and the use of high-level image features for depth prediction. Besides, cross-view consistency and scale invariance are also overlooked in previous methods. In this paper, we introduce Frequencyaware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder, which can be obtained through three main modules. Specifically, the Frequency-aware Spatial Pyramid Encoder (FSPE) constructs a feature pyramid by combining high-frequency edge clues and low-frequency semantics from different levels respectively. Then the Crossview Scale-invariant Depth Predictor (CSDP) estimates the pixel-level depth distribution with cross-view and efficient channel attention mechanism. Finally, the Positional Depth Encoder (PDE) combines the 2D image features and 3D position embeddings to generate the 3D depth-aware features for query decoding. Additionally, hybrid depth supervision is adopted for complementary depth learning from both metric and distribution aspects. Extensive experiments conducted on the nuScenes dataset demonstrate the effectiveness and superiority of our proposed method.
HUG: Hierarchical Urban Gaussian Splatting with Block-Based Reconstruction for Large-Scale Aerial Scenes	Mai Su Peking University Zhongtao Wang Peking University Huishan Au Peking University Yilong Li Peking University Xizhe Cao Peking University Chengwei Pan Institute of Artificial Intelligence, BUAA Yisong Chen Peking University Guoping Wang Peking University	Paper Supplementary Abstract 3DGS is an emerging and increasingly popular technology in the field of novel view synthesis. Its highly realistic rendering quality and real-time rendering capabilities make it promising for various applications. However, when applied to large-scale aerial urban scenes, 3DGS methods suffer from issues such as excessive memory consumption, slow training times, prolonged partitioning processes, and significant degradation in rendering quality due to the increased data volume. To tackle these challenges, we introduce HUG, a novel approach that enhances data partitioning and reconstruction quality by leveraging a hierarchical neural Gaussian representation. We first propose a visibility-based data partitioning method that is simple yet highly efficient, significantly outperforming existing methods in speed. Then, we introduce a novel hierarchical weighted training approach, combined with other optimization strategies, to substantially improve reconstruction quality. Our method achieves state-of-the-art results on one synthetic dataset and four real-world datasets.
OVA-Fields: Weakly Supervised Open-Vocabulary Affordance Fields for Robot Operational Part Detection	Heng Su Chongqing University Mengying Xie Chongqing University Nieqing Cao Xi'an Jiaotong-Liverpool University Yan Ding Shanghai AI Lab Beichen Shao Chongqing University Xianlei Long Chongqing University Fuqiang Gu Chongqing University Chao Chen Chongqing University	Paper Supplementary Abstract In recent years, affordance detection has become essential for robotic manipulation in real-world scenes, where robots must autonomously interpret commands and perform actions. Current methods often focus on individual point cloud objects or simple semantic queries, limiting their effectiveness in diverse scenes and complex instructions. To address this, we introduce OVA-Fields, a framework for affordance detection in 3D scenes with complex semantics. By integrating multilevel geometric encoding and enhanced semantic affordance embeddings, OVA-Fields maps user commands directly to operational parts, embedding enriched affordance information into the 3D scene. Experimental results demonstrate that OVA-Fields achieves 52.4% mIoU on complex semantic real-world scenes and 90% success rate in real-world robot manipulation tasks (e.g., 'take out some food from the refirgerator') using RGB-D sensing. Our approach enables the precise identification of operational parts, transforming natural language queries into targeted manipulations in real-world environments. Our codes are available at: https://github. com/vlasu19/OVA-Fields
Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction	Edgar Sucar Visual Geometry Group (VGG), University of Oxford Zihang Lai Visual Geometry Group (VGG), University of Oxford Eldar Insafutdinov Visual Geometry Group (VGG), University of Oxford Andrea Vedaldi Visual Geometry Group (VGG), University of Oxford	Paper Supplementary Abstract DUSt3R has recently demonstrated that many tasks in multiview geometry, including estimating camera intrinsics and extrinsics, reconstructing 3D scenes, and establishing image correspondences, can be reduced to predicting a pair of viewpoint-invariant point maps, i.e., pixel-aligned point clouds defined in a common reference frame. While this formulation is elegant and powerful, it is limited to static scenes. To overcome this limitation, we introduce the concept of Dynamic Point Maps (DPM), which extends standard point maps to support 4D tasks such as motion segmentation, scene flow estimation, 3D object tracking, and 2D correspondence. Our key insight is that, when time is introduced, several possible spatial and temporal references can be used to define the point maps. We identify a minimal subset of these combinations that can be regressed by a network to solve the aforementioned tasks. We train a DPM predictor on a mixture of synthetic and real data and evaluate it across diverse benchmarks, including video depth prediction, dynamic point cloud reconstruction, 3D scene flow, and object pose tracking, achieving stateof-the-art performance. Additional results are available at https://www.robots.ox.ac.uk/˜vgg/research/dynamic-pointmaps/.
SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images	Gencer Sumbul Ecole Polytechnique Fédérale de Lausanne (EPFL) Chang Xu Ecole Polytechnique Fédérale de Lausanne (EPFL) Emanuele Dalsasso Ecole Polytechnique Fédérale de Lausanne (EPFL) Devis Tuia Ecole Polytechnique Fédérale de Lausanne (EPFL)	Paper Supplementary Abstract From optical sensors to microwave radars, leveraging the complementary strengths of remote sensing (RS) sensors is crucial for achieving dense spatio-temporal monitoring of our planet. In contrast, recent deep learning models, whether task-specific or foundational, are often specific to single sensors or to fixed combinations: adapting such models to different sensory inputs requires both architectural changes and re-training, limiting scalability and generalization across multiple RS sensors. On the contrary, a single model able to modulate its feature representations to accept diverse sensors as input would pave the way to agile and flexible multi-sensor RS data processing. To address this, we introduce SMARTIES, a generic and versatile foundation model lifting sensor-specific/dependent efforts and enabling scalability and generalization to diverse RS sensors: SMARTIES projects data from heterogeneous sensors into a shared spectrum-aware space, enabling the use of arbitrary combinations of bands both for training and inference. To obtain sensor-agnostic representations, we train a single, unified transformer model reconstructing masked multisensor data with cross-sensor token mixup. On both singleand multi-modal tasks across diverse sensors, SMARTIES outperforms previous models that rely on sensor-specific pretraining. Our code and pretrained models are available at https://gsumbul.github.io/SMARTIES.
ARMO: Autoregressive Rigging for Multi-Category Objects	Mingze Sun Tsinghua Shenzhen International Graduate School Shiwei Mao Tsinghua Shenzhen International Graduate School Keyi Chen Tsinghua Shenzhen International Graduate School Yurun Chen Tsinghua Shenzhen International Graduate School Shunlin Lu The Chinese University of Hong Kong, Shenzhen Jingbo Wang Shanghai AI Laboratory Junting Dong Shanghai AI Laboratory Ruqi Huang Tsinghua Shenzhen International Graduate School	Paper Supplementary Abstract Recent advancements in large-scale generative models have significantly improved the quality and diversity of 3D shape generation. However, most existing methods focus primarily on generating static 3D models, overlooking the potential dynamic nature of certain shapes, such as humanoids, animals, and insects. To address this gap, we focus on rigging, a fundamental task in animation that establishes skeletal structures and skinning for 3D models. In this paper, we introduce OmniRig, the first large-scale rigging dataset, comprising 79,499 meshes with detailed skeleton and skinning information. Unlike traditional benchmarks that rely on predefined standard poses (e.g., A-pose, T-pose), our dataset embraces diverse shape categories, styles, and poses. Leveraging this rich dataset, we propose ARMO, a novel rigging framework that utilizes an autoregressive model to predict both joint positions and connectivity relationships in a unified manner. By treating the skeletal structure as a complete graph and discretizing it into tokens, we encode the joints using an auto-encoder to obtain a latent embedding and an autoregressive model to predict the tokens. A mesh-conditioned latent diffusion model is used to predict the latent embedding for conditional skeleton generation. Our method addresses the limitations of regressionbased approaches, which often suffer from error accumulation and suboptimal connectivity estimation. Through extensive experiments on the OmniRig dataset, our approach achieves state-of-the-art performance in skeleton prediction, demonstrating improved generalization across diverse object categories. The code and dataset will be made available at https://armo-omnirig.github.io/.
AnnofreeOD: Detecting All Classes at Low Frame Rates Without Human Annotations	Boyi Sun Institute of Automation, Chinese Academy of Sciences Yuhang Liu Institute of Automation, Chinese Academy of Sciences Houxin He Institute of Automation, Chinese Academy of Sciences Yonglin Tian Institute of Automation, Chinese Academy of Sciences Fei-Yue Wang Institute of Automation, Chinese Academy of Sciences	Paper Supplementary Abstract Manual annotation of 3D bounding boxes in large-scale 3D scenes is expensive and time-consuming. This motivates the exploration of annotation-free 3D object detection using unlabeled point cloud data. Existing unsupervised 3D detection frameworks predominantly identify moving objects via scene flow, which has significant limitations: (1) limited detection classes (≤3), (2) difficulty in detecting stationary objects, and (3) reliance on high frame rates. To address these limitations, we propose AnnofreeOD, a novel Annotation-free Object Detection framework based on 2D-to-3D knowledge distillation. First, we explore an effective strategy to generate high-quality pseudo boxes using single-frame 2D knowledge. Second, we observe the noise from the previous step and introduce NoiseResistant Regression (NRR) based on Box Augmentation (BA). AnnofreeOD achieves state-of-the-art performance across multiple experiments. On the nuScenes dataset, we established the first annotation-free 10-class object detection baseline, achieving 40% of fully supervised performance. Furthermore, in 3-class and class-agnostic object detection tasks, our approach surpasses prior stateof-the-art methods by +9.3% mAP (+12.2% NDS) and +6.0% AP (+4.1% NDS), significantly improving precision. Our codes will be released at https://github.com/sbysbysbys/AnnofreeAD.
Arti-PG: A Toolbox for Procedurally Synthesizing Large-Scale and Diverse Articulated Objects with Rich Annotations	Jianhua Sun School of Artificial Intelligence, Shanghai Jiao Tong University Yuxuan Li School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University Jiude Wei School of Artificial Intelligence, Shanghai Jiao Tong University Longfei Xu School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University Nange Wang School of Artificial Intelligence, Shanghai Jiao Tong University Yining Zhang School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University Cewu Lu School of Artificial Intelligence, Shanghai Jiao Tong University	Paper Abstract The acquisition of substantial volumes of 3D articulated object data is expensive and time-consuming, and consequently the scarcity of 3D articulated object data becomes an obstacle for deep learning methods to achieve remarkable performance in various articulated object understanding tasks. Meanwhile, pairing these object data with detailed annotations to enable training for various tasks is also difficult and labor-intensive to achieve. In order to expeditiously gather a significant number of 3D articulated objects with comprehensive and detailed annotations for training, we propose Articulated Object Procedural Generation toolbox, a.k.a. Arti-PG toolbox. Arti-PG toolbox consists of i) descriptions of articulated objects by means of a generalized structure program along with their analytic correspondence to the objects' point cloud, ii) procedural rules about manipulations on the structure program to synthesize large-scale and diverse new articulated objects, and iii) mathematical descriptions of knowledge (e.g. affordance, semantics, etc.) to provide annotations to the synthesized object. Arti-PG has two appealing properties for providing training data for articulated object understanding tasks: i) objects are created with unlimited variations in shape through program-oriented structure manipulation, ii) Arti-PG is widely applicable to diverse tasks by easily providing comprehensive and detailed annotations. Arti-PG now supports the procedural generation of 26 categories of articulate objects and provides annotations across a wide range of both vision and manipulation tasks, and we provide exhaustive experiments which fully demonstrate its advantages. Our code is released at https://github.com/Analytic-Concept-Group/ArtiPG. †denotes equal contribution, § denotes corresponding author
Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data	Zeyi Sun Shanghai Jiaotong University Tong Wu Stanford University Pan Zhang Shanghai Artificial Intelligence Laboratory Yuhang Zang Shanghai Artificial Intelligence Laboratory Xiaoyi Dong Shanghai Artificial Intelligence Laboratory Yuanjun Xiong Adobe Dahua Lin Shanghai Artificial Intelligence Laboratory Jiaqi Wang Shanghai Artificial Intelligence Laboratory	Paper Supplementary Abstract Recent years have witnessed remarkable progress in multiview diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D data with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically generates filtered multi-view images to assist in training multi-view diffusion models. Specifically, we introduce a data generation pipeline that employs (1) 2D and video diffusion models to generate multi-view images based on constructed text prompts, and (2) our fine-tuned 3D-aware MVLLaVA for filtering data and rewriting inaccurate captions. Leveraging this pipeline, we have generated large scale synthetic multi-view images with dense descriptive captions. Furthermore, we present a Training Timestep Reschedule (TTR) strategy that leverages the denoising process to learn multi-view consistency while maintaining the original 2D diffusion prior. Extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment and view consistency.
CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation	Lin Sun Tianjin University Jiale Cao Tianjin University Jin Xie Chongqing University Xiaoheng Jiang Zhengzhou University Yanwei Pang Tianjin University	Paper Supplementary Abstract Contrastive Language-Image Pre-training (CLIP) exhibits strong zero-shot classification ability on image-level tasks, leading to the research to adapt CLIP for open-vocabulary semantic segmentation without training. The key is to improve spatial representation of image-level CLIP, such as replacing self-attention map at last layer with self-self attention map or vision foundation model based attention map. In this paper, we present a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP. The proposed CLIPer includes an early-layer fusion and a fine-grained compensation. We observe that, the embeddings and attention maps at early layers can preserve spatial structural information. Inspired by this, we design the early-layer fusion module to generate segmentation map with better spatial coherence. Afterwards, we employ a fine-grained compensation module to compensate local details using the self-attention maps of diffusion model. We conduct the experiments on eight segmentation datasets. Our CLIPer achieves the state-ofthe-art performance on these datasets. With ViT-L and sliding-window inference, CLIPer has the mIoU of 72.2% and 44.7% on VOC and Object, outperforming ProxyCLIP by 11.6% and 5.5%. Our code is available at https: //github.com/linsun449/cliper.code.
Controllable-LPMoE: Adapting to Challenging Object Segmentation via Dynamic Local Priors from Mixture-of-Experts	Yanguang Sun Nanjing University of Science and Technology Jiawei Lian Nanjing University of Science and Technology Jian Yang Nankai University Lei Luo Nanjing University of Science and Technology	Paper Abstract Large-scale foundation models provide powerful feature representations for downstream object segmentation tasks. However, when adapted to specific tasks through the fullparameter fine-tuning, the enormous parameters being updated often results in significant computational overhead, creating a bottleneck in training efficiency. Although existing methods attempt to fine-tune frozen models by directly embedding trainable prompts, these prompts lack inherent semantic priors, limiting the adaptability of largescale models. In this paper, we propose a novel dynamic priors-based fine-tuning paradigm with fewer trainable parameters, dubbed Controllable-LPMoE, which adaptively modulates frozen foundation models by dynamically controlling local priors to enhance fine-grained perception for specific segmentation tasks. More specifically, we construct a lightweight dynamic mixed local priors extractor that captures diverse local priors from input images through heterogeneous convolutions while employing a gating network to dynamically output expert priors required for the subsequent fine-tuning. Furthermore, we design a bi-directional interaction adapter that employs cosine-aligned deformable attention and channel-oriented adaptive scale enhancement to interact and restructure between frozen and trainable features, achieving efficient fine-tuning. Extensive experiments validate the superiority of our Controllable-LPMoE approach, demonstrating excellent segmentation performance compared to 31 state-of-the-art (SOTA) methods and adaptability to multiple binary object segmentation tasks.
Dual Domain Control via Active Learning for Remote Sensing Domain Incremental Object Detection	Jiachen Sun Xidian University De Cheng Xidian University Xi Yang Xidian University Nannan Wang Xidian University	Paper Supplementary Abstract Domain incremental object detection in remote sensing addresses the challenge of adapting to continuously emerging domains with distinct characteristics. Unlike natural images, remote sensing data vary significantly due to differences in sensors, altitudes, and geographic locations, leading to data distribution shifts and feature misalignments. These challenges make it difficult for models to generalize across domains while retaining knowledge from previous tasks, requiring effective adaptation strategies to mitigate catastrophic forgetting. To address these challenges, we propose the Dual Domain Control via Active Learning (Active-DDC) method, which integrates active learning strategies to handle data distribution and model feature shifts. The first component, the Data-based Active Learning Example Replay (ALER) module, combines a highinformation sample selection strategy from active learning with the characteristic extreme foreground-background ratio in remote sensing images, enabling the selection of highly representative samples for storage in a memory bank. The second component, the Query-based Active Domain Shift Control (ADSC) module, leverages the query vector, a key element for DETR-based detectors, to implement query active preselection and optimal transport matching, thus facilitating effective cross-domain knowledge transfer. Our method achieves optimal performance in domain incremental tasks across four remote sensing datasets, and ablation studies further validate the effectiveness of both components.
EVDM: Event-based Real-world Video Deblurring with Mamba	Zhijing Sun University of Science and Technology of China Senyan Xu University of Science and Technology of China Kean Liu University of Science and Technology of China Runze Tian University of Science and Technology of China Xueyang Fu University of Science and Technology of China Zheng-Jun Zha University of Science and Technology of China	Paper Supplementary Abstract Existing event-based video deblurring methods face limitations in extracting and fusing long-range spatiotemporal motion information from events, primarily due to restricted receptive fields or low computational efficiency, resulting in suboptimal deblurring performance. To address these issues, we introduce the state space model, which leverages linear complexity and global receptive fields for long-range modeling, and propose EVDM, a novel Eventbased Video Deblurring framework with Mamba. The framework consists of: (1) Motion Clue Extraction Mamba (MCEM), which employs an event self-reconstruction loss to ensure the completeness of details when extracting longrange motion information. (2) Motion-aware Intra-frame Fusion Mamba (MIFM) and Inter-frame Temporal Propagation Mamba (ITPM), which utilize the motion-aware state space to perform cross-modal fusion and inter-frame information exchange guided by motion clues. Consequently, EVDM achieves superior detail restoration in blurred regions while ensuring temporal motion consistency across frames. Additionally, to overcome the limitation of fixed exposure ratios in existing event-frame paired datasets, we introduce T-RED, a high-quality, high-resolution dataset with varying exposure time ratios. T-RED provides more realistic and complex data for event-based video deblurring research. Experiments on multiple datasets demonstrate that EVDM outperforms previous SOTA methods.
Hierarchy UGP: Hierarchy Unified Gaussian Primitive for Large-Scale Dynamic Scene Reconstruction	Hongyang Sun Zhejiang University Qinglin Yang Zhejiang University Jiawei Wang UESTC Zhen Xu Zhejiang University Chen Liu Li Auto Inc. Yida Wang Li Auto Inc. Kun Zhan Li Auto Inc. Hujun Bao Zhejiang University Xiaowei Zhou Zhejiang University Sida Peng Zhejiang University	Paper Supplementary Abstract Recent advances in differentiable rendering have significantly improved dynamic street scene reconstruction. However, the complexity of large-scale scenarios and dynamic elements, such as vehicles and pedestrians, remains a substantial challenge. Existing methods often struggle to scale to large scenes or accurately model arbitrary dynamics. To address these limitations, we propose Hierarchy UGP, which constructs a hierarchical structure consisting of a root level, sub-scenes level, and primitive level, using Unified Gaussian Primitive (UGP) defined in 4D space as the representation. The root level serves as the entry point to the hierarchy. At the sub-scenes level, the scene is spatially divided into multiple sub-scenes, with various elements extracted. At the primitive level, each element is modeled with UGPs, and its global pose is controlled by a motion prior related to time. This hierarchical design greatly enhances the model's capacity, enabling it to model large-scale scenes. Additionally, our UGP allows for the reconstruction of both rigid and non-rigid dynamics. We conducted experiments on Dynamic City, our proprietary large-scale dynamic street scene dataset, as well as the public Waymo dataset. Experimental results demonstrate that our method achieves state-of-the-art performance. We plan to release the accompanying code and the Dynamic City dataset as open resources to further research within the community.
Low-Light Image Enhancement Using Event-Based Illumination Estimation	Lei Sun INSAIT, Sofia University 'St. Kliment Ohridski' Yuhan Bao Zhejiang University Jiajun Zhai Zhejiang University Jingyun Liang Alibaba Group Yulun Zhang Shanghai Jiao Tong University Kaiwei Wang Zhejiang University Danda Pani Paudel INSAIT, Sofia University 'St. Kliment Ohridski' Luc Van Gool INSAIT, Sofia University 'St. Kliment Ohridski'	Paper Supplementary Abstract Low-light image enhancement (LLIE) aims to improve the visibility of images captured in poorly lit environments. Prevalent event-based solutions primarily utilize events triggered by motion, i.e., 'motion events' to strengthen only the edge texture, while leaving the high dynamic range and excellent low-light responsiveness of event cameras largely unexplored. This paper instead opens a new avenue from the perspective of estimating the illumination using 'temporal-mapping' events, i.e., by converting the timestamps of events triggered by a transmittance modulation into brightness values. The resulting fine-grained illumination cues facilitate a more effective decomposition and enhancement of the reflectance component in low-light images through the proposed Illumination-aided Reflectance Enhancement module. Furthermore, the degradation model of temporal-mapping events under low-light conditions is investigated for realistic training data synthesis. To address the lack of datasets under this regime, we construct a beamsplitter setup and collect EvLowLight dataset that includes images, temporal-mapping events, and motion events. Experiments across 5 synthetic datasets and our real-world EvLowLight dataset substantiate that the devised pipeline, dubbed RETINEV, excels in producing well-illuminated, high dynamic range images, outperforming previous stateof-the-art event-based methods by up to 6.62 dB, while maintaining an efficient inference speed of 35.6 framesper-second on a 640 x 480 image. Codes and datasets: https://github.com/AHupuJR/RetinEV.
Mitigating Geometric Degradation in Fast DownSampling via FastAdapter for Point Cloud Segmentation	Shuofeng Sun Beijing University of Posts and Telecommunications Haibin Yan Beijing University of Posts and Telecommunications	Paper Supplementary Abstract Farthest Point Sampling (FPS) is widely used in existing point-based models because it effectively preserves structural integrity during downsampling. However, it incurs significant computational overhead, severely impacting the model's inference efficiency. Random sampling or grid sampling is considered faster downsampling methods; however, these fast downsampling methods may lead to the loss of geometric information during the downsampling process due to their overly simplistic and fixed rules, which can negatively affect model performance. To address this issue, we propose FastAdapter, which aggregates local contextual information through a small number of anchor points and facilitates interactions across spatial and layer dimensions, ultimately feeding this information back into the downsampled point cloud to mitigate the information degradation caused by fast downsampling methods. In addition to using FastAdapter to enhance model performance in methods that already employ fast downsampling, we aim to explore a more challenging yet valuable application scenario. Specifically, we focus on pre-trained models that utilize FPS, embedding FastAdapter and replacing FPS with random sampling for lightweight fine-tuning. This approach aims to significantly improve inference speed while maintaining relatively unchanged performance. Experimental results on ScanNet, S3DIS, and SemanticKITTI demonstrate that our method effectively mitigates the geometric information degradation issues caused by fast downSampling.
Moment Quantization for Video Temporal Grounding	Xiaolong Sun Xi'an Jiaotong University Le Wang Xi'an Jiaotong University Sanping Zhou Xi'an Jiaotong University Liushuai Shi Xi'an Jiaotong University Kun Xia Xi'an Jiaotong University Mengnan Liu Xi'an Jiaotong University Yabing Wang Xi'an Jiaotong University Gang Hua Amazon Alexa AI	Paper Supplementary Abstract Video temporal grounding is a critical video understanding task, which aims to localize moments relevant to a language description. The challenge of this task lies in distinguishing relevant and irrelevant moments. Previous methods focused on learning continuous features exhibit weak differentiation between foreground and background features. In this paper, we propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG), which quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments. Specifically, MQVTG maintains a learnable moment codebook, where each video moment matches a codeword. Considering the visual diversity, i.e., various visual expressions for the same moment, MQVTG treats moment-codeword matching as a clustering process without using discrete vectors, avoiding the loss of useful information from direct hard quantization. Additionally, we employ effective priorinitialization and joint-projection strategies to enhance the maintained moment codebook. With its simple implementation, the proposed method can be integrated into existing temporal grounding models as a plug-and-play component. Extensive experiments on six popular benchmarks demonstrate the effectiveness and generalizability of MQVTG, significantly outperforming state-of-the-art methods. Further qualitative analysis shows that our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination. Code is available at https://github.com/TensorsSun/MQVTG.
RobAVA: A Large-scale Dataset and Baseline Towards Video based Robotic Arm Action Understanding	Baoli Sun Dalian University of Technology Ning Wang Dalian University of Technology Xinzhu Ma The Chinese University of Hong Kong Anqi Zou Dalian University of Technology Yihang Lu Dalian University of Technology Chuixuan Fan Dalian University of Technology Zhihui Wang Dalian University of Technology Kun Lu Dalian University of Technology Zhiyong Wang The University of Sydney	Paper Supplementary Abstract Understanding the behaviors of robotic arms is essential for various robotic applications such as logistics management and automated manufacturing. However, the lack of large-scale and diverse datasets significantly hinders progress in video-based robotic arm action understanding.In particular, our RobAVA contains 40k video sequences with video-level fine-grained annotations, covering basic actions such as picking, pushing, and placing, as well as their combinations in different orders and interactions with various objects. Distinguished to existing action recognition benchmarks, RobAVA includes instances of both normal and anomalous executions for each action category. The main challenge in robotic arm action recognition is that a complete action is composed of fundamental, atomic behaviors, requiring models to learn their inter-relationships. To this end, we propose a novel baseline approach, AGPTNet, which re-defines the problem of understanding robotic arm actions as a task of aligning video sequences with atomic attributes. To enhance AGPT-Net's ability to distinguish normal and anomalous action instances, we introduce a joint semantic space constraint between category and attribute semantics, thereby amplifying the separation between normal and anomalous attribute representations for each action. We conduct extensive experiments to demonstrate AGPT-Net's superiority over other mainstream recognition models. Please see the project page at https://github.com/Sunbaoli/RobAVA.
Towards Efficient General Feature Prediction in Masked Skeleton Modeling	Shengkai Sun Hefei University of Technology Zefan Zhang Jilin University Jianfeng Dong Zhejiang Gongshang University Zhiyong Cheng Hefei University of Technology Xiaojun Chang University of Science and Technology of China Meng Wang Hefei University of Technology	Paper Abstract Recent advances in the masked autoencoder (MAE) paradigm have significantly propelled self-supervised skeleton-based action recognition. However, most existing approaches limit reconstruction targets to raw joint coordinates or their simple variants, resulting in computational redundancy and limited semantic representation. To address this, we propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling. Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations. Specifically, we introduce a collaborative learning framework where a lightweight target generation network dynamically produces diversified supervision signals across spatial-temporal hierarchies, avoiding reliance on pre-computed offline features. The framework incorporates constrained optimization to ensure feature diversity while preventing model collapse. Experiments on NTU RGB+D 60, NTU RGB+D 120 and PKU-MMD demonstrate the benefits of our approach: Computational efficiency (with 6.2x faster training than standard masked skeleton modeling methods) and superior representation quality, achieving state-of-the-art performance in various downstream tasks.
Two Losses, One Goal: Balancing Conflict Gradients for Semi-supervised Semantic Segmentation	Rui Sun Shenzhen International Graduate School, Tsinghua University Huayu Mai National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory Wangkai Li National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory Yujia Chen National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory Yuan Wang University of Science and Technology of China	Paper Abstract Semi-supervised semantic segmentation has attracted considerable attention as it alleviates the need for extensive pixel-level annotations. However, existing methods often overlook the potential optimization conflict between supervised and unsupervised learning objectives, leading to suboptimal performance. In this paper, we identify this underexplored issue and propose a novel Pareto Optimization Strategy (POS) to tackle it. POS aims to find a descent gradient direction that benefits both learning objectives, thereby facilitating model training. By dynamically assigning weights to the gradients at each iteration based on the model's learning status, POS effectively reconciles the intrinsic tension between the two objectives. Furthermore, we analyze POS from the perspective of gradient descent in random batch sampling and propose the Magnitude Enhancement Operation (MEO) to further unleash its potential by considering both direction and magnitude during gradient integration. Extensive experiments on challenging benchmarks demonstrate that integrating POS into existing semi-supervised segmentation methods yields consistent improvements across different data splits and architectures (CNN, Transformer), showcasing its effectiveness.
Uncertainty-Aware Gradient Stabilization for Small Object Detection	Huixin Sun School of Electronic Information Engineering, Beihang University Yanjing Li School of Electronic Information Engineering, Beihang University Linlin Yang State Key Laboratory of Media Convergence and Communication, CUC Xianbin Cao School of Electronic Information Engineering, Beihang University Baochang Zhang School of Artificial Intelligence, Beihang University	Paper Supplementary Abstract Despite advances in generic object detection, there remains a performance gap in detecting small objects compared to normal-scale objects. We reveal that conventional object localization methods suffer from gradient instability in small objects due to sharper loss curvature, leading to a convergence challenge. To address the issue, we propose Uncertainty-Aware Gradient Stabilization (UGS), a framework that reformulates object localization as a classification task to stabilize gradients. UGS quantizes continuous labels into interval non-uniform discrete representations. Under a classification-based objective, the localization branch generates bounded and confidence-driven gradients, mitigating instability. Furthermore, UGS integrates an uncertainty minimization (UM) loss that reduces prediction variance and an uncertainty-guided refinement (UR) module that identifies and refines high-uncertainty regions via perturbations. Evaluated on four benchmarks, UGS consistently improves anchor-based, anchor-free, and leading small object detectors. Notably, UGS enhances DINO5scale by 2.6 AP on VisDrone, surpassing prior state-ofthe-art performance.
Visual Intention Grounding for Egocentric Assistants	Pengzhan Sun National University of Singapore Junbin Xiao National University of Singapore Tze Ho Elden Tse National University of Singapore Yicong Li National University of Singapore Arjun Akula Google DeepMind Angela Yao National University of Singapore	Paper Supplementary Abstract Visual grounding associates textual descriptions with objects in an image. Conventional methods target third-person image inputs and named object queries. In applications such as AI assistants, the perspective shifts - inputs are egocentric, and objects may be referred to implicitly through needs and intentions. To bridge this gap, we introduce EgoIntention, the first dataset for egocentric visual intention grounding. EgoIntention challenges multimodal LLMs to 1) understand and ignore unintended contextual objects and 2) reason about uncommon object functionalities. Benchmark results show that current models misidentify context objects and lack affordance understanding in egocentric views. We also propose Reason-to-Ground (RoG) instruction tuning; it enables hybrid training with normal descriptions and egocentric intentions with a chained intention reasoning and object grounding mechanism. RoG significantly outperforms naive finetuning and hybrid training on EgoIntention, while maintaining or slightly improving naive description grounding. This advancement enables unified visual grounding for egocentric and exocentric visual inputs while handling explicit object queries and implicit human intentions. Our code and model are available at https://github.com/pengzhansun/EgoIntention.
Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models	Wei Suo Northwestern Polytechnical University Ji Ma Northwestern Polytechnical University Mengyang Sun Northwestern Polytechnical University Lin Yuanbo Wu Swansea University Peng Wang Northwestern Polytechnical University Yanning Zhang Northwestern Polytechnical University	Paper Abstract Although Large Vision-Language Models (LVLMs) have achieved impressive results, their high computational costs pose a significant barrier to wide application. To enhance inference efficiency, most existing approaches can be categorized as parameter-dependent or token-dependent strategies to reduce computational demands. However, parameter-dependent methods require retraining LVLMs to recover performance while token-dependent strategies struggle to consistently select the most relevant tokens. In this paper, we systematically analyze the above challenges and provide a series of valuable insights for inference acceleration. Based on these findings, we propose a novel framework, the Pruning All-Rounder (PAR). Different from previous works, PAR develops a meta-router to adaptively organize pruning flows across both tokens and layers. With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of acceleration scenarios. The code for this work is publicly available at https://github.com/ASGO-MM/Pruning-All-Rounder.
Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues	Francesco Taioli Polytechnic of Turin Edoardo Zorzi University of Verona Gianni Franchi U2IS, ENSTA Paris Alberto Castellini University of Verona Alessandro Farinelli University of Verona Marco Cristani University of Verona Yiming Wang Fondazione Bruno Kessler	Paper Supplementary Abstract Language-driven instance object navigation assumes that a human initiates the task by providing a detailed description of the target to the embodied agent. While this description is crucial for distinguishing the target from other visually similar instances, providing it prior to navigation can be demanding for humans. We thus introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolves uncertainties about the target instance during navigation in natural, template-free and open-ended dialogues with the human, minimizing user input. We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the humanagent interaction reasoning using Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates internal selfdialogues within the agent to obtain a complete and accurate observation with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue, or halt navigation. For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA serves as a competitive baseline, whereas existing language-driven instance navigation methods struggle in multi-instance scenes.
ReTracker: Exploring Image Matching for Robust Online Any Point Tracking	Dongli Tan Zhejiang University Xingyi He Zhejiang University Sida Peng Zhejiang University Yiqing Gong Zhejiang University Xing Zhu Ant Research Jiaming Sun Zhejiang University Ruizhen Hu Shenzhen University Yujun Shen Zhejiang University Hujun Bao Zhejiang University Xiaowei Zhou Zhejiang University	Paper Supplementary Abstract This paper aims to establish correspondences for a set of 2D query points across a video sequence in an online manner. Recent methods leverage future frames to achieve smooth point tracking at the current frame, but they still struggle to find points with significant viewpoint changes after longterm occlusions and inherently cannot achieve online tracking. To overcome these challenges, we develop a novel online tracking framework, named ReTracker, that integrates two advances in image matching with tracking-specific designs. First, a decoder network with a global receptive field is incorporated with a temporal attention module to robustly track points undergoing large location changes. Second, the decoder network is adapted to pretrain on large-scale twoview matching data, which offers significantly greater diversity and volume than tracking data, to learn general matching priors. This pretraining strategy effectively enhances our tracker's ability to handle viewpoint and appearance variations after long-term occlusions. Experiments demonstrate that our method outperforms recent online trackers across multiple benchmarks and achieves competitive or superior performance compared to offline methods. Furthermore, we collect an ego-centric, occlusion-heavy dataset to illustrate the retracking capabilities of our approach. Project page: re-tracker.github.io.
Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning	Jieyi Tan Wuhan University Chengwei Zhang University of Cambridge Bo Dang Wuhan University Yansheng Li Wuhan University	Paper Supplementary Abstract Traditional Remote Sensing Foundation Models (RSFMs) are pre-trained with a data-centralized paradigm, through self-supervision on large-scale curated remote sensing data. For each institution, however, pre-training RSFMs with limited data in a standalone manner may lead to suboptimal performance, while aggregating remote sensing data from multiple institutions for centralized pre-training raises privacy concerns. Seeking for collaboration is a promising solution to resolve this dilemma, where multiple institutions can collaboratively train RSFMs without sharing private data. In this paper, we propose a novel privacypreserved pre-training framework (FedSense), which enables multiple institutions to collaboratively train RSFMs without sharing private data. However, it is a non-trivial task hindered by a vicious cycle, which results from model drift by remote sensing data heterogeneity and high communication overhead. To break this vicious cycle, we introduce federated mutual-guidance learning. Specifically, we propose a Server-to-Clients Guidance (SCG) mechanism to guide clients' updates towards global-flatness optimal solutions. Additionally, we propose a Clients-to-Server Guidance (CSG) mechanism to inject local knowledge into the server by low-bit communication. Extensive experiments on four downstream tasks demonstrate the effectiveness of our FedSense in both full-precision and communicationreduced scenarios, showcasing remarkable communication efficiency and performance gains.
What You Have is What You Track: Adaptive and Robust Multimodal Tracking	Yuedong Tan TeleAI, China Telecom Jiawei Shao TeleAI, China Telecom Eduard Zamfir Computer Vision Lab, CAIDAS & IFI, University of Wurzburg Ruanjun Li ShanghaiTech University Zhaochong An University of Copenhagen Chao Ma AI Institute, Shanghai Jiao Tong University Danda Paudel INSAIT, Sofia University Luc Van Gool INSAIT, Sofia University Radu Timofte Computer Vision Lab, CAIDAS & IFI, University of Wurzburg Zongwei Wu Computer Vision Lab, CAIDAS & IFI, University of Wurzburg	Paper Supplementary Abstract Multimodal data is known to be helpful for visual tracking by improving robustness to appearance variations. However, sensor synchronization challenges often compromise data availability, particularly in video settings where shortages can be temporal. Despite its importance, this area remains underexplored. In this paper, we present the first comprehensive study on tracker performance with temporally incomplete multimodal data. Unsurprisingly, under such a circumstance, existing trackers exhibit significant performance degradation, as their rigid architectures lack the adaptability needed to effectively handle missing modalities. To address these limitations, we propose a flexible framework for robust multimodal tracking. We venture that a tracker should dynamically activate computational units based on missing data rates. This is achieved through a novel Heterogeneous Mixture-of-Experts fusion mechanism with adaptive complexity, coupled with a video-level masking strategy that ensures both temporal consistency and spatial completeness - critical for effective video tracking. Surprisingly, our model not only adapts to varying missing rates but also adjusts to scene complexity. Extensive experiments show that our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings. The code and benchmark will be made publicly available at https: //github.com/supertyd/FlexTrack.
RnGCam: High-speed video from rolling & global shutter measurements	Kevin Tandi University of California, San Diego Xiang Dai University of California, San Diego Chinmay Talegaonkar University of California, San Diego Gal Mishne University of California, San Diego Nick Antipa University of California, San Diego	Paper Supplementary Abstract Compressive video capture encodes a short high-speed video into a single measurement using a low-speed sensor, then computationally reconstructs the original video. Prior implementations rely on expensive hardware and are restricted to imaging sparse scenes with empty backgrounds. We propose RnGCam, a system that fuses measurements from low-speed consumer-grade rolling-shutter (RS) and global-shutter (GS) sensors into video at kHz frame rates. The RS sensor is combined with a pseudorandom optic, called a diffuser, which spatially multiplexes scene information. The GS sensor is coupled with a conventional lens. The RS-diffuser provides low spatial detail and high temporal detail, complementing the GS-lens system's high spatial detail and low temporal detail. We propose a reconstruction method using implicit neural representations (INR) to fuse the measurements into a high-speed video. Our INR method separately models the static and dynamic scene components, while explicitly regularizing dynamics. In simulation, we show that our approach significantly outperforms previous RS compressive video methods, as well as state-of-the-art frame interpolators. We validate our approach in a dual-camera hardware setup, which generates 230 frames of video at 4,800 frames per second for dense scenes, using hardware that costs 10⇥less than previous compressive video systems.
Closed-Loop Transfer for Weakly-supervised Affordance Grounding	Jiajin Tang Zhengxuan Wei Ge Zheng Sibei Yang	Paper Abstract Humans can perform previously unexperienced interactions with novel objects simply by observing others engage with them. Weakly-supervised affordance grounding mimics this process by learning to locate object regions that enable actions on egocentric images, using exocentric interaction images with image-level annotations. However, extracting affordance knowledge solely from exocentric images and transferring it one-way to egocentric images limits the applicability of previous works in complex interaction scenarios. Instead, this study introduces LoopTrans, a novel closed-loop framework that not only transfers knowledge from exocentric to egocentric but also transfers back to enhance exocentric knowledge extraction. Within LoopTrans, several innovative mechanisms are introduced, including unified cross-modal localization and denoising knowledge distillation, to bridge domain gaps between object-centered egocentric and interaction-centered exocentric images while enhancing knowledge transfer. Experiments show that LoopTrans achieves consistent improvements across all metrics on image and video benchmarks, even handling challenging scenarios where object interaction regions are fully occluded by the human body. All models and codes will be made publicly available.
CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective	Zongheng Tang Hangzhou International Innovation Institute, Beihang University Yi Liu School of Artificial Intelligence, Beihang University Yifan Sun School of Artificial Intelligence, Beihang University Yulu Gao Hangzhou International Innovation Institute, Beihang University Jinyu Chen School of Artificial Intelligence, Beihang University Runsheng Xu University of California, Los Angeles Si Liu School of Artificial Intelligence, Beihang University	Paper Supplementary Abstract Collaborative perception shares information among different agents and helps solving problems that individual agents may face, e.g., occlusions and small sensing range. Prior methods usually separate the multi-agent fusion and multi-time fusion into two consecutive steps. In contrast, this paper proposes an efficient collaborative perception that aggregates the observations from different agents (space) and different times into a unified spatio-temporal space simultaneously. The unified spatio-temporal space brings two benefits, i.e., efficient feature transmission and superior feature fusion. 1) Efficient feature transmission: each static object yields a single observation in the spatial temporal space, and thus only requires transmission only once (whereas prior methods re-transmit all the object features multiple times). 2) superior feature fusion: merging the multi-agent and multi-time fusion into a unified spatialtemporal aggregation enables a more holistic perspective, thereby enhancing perception performance in challenging scenarios. Consequently, our Collaborative perception with Spatio-temporal Transformer (CoST) gains improvement in both efficiency and accuracy. Notably, CoST is not tied to any specific method and is compatible with a majority of previous methods, enhancing their accuracy while reducing the transmission bandwidth. Code will be available at https://github.com/tzhhhh123/CoST.
HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder	Yingqi Tang Nullmax Zhuoran Xu Nullmax Zhaotie Meng Nullmax Erkang Cheng Nullmax	Paper Supplementary Abstract Although end-to-end autonomous driving (E2E-AD) technologies have made significant progress in recent years, there remains an unsatisfactory performance on closed-loop evaluation. The potential of leveraging planning in query design and interaction has not yet been fully explored. In this paper, we introduce a multi-granularity planning query representation that integrates heterogeneous waypoints, including spatial, temporal, and driving-style waypoints across various sampling patterns. It provides additional supervision for trajectory prediction, enhancing precise closed-loop control for the ego vehicle. Additionally, we explicitly utilize the geometric properties of planning trajectories to effectively retrieve relevant image features based on physical locations using deformable attention. By combining these strategies, we propose a novel end-to-end autonomous driving framework, termed HiP-AD, which simultaneously performs perception, prediction, and planning within a unified decoder. HiP-AD enables comprehensive interaction by allowing planning queries to iteratively interact with perception queries in the BEV space while dynamically extracting image features from perspective views. Experiments demonstrate that HiPAD outperforms all existing end-to-end autonomous driving methods on the closed-loop benchmark Bench2Drive and achieves competitive performance on the real-world dataset nuScenes.
G2SF: Geometry-Guided Score Fusion for Multimodal Industrial Anomaly Detection	Chengyu Tao The Hong Kong University of Science and Technology Xuanming Cao The Hong Kong University of Science and Technology (Guangzhou) Juan Du The Hong Kong University of Science and Technology	Paper Supplementary Abstract Industrial quality inspection plays a critical role in modern manufacturing by identifying defective products during production. While single-modality approaches using either 3D point clouds or 2D RGB images suffer from information incompleteness, multimodal anomaly detection offers promise through the complementary fusion of crossmodal data. However, existing methods face challenges in effectively integrating unimodal results and improving discriminative power. To address these limitations, we first reinterpret memory bank-based anomaly scores in single modalities as isotropic Euclidean distances in local feature spaces. Dynamically evolving from Euclidean metrics, we propose a novel Geometry-Guided Score Fusion (G2SF) framework that progressively learns an anisotropic local distance metric as a unified score for the fusion task. Through a geometric encoding operator, a novel Local Scale Prediction Network (LSPN) is proposed to predict direction-aware scaling factors that characterize first-order local feature distributions, thereby enhancing discrimination between normal and anomalous patterns. Additionally, we develop specialized loss functions and score aggregation strategy from geometric priors to ensure both metric generalization and efficacy. Comprehensive evaluations on the MVTec-3D AD and Eyecandies datasets demonstrate the state-of-the-art detection performance of our method, and detailed ablation analysis validates each component's contribution. Our code is available at https://github.com/ctaoaa/G2SF.
GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation	Ye Tao State Key Laboratory of Virtual Reality Technology and Systems, Beihang University Jiawei Zhang SenseTime Research Yahao Shi State Key Laboratory of Virtual Reality Technology and Systems, Beihang University Dongqing Zou SenseTime Research Bin Zhou State Key Laboratory of Virtual Reality Technology and Systems, Beihang University	Paper Supplementary Abstract Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pretrained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splattingbased geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-viewconsistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. Our code is available at https://github.com/MOMOYATW/GSV3D.
MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion	Peilin Tao Institute of Automation, Chinese Academy of Sciences Hainan Cui Institute of Automation, Chinese Academy of Sciences Diantao Tu Institute of Automation, Chinese Academy of Sciences Shuhan Shen Institute of Automation, Chinese Academy of Sciences	Paper Supplementary Abstract Multi-camera systems are increasingly vital in the environmental perception of autonomous vehicles and robotics. Their physical configuration offers inherent fixed relative pose constraints that benefit Structure-from-Motion (SfM). However, traditional global SfM systems struggle with robustness due to their optimization framework. We propose a novel global motion averaging framework for multi-camera systems, featuring two core components: a decoupled rotation averaging module and a hybrid translation averaging module. Our rotation averaging employs a hierarchical strategy by first estimating relative rotations within rigid camera units and then computing global rigid unit rotations. To enhance the robustness of translation averaging, we incorporate both camera-to-camera and camera-to-point constraints to initialize camera positions and 3D points with a convex distancebased objective function and refine them with an unbiased non-bilinear angle-based objective function. Experiments on large-scale datasets show that our system matches or exceeds incremental SfM accuracy while significantly improving efficiency. Our framework outperforms existing global SfM methods, establishing itself as a robust solution for realworld multi-camera SfM applications. The code is available at https://github.com/3dv-casia/MGSfM/.
RoboPearls: Editable Video Simulation for Robot Manipulation	Tang Tao Shenzhen Campus of Sun Yat-sen University Likui Zhang Sun Yat-sen University Youpeng Wen Shenzhen Campus of Sun Yat-sen University Kaidong Zhang Sun Yat-sen University Jia-Wang Bian Bytedance Seed Xia Zhou Li Auto Inc. Tianyi Yan Li Auto Inc. Kun Zhan Li Auto Inc. Peng Jia Li Auto Inc. Hefeng Wu Sun Yat-sen University Liang Lin Sun Yat-sen University Xiaodan Liang Shenzhen Campus of Sun Yat-sen University	Paper Supplementary Abstract The development of generalist robot manipulation policies has seen significant progress, driven by large-scale demonstration data across diverse environments. However, the high cost and inefficiency of collecting real-world demonstrations hinder the scalability of data acquisition. While existing simulation platforms enable controlled environments for robotic learning, the challenge of bridging the sim-to-real gap remains. To address these challenges, we propose RoboPearls, an editable video simulation framework for robotic manipulation. Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations from demonstration videos, and supports a wide range of simulation operators, including various object manipulations, powered by proposed modules like Incremental Semantic Distillation (ISD) and 3D regularized NNFM Loss (3D-NNFM). Moreover, by incorporating large language models (LLMs), RoboPearls automates the simulation production process in a user-friendly manner through ﬂexible command interpretation and execution. Furthermore, RoboPearls employs a vision-language model (VLM) to analyze robotic learning issues to close the simulation loop for performance enhancement. To demonstrate the effectiveness of RoboPearls, we conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot, which demonstrate our satisfactory simulation performance. More information can be found on our Project Page.
Parameter-Efficient Adaptation of Geospatial Foundation Models through Embedding Deflection	Romain Thoreau CNES Valerio Marsocci European Space Agency Φ-Lab Dawa Derksen CNES	Paper Supplementary Abstract As large-scale heterogeneous data sets become increasingly available, adapting foundation models at low cost has become a key issue. Seminal works in natural language processing, e.g. Low-Rank Adaptation (LoRA), leverage the low 'intrinsic rank' of parameter updates during adaptation. In this paper, we argue that incorporating stronger inductive biases on both the data and the models can enhance the adaptation of Geospatial Foundation Models (GFMs), pretrained on RGB satellite images, to other types of optical satellite data. Specifically, the pretrained parameters of GFMs serve as a strong prior for the spatial structure of multispectral images. For this reason, we introduce DEFLECT (Deflecting Embeddings for Finetuning Latent representations for Earth and Climate Tasks), a novel strategy for adapting GFMs to multispectral satellite imagery with very few additional parameters. DEFLECT improves the representation capabilities of the extracted features, particularly enhancing spectral information, which is essential for geoscience and environmental-related tasks. We demonstrate the effectiveness of our method across three different GFMs and five diverse datasets, ranging from forest monitoring to marine environment segmentation. Compared to competing methods, DEFLECT achieves on-par or higher accuracy with 5-10x fewer parameters for classification and segmentation tasks. The code is available at https://github.com/VMarsocci/DEFLECT.
DATA: Domain-And-Time Alignment for High-Quality Feature Fusion in Collaborative Perception	Chengchang Tian Southeast University Jianwei Ma Southeast University Yan Huang Southeast University Zhanye Chen Southeast University Honghao Wei Washington State University Hui Zhang Southeast University Wei Hong Southeast University	Paper Supplementary Abstract Feature-level fusion shows promise in collaborative perception (CP) through balanced performance and communication bandwidth trade-off. However, its effectiveness critically relies on input feature quality. The acquisition of high-quality features faces domain gaps from hardware diversity and deployment conditions, alongside temporal misalignment from transmission delays. These challenges degrade feature quality with cumulative effects throughout the collaborative network. In this paper, we present the Domain-And-Time Alignment (DATA) network, designed to systematically align features while maximizing their semantic representations for fusion. Specifically, we propose a Consistency-preserving Domain Alignment Module (CDAM) that reduces domain gaps through proximal-region hierarchical downsampling and observability-constrained discriminator. We further propose a Progressive Temporal Alignment Module (PTAM) to handle transmission delays via multi-scale motion modeling and two-stage compensation. Building upon the aligned features, an Instancefocused Feature Aggregation Module (IFAM) is developed to enhance semantic representations. Extensive experiments demonstrate that DATA achieves state-of-the-art performance on three typical datasets, maintaining robustness with severe communication delays and pose errors. The code will be released at https://github.com/ChengchangTian/DATA.
DuoCLR: Dual-Surrogate Contrastive Learning for Skeleton-based Human Action Segmentation	Haitao Tian University of Ottawa	Paper Abstract In this paper, a new contrastive representation learning framework is proposed to enhance action segmentation via pretraining using trimmed (single action) skeleton sequences. Unlike previous representation learning works that are tailored for action recognition and that develop isolated sequence-wise representations, the proposed framework focuses on exploiting multi-scale representations in conjunction with cross-sequence variations. More specifically, it proposes a novel data augmentation strategy, 'Shuffle and Warp', which exploits diverse multi-action permutations. The latter effectively assist two surrogate tasks that are introduced in contrastive learning: Cross Permutation Contrasting (CPC) and Relative Order Reasoning (ROR). In optimization, CPC learns intra-class similarities by contrasting representations of the same action class across different permutations, while ROR reasons about inter-class contexts by predicting relative mapping between two permutations. Together, these tasks enable a Dual-Surrogate Contrastive Learning (DuoCLR) network to learn multi-scale feature representations optimized for action segmentation. In experiments, DuoCLR is pretrained on a trimmed skeleton dataset and evaluated on an untrimmed dataset where it demonstrates a significant boost over state-the-art comparatives in both multi-class and multi-label action segmentation tasks. Lastly, ablation studies are conducted to evaluate the effectiveness of each component of the proposed approach.
AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration	Javier Tirado-Garín University of Zaragoza Javier Civera University of Zaragoza	Paper Supplementary Abstract We present AnyCalib, a method for calibrating the intrinsic parameters of a camera from a single in-the-wild image, that is agnostic to the camera model. Current methods are predominantly tailored to specific camera models and/or require extrinsic cues, such as the direction of gravity, to be visible in the image. In contrast, we argue that the perspective and distortion cues inherent in images are sufficient for model-agnostic camera calibration. To demonstrate this, we frame the calibration process as the regression of the rays corresponding to each pixel. We show, for the first time, that this intermediate representation allows for a closed-form recovery of the intrinsics for a wide range of camera models, including but not limited to: pinhole, Brown-Conrady and KannalaBrandt. Our approach also applies to edited-cropped and stretched-images. Experimentally, we demonstrate that AnyCalib consistently outperforms alternative methods, including 3D foundation models, despite being trained on orders of magnitude less data. Code is available at https://github.com/javrtg/AnyCalib.
GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization	Shaowen Tong ShanghaiTech University Zimin Xia École Polytechnique Fédérale de Lausanne (EPFL) Alexandre Alahi École Polytechnique Fédérale de Lausanne (EPFL) Xuming He ShanghaiTech University Yujiao Shi ShanghaiTech University	Paper Supplementary Abstract Cross-view localization, the task of estimating a camera's 3-degrees-of-freedom (3-DoF) pose by aligning groundlevel images with aerial images, is crucial for large-scale outdoor applications like autonomous navigation and augmented reality. Existing methods often rely on fully supervised learning, which requires costly ground-truth pose annotations. In this work, we propose GeoDistill, a Geometry guided weakly supervised self Distillation framework that uses teacher-student learning with Field-of-View (FoV)- based masking to enhance local feature learning for robust cross-view localization. In GeoDistill, the teacher model localizes a full view image, while the student model predicts locations from a limited FoV counterpart created by FoVbased masking. By aligning the student's predictions with those of the teacher, the student focuses on key features like lane lines and ignores textureless regions, such as roads. This results in more accurate predictions and reduced uncertainty. Our experiments show that GeoDistill significantly improves localization performance across different frameworks. Additionally, we introduce a novel orientation estimation network that predicts relative orientation without requiring precise planar position ground truth. GeoDistill provides a scalable and efficient solution for real-world cross-view localization challenges. Code and model can be found at https://github.com/tongshw/GeoDistill.
EvRT-DETR: Latent Space Adaptation of Image Detectors for Event-based Vision	Dmitrii Torbunov Brookhaven National Laboratory Yihui Ren Brookhaven National Laboratory Animesh Ghose Brookhaven National Laboratory Odera Dim Brookhaven National Laboratory Yonggang Cui Brookhaven National Laboratory	Paper Supplementary Abstract Event-based cameras (EBCs) have emerged as a bioinspired alternative to traditional cameras, offering advantages in power efficiency, temporal resolution, and high dynamic range. However, development of image analysis methods for EBCs is challenging due to the sparse and asynchronous nature of the data. This work addresses the problem of object detection for EBC cameras. The current approaches to EBC object detection focus on constructing complex data representations and rely on specialized architectures. We introduce I2EvDet (Image-to-Event Detection), a novel adaptation framework that bridges mainstream object detection with temporal event data processing. First, we demonstrate that a Real-Time DEtection TRansformer, or RT-DETR, a state-of-the-art natural image detector, trained on a simple image-like representation of the EBC data achieves performance comparable to specialized EBC methods. Next, as part of our framework, we develop an efficient adaptation technique that transforms image-based detectors into event-based detection models by modifying their frozen latent representation space via minimal architectural additions. The resulting EvRT-DETR model reaches state-of-the-art performance on the standard benchmark datasets Gen1 (mAP +2.3) and 1Mpx/Gen4 (mAP +1.4). These results demonstrate a fundamentally new approach to EBC object detection through principled adaptation of mainstream architectures, offering an efficient alternative with potential applications to other temporal visual domains. The code is available at: https://github.com/realtimeintelligence/evrt-detr.
Leveraging 2D Priors and SDF Guidance for Urban Scene Rendering	Siddharth Tourani IIIT Hyderabad Jayaram Reddy IIIT Hyderabad Akash Kumbar IIIT Hyderabad Satyajit Tourani IIIT Hyderabad Nishant Goyal IIT Kharagpur Madhava Krishna IIIT Hyderabad N Dinesh Reddy VLM Run Muhammad Haris Khan MBZUAI	Paper Supplementary Abstract Dynamic scene rendering and reconstruction play a crucial role in computer vision and augmented reality. Recent methods based on 3D Gaussian Splatting (3DGS), have enabled accurate modeling of dynamic urban scenes, but for urban scenes they require both camera and LiDAR data, ground-truth 3D segmentations and motion data in the form of tracklets or pre-defined object templates such as SMPL. In this work, we explore whether a combination of 2D object agnostic priors in the form of depth and point tracking coupled with a signed distance function (SDF) representation for dynamic objects can be used to relax some of these requirements. We present a novel approach that integrates Signed Distance Functions (SDFs) with 3D Gaussian Splatting (3DGS) to create a more robust object representation by harnessing the strengths of both methods. Our unified optimization framework enhances the geometric accuracy of 3D Gaussian splatting and improves deformation modeling within the SDF, resulting in a more adaptable and precise representation. We demonstrate that our method achieves state-of-the-art performance in rendering metrics even without LiDAR data on urban scenes. When incorporating LiDAR, our approach improved further in reconstructing and generating novel views across diverse object categories, without ground-truth 3D motion annotation. Additionally, our method enables various scene editing tasks, including scene decomposition, and scene composition.
Head2Body: Body Pose Generation from Multi-sensory Head-mounted Inputs	Minh Tran University of Southern California Hongda Mao Amazon Qingshuang Chen Amazon Yelin Kim Amazon	Paper Supplementary Abstract Generating body pose from head-mounted, egocentric inputs is essential for immersive VR/AR and assistive technologies, as it supports more natural interactions. However, the task is challenging due to limited visibility of body parts in first-person views and the sparseness of sensory data, with only a single device placed on the head. To address these challenges, we introduce Head2Body, a novel framework for body pose estimation that effectively combines headIMU and egocentric visual data. First, we introduce a pretrained IMU encoder, trained on over 1,700 hours of Ego4D IMU data from head-mounted devices, to better capture detailed temporal motion cues given limited labeled egocentric pose data. For visual processing, we leverage large vision-language models (LVLMs) to segment body parts that appear sporadically in video frames to improve visual feature extraction. To better guide pose generation from sparse head-mounted signals, we incorporate a residual Vector Quantized Variational Autoencoder (VQ-VAE) to represent poses with discrete tokens, capturing high-frequency motion patterns and improving over direct continuous regression, which often lacks structure and temporal consistency. Our experiments demonstrate the effectiveness of the proposed approach, yielding 6-13% gains over state-of-the-art baselines on three datasets: AMASS, KinPoly, and EgoExo4D. By capturing subtle temporal dynamics and leveraging complementary sensory data, our approach advances accurate egocentric body pose estimation and sets a new benchmark for multi-modal, first-person motion tracking.
More Reliable Pseudo-labels, Better Performance: A Generalized Approach to Single Positive Multi-label Learning	Luong Tran FPT Software AI Center Thieu Vo National University of Singapore Anh Nguyen University of Liverpool Sang Dinh Hanoi University of Science and Technology Van Nguyen FPT Software AI Center	Paper Supplementary Abstract Multi-label learning is a challenging computer vision task that requires assigning multiple categories to each image. However, fully annotating large-scale datasets is often impractical due to high costs and effort, motivating the study of learning from partially annotated data. In the extreme case of Single Positive Multi-Label Learning (SPML), each image is provided with only one positive label, while all other labels remain unannotated. Traditional SPML methods that treat missing labels as unknown or negative tend to yield inaccuracies and false negatives, and integrating various pseudo-labeling strategies can introduce additional noise. To address these challenges, we propose the Generalized Pseudo-Label Robust Loss (GPR Loss), a novel loss function that effectively learns from diverse pseudo-labels while mitigating noise. Complementing this, we introduce a simple yet effective Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) technique. Together, these contributions form the Adaptive and Efficient Vision-Language Pseudo-Labeling (AEVLP) framework. Extensive experiments on four benchmark datasets demonstrate that our framework significantly advances multi-label classification, achieving state-of-the-art results.
PHATNet: A Physics-guided Haze Transfer Network for Domain-adaptive Real-world Image Dehazing	Fu-Jen Tsai National Tsing Hua University Yan-Tsung Peng National Chengchi University Yen-Yu Lin National Yang Ming Chiao Tung University Chia-Wen Lin National Tsing Hua University	Paper Supplementary Abstract Image dehazing aims to remove unwanted hazy artifacts in images. Although previous research has collected paired real-world hazy and haze-free images to improve dehazing models' performance in real-world scenarios, these models often experience significant performance drops when handling unseen real-world hazy images due to limited training data. This issue motivates us to develop a flexible domain adaptation method to enhance dehazing performance during testing. Observing that predicting haze patterns is generally easier than recovering clean content, we propose the Physics-guided Haze Transfer Network (PHATNet) which transfers haze patterns from unseen target domains to source-domain haze-free images, creating domainspecific fine-tuning sets to update dehazing models for effective domain adaptation. Additionally, we introduce a HazeTransfer-Consistency loss and a Content-Leakage Loss to enhance PHATNet's disentanglement ability. Experimental results demonstrate that PHATNet significantly boosts state-of-the-art dehazing models on benchmark real-world image dehazing datasets. The source code is available at https://github.com/pp00704831/PHATNet.
Auto-Vocabulary Semantic Segmentation	Osman Ülger University of Amsterdam Maksymilian Kulicki Institute of Fundamental Technological Research, Polish Academy of Science Yuki Asano University of Technology Nuremberg Martin R. Oswald University of Amsterdam	Paper Supplementary Abstract Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, without training or fine-tuning. However, OVS methods typically require a human in the loop to specify the vocabulary based on the task or dataset at hand. In this paper, we introduce Auto-Vocabulary Semantic Segmentation (AVS), advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. Our approach, AutoSeg, presents a framework that autonomously identifies relevant class names using semantically enhanced BLIP embeddings and segments them afterwards. Given that open-ended object category predictions cannot be directly compared with a fixed ground truth, we develop a Large Language Model-based Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically generated classes and their corresponding segments. With AVS, our method sets new benchmarks on datasets PASCAL VOC, Context, ADE20K, and Cityscapes, while showing competitive performance to OVS methods that require specified class names. All code is released here.
Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation	Maximilian Ulmer German Aerospace Center (DLR) Wout Boerdijk German Aerospace Center (DLR) Rudolph Triebel German Aerospace Center (DLR) Maximilian Durner Technical University of Munich	Paper Supplementary Abstract This paper presents Object-Conditioned Diffusion Transformer (OC-DiT), a novel class of diffusion models designed for object-centric prediction, and applies it to zeroshot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model's latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large-scale synthetic dataset comprising thousands of high-quality object meshes. Remarkably, our model achieves state-of-the-art performance on multiple challenging real-world benchmarks, without requiring any retraining on target data. Through comprehensive ablation studies, we demonstrate the potential of diffusion models for instance segmentation tasks. Code is available at https://github.com/DLR-RM/oc-dit.
Neural Inverse Rendering for High-Accuracy 3D Measurement of Moving Objects with Fewer Phase-Shifting Patterns	Yuki Urakawa Institute of Science Tokyo Yoshihiro Watanabe Institute of Science Tokyo	Paper Supplementary Abstract Among structured-light methods, the phase-shifting approach enables high-resolution and high-accuracy measurements using a minimum of three patterns. However, its performance is significantly affected when dynamic and complex-shaped objects are measured, as motion artifacts and phase inconsistencies can degrade accuracy. In this study, we propose an enhanced phaseshifting method that incorporates neural inverse rendering to enable the 3D measurement of moving objects. To effectively capture object motion, we introduce a displacement field into the rendering model, which accurately represents positional changes and mitigates motion-induced distortions. Additionally, to achieve high-precision reconstruction with fewer phase-shifting patterns, we design a multiview-rendering framework that utilizes multiple cameras in conjunction with a single projector. Comparisons with state-of-the-art methods and various ablation studies demonstrated that our method accurately reconstructs the shapes of moving objects, even with a small number of patterns, using only simple, well-known phase-shifting patterns.
Uncalibrated Structure from Motion on a Sphere	Jonathan Ventura California Polytechnic State University Viktor Larsson Lund University Fredrik Kahl Chalmers University of Technology	Paper Supplementary Abstract Spherical motion is a special case of camera motion where the camera moves on the imaginary surface of a sphere with the optical axis normal to the surface. Common sources of spherical motion are a person capturing a stereo panorama with a phone held in an outstretched hand, or a hemispherical camera rig used for multi-view scene capture. However, traditional structure-from-motion pipelines tend to fail on spherical camera motion sequences, especially when the camera is facing outward. Building upon prior work addressing the calibrated case, we explore uncalibrated reconstruction from spherical motion, assuming a fixed but unknown focal length parameter. We show that, although two-view spherical motion is always a critical case, self-calibration is possible from three or more views. Through analysis of the relationship between focal length and spherical relative pose, we devise a global structurefrom-motion approach for uncalibrated reconstruction. We demonstrate the effectiveness of our approach on real-world captures in various settings, even when the camera motion deviates from perfect spherical motion. Code and data for our method are available at https://github.com/jonathanventura/spherical-sfm.
EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation	Zengyu Wan University of Science and Technology of China Wei Zhai University of Science and Technology of China Yang Cao University of Science and Technology of China Zhengjun Zha University of Science and Technology of China	Paper Supplementary Abstract Visual 3D motion estimation aims to infer the motion of 2D pixels in 3D space based on visual cues. The key challenge arises from depth variation induced spatio-temporal motion inconsistencies, disrupting the assumptions of local spatial or temporal motion smoothness in previous motion estimation frameworks. In contrast, event cameras offer new possibilities for 3D motion estimation through continuous adaptive pixel-level responses to scene changes. This paper presents EMoTive, a novel event-based framework that models spatio-temporal trajectories via event-guided non-uniform parametric curves, effectively characterizing locally heterogeneous spatio-temporal motion. Specifically, we first introduce Event Kymograph - an event projection method that leverages a continuous temporal projection kernel and decouples spatial observations to encode finegrained temporal evolution explicitly. For motion representation, we introduce a density-aware adaptation mechanism to fuse spatial and temporal features under event guidance, coupled with a non-uniform rational curve parameterization framework to adaptively model heterogeneous trajectories. The final 3D motion estimation is achieved through multi-temporal sampling of parametric trajectories, yielding optical flow and depth motion fields. To facilitate evaluation, we introduce CarlaEvent3D, a multi-dynamic synthetic dataset for comprehensive validation. Extensive experiments on both this dataset and a real-world benchmark demonstrate the effectiveness of the proposed method.
Event-aided Dense and Continuous Point Tracking: Everywhere and Anytime	Zhexiong Wan Northwestern Polytechnical University Jianqin Luo Northwestern Polytechnical University Yuchao Dai Northwestern Polytechnical University Gim Hee Lee National University of Singapore	Paper Supplementary Abstract Recent point tracking methods have made great strides in recovering the trajectories of any point (especially key points) in long video sequences associated with large motions. However, the spatial and temporal granularities of point trajectories remain constrained by limited motion estimation accuracy and video frame rate. Leveraging the high temporal resolution and motion sensitivity of event cameras, we introduce event data for the first time to recover spatially dense and temporally continuous trajectories of every point at any time. Specifically, we define the dense and continuous point trajectory representation as estimating multiple control points of curves for each pixel and model the movement of sparse events triggered along continuous point trajectories. Building on this, we propose a novel multi-frame iterative streaming framework that first estimates local inter-frame motion representations from two consecutive frames with inter-frame events, then aggregates them into a global long-term motion representation to utilize input full video and event data with an arbitrary number of frames. Extensive experiments on simulated and real data demonstrate the significant improvement of our framework over state-of-the-art methods and the crucial role of introducing events to model continuous point trajectories.
AG2aussian: Anchor-Graph Structured Gaussian Splatting for Instance-Level 3D Scene Understanding and Editing	Zhaonan Wang Shandong University Manyi Li Shandong University Changhe Tu Shandong University	Paper Supplementary Abstract 3D Gaussian Splatting (3DGS) has witnessed exponential adoption across diverse applications, driving a critical need for semantic-aware 3D Gaussian representations to enable scene understanding and editing tasks. Existing approaches typically attach semantic features to a collection of free Gaussians and distill the features via differentiable rendering, leading to noisy segmentation and a messy selection of Gaussians. In this paper, we introduce AG2aussian, a novel framework that leverages an anchor-graph structure to organize semantic features and regulate Gaussian primitives. Our anchor-graph structure not only promotes compact and instance-aware Gaussian distributions, but also facilitates graph-based propagation, achieving a clean and accurate instance-level Gaussian selection. Extensive validation across four applications, i.e. interactive click-based query, open-vocabulary text-driven query, object removal editing, and physics simulation, demonstrates the advantages of our approach and its benefits to various applications. The experiments and ablation studies further evaluate the effectiveness of the key designs of our approach.
Authentic 4D Driving Simulation with a Video Generation Model	Lening Wang Beihang University Wenzhao Zheng Tsinghua University Dalong Du PhiGent Robotics Yunpeng Zhang unknown Yilong Ren Beihang University Han Jiang Beihang University Zhiyong Cui Beihang University Haiyang Yu Beihang University Jie Zhou Tsinghua University Shanghang Zhang Peking University	Paper Abstract Simulating driving environments in 4D is crucial for developing accurate and immersive autonomous driving systems. Despite progress in generating driving scenes, challenges in transforming views and modeling the dynamics of space and time remain. To tackle these issues, we propose a fresh methodology that reconstructs real-world driving environments and utilizes a generative network to enable 4D simulation. This approach builds continuous 4D point cloud scenes by leveraging surround-view data from autonomous vehicles. By separating the spatial and temporal elements, it creates smooth keyframe sequences. Furthermore, video generation techniques are employed to produce lifelike 4D simulation videos from any given perspective. To extend the range of possible viewpoints, we incorporate training using decomposed camera poses, which allows for enhanced modeling of distant scenes. Additionally, we merge camera trajectory data to synchronize 3D points across consecutive frames, fostering a richer understanding of the evolving scene. With training across multiple scene levels, our method is capable of simulating scenes from any viewpoint and offers deep insight into the evolution of scenes over time in a consistent spatial-temporal framework. In comparison with current methods, this approach excels in maintaining consistency across views, background coherence, and overall accuracy, significantly contributing to the development of more realistic autonomous driving simulations.
C4D: 4D Made from 3D through Dual Correspondences	Shizun Wang National University of Singapore Zhenxiang Jiang National University of Singapore Xingyi Yang The Hong Kong Polytechnic University Xinchao Wang National University of Singapore	Paper Supplementary Abstract Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multiview geometric constraints, disrupting the reconstruction. To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. Specifically, apart from predicting pointmaps, C4D captures two types of correspondences: short-term optical flow and long-term point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes. Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction. Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking. Project Page: https://littlepure2333.github.io/C4D
Completing 3D Partial Assemblies with View-Consistent 2D-3D Correspondence	Weihao Wang Tongji University Yu Lan Tongji University Mingyu You Tongji University Bin He Tongji University	Paper Supplementary Abstract 3D assembly completion represents a fundamental task in 3D computer vision and robotics. This task aims to retrieve the missing parts from a set of candidates and predict their 6DoF poses to make the partial assembly complete. However, due to the inherent uncertainty in completion and the similarity among candidates, even humans struggle to achieve precise completion without external guidance. To address this challenge, we introduce an auxiliary image depicting the complete assembly from a specific view. The primary challenge lies in the lack of correspondence or grounding between the partial assembly and the image, leading to ambiguities in identifying missing parts and ineffective guidance for completion. Moreover, this correspondence heavily depends on the view of image, which, unfortunately, is often unknown in real-world scenarios. To this end, we propose a novel cross-modal 3D assembly completion framework. At its core is missing-oriented feature fusion augmented by self-supervised view alignment to establish view-consistent 2D-3D correspondence between the image and the partial assembly, which effectively captures clues of missing parts from the image and provides targeted guidance for completion. Extensive experiments demonstrate our state-of-the-art performance on the PartNet dataset and show its generalization capabilities in two downstream applications: component suggestion and furniture restoration.
Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention	Weida Wang Tongji University Changyong He Tongji University Jin Zeng Tongji University Di Qiu Google	Paper Supplementary Abstract Depth images captured by Time-of-Flight (ToF) sensors are prone to noise, requiring denoising for reliable downstream applications. Previous works either focus on single-frame processing, or perform multi-frame processing without considering depth variations at corresponding pixels across frames, leading to undesirable temporal inconsistency and spatial ambiguity. In this paper, we propose a novel ToF depth denoising network leveraging motion-invariant graph fusion to simultaneously enhance temporal stability and spatial sharpness. Specifically, despite depth shifts across frames, graph structures exhibit temporal self-similarity, enabling cross-frame geometric attention for graph fusion. Then, by incorporating an image smoothness prior on the fused graph and data fidelity term derived from ToF noise distribution, we formulate a maximum a posterior problem for ToF denoising. Finally, the solution is unrolled into iterative filters whose weights are adaptively learned from the graph-informed geometric attention, producing a highperformance yet interpretable network. Experimental results demonstrate that the proposed scheme achieves stateof-the-art performance in terms of accuracy and consistency on synthetic DVToF dataset and exhibits robust generalization on the real Kinectv2 dataset. Source code is available at https://github.com/davidweidawang/GIGA-ToF.
Continuous-Time Human Motion Field from Event Cameras	Ziyun Wang University of Pennsylvania Ruijun Zhang University of Pennsylvania Zi-Yan Liu University of Pennsylvania Yufu Wang University of Pennsylvania Kostas Daniilidis University of Pennsylvania	Paper Supplementary Abstract This paper addresses the challenges of estimating a continuous-time human motion field from a stream of events. Existing human motion estimation methods rely predominantly on frame-based approaches, which are prone to aliasing and inaccuracies due to limited temporal resolution and motion blur. In this work, we predict a continuoustime human motion field directly from events, by leveraging a recurrent feed-forward neural network to predict human motion in the latent space of possible human motions. Prior state-of-the-art event-based methods rely on computationally intensive optimization across a fixed number of poses at high frame rates, which becomes prohibitively expensive as we increase the temporal resolution. In comparison, we present the first work that replaces traditional discretetime predictions with a continuous human motion field represented as a time-implicit function, enabling parallel pose queries at arbitrary temporal resolutions. Despite the promises of event cameras, few benchmarks have tested the limit of high speed human motion estimation. We introduce Beam-splitter Event Agile Human Motion Dataset-a hardware-synchronized high-speed human dataset to fill Work completed while Ziyun Wang was at University of Pennsylvania. this gap. On this new data, our method improves joint errors by 23.8 % compared to previous event human methods, while reducing the computational time by 69%. More details of the work can be found on the project page: ziyunclaudewang.github.io/evhuman.html.
Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild	Haoran Wang Nanjing University Zekun Li Nanjing University Jian Zhang Nanjing University Lei Qi Southeast University Yinghuan Shi Nanjing University	Paper Supplementary Abstract Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild. Consequently, reference segmentation, which leverages reference images and their corresponding masks to impart novel knowledge to the model, emerges as a promising new direction for adapting large vision models. However, existing reference segmentation approaches predominantly rely on meta-learning, which still necessitates an extensive meta-training process and brings massive data and computational cost. In this study, we propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video. This perspective allows the latest version of SAM, known as SAM2, which is equipped with interactive video object segmentation (iVOS) capabilities, to be adapted to downstream tasks in a lightweight manner. We term this approach Correspondence As Video for SAM (CAV-SAM). CAV-SAM comprises two key modules: the Diffusion-Based Semantic Transition (DBST) module employs a diffusion model to construct a semantic transformation sequence, while the Test-Time Geometric Alignment (TTGA) module aligns the geometric changes within this sequence through test-time fine-tuning. We evaluated CAVSAM on widely-used datasets, achieving segmentation performance improvements exceeding 5% over SOTA methods. Our implementation is available at https://github. com/wanghr64/cav-sam.
DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction	Rui Wang ETH Zürich Quentin Lohmeyer ETH Zürich Mirko Meboldt ETH Zürich Siyu Tang ETH Zürich	Paper Supplementary Abstract Reconstructing clean, distractor-free 3D scenes from realworld captures remains a significant challenge, particularly in highly dynamic and cluttered settings such as egocentric videos. To tackle this problem, we introduce DeGauss, a simple and robust self-supervised framework for dynamic scene reconstruction based on a decoupled dynamic-static Gaussian Splatting design. DeGauss models dynamic elements with foreground Gaussians and static content with background Gaussians, using a probabilistic mask to coordinate their composition and enable independent yet complementary optimization. DeGauss generalizes robustly across a wide range of real-world scenarios, from casual image collections to long, dynamic egocentric videos, without relying on complex heuristics or extensive supervision. Experiments on benchmarks including NeRF-on-the-go, ADT, AEA, Hot3D, and EPIC-Fields demonstrate that DeGauss consistently outperforms existing methods, establishing a strong baseline for generalizable, distractor-free 3D reconstruction in highly dynamic, interaction-rich environments.
Debiasing Trace Guidance: Top-down Trace Distillation and Bottom-up Velocity Alignment for Unsupervised Anomaly Detection	Xingjian Wang Zhejiang University Li Chai Zhejiang University Jiming Chen Zhejiang University	Paper Abstract The leak of anomalous information from input condition poses a great challenge to reconstruction-based anomaly detection. Recent diffusion-based methods respond to this issue by suppressing anomaly information for condition injection or in-sampling inversion. However, since they treat conditions as a time-invariant prior, they fall into a trade-off problem between anomaly suppression and normal pattern consistency. To address this problem, we propose Debiasing Trace Guidance (DTG) framework based on Flow Matching towards debiasing generation for more accurate unsupervised multi-class anomaly detection. Generally, DTG distills a low-dimensional generation sub-trace robust to anomalies by Top-down Trace Distillation, and then utilizes its time-varying velocity features to guide a debiasing generation by Bottom-up Velocity Alignment. The trace distillation filters out high-frequency anomalies via learnable wavelet filters and reserving structural information by keeping global consistency across samples using Skinhorn Distance. Subsequently, the velocity field of original trace is aligned with the one of sub-trace through KVInjection Attention mechanism. The model is forced to generate normal details from corresponding low-dimensional contexts via Alignment Mask. Experimental results on several benchmarks and corresponding ablation studies have demonstrated the effectiveness of the proposed method.
Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval	Zhichuan Wang Huazhong Agricultural University Yang Zhou Shenzhen University Zhe Liu The University of Hong Kong Rui Yu University of Louisville Song Bai ByteDance Yulong Wang Huazhong Agricultural University Xinwei He Huazhong Agricultural University Xiang Bai Huazhong University of Science and Technology	Paper Supplementary Abstract Open-set 3D object retrieval (3DOR) is an emerging task aiming to retrieve 3D objects of unseen categories beyond the training set. Existing methods typically utilize all modalities (i.e., voxels, point clouds, multi-view images) and train specific backbones before fusion. However, they still struggle to produce generalized representations due to insufficient 3D training data. Being contrastively pre-trained on web-scale image-text pairs, CLIP inherently produces generalized representations for a wide range of downstream tasks. Building upon it, we present a simple yet effective framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR. DAC innovatively synergizes a CLIP model with a multi-modal large language model (MLLM) to learn generalized 3D representations, where the MLLM is used for dual purposes. First, it describes the seen category information to align with CLIP's training objective for adaptation during training. Second, it provides external hints about unknown objects complementary to visual cues during inference. To improve the synergy, we introduce an Additive-Bias Low-Rank adaptation (AB-LoRA), which alleviates overfitting and further enhances the generalization to unseen categories. With only multi-view images, DAC significantly surpasses prior arts by an average of +10.01% mAP on four open-set 3DOR datasets. Moreover, its generalization is also validated on image-based and cross-dataset setups. Code is available at https://github.com/wangzhichuan123/DAC.
Deterministic Object Pose Confidence Region Estimation	Jinghao Wang National University of Defense Technology Zhang Li National University of Defense Technology Zi Wang National University of Defense Technology Banglei Guan National University of Defense Technology Yang Shang National University of Defense Technology Qifeng Yu National University of Defense Technology	Paper Supplementary Abstract 6D pose confidence region estimation has emerged as a critical direction, aiming to perform uncertainty quantification for assessing the reliability of estimated poses. However, current sampling-based approach suffers from critical limitations that severely impede their practical deployment: 1) the sampling speed significantly decreases as the number of samples increases. 2) the derived confidence regions are often excessively large. To address these challenges, we propose a deterministic and efficient method for estimating pose confidence regions. Our approach uses inductive conformal prediction to calibrate the deterministically regressed Gaussian keypoint distributions into 2D keypoint confidence regions. We then leverage the implicit function theorem to propagate these keypoint confidence regions directly into 6D pose confidence regions. This method avoids the inefficiency and inflated region sizes associated with sampling and ensembling. It provides compact confidence regions that cover the ground-truth poses with a user-defined confidence level. Experimental results on the LineMOD Occlusion and SPEED datasets show that our method achieves higher pose estimation accuracy with reduced computational time. For the same coverage rate, our method yields significantly smaller confidence region volumes, reducing them by up to 99.9% for rotations and 99.8% for translations. The code will be available soon.
DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover	Youzhuo Wang ShanghaiTech University Jiayi Ye ShanghaiTech University Chuyang Xiao ShanghaiTech University Yiming Zhong ShanghaiTech University Heng Tao ShanghaiTech University Hang Yu ShanghaiTech University Yumeng Liu ShanghaiTech University Jingyi Yu ShanghaiTech University Yuexin Ma ShanghaiTech University	Paper Supplementary Abstract Handover between a human and a dexterous robotic hand is a fundamental yet challenging task in human-robot collaboration. It requires handling dynamic environments and a wide variety of objects and demands robust and adaptive grasping strategies. However, progress in developing effective dynamic dexterous grasping methods is limited by the absence of high-quality, real-world human-to-robot handover datasets. Existing datasets primarily focus on grasping static objects or rely on synthesized handover motions, which differ significantly from real-world robot motion patterns, creating a substantial gap in applicability. In this paper, we introduce DexH2R, a comprehensive real-world dataset for human-to-robot handovers, built on a dexterous robotic hand. Our dataset captures a diverse range of interactive objects, dynamic motion patterns, rich visual sensor data, and detailed annotations. Additionally, to ensure natural and human-like dexterous motions, we utilize teleoperation for data collection, enabling the robot's movements to align with human behaviors and habits, which is a crucial characteristic for intelligent humanoid robots. Furthermore, we propose an effective solution, DynamicGrasp, for human-to-robot handover and evaluate various state-ofthe-art approaches, including auto-regressive models and diffusion policy methods, providing a thorough comparison and analysis. We believe our benchmark will drive advancements in human-to-robot handover research by offering a high-quality dataset, effective solutions, and comprehensive evaluation metrics. Project is at dexh2r.github.io/.
End-to-End Entity-Predicate Association Reasoning for Dynamic Scene Graph Generation	Liwei Wang Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology Yanduo Zhang Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology Tao Lu Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology Fang Liu Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology Huiqin Zhang Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology Jiayi Ma Wuhan University Huabing Zhou Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology	Paper Supplementary Abstract Dynamic Scene Graph Generation (DSGG) aims to comprehensively understand videos by abstracting them into visual triplets <subject, predicate, object>. Most existing methods focus on capturing temporal dependencies, but overlook crucial visual relationship dependencies between entities and predicates, as well as among predicate subclasses. These dependencies are essential for a deeper contextual understanding of scenarios. Additionally, current approaches do not support end-to-end training and instead rely on a two-stage pipeline, which incurs higher computational costs. To address these issues, we propose an end-to-end Association Reasoning Network (ARN) for DSGG. ARN leverages CLIP's semantic priors to model fine-grained triplet cues to generate scene graph. In addition, we design a Predicate Association Parsing (PAP) module that employs a conditional weight mapping mechanism to structure entity and predicate representations. We further introduce a Hierarchical Attention (HA) mechanism to integrate spatio-temporal context with entity and predicate representations, enabling effective associative reasoning. Extensive experiments on the Action Genome dataset demonstrate significant performance improvements over existing methods. The source code is available in https://github.com/wlw951226/ARN.
Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics	Taowen Wang Rochester Institute of Technology Cheng Han University of Missouri - Kansas City James Liang U.S. Naval Research Laboratory Wenhao Yang Lamar University Dongfang Liu Rochester Institute of Technology Luna Xinyu Zhang Rochester Institute of Technology Qifan Wang Meta AI Jiebo Luo University of Rochester Ruixiang Tang Rutgers University	Paper Supplementary Abstract Recently in robotics, Vision-Language-Action (VLA) models have emerged as a transformative approach, enabling robots to execute complex tasks by integrating visual and linguistic inputs within an end-to-end learning framework. Despite their significant capabilities, VLA models introduce new attack surfaces. This paper systematically evaluates their robustness. Recognizing the unique demands of robotic execution, our attack objectives target the inherent spatial and functional characteristics of robotic systems. In particular, we introduce two untargeted attack objectives that leverage spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory. Additionally, we design an adversarial patch generation approach that places a small, colorful patch within the camera's view, effectively executing the attack in both digital and physical environments. Our evaluation reveals a marked degradation in task success rates, with up to a 100% reduction across a suite of simulated robotic tasks, highlighting critical security gaps in current VLA architectures. By unveiling these vulnerabilities and proposing actionable evaluation metrics, we advance both the understanding and enhancement of safety for VLA-based robotic systems, underscoring the necessity for continuously developing robust defense strategies prior to physical-world deployments1.
Faster and Better 3D Splatting via Group Training	Chengbo Wang School of Design, Hunan University Guozheng Ma Nanyang Technological University Yifei Xue School of Design, Hunan University Yizhen Lao School of Design, Hunan University	Paper Supplementary Abstract 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, demonstrating remarkable capability in high-fidelity scene reconstruction through its Gaussian primitive representations. However, the computational overhead induced by the massive number of primitives poses a significant bottleneck to training efficiency. To overcome this challenge, we propose Group Training, a simple yet effective strategy that organizes Gaussian primitives into manageable groups, optimizing training efficiency and improving rendering quality. This approach shows universal compatibility with existing 3DGS frameworks, including vanilla 3DGS and MipSplatting, consistently achieving accelerated training while maintaining superior synthesis quality. Extensive experiments reveal that our straightforward Group Training strategy achieves up to 30% faster convergence and improved rendering quality across diverse scenarios. Project Website: https://chengbo-wang.github.io/3DGSwith-Group-Training/.
From Enhancement to Understanding: Build a Generalized Bridge for Low-light Vision via Semantically Consistent Unsupervised Fine-tuning	Sen Wang East China Normal University Shao Zeng Tencent Youtu Lab Tianjun Gu East China Normal University Zhizhong Zhang East China Normal University Ruixin Zhang Tencent Youtu Lab Shouhong Ding Tencent Youtu Lab Jingyun Zhang Tencent WeChat Pay Lab Jun Wang Tencent WeChat Pay Lab Xin Tan East China Normal University Yuan Xie East China Normal University Lizhuang Ma East China Normal University	Paper Supplementary Abstract Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation. The code is available at GEFU.
HccePose(BF): Predicting Front & Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation	Yulin Wang Southeast University Mengting Hu Southeast University Hongli Li Purdue University Chen Luo Southeast University	Paper Abstract In pose estimation for seen objects, a prevalent pipeline involves using neural networks to predict dense 3D coordinates of the object surface on 2D images, which are then used to establish dense 2D-3D correspondences. However, current methods primarily focus on more efficient encoding techniques to improve the precision of predicted 3D coordinates on the object's front surface, overlooking the potential benefits of incorporating the back surface and interior of the object. To better utilize the full surface and interior of the object, this study predicts 3D coordinates of both the object's front and back surfaces and densely samples 3D coordinates between them. This process creates ultra-dense 2D-3D correspondences, effectively enhancing pose estimation accuracy based on the Perspective-n-Point (PnP) algorithm. Additionally, we propose Hierarchical Continuous Coordinate Encoding (HCCE) to provide a more accurate and efficient representation of front and back surface coordinates. Experimental results show that, compared to existing state-of-the-art (SOTA) methods on the BOP website, the proposed approach outperforms across seven classic BOP core datasets. Code is available at https: //github.com/WangYuLin-SEU/HCCEPose.
Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection	Hanshi Wang State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA Jin Gao School of Artificial Intelligence, University of Chinese Academy of Sciences Weiming Hu School of Artificial Intelligence, University of Chinese Academy of Sciences Zhipeng Zhang School of Artificial Intelligence, Shanghai Jiao Tong University	Paper Supplementary Abstract We present the first work demonstrating that a pure Mamba block can achieve efficient Dense Global Fusion, meanwhile guaranteeing top performance for cameraLiDAR multi-modal 3D object detection. Our motivation stems from the observation that existing fusion strategies are constrained by their inability to simultaneously achieve efficiency, long-range modeling, and retaining complete scene information. Inspired by recent advances in statespace models (SSMs) [8] and linear attention [35, 43], we leverage their linear complexity and long-range modeling capabilities to address these challenges. However, this is non-trivial since our experiments reveal that simply adopting efficient linear-complexity methods does not necessarily yield improvements and may even degrade performance. We attribute this degradation to the loss of height information during multi-modal alignment, leading to deviations in sequence order. To resolve this, we propose height-fidelity LiDAR encoding that preserves precise height information through voxel compression in continuous space, thereby enhancing camera-LiDAR alignment. Subsequently, we introduce the Hybrid Mamba Block, which leverages the enriched height-informed features to conduct local and global contextual learning. By integrating these components, our method achieves state-of-the-art performance with the top-tire NDS score of 75.0 on the nuScenes [2] validation benchmark, even surpassing methods that utilize highresolution inputs. Meanwhile, our method maintains efficiency, achieving faster inference speed than most recent state-of-the-art methods. Code is available at https://github.com/AutoLab-SAI-SJTU/MambaFusion
HiNeuS: High-fidelity Neural Surface Mitigating Low-texture and Reflective Ambiguity	Yida Wang Li Auto Inc. Xueyang Zhang Li Auto Inc. Kun Zhan Li Auto Inc. Peng Jia Li Auto Inc. Xianpeng Lang Li Auto Inc.	Paper Supplementary Abstract Neural surface reconstruction faces persistent challenges in reconciling geometric fidelity with photometric consistency under complex scene conditions. We present HiNeuS, a unified framework that holistically addresses three core limitations in existing approaches: multi-view radiance inconsistency, missing keypoints in textureless regions, and structural degradation from over-enforced Eikonal constraints during joint optimization. To resolve these issues through a unified pipeline, we introduce: 1) Differential visibility verification through SDF-guided ray tracing, resolving reflection ambiguities via continuous occlusion modeling; 2) Planar-conformal regularization via ray-aligned geometry patches that enforce local surface coherence while preserving sharp edges through adaptive appearance weighting; and 3) Physically-grounded Eikonal relaxation that dynamically modulates geometric constraints based on local radiance gradients, enabling detail preservation without sacrificing global regularity. Unlike prior methods that handle these aspects through sequential optimizations or isolated modules, our approach achieves cohesive integration where appearance-geometry constraints evolve synergistically throughout training. Comprehensive evaluations across synthetic and real-world datasets demonstrate SotA performance, including a 21.4% reduction in Chamfer distance over reflection-aware baselines and 2.32 dB PSNR improvement against neural rendering counterparts. Qualitative analyses reveal superior capability in recovering specular instruments, urban layouts with centimeterscale infrastructure, and low-textured surfaces without local patch collapse. The method's generalizability is further validated through successful application to inverse rendering tasks, including material decomposition and viewconsistent relighting. Project hosted here, where the urban and vehicle reconstruction related modules are excluded from open-sourced codes due to legal concerns.
HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery	Yu Wang School of Remote Sensing and Information Engineering, Wuhan University Bo Dang School of Remote Sensing and Information Engineering, Wuhan University Wanchun Li School of Remote Sensing and Information Engineering, Wuhan University Wei Chen School of Remote Sensing and Information Engineering, Wuhan University Yansheng Li School of Remote Sensing and Information Engineering, Wuhan University	Paper Supplementary Abstract With the increasing resolution of remote sensing imagery (RSI), large-size RSI has emerged as a vital data source for high-precision vector mapping of geographic objects. Existing methods are typically constrained to processing small image patches, which often leads to the loss of contextual information and produces fragmented vector outputs. To address these, this paper introduces HoliTracer, the first framework designed to holistically extract vectorized geographic objects from large-size RSI. In HoliTracer, we enhance segmentation of large-size RSI using the Context Attention Net (CAN), which employs a local-to-global attention mechanism to capture contextual dependencies. Furthermore, we achieve holistic vectorization through a robust pipeline that leverages the Mask Contour Reformer (MCR) to reconstruct polygons and the Polygon Sequence Tracer (PST) to trace vertices. Extensive experiments on large-size RSI datasets, including buildings, water bodies and roads, demonstrate that HoliTracer outperforms stateof-the-art methods. Our code and data are available in github.com/vvangfaye/HoliTracer
LA-MOTR: End-to-End Multi-Object Tracking by Learnable Association	Peng Wang School of Information, Renmin University of China Yongcai Wang School of Information, Renmin University of China Hualong Cao School of Information, Renmin University of China Wang Chen School of Information, Renmin University of China Deying Li School of Information, Renmin University of China	Paper Abstract This paper proposes LA-MOTR, a novel Tracking-byLearnable-Association framework that resolves the competing optimization objectives between detection and association in end-to-end Tracking-by-Attention (TbA) MultiObject Tracking. Current TbA methods employ shared decoders for simultaneous object detection and tracklet association, often resulting in task interference and suboptimal accuracy. By contrast, our end-to-end framework decouples these tasks into two specialized modules: Separated ObjectTracklet Detection (SOTD) and Spatial-Guided Learnable Association (SGLA). This decoupled design offers ﬂexibility and explainability. In particular, SOTD independently detects new objects and existing tracklets in each frame, while SGLA associates them via Spatial-Weighted Learnable Attention module guided by relative spatial cues. Temporal coherence is further maintained through Tracklet Updates Module. The learnable association mechanism resolves the inherent suboptimal association issues in decoupled frameworks, avoiding the task interference commonly observed in joint approaches. Evaluations on DanceTrack, MOT17, and SportMOT datasets demonstrate state-of-theart performance. Extensive ablation studies validate the effectiveness of the designed modules. Code is available at https://github.com/PenK1nG/LA-MOTR.
LaneDiffusion: Improving Centerline Graph Learning via Prior Injected BEV Feature Generation	Zijie Wang Sun Yat-sen University Weiming Zhang Baidu Inc. Wei Zhang Baidu Inc. Xiao Tan Baidu Inc. Hongxing Liu Baidu Inc. Yaowei Wang Harbin Institute of Technology, Shenzhen	Paper Supplementary Abstract Centerline graphs, crucial for path planning in autonomous driving, are traditionally learned using deterministic methods. However, these methods often lack spatial reasoning and struggle with occluded or invisible centerlines. Generative approaches, despite their potential, remain underexplored in this domain. We introduce LaneDiffusion, a novel generative paradigm for centerline graph learning. LaneDiffusion innovatively employs diffusion models to generate lane centerline priors at the Bird's Eye View (BEV) feature level, instead of directly predicting vectorized centerlines. Our method integrates a Lane Prior Injection Module (LPIM) and a Lane Prior Diffusion Module (LPDM) to effectively construct diffusion targets and manage the diffusion process. Furthermore, vectorized centerlines and topologies are then decoded from these prior-injected BEV features. Extensive evaluations on the nuScenes and Argoverse2 datasets demonstrate that LaneDiffusion significantly outperforms existing methods, achieving improvements of 4.2%, 4.6%, 4.7%, 6.4% and 1.8% on finegrained point-level metrics (GEO F1, TOPO F1, JTOPO F1, APLS and SDA) and 2.3%, 6.4%, 6.8% and 2.1% on segment-level metrics (IoU, mAPcf, DETl and TOPll). These results establish state-of-the-art performance in centerline graph learning, offering new insights into generative models for this task. Code will be available at: https://github.com/ZJWang9928/LaneDiffusion.
Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts	Yun Wang City University of Hong Kong Longguang Wang Shenzhen Campus, Sun Yat-sen University Chenghao Zhang Chinese Academy of Sciences Yongjian Zhang Shenzhen Campus, Sun Yat-sen University Zhanjie Zhang Zhejiang University Ao Ma JD.com Chenyou Fan South China Normal University Tin Lun Lam The Chinese University of Hong Kong, Shenzhen Junjie Hu The Chinese University of Hong Kong, Shenzhen	Paper Supplementary Abstract Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge. To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction. Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-theart cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation. The code is available at https://github.com/cocowy1/SMoE-Stereo.
LightCity: An Urban Dataset for Outdoor Inverse Rendering and Reconstruction under Multi-illumination Conditions	Jingjing Wang State Key Lab of CAD&CG, Zhejiang University Qirui Hu State Key Lab of CAD&CG, Zhejiang University Chong Bao State Key Lab of CAD&CG, Zhejiang University Yuke Zhu State Key Lab of CAD&CG, Zhejiang University Hujun Bao State Key Lab of CAD&CG, Zhejiang University Zhaopeng Cui State Key Lab of CAD&CG, Zhejiang University Guofeng Zhang State Key Lab of CAD&CG, Zhejiang University	Paper Supplementary Abstract Inverse rendering in urban scenes is pivotal for applications like autonomous driving and digital twins. Yet, it faces significant challenges due to complex illumination conditions, including multi-illumination and indirect light and shadow effects. However, the effects of these challenges on intrinsic decomposition and 3D reconstruction have not been explored due to the lack of appropriate datasets. In this paper, we present LightCity, a novel high-quality synthetic urban dataset featuring diverse illumination conditions with realistic indirect light and shadow effects. LightCity encompasses over 300 sky maps with highly controllable illumination, varying scales with street-level and aerial perspectives over 50K images, and rich properties such as depth, normal, material components, light and indirect light, etc. Besides, we leverage LightCity to benchmark three fundamental tasks in the urban environments and conduct a comprehensive analysis of these benchmarks, laying a robust foundation for advancing related research. Project page: https://zju3dv.github.io/lightcity/.
MOERL: When Mixture-of-Experts Meet Reinforcement Learning for Adverse Weather Image Restoration	Tao Wang Nanjing University Peiwen Xia Nanjing University Bo Li vivo Mobile Communication Co., Ltd Peng-Tao Jiang vivo Mobile Communication Co., Ltd Zhe Kong Shenzhen Campus of Sun Yat-sen University Kaihao Zhang Harbin Institute of Technology (Shenzhen) Tong Lu Nanjing University Wenhan Luo The Hong Kong University of Science and Technology	Paper Supplementary Abstract Adverse weather conditions, such as rain, snow, and haze, introduce complex degradations that present substantial challenges for effective image restoration. Existing all-inone models often rely on fixed network structures, limiting their ability to adapt to the varying characteristics of different weather conditions. Moreover, these models typically lack the iterative refinement process that human experts use for progressive image restoration. In this work, we propose MOERL, a Mixture-of-Experts (MoE) model optimized with reinforcement learning (RL) to enhance image restoration across diverse weather conditions. Our method incorporates two core types of experts, i.e., channel-wise modulation and spatial modulation experts, to address task-specific degradation characteristics while minimizing task interference. In addition, inspired by human expertise, we frame the optimization process as a sequential, progressive problem, allowing the network to refine its parameters progressively and adapt to specific weather conditions. Extensive experiments demonstrate the efficacy and superiority of our proposed method.
MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips	Shibo Wang The Hong Kong University of Science and Technology (Guangzhou) Haonan He The Hong Kong University of Science and Technology (Guangzhou) Maria Parelli ETH Zürich Christoph Gebhardt The Hong Kong University of Science and Technology (Guangzhou) Zicong Fan ETH Zürich Jie Song The Hong Kong University of Science and Technology	Paper Supplementary Abstract Most RGB-based hand-object reconstruction methods rely on object templates, while template-free methods typically assume full object visibility. This assumption often breaks in real-world settings, where fixed camera viewpoints and static grips leave parts of the object unobserved, resulting in implausible reconstructions. To overcome this, we present MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even under limited viewpoint variation. Our key insight is that, † Prior to joining University of T¨ubingen and T¨ubingen AI Center despite the scarcity of paired 3D hand-object data, largescale novel view synthesis diffusion models offer rich object supervision. This supervision serves as a prior to regularize unseen object regions during hand interactions. Leveraging this insight, we integrate a novel view synthesis model into our hand-object reconstruction framework. We further align hand to object by incorporating visible contact constraints. Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art hand-object reconstruction methods. We also show that novel view synthesis diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction.
Mamba-3VL: Taming State Space Model for 3D Vision Language Learning	Yuan Wang Tsinghua University Yuxin Chen ARC Lab, Tencent PCG Zhongang Qi ARC Lab, Tencent PCG Lijun Liu UCAS Jile Jiao Deepeleph Xuetao Feng Deepeleph Yujia Liang HUST Ying Shan ARC Lab, Tencent PCG Zhipeng Zhang School of Artificial Intelligence, SJTU	Paper Abstract 3D vision-language (3D-VL) reasoning, connecting natural language with 3D physical world, represents a milestone in advancing spatial intelligence. While transformer-based methods dominate 3D-VL research, their quadratic complexity and simplistic positional embedding mechanisms severely limits effective modeling of long-range 3D-VL dependencies and spatial relationships in 3D-VL tasks. State Space Models (SSM) have emerged as promising linear-complexity alternatives for sequential data processing, while inherent selection mechanism offers notable capability for spatial modeling. Despite its potential, straightforward adoption of Mamba to 3D-VL tasks encounters two obstacles: (1) how to perceive the position of 3D objects and understand complex spatial relationships, and (2) how to achieve thorough synergies of This work is done during internship at Tencent. The code is released at https://github.com/wangyuan123ac/Mamba-3VL. multi-modal features. In this paper, we propose Mamba-3VL, a pioneering 3D-VL framework to model complex intra- and inter-modality correlations and enhance spatial relation reasoning, while guaranteeing top-tier performance, high efficiency, and generalization potential for 3D-VL tasks. Specifically, Mamba Mixer explicitly models 3D-VL interaction via channel twisting and relation-prioritized spatial scanning policy. It maximally retain spatial relation of objectcentric features. To further provide precise spatial encoding for mamba, we develop Instance-aware Dynamic Position Adapter (IDPA) to dynamically adjust instance-specific positional embeddings and enhance local spatial relation of 3D objects. Extensive results validate Mamba-3VL trumps other competitors on seven 3D-VL benchmarks and showcases versatile potentials for challenging Embodied AI tasks.
MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion	Zihan Wang Carnegie Mellon University Jeff Tan Carnegie Mellon University Tarasha Khurana Carnegie Mellon University Neehar Peri Carnegie Mellon University Deva Ramanan Carnegie Mellon University	Paper Supplementary Abstract We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and viewconsistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data-processing scripts are available on Github.
Monocular Semantic Scene Completion via Masked Recurrent Networks	Xuzhi Wang Tianjin Normal University Xinran Wu Tianjin Normal University Song Wang Zhejiang University Lingdong Kong National University of Singapore Ziping Zhao Tianjin Normal University	Paper Supplementary Abstract Monocular Semantic Scene Completion (MSSC) aims to predict the voxel-wise occupancy and semantic category from a single-view RGB image. Existing methods adopt a single-stage framework that aims to simultaneously achieve visible region segmentation and occluded region hallucination, while also being affected by inaccurate depth estimation. Such methods often achieve suboptimal performance, especially in complex scenes. We propose a novel two-stage framework that decomposes MSSC into coarse MSSC followed by the Masked Recurrent Network. Specifically, we propose the Masked Sparse Gated Recurrent Unit (MS-GRU) which concentrates on the occupied regions by the proposed mask updating mechanism, and a sparse GRU design is proposed to reduce the computation cost. Additionally, we propose the distance attention projection to reduce projection errors by assigning different attention scores according to the distance to the observed surface. Experimental results demonstrate that our proposed unified framework, MonoMRN, effectively supports both indoor and outdoor scenes and achieves state-of-the-art performance on the NYUv2 and SemanticKITTI datasets. Furthermore, we conduct robustness analysis under various disturbances, highlighting the role of the Masked Recurrent Network in enhancing the model's resilience to such challenges. The source code is publicly available at: https: //github.com/alanWXZ/MonoMRN.
Open-Vocabulary Octree-Graph for 3D Scene Understanding	Zhigang Wang Northwestern Polytechnical University Yifei Su University of Chinese Academy of Sciences Chenhui Li Shanghai AI Laboratory Dong Wang Shanghai AI Laboratory Yan Huang University of Chinese Academy of Sciences Xuelong Li TeleAI Bin Zhao Northwestern Polytechnical University	Paper Abstract Open-vocabulary 3D scene understanding is indispensable for embodied agents. Recent works leverage pretrained vision-language models (VLMs) for object segmentation and project them to point clouds to build 3D maps. Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and does not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and text-based object retrieval. To address these issues, we propose Octree-Graph, a novel scene representation for open-vocabulary 3D scene understanding. Specifically, a Chronological Group-wise Segment Merging (CGSM) strategy and an Instance Feature Aggregation (IFA) algorithm are first designed to get 3D instances and corresponding semantic features. Subsequently, an adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape. Finally, the Octree-Graph is constructed where each adaptive-octree acts as a graph node, and edges describe the spatial relations among nodes. Extensive experiments on various tasks are conducted on several widelyused datasets, demonstrating the versatility and effectiveness of our method. Code is available here.
PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency	Haotian Wang Xi'an Jiaotong University Aoran Xiao Nanyang Technological University Xiaoqin Zhang Zhejiang University of Technology Meng Yang Xi'an Jiaotong University Shijian Lu Nanyang Technological University	Paper Supplementary Abstract Generalizable depth completion enables the acquisition of dense metric depth maps for unseen environments, offering robust perception capabilities for various downstream tasks. However, training such models typically requires large-scale datasets with metric depth labels, which are often labor-intensive to collect. This paper presents PacGDC, a label-efficient technique that enhances data diversity with minimal annotation effort for generalizable depth completion. PacGDC builds on novel insights into inherent ambiguities and consistencies in object shapes and positions during 2D-to-3D projection, allowing the synthesis of numerous pseudo geometries for the same visual scene. This process greatly broadens available geometries by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a new data synthesis pipeline that uses multiple depth foundation models as scale manipulators. These models robustly provide pseudo depth labels with varied scene scales, affecting both local objects and global layouts, while ensuring projection consistency that supports generalization. To further diversify geometries, we incorporate interpolation and relocation strategies, as well as unlabeled images, extending the data coverage beyond the individual use of foundation models. Extensive experiments show that PacGDC achieves remarkable generalizability across multiple benchmarks, excelling in diverse scene semantics/scales and depth sparsity/patterns under both zero-shot and few-shot settings. Code: https: //github.com/Wang-xjtu/PacGDC.
Precise Action-to-Video Generation Through Visual Action Prompts	Yuang Wang Zhejiang University Chao Wen Fudan University Haoyu Guo Zhejiang University Sida Peng Zhejiang University Minghan Qin Tsinghua University Hujun Bao Zhejiang University Xiaowei Zhou Zhejiang University Ruizhen Hu Xiangjiang Lab	Paper Abstract We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality tradeoff: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to 'render' actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of actiondriven generative models. By integrating visual skeletons into pretrained video generation models via lightweight finetuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid [64], RT-1 [11] and DROID [35] demonstrate the effectiveness of our proposed approach.
ProSAM: Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts	Xiaoqi Wang Bosch Research North America Clint Sebastian Bosch Center for Artificial Intelligence (BCAI) Wenbin He Bosch Research North America Liu Ren Bosch Research North America	Paper Supplementary Abstract The recent advancements in large foundation models have driven the success of open-set image segmentation, a task focused on segmenting objects beyond predefined categories. Among various prompt types (such as points, boxes, texts, and visual references), visual reference segmentation stands out for its unique ﬂexibility and strong zeroshot capabilities. Recently, several SAM-based methods have made notable progress in this task by automatically generating prompts to guide SAM. However, these methods often generate prompts at boundaries of target regions due to suboptimal prompt encoder, which results in instability and reduced robustness. In this work, we introduce ProSAM, a simple but effective method to address the stability challenges we identified in existing SAM-based visual reference segmentation approaches. By learning a variational prompt encoder to predict multivariate prompt distributions, ProSAM avoids generating prompts that lie in unstable regions, overcoming the instability caused by less robust prompts. Our approach consistently surpasses state-ofthe-art methods on the Pascal-5i and COCO-20i datasets, providing a more robust solution for visual reference segmentation.
Recognizing Actions from Robotic View for Natural Human-Robot Interaction	Ziyi Wang State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School Peiming Li State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School Hong Liu State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School Zhichao Deng Sun Yat-sen University Can Wang Kiel University Jun Liu Lancaster University Junsong Yuan State University of New York at Buffalo Mengyuan Liu State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School	Paper Abstract Natural Human-Robot Interaction (N-HRI) requires robots to recognize human actions at varying distances and states, regardless of whether the robot itself is in motion or stationary. This setup is more flexible and practical than conventional human action recognition tasks. However, existing benchmarks designed for traditional action recognition fail to address the unique complexities in N-HRI due to limited data, modalities, task categories, and diversity of subjects and environments. To address these challenges, we introduce ACTIVE (Action from Robotic View), a large-scale dataset tailored specifically for perception-centric robotic views prevalent in mobile service robots. ACTIVE comprises 30 composite action categories, 80 participants, and 46,868 annotated video instances, covering both RGB and point cloud modalities. Participants performed various human actions in diverse environments at distances ranging from 3m to 50m, while the camera platform was also mobile, simulating real-world scenarios of robot perception with varying camera heights due to uneven ground. This comprehensive and challenging benchmark aims to advance action and attribute recognition research in N-HRI. Furthermore, we propose ACTIVE-PC, a method that accurately perceives human actions at long distances using Multilevel Neighborhood Sampling, Layered Recognizers, Elastic Ellipse Query, and precise decoupling of kinematic interference from human actions. Experimental results demonstrate the effectiveness of ACTIVE-PC. Our code is available at: https://github.com/wangzy01/ACTIVE-Action-from-Robotic-View.
Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities	Liuyi Wang Tongji University Xinyuan Xia Shanghai AI Laboratory Hui Zhao Shanghai AI Laboratory Hanqing Wang Shanghai AI Laboratory Tai Wang Shanghai AI Laboratory Yilun Chen Shanghai AI Laboratory Chengju Liu Tongji University Qijun Chen Tongji University Jiangmiao Pang Shanghai AI Laboratory	Paper Supplementary Abstract Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLNPE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment's overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models. The code is available at https://crystalsixone.github.io/vln pe.github.io.
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness	Haochen Wang NLPR, MAIS, CASIA Yucheng Zhao Dexmal Tiancai Wang Dexmal Haoqiang Fan Dexmal Xiangyu Zhang MEGVII Technology Zhaoxiang Zhang NLPR, MAIS, CASIA	Paper Supplementary Abstract The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (ROSS3D), which integrates 3D aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird'sEye-View images, contributing to a comprehensive overview of the entire scene. Empirically, ROSS3D achieves state-ofthe-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.
S3E: Self-Supervised State Estimation for Radar-Inertial System	Shengpeng Wang Huazhong University of Science and Technology Yulong Xie Huazhong University of Science and Technology Qing Liao Harbin Institute of Technology Wei Wang Wuhan University	Paper Supplementary Abstract Millimeter-wave radar for state estimation is gaining significant attention for its affordability and reliability in harsh conditions. Existing localization solutions typically rely on post-processed radar point clouds as landmark points. Nonetheless, the inherent sparsity of radar point clouds, ghost points from multi-path effects, and limited angle resolution in single-chirp radar severely degrade state estimation performance. To address these issues, we propose S3E, a Self-Supervised State Estimator that employs more richly informative radar signal spectra to bypass sparse points and fuses complementary inertial information to achieve accurate localization. S3E fully explores the association between exteroceptive radar and proprioceptive inertial sensor to achieve complementary benefits. To deal with limited angle resolution, we introduce a novel cross-fusion technique that enhances spatial structure information by exploiting subtle rotational shift correlations across heterogeneous data. The experimental results demonstrate our method achieves robust and accurate performance without relying on localization ground truth supervision. To the best of our knowledge, this is the first attempt to achieve state estimation by fusing radar spectra and inertial data in a complementary self-supervised manner.
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension	Xiyao Wang University of Maryland, College Park Zhengyuan Yang Microsoft Linjie Li Microsoft Hongjin Lu University of Maryland, College Park Yuancheng Xu University of Maryland, College Park Chung-Ching Lin Microsoft Kevin Lin Microsoft Furong Huang University of Maryland, College Park Lijuan Wang Microsoft	Paper Supplementary Abstract Despite significant advancements in vision-language models (VLMs), there lack effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the selfimproving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improves VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs.
Shape of Motion: 4D Reconstruction from a Single Video	Qianqian Wang UC Berkeley Vickie Ye UC Berkeley Hang Gao UC Berkeley Weijia Zeng UC San Diego Jake Austin UC Berkeley Zhengqi Li Adobe Research Angjoo Kanazawa UC Berkeley	Paper Supplementary Abstract Monocular dynamic reconstruction is a challenging and long-standing vision problem due to the highly ill-posed nature of the task. Existing approaches depend on templates, are effective only in quasi-static scenes, or fail to model 3D motion explicitly. We introduce a method for reconstructing generic dynamic scenes, featuring explicit, persistent 3D motion trajectories in the world coordinate frame, from casually captured monocular videos. We tackle the problem with two key insights: First, we exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE(3) motion bases. Each point's motion is expressed as a linear combination of these bases, facilitating soft decomposition of the scene into multiple rigidly-moving groups. Second, we take advantage of offthe-shelf data-driven priors such as monocular depth maps and long-range 2D tracks, and devise a method to effectively consolidate these noisy supervisory signals, resulting in a globally consistent representation of the dynamic scene. Experiments show that our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes. Project Page: https://shape-of-motion.github.io/
StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning	Chuxin Wang University of Science and Technology of China Yixin Zha University of Science and Technology of China Wenfei Yang University of Science and Technology of China Tianzhu Zhang University of Science and Technology of China	Paper Supplementary Abstract Recently, Mamba-based methods have demonstrated impressive performance in point cloud representation learning by leveraging State Space Model (SSM) with the efficient context modeling ability and linear complexity. However, these methods still face two key issues that limit the potential of SSM: Destroying the adjacency of 3D points during SSM processing and failing to retain long-sequence memory as the input length increases in downstream tasks. To address these issues, we propose StruMamba3D, a novel paradigm for self-supervised point cloud representation learning. It enjoys several merits. First, we design spatial states and use them as proxies to preserve spatial dependencies among points. Second, we enhance the SSM with a state-wise update strategy and incorporate a lightweight convolution to facilitate interactions between spatial states for efficient structure modeling. Third, our method reduces the sensitivity of pre-trained Mamba-based models to varying input lengths by introducing a sequence length-adaptive strategy. Experimental results across four downstream tasks showcase the superior performance of our method. In addition, our method attains the SOTA 95.1% accuracy on ModelNet40 and 92.75% accuracy on the most challenging split of ScanObjectNN without voting strategy.
The Source Image is the Best Attention for Infrared and Visible Image Fusion	Song Wang School of Computer Science and Technology, North University of China Xie Han School of Computer Science and Technology, North University of China Liqun Kuang School of Computer Science and Technology, North University of China Boying Wang School of Computer Science and Technology, North University of China Zhongyu Chen School of Computer Science and Technology, North University of China Zherui Qiao School of Computer Science and Technology, North University of China Fan Yang School of Computer Science and Technology, North University of China Xiaoxia Liu School of Computer Science and Technology, North University of China Bingyu Zhang School of Computer Science and Technology, North University of China Zhixun Wang School of Computer Science and Technology, North University of China	Paper Abstract Infrared and visible image fusion (IVF) endeavors to engineer composite outputs by blending optimal virtues of divergent modalities. This paper reveals, unprecedentedly, the intrinsic 'attention properties' of infrared images, which directly arise from their physical characteristics (i.e., heat distribution) and can be linked to attention mechanisms naturally, as observed in the gradient-weighted class activation mapping (Grad-CAM) visualization analysis of image classification models. To incorporate this property into IVF for better fusion, we propose the source infrared cross attention (I-SCA) and further extend it to the visible modality, subsequently introducing the source visible cross attention (V-SCA). The joint use of I-SCA and V-SCA greatly alleviate longstanding issues, such as insufficient and incomplete multimodal feature interaction and fusion, in IVF. Moreover, an auxiliary component for I-SCA and VSCA, termed CBSM, is employed to boost the channel, map space, and suppress redundancy and misleading information of the source images. Specifically, we directly treat the CBSM-processed raw image as the query, while the intermediate features of another modality are treated as keys and values in I-SCA and V-SCA. Unlike attention mechanisms that divide images into patches or limit computations to local windows, our cross attention modules achieve smoother and more robust IVF through true global modeling across the entire image space with linear complexity. Comparison with current SOTA methods on three popular public datasets confirms its superiority.
TopicGeo: An Efficient Unified Framework for Geolocation	Xin Wang Xidian University Xinlin Wang Xidian University Shuiping Gou Xidian University	Paper Supplementary Abstract Vision-based geolocation techniques that establish spatial correspondences between smaller query images and larger georeferenced images have gained significant attention. Existing approaches typically employ a separate 'retrievethen-match' paradigm, whereas such paradigms suffer from computational inefficiency or precision limitations. To this end, we propose TopicGeo, a unified framework for direct and precise query-to-reference image matching via three key innovations. The textual object semantics, called topics, distilled from CLIP prompt learning are embedded into the geolocation framework to eliminate intra-class and inter-class distribution discrepancies while also enhancing processing efficiency. Center-based adaptive label assignment and outlier rejection mechanisms as a joint retrievalmatching optimization strategy ensure task-coherent feature learning and precise spatial correspondences. A multi-level fine matching pipeline is introduced to refine matching from quality and quantity. Evaluations on large-scale synthetic and real-world datasets illustrate that TopicGeo achieves state-of-the-art performance in retrieval recall and matching accuracy while maintaining a balance in computational efficiency.
TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking	Mengmeng Wang Zhejiang University of Technology Haonan Wang Zhejiang University of Technology Yulong Li Zhejiang University of Technology Xiangjie Kong Zhejiang University of Technology Jiaxin Du Zhejiang University of Technology Guojiang Shen Zhejiang University of Technology Feng Xia RMIT University	Paper Supplementary Abstract 3D LiDAR-based single object tracking (SOT) relies on sparse and irregular point clouds, posing challenges from geometric variations in scale, motion patterns, and structural complexity across object categories. Current category-specific approaches achieve good accuracy but are impractical for real-world use, requiring separate models for each category and showing limited generalization. To tackle these issues, we propose TrackAny3D, the first framework to transfer large-scale pretrained 3D models for category-agnostic 3D SOT. We first integrate parameterefficient adapters to bridge the gap between pretraining and tracking tasks while preserving geometric priors. Then, we introduce a Mixture-of-Geometry-Experts (MoGE) architecture that adaptively activates specialized subnetworks based on distinct geometric characteristics. Additionally, we design a temporal context optimization strategy that incorporates learnable temporal tokens and a dynamic mask weighting module to propagate historical information and mitigate temporal drift. Experiments on three commonlyused benchmarks show that TrackAny3D establishes new state-of-the-art performance on category-agnostic 3D SOT, demonstrating strong generalization and competitiveness. We hope this work will enlighten the community on the importance of unified models and further expand the use of large-scale pretrained models in this field.
UAVScenes: A Multi-Modal Dataset for UAVs	Sijie Wang Nanyang Technological University Siqi Li Nanyang Technological University Yawei Zhang Nanyang Technological University Shangshu Yu School of Computer Science and Engineering, Northeastern University Shenghai Yuan Nanyang Technological University Rui She Beihang University Quanjiang Guo University of Electronic Science and Technology of China	Paper Abstract Multi-modal perception is essential for unmanned aerial vehicle (UAV) operations, as it enables a comprehensive understanding of the UAVs' surrounding environment. However, most existing multi-modal UAV datasets are primarily biased toward localization and 3D reconstruction tasks, or only support map-level semantic segmentation due to the lack of frame-wise annotations for both camera images and LiDAR point clouds. This limitation prevents them from being used for high-level scene understanding tasks. To address this gap and advance multi-modal UAV perception, we introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well-calibrated multi-modal UAV dataset MARSLVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both frame-wise images and LiDAR point clouds, along with accurate 6-degree-of-freedom (6-DoF) poses. These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis (NVS). Our dataset is available at https://github.com/sijieaaa/UAVScenes
UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving	Yuping Wang University of California, Riverside Xiangyu Huang University of Wisconsin, Madison Xiaokang Sun University of Michigan Mingxuan Yan University of California, Riverside Shuo Xing Texas A&M University Zhengzhong Tu Texas A&M University Jiachen Li University of California, Riverside	Paper Supplementary Abstract We introduce UniOcc, a comprehensive, unified benchmark and toolkit for occupancy forecasting (i.e., predicting future occupancies based on historical information) and occupancy prediction (i.e., predicting current-frame occupancy from camera images. UniOcc unifies the data from multiple real-world datasets (i.e., nuScenes, Waymo) and highfidelity driving simulators (i.e., CARLA, OpenCOOD), providing 2D/3D occupancy labels and annotating innovative per-voxel flows. Unlike existing studies that rely on suboptimal pseudo labels for evaluation, UniOcc incorporates novel evaluation metrics that do not depend on ground-truth labels, enabling robust assessment on additional aspects of occupancy quality. Through extensive experiments on state-of-the-art models, we demonstrate that large-scale, diverse training data and explicit flow information significantly enhance occupancy prediction and forecasting performance. Our data and code are available at https://uniocc.github.io/.
V2XScenes: A Multiple Challenging Traffic Conditions Dataset for Large-Range Vehicle-Infrastructure Collaborative Perception	Bowen Wang Shanghai Jiao Tong University Yafei Wang Shanghai Jiao Tong University Wei Gong Shanghai Jiao Tong University Siheng Chen Shanghai AI Laboratory Genjia Liu Shanghai Jiao Tong University Minhao Xiong Shanghai Jiao Tong University Chin Long Ng Shanghai Jiao Tong University	Paper Supplementary Abstract Whether autonomous driving can effectively handle challenging scenarios such as bad weather and complex traffic environments is still in doubt. One of the critical difficulties is that the single-view perception makes it hard to obtain the complementary perceptual information around the multi-condition scenes, such as meeting occlusion and congestion. To investigate the advantages of collaborative perception in high-risky driving scenarios, we construct a multiple challenging conditions dataset for largerange vehicle-infrastructure cooperative perception, called V2XScenes, which includes seven typical multi-modal layouts at successive road section. Particularly, each selected scene is labeled with a specific condition description, and we provide unique object tracking numbers across the entire road section and sequential frames to ensure consistency. Comprehensive cooperative perception benchmarks of 3D object detection and tracking for large-range roadside scenes are summarized, and the quantitative results based on the state-of-the-art demonstrate the effectiveness of collaborative perception facing challenging scenes. The data and benchmark codes of V2XScenes will be released.
VISO: Accelerating In-orbit Object Detection with Language-Guided Mask Learning and Sparse Inference	Meiqi Wang Tsinghua University Han Qiu Tsinghua University	Paper Supplementary Abstract In-orbit object detection is essential for Earth observation missions on satellites equipped with GPUs. A promising approach is to use pre-trained vision-language modeling (VLM) to enhance its open-vocabulary capability. However, adopting it on satellites poses two challenges: (1) satellite imagery differs substantially from natural images, and (2) satellites' embedded GPUs are insufficient for complex models' inference. We reveal their lack of a crucial prior: in-orbit detection involves identifying a set of known objects within a cluttered yet monotonous background. Motivated by this observation, we propose VISO, a Visionlanguage Instructed Satellite Object detection model that focuses on object-specific features while suppressing irrelevant regions through language-guided mask learning. After pre-training on a large-scale satellite dataset with 3.4M region-text pairs, VISO enhances object-text alignment and object-centric features to improve detection accuracy. Also, VISO suppresses irrelevant regions, enabling highly sparse inference to accelerate speed on satellites. Extensive experiments show that VISO without sparsity outperforms stateof-the-art (SOTA) VLMs in zero-shot detection by increasing 34.1% AP and reducing 27x FLOPs, and surpasses specialist models in supervised object detection and object referring by improving 2.3% AP. When sparsifying VISO to a comparable AP, FLOPs can be greatly reduced by up to 8.5x. Real-world tests reveal that VISO achieves a 2.8-4.8x FPS speed-up on satellites' embedded GPUs1.
VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers	Yating Wang Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University Haoyi Zhu Shanghai AI Lab Mingyu Liu USTC Jiange Yang ZJU Hao-Shu Fang NJU Tong He SJTU	Paper Supplementary Abstract In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent action outputs. Once trained, the tokenizer can be seamlessly adapted to a wide range of downstream tasks in a zero-shot manner, from short-horizon reactive behaviors to long-horizon planning. A key finding of our work is that the domain gap between synthetic and real action trajectories is marginal, allowing us to effectively utilize a vast amount of synthetic data during training without compromising real-world performance. To validate our approach, we conducted extensive experiments in both simulated environments and on real robotic platforms. The results demonstrate that as the volume of synthetic trajectory data increases, the performance of our tokenizer on downstream tasks improves significantly-most notably, achieving up to a 30% higher success rate on two real-world tasks in long-horizon scenarios. These findings highlight the potential of our action tokenizer as a robust and scalable solution for real-time embodied intelligence systems, paving the way for more efficient and reliable robotic control in diverse application domains.
VehicleMAE: View-asymmetry Mutual Learning for Vehicle Re-identification Pre-training via Masked AutoEncoders	Qi Wang School of Mathematics and Computer Sciences, Nanchang University Zeyu Zhang School of Mathematics and Computer Sciences, Nanchang University Dong Wang School of Software, Nanchang University Di Gai School of Mathematics and Computer Sciences, Nanchang University Xin Xiong The First Affiliated Hospital, Jiangxi Medical College, Nanchang University Jiyang Xu School of Mathematics and Computer Sciences, Nanchang University Ruihua Zhou School of Software, Nanchang University	Paper Supplementary Abstract Large-scale pre-training technology has achieved remarkable performance in diversified object re-identification (ReID) downstream tasks. Nevertheless, to our best knowledge, the pre-training model specifically for vehicle Re-ID, which focuses on tackling the challenge of multi-view variations, has not been fully investigated. In this paper, we first leverage a diffusion model to build a large-scale vehicle Re-ID benchmark dataset, dubbed 'DiffVERI', containing over 1700K images from abundant multi-view annotations. In terms of this dataset, we further present VehicleMAE, a novel masked image modeling pre-training paradigm that learns view-invariant representations by performing mutual-distillation in a self-supervised manner. To be specific, the pipeline of VehicleMAE unfolds two core modules, i.e., view-asymmetry masked image modeling (VMIM) and past-to-present mutual-distillation (PPMD). Technically, VMIM consists of two homogeneous masked autoencoders (MAE) that simultaneously reconstruct the RGB pixels and multi-view semantic information of the specific vehicle body region via paired asymmetric mask sampling strategies. To progressively distill the knowledge of the model itself, PPMD considers the two MAEs in the current epoch and the previous one as the student models and the teacher models, respectively, which leverages the knowledge learned by the current student and the historical teacher for mutual feature-level distillation. Extensive experimental results have verified that the proposed pre-training paradigm on DiffVERI gains compelling downstream task performance for vehicle Re-ID.
YOLOE: Real-Time Seeing Anything	Ao Wang School of Software, Tsinghua University Lihao Liu School of Software, Tsinghua University Hui Chen BNRist, Tsinghua University Zijia Lin School of Software, Tsinghua University Jungong Han Department of Automation, Tsinghua University Guiguang Ding School of Software, Tsinghua University	Paper Supplementary Abstract Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a builtin large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3x less training cost and 1.4x inference speedup, YOLOEv8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 APb and 0.4 APm gains over closed-set YOLOv8-L with nearly 4x less training time. Code and models are available at here.
You Think, You ACT: The New Task of Arbitrary Text to Motion Generation	Runqi Wang National Engineering Research Center for Multimedia Software, Wuhan University Caoyuan Ma School of Computer Science, Wuhan University Guopeng Li StepFun Hanrui Xu School of Computer Science, Wuhan University Yuke Li University of Maryland College Park Zheng Wang National Engineering Research Center for Multimedia Software, Wuhan University	Paper Supplementary Abstract Text to Motion aims to generate human motions from texts. Existing settings rely on limited Action Texts that include action labels (e.g., 'walk, bend'), which limits flexibility and practicability in scenarios difficult to describe directly. This paper extends limited Action Texts to arbitrary ones. Scene texts without explicit action labels can enhance the practicality of models in complex and diverse industries such as virtual human interaction, robot behavior generation, and film production, while also supporting the exploration of potential implicit behavior patterns. However, newly introduced Scene Texts may yield multiple reasonable output results, causing significant challenges in existing data, framework, and evaluation. To address this practical issue, we first create a new dataset HUMANML3D++ by extending texts of the wellannotated dataset HUMANML3D. Secondly, we propose a simple yet effective framework that extracts action instructions from arbitrary texts and subsequently generates motions. Furthermore, we also benchmark this new setting with multi-solution metrics to address the inadequacies of existing single-solution metrics. Extensive experiments indicate that Text to Motion in this realistic setting is challenging, fostering new research in this practical direction. More details are available in https://github.com/RunqiWang77/TAAT.github.io.
ZeroStereo: Zero-shot Stereo Matching from Single Images	Xianqi Wang Huazhong University of Science and Technology Hao Yang Huazhong University of Science and Technology Gangwei Xu Huazhong University of Science and Technology Junda Cheng Huazhong University of Science and Technology Min Lin Huazhong University of Science and Technology Yong Deng Autel Robotics Jinliang Zang Autel Robotics Yurui Chen Autel Robotics Xin Yang Optics Valley Laboratory	Paper Supplementary Abstract State-of-the-art supervised stereo matching methods have achieved remarkable performance on various benchmarks. However, their generalization to real-world scenarios remains challenging due to the scarcity of annotated realworld stereo data. In this paper, we propose ZeroStereo, a novel stereo image generation pipeline for zero-shot stereo matching. Our approach synthesizes high-quality right images from arbitrary single images by leveraging pseudo disparities generated by a monocular depth estimation model. Unlike previous methods that address occluded regions by filling missing areas with neighboring pixels or random backgrounds, we fine-tune a diffusion inpainting model to recover missing details while preserving semantic structure. Additionally, we propose Training-Free Confidence Generation, which mitigates the impact of unreliable pseudo labels without additional training, and Adaptive Disparity Selection, which ensures a diverse and realistic disparity distribution while preventing excessive occlusion and foreground distortion. Experiments demonstrate that models trained with our pipeline achieve state-of-theart zero-shot generalization across multiple datasets, with only a dataset volume comparable to Scene Flow. Code: https://github.com/Windsrain/ZeroStereo.
MixANT: Observation-dependent Memory Propagation for Stochastic Dense Action Anticipation	Syed Talal Wasim University of Bonn Hamid Suleman University of Bonn Olga Zatsarynna University of Bonn Muzammal Naseer Khalifa University Juergen Gall University of Bonn	Paper Supplementary Abstract We present MixANT, a novel architecture for stochastic long-term dense anticipation of human activities. While recent State Space Models (SSMs) like Mamba have shown promise through input-dependent selectivity on three key parameters, the critical forget-gate (A matrix) controlling temporal memory remains static. We address this limitation by introducing a mixture of experts approach that dynamically selects contextually relevant A matrices based on input features, enhancing representational capacity without sacrificing computational efficiency. Extensive experiments on the 50Salads, Breakfast, and Assembly101 datasets demonstrate that MixANT consistently outperforms stateof-the-art methods across all evaluation settings. Our results highlight the importance of input-dependent forgetgate mechanisms for reliable prediction of human behavior in diverse real-world scenarios. The project page is available at https://talalwasim.github.io/MixANT/.
3D Test-time Adaptation via Graph Spectral Driven Point Shift	Xin Wei State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University Qin Yang State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University Yijie Fang State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University Mingrui Zhu State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University Nannan WangB State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University	Paper Supplementary Abstract While test-time adaptation (TTA) methods effectively address domain shifts by dynamically adapting pre-trained models to target domain data during online inference, their application to 3D point clouds is hindered by their irregular and unordered structure. Current 3D TTA methods often rely on computationally expensive spatial-domain optimizations and may require additional training data. In contrast, we propose Graph Spectral Domain Test-Time Adaptation (GSDTTA), a novel approach for 3D point cloud classification that shifts adaptation to the graph spectral domain, enabling more efficient adaptation by capturing global structural properties with fewer parameters. Point clouds in target domain are represented as outlier-aware graphs and transformed into graph spectral domain by Graph Fourier Transform (GFT). For efficiency, adaptation is performed by optimizing only the lowest 10% of frequency components, which capture the majority of the point cloud's energy. An inverse GFT (IGFT) is then applied to reconstruct the adapted point cloud with the graph spectral-driven point shift. This process is enhanced by an eigenmap-guided selftraining strategy that iteratively refines both the spectral adjustments and the model parameters. Experimental results and ablation studies on benchmark datasets demonstrate the effectiveness of GSDTTA, outperforming existing TTA methods for 3D point cloud classification.
AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance	Yi-Lin Wei School of Computer Science and Engineering, Sun Yat-sen University Mu Lin School of Computer Science and Engineering, Sun Yat-sen University Yuhao Lin School of Computer Science and Engineering, Sun Yat-sen University Jian-Jian Jiang School of Computer Science and Engineering, Sun Yat-sen University Xiao-Ming Wu School of Computer Science and Engineering, Sun Yat-sen University Ling-An Zeng School of Computer Science and Engineering, Sun Yat-sen University Wei-Shi Zheng School of Computer Science and Engineering, Sun Yat-sen University	Paper Supplementary Abstract Language-guided robot dexterous generation enables robots to grasp and manipulate objects based on human commands. However, previous data-driven methods are hard to understand intention and execute grasping with unseen categories in the open set. In this work, we explore a new task, Open-set Language-guided Dexterous Grasp, and find that the main challenge is the huge gap between highlevel human language semantics and low-level robot actions. To solve this problem, we propose an Affordance Dexterous Grasp (AffordDexGrasp) framework, with the insight of bridging the gap with a new generalizable-instructive affordance representation. This affordance can generalize to unseen categories by leveraging the object's local structure and category-agnostic semantic attributes, thereby effectively guiding dexterous grasp generation. Built upon the affordance, our framework introduces Affordance Flow Matching (AFM) for affordance generation with language as input, and Grasp Flow Matching (GFM) for generating dexterous grasp with affordance as input. To evaluate our framework, we build an open-set table-top language-guided dexterous grasp dataset. Extensive experiments in the simulation and real worlds show that our framework surpasses all previous methods in open-set generalization.
EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting	Xiaobao Wei State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Qingpo Wuwu State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Zhongyu Zhao State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Zhuangzhe Wu State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Nan Huang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Ming Lu State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Ningning Ma State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Shanghang Zhang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University	Paper Supplementary Abstract Photorealistic reconstruction of street scenes is essential for developing real-world simulators in autonomous driving. While recent methods based on 3D/4D Gaussian Splatting (GS) have demonstrated promising results, they still encounter challenges in complex street scenes due to the unpredictable motion of dynamic objects. Current methods typically decompose street scenes into static and dynamic objects, learning the Gaussians in either a supervised manner (e.g., w/o 3D bounding-box) or a self-supervised manner (e.g., w/o 3D bounding-box). However, these approaches do not effectively model the motions of dynamic objects (e.g., the motion speed of pedestrians is clearly different from that of vehicles), resulting in suboptimal scene decomposition. To address this, we propose Explicit Motion Decomposition (EMD), which models the motions of dynamic objects by introducing learnable motion embeddings to the Gaussians, enhancing the decomposition in street scenes. The proposed plug-and-play EMD module compensates for the lack of motion modeling in self-supervised street Gaussian splatting methods. We also introduce tailored training strategies to extend EMD to supervised approaches. Comprehensive experiments demonstrate the effectiveness of our method, achieving state-of-the-art novel view synthesis performance in self-supervised settings. The code is available at: https://qingpowuwu.github.io/emd.
Noise2Score3D: Tweedie's Approach for Unsupervised Point Cloud Denoising	Xiangbin Wei Shenzhen University Yuanfeng Wang Quantum Science Center of Guangdong-Hong Kong-Macao Greater Bay Area Ao XU Research Institute of Tsinghua University in Shenzhen Lingyu Zhu YunJi Intelligent Engineering Co., Ltd. Dongyong Sun YunJi Intelligent Engineering Co., Ltd. Keren Li Shenzhen University Yang Li Nanjing University Qi Qin City University of Hong Kong	Paper Supplementary Abstract Building on recent advances in Bayesian statistics and image denoising, we propose Noise2Score3D, a fully unsupervised framework for point cloud denoising. Noise2Score3D learns the score function of the underlying point cloud distribution directly from noisy data, eliminating the need for clean data during training. Using Tweedie's formula, our method performs denoising in a single step, avoiding the iterative processes used in existing unsupervised methods, thus improving both accuracy and efficiency. Additionally, we introduce Total Variation for Point Clouds as a denoising quality metric, which allows for the estimation of unknown noise parameters. Experimental results demonstrate that Noise2Score3D achieves state-of-the-art performance on standard benchmarks among unsupervised learning methods in Chamfer distance and point-to-mesh metrics. Noise2Score3D also demonstrates strong generalization ability beyond training datasets. Our method, by addressing the generalization issue and challenge of the absence of clean data in learning-based methods, paves the way for learning-based point cloud denoising methods in real-world applications.
PCR-GS: COLMAP-Free 3D Gaussian Splatting via Pose Co-Regularizations	Yu Wei Nanyang Technological University Jiahui Zhang Nanyang Technological University Xiaoqin Zhang Zhejiang University of Technology Ling Shao UCAS-Terminus AI Lab, University of Chinese Academy of Sciences Shijian Lu Nanyang Technological University	Paper Supplementary Abstract COLMAP-free 3D Gaussian Splatting (3D-GS) has recently attracted increasing attention due to its remarkable performance in reconstructing high-quality 3D scenes from unposed images or videos. However, it often struggles to handle scenes with complex camera trajectories as featured by drastic rotation and translation across adjacent camera views, leading to degraded estimation of camera poses and further local minima in joint optimization of camera poses and 3D-GS. We propose PCR-GS, an innovative COLMAP-free 3DGS technique that achieves superior 3D scene modeling and camera pose estimation via camera pose co-regularization. PCR-GS achieves regularization from two perspectives. The first is feature reprojection regularization which extracts view-robust DINO features from adjacent camera views and aligns their semantic information for camera pose regularization. The second is waveletbased frequency regularization which exploits discrepancy in high-frequency details to further optimize the rotation matrix in camera poses. Extensive experiments over multiple real-world scenes show that the proposed PCR-GS achieves superior pose-free 3D-GS scene modeling under dramatic changes of camera trajectories.
Passing the Driving Knowledge Test	Maolin Wei Boston University Wanzhou Liu Washington University in St. Louis Eshed Ohn-Bar Boston University	Paper Abstract If a Large Language Model (LLM) were to take a driving knowledge test today, would it pass? Beyond standard spatial and visual question-answering (QA) tasks on current autonomous driving benchmarks, driving knowledge tests require a complete understanding of all traffic rules, signage, and right-of-way principles. To pass this test, human drivers must discern various edge cases that rarely appear in real-world datasets. In this work, we present DriveQA, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios. Through our experiments using DriveQA, we show that (1) state-of-the-art LLMs and Multimodal LLMs (MLLMs) perform well on basic traffic rules but exhibit significant weaknesses in numerical reasoning and complex right-of-way scenarios, traffic sign variations, and spatial layouts, (2) fine-tuning on DriveQA improves accuracy across multiple categories, particularly in regulatory sign recognition and intersection decision-making, (3) controlled variations in DriveQA-V provide insights into model sensitivity to environmental factors such as lighting, perspective, distance, and weather conditions, and (4) pretraining on DriveQA enhances downstream driving task performance, leading to improved results on real-world datasets such as nuScenes and BDD, while also demonstrating that models can internalize text and synthetic traffic knowledge to generalize effectively across downstream QA tasks.
RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians	Shenxing Wei The Hong Kong Polytechnic University Jinxi Li The Hong Kong Polytechnic University Yafei Yang The Hong Kong Polytechnic University Siyuan Zhou The Hong Kong Polytechnic University Bo Yang The Hong Kong Polytechnic University	Paper Supplementary Abstract In this paper, we present a generalizable method for 3D surface reconstruction from raw point clouds or pre-estimated 3D Gaussians by 3DGS from RGB images. Unlike existing coordinate-based methods which are often computationally intensive when rendering explicit surfaces, our proposed method, named RayletDF, introduces a new technique called raylet distance field, which aims to directly predict surface points from query rays. Our pipeline consists of three key modules: a raylet feature extractor, a raylet distance field predictor, and a multi-raylet blender. These components work together to extract fine-grained local geometric features, predict raylet distances, and aggregate multiple predictions to reconstruct precise surface points. We extensively evaluate our method on multiple public real-world datasets, demonstrating superior performance in surface reconstruction from point clouds or 3D Gaussians. Most notably, our method achieves exceptional generalization ability, successfully recovering 3D surfaces in a single-forward pass across unseen datasets in testing. Our code and datasets are available at https://github. com/vLAR-group/RayletDF.
Object-level Correlation for Few-Shot Segmentation	Chunlin Wen School of Computer Science and Engineering, Southeast University Yu Zhang School of Computer Science and Engineering, Southeast University Jie Fan Samsung Electronics (China) R&D Centre Hongyuan Zhu Institute for Infocomm Research (I2R), A*STAR Singapore Xiu-Shen Wei School of Computer Science and Engineering, Southeast University Yijun Wang School of Computer Science and Engineering, Southeast University Zhiqiang Kou School of Computer Science and Engineering, Southeast University Shuzhou Sun Shanghai AI Laboratory	Paper Supplementary Abstract Few-shot semantic segmentation (FSS) aims to segment objects of novel categories in the query images given only a few annotated support samples. Existing methods primarily build the image-level correlation between the support target object and the entire query image. However, this correlation contains the hard pixel noise, i.e., irrelevant background objects, that is intractable to trace and suppress, leading to the overfitting of the background. To address the limitation of this correlation, we imitate the biological vision process to identify novel objects in the object-level information. Target identification in the general objects is more valid than in the entire image, especially in the low-data regime. Inspired by this, we design an Object-level Correlation Network (OCNet) by establishing the object-level correlation between the support target object and query general objects, which is mainly composed of the General Object Mining Module (GOMM) and Correlation Construction Module (CCM). Specifically, GOMM constructs the query general object feature by learning saliency and high-level similarity cues, where the general objects include the irrelevant background objects and the target foreground object. Then, CCM establishes the objectlevel correlation by allocating the target prototypes to match the general object feature. The generated object-level correlation can mine the query target feature and suppress the hard pixel noise for the final prediction. Extensive experiments on PASCAL-5i and COCO-20i show that our model achieves the state-of-the-art performance.
SEGS-SLAM: Structure-enhanced 3D Gaussian Splatting SLAM with Appearance Embedding	Tianci Wen IRAIS, tjKLIR, College of Artificial Intelligence, Nankai University Zhiang Liu IRAIS, tjKLIR, College of Artificial Intelligence, Nankai University Yongchun Fang IRAIS, tjKLIR, College of Artificial Intelligence, Nankai University	Paper Supplementary Abstract 3D Gaussian splatting (3D-GS) has recently revolutionized novel view synthesis in the simultaneous localization and mapping (SLAM) problem. However, most existing algorithms fail to fully capture the underlying structure, resulting in structural inconsistency. Additionally, they struggle with abrupt appearance variations, leading to inconsistent visual quality. To address these problems, we propose SEGS-SLAM, a structure-enhanced 3D Gaussian Splatting SLAM, which achieves high-quality photorealistic mapping. Our main contributions are two-fold. First, we propose a structure-enhanced photorealistic mapping (SEPM) framework that, for the first time, leverages highly structured point cloud to initialize structured 3D Gaussians, leading to significant improvements in rendering quality. Second, we propose Appearance-from-Motion embedding (AfME), enabling 3D Gaussians to better model image appearance variations across different camera poses. Extensive experiments on monocular, stereo, and RGB-D datasets demonstrate that SEGS-SLAM significantly outperforms state-ofthe-art (SOTA) methods in photorealistic mapping quality, e.g., an improvement of 19.86% in PSNR over MonoGS on the TUM RGB-D dataset for monocular cameras.
Seeing and Seeing Through the Glass: Real and Synthetic Data for Multi-Layer Depth Estimation	Hongyu Wen Department of Computer Science, Princeton University Yiming Zuo Department of Computer Science, Princeton University Venkat Subramanian Department of Computer Science, Princeton University Patrick Chen Department of Computer Science, Princeton University Jia Deng Department of Computer Science, Princeton University	Paper Supplementary Abstract Transparent objects are common in daily life, and understanding their multi-layer depth information-perceiving both the transparent surface and the objects behind it-is crucial for real-world applications that interact with transparent materials. In this paper, we introduce LayeredDepth, the first dataset with multi-layer depth annotations, including a real-world benchmark and a synthetic data generator, to support the task of multi-layer depth estimation. Our real-world benchmark consists of 1,500 images from diverse scenes, and evaluating state-of-the-art depth estimation methods on it reveals that they struggle with transparent objects. The synthetic data generator is fully procedural and capable of providing training data for this task with an unlimited variety of objects and scene compositions. Using this generator, we create a synthetic dataset with 15,300 images. Baseline models training solely on this synthetic dataset produce good cross-domain multilayer depth estimation. Fine-tuning state-of-the-art singlelayer depth models on it substantially improves their performance on transparent objects, with quadruplet accuracy on our benchmark increased from 55.14% to 75.20%. All images and validation annotations are available under CC0 at https://layereddepth.cs.princeton.edu.
ArgoTweak: Towards Self-Updating HD Maps through Structured Priors	Lena Wild KTH Royal Institute of Technology Rafael Valencia TRATON Patric Jensfelt KTH Royal Institute of Technology	Paper Supplementary Abstract Reliable integration of prior information is crucial for selfverifying and self-updating HD maps. However, no public dataset includes the required triplet of prior maps, current maps, and sensor data. As a result, existing methods must rely on synthetic priors, which create inconsistencies and lead to a significant sim2real gap. To address this, we introduce ArgoTweak, the first dataset to complete the triplet with realistic map priors. At its core, ArgoTweak employs a bijective mapping framework, breaking down large-scale modifications into fine-grained atomic changes at the map element level, thus ensuring interpretability. This paradigm shift enables accurate change detection and integration while preserving unchanged elements with high fidelity. Experiments show that training models on ArgoTweak significantly reduces the sim2real gap compared to synthetic priors. Extensive ablations further highlight the impact of structured priors and detailed change annotations. By establishing a benchmark for explainable, prioraided HD mapping, ArgoTweak advances scalable, selfimproving mapping solutions. The dataset, baselines, map modification toolbox, and further resources are available at https://KTH-RPL.github.io/ArgoTweak/.
Resonance: Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations	Conghao Wong Huazhong University of Science and Technology Ziqian Zou Huazhong University of Science and Technology Beihao Xia Huazhong University of Science and Technology	Paper Supplementary Abstract Learning to forecast trajectories of intelligent agents has caught much more attention recently. However, it remains a challenge to accurately account for agents' intentions and social behaviors when forecasting, and in particular, to simulate the unique randomness within each of those components in an explainable and decoupled way. Inspired by vibration systems and their resonance properties, we propose the Resonance (short for Re) model to encode and forecast pedestrian trajectories in the form of 'co-vibrations'. It decomposes trajectory modifications and randomness into multiple vibration portions to simulate agents' reactions to each single cause, and forecasts trajectories as the superposition of these independent vibrations separately. Also, benefiting from such vibrations and their spectral properties, representations of social interactions can be learned by emulating the resonance phenomena, further enhancing its explainability. Experiments on multiple datasets have verified its usefulness both quantitatively and qualitatively.
Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images	Tianhao Wu Nanyang Technological University Chuanxia Zheng University of Oxford Frank Guan Singapore Institute of Technology Andrea Vedaldi University of Oxford Tat-Jen Cham Nanyang Technological University	Paper Supplementary Abstract Most existing image-to-3D models assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional image-to-3D model designed to reconstruct plausible 3D geometry and appearance from partial observations. We extend a 'foundation' 3D generator by introducing a visible mask-weighted attention mechanism and an occlusion-aware attention layer that explicitly leverage visible and occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes. It substantially outperforms state-of-the-art methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction. †Project Lead.
CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation	Jianyu Wu Shanghai Jiao Tong University Yizhou Wang The Chinese University of Hong Kong Xiangyu Yue The Chinese University of Hong Kong Xinzhu Ma Shanghai Artificial Intelligence Laboratory Jinyang Guo Beihang University Dongzhan Zhou Shanghai Artificial Intelligence Laboratory Wanli Ouyang Shanghai Artificial Intelligence Laboratory Shixiang Tang Shanghai Artificial Intelligence Laboratory	Paper Supplementary Abstract While accurate and user-friendly Computer-Aided Design (CAD) is crucial for industrial design and manufacturing, existing methods still struggle to achieve this due to their over-simplified representations or architectures incapable of supporting multimodal design requirements. In this paper, we attempt to tackle this problem from both methods and datasets aspects. First, we propose a cascade MAR [25] with topology predictor (CMT), the first multimodal framework for CAD generation based on Boundary Representation (B-Rep). Specifically, the cascade MAR can effectively capture the 'edge-counters-surface' priors that are essential in B-Reps, while the topology predictor directly estimates topology in B-Reps from the compact tokens in MAR. Second, to facilitate large-scale training, we develop a large-scale multimodal CAD dataset, mmABC, which includes over 1.3 million B-Rep models with multimodal annotations, including point clouds, text descriptions, and multi-view images. Extensive experiments show the superior of CMT in both conditional and unconditional CAD generation tasks. For example, we improve Coverage and Valid ratio by +10.68% and +10.3%, respectively, compared to state-of-the-art methods on ABC [21] in unconditional generation. CMT also improves +4.01 Chamfer on image conditioned CAD generation on mmABC.
Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling	Qirui Wu Simon Fraser University Denys Iliash Simon Fraser University Daniel Ritchie Brown University Manolis Savva Simon Fraser University Angel X. Chang Simon Fraser University	Paper Supplementary Abstract Reconstructing structured 3D scenes from RGB images using CAD objects unlocks efficient and compact scene representations that maintain compositionality and interactability. Existing works propose training-heavy methods relying on either expensive yet inaccurate real-world annotations or controllable yet monotonous synthetic data that do not generalize well to unseen objects or domains. We present Diorama, the first zero-shot open-world system that holistically models 3D scenes from single-view RGB observations without requiring end-to-end training or human annotations. We show the feasibility of our approach by decomposing the problem into subtasks and introduce better solutions to each: architecture reconstruction, 3D shape retrieval, object pose estimation, and scene layout optimization. We evaluate our system on both synthetic and realworld data to show we significantly outperform baselines from prior work. We also demonstrate generalization to real-world internet images and the text-to-scene task.
Efficient Spiking Point Mamba for Point Cloud Analysis	Peixi Wu University of Science and Technology of China Bosong Chai Zhejiang University Menghua Zheng Tsingmao Intelligence Wei Li University of Science and Technology of China Zhangchi Hu University of Science and Technology of China Jie Chen University of Science and Technology of China Zheyu Zhang University of Science and Technology of China Hebei Li University of Science and Technology of China Xiaoyan Sun University of Science and Technology of China	Paper Supplementary Abstract Bio-inspired Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing 3D SNNs have struggled with long-range dependencies until the recent emergence of Mamba, which offers superior computational efficiency and sequence modeling capability. In this work, we propose Spiking Point Mamba (SPM), the first Mamba-based SNN in the 3D domain. Naively adapting Mamba to 3D SNNs, though, is hindered by temporal dynamics mismatch and spike-induced information loss. Thus, we first introduce Hierarchical Dynamic Encoding (HDE), an improved direct encoding method that effectively introduces dynamic temporal mechanism. Then, we propose Spiking Mamba Block (SMB), which builds upon Mamba while learning intertime-step features and minimizing information loss caused by spikes. Finally, to further boost performance, we adopt an asymmetric SNN-ANN architecture for spike-based pretraining and finetune. Compared with the previous stateof-the-art SNN models, SPM improves overall accuracy by +6.2%, +6.1%, and +7.4% on three variants of ScanObjectNN, and boosts instance mIOU by +1.9% on ShapeNetPart. Meanwhile, its energy consumption is at most 12.6x lower than that of its ANN counterpart. Code: https://github.com/PeppaWu/SPM.
EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding	Yuqi Wu Tsinghua University Wenzhao Zheng Tsinghua University Sicheng Zuo Tsinghua University Yuanhui Huang Tsinghua University Jie Zhou Tsinghua University Jiwen Lu Tsinghua University	Paper Supplementary Abstract 3D occupancy prediction provides a comprehensive description of the surrounding scenes and has become an essential task for 3D perception. Most existing methods focus on offline perception from one or a few views and cannot be applied to embodied agents that demand to gradually perceive the scene through progressive embodied exploration. In this paper, we formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize the global scene with uniform 3D semantic Gaussians and progressively update local regions observed by the embodied agent. For each update, we extract semantic and structural features from the observed image and efficiently incorporate them via deformable crossattention to refine the regional Gaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global 3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown (i.e., uniformly distributed) environment and maintains an explicit global memory of it with 3D Gaussians. It gradually gains knowledge through the local refinement of regional Gaussians, which is consistent with how humans understand new scenes through embodied exploration. We reorganize an EmbodiedOccScanNet benchmark based on local annotations to facilitate the evaluation of the embodied 3D occupancy prediction task. Our EmbodiedOcc outperforms existing methods by a large margin and accomplishes the embodied occupancy prediction with high accuracy and efficiency. Code: https://github.com/YkiWu/EmbodiedOcc.
FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing		Paper Supplementary Abstract Generating realistic human motions from textual descriptions has undergone significant advancements. However, existing methods often overlook specific body part movements and their timing. In this paper, we address this issue by enriching the textual description with more details. Specifically, we propose the FineMotion dataset, which contains over 442,000 human motion snippets - short segments of human motion sequences - and their corresponding detailed descriptions of human body part movements. Additionally, the dataset includes about 95k detailed paragraphs describing the movements of human body parts of entire motion sequences. Experimental results demonstrate the significance of our dataset on the text-driven finegrained human motion generation task, especially with a remarkable +15.3% improvement in Top-3 accuracy for the MDM model. Notably, we further support a zero-shot pipeline of fine-grained motion editing, which focuses on detailed editing in both spatial and temporal dimensions via text.
Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition	Wenhan Wu University of North Carolina at Charlotte Zhishuai Guo Northern Illinois University Chen Chen University of Central Florida Hongfei Xue University of North Carolina at Charlotte Aidong Lu University of North Carolina at Charlotte	Paper Supplementary Abstract Zero-shot skeleton-based action recognition aims to develop models capable of identifying actions beyond the categories encountered during training. Previous approaches have primarily focused on aligning visual and semantic representations but often overlooked the importance of fine-grained action patterns in the semantic space (e.g., the hand movements in drinking water and brushing teeth). To address these limitations, we propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition. FS-VAE consists of three key components: 1) a frequency-based enhancement module with high- and low-frequency adjustments to enrich the skeletal semantics learning and improve the robustness of zero-shot action recognition; 2) a semantic-based action description with multilevel alignment to capture both local details and global correspondence, effectively bridging the semantic gap and compensating for the inherent loss of information in skeleton sequences; 3) a calibrated cross-alignment loss that enables valid skeleton-text pairs to counterbalance ambiguous ones, mitigating discrepancies and ambiguities in skeleton and text features, thereby ensuring robust alignment. Evaluations on the benchmarks demonstrate the effectiveness of our approach, validating that frequency-enhanced semantic features enable robust differentiation of visually and semantically similar action clusters, improving zeroshot action recognition. Our project is publicly available at: https://github.com/wenhanwu95/FS-VAE.
Human-Object Interaction from Human-Level Instructions	Zhen Wu Stanford University Jiaman Li Stanford University Pei Xu Stanford University C. Karen Liu Stanford University	Paper Supplementary Abstract Intelligent agents must autonomously interact with the environments to perform daily tasks based on human-level instructions. They need a foundational understanding of the world to accurately interpret these instructions, along with precise low-level movement and interaction skills to execute the derived actions. In this work, we propose the first complete system for synthesizing physically plausible, long-horizon human-object interactions for object manipulation in contextual environments, driven by human-level instructions. We leverage large language models (LLMs) to interpret the input instructions into detailed execution plans. Unlike prior work, our system is capable of generating detailed finger-object interactions, in seamless coordination with full-body movements. We also train a policy to track generated motions in physics simulation via reinforcement learning (RL) to ensure physical plausibility of the motion. Our experiments demonstrate the effectiveness of our system in synthesizing realistic interactions with diverse objects in complex environments, highlighting its potential for realworld applications.
LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling	Jiahao Wu Peking University Rui Peng Peking University Jianbo Jiao University of Birmingham Jiayu Yang Peking University Luyang Tang Peking University Kaiqiang Xiong Peking University Jie Liang Peking University Jinbo Yan Peking University Runling Liu Peking University Ronggang Wang Peking University	Paper Supplementary Abstract Due to the complex and highly dynamic motions in the real world, synthesizing dynamic videos from multi-view inputs for arbitrary viewpoints is challenging. Previous works based on neural radiance field or 3D Gaussian splatting are limited to modeling fine-scale motion, greatly restricting their application. In this paper, we introduce LocalDyGS, which consists of two parts to adapt our method to both large-scale and fine-scale motion scenes: 1) We decompose a complex dynamic scene into streamlined local spaces defined by seeds, enabling global modeling by capturing motion within each local space. 2) We decouple static and dynamic features for local space motion modeling. A static feature shared across time steps captures static information, while a dynamic residual field provides time-specific features. These are combined and decoded to generate Temporal Gaussians, modeling motion within each local space. As a result, we propose a novel dynamic scene reconstruction framework to model highly dynamic real-world scenes more realistically. Our method not only demonstrates competitive performance on various fine-scale datasets compared to state-of-the-art (SOTA) methods, but also represents the first attempt to model larger and more complex highly dynamic scenes. Project page: https://wujh2001.github. io/LocalDyGS/.
Measuring the Impact of Rotation Equivariance on Aerial Object Detection	Xiuyu Wu Xidian University Xinhao Wang Xidian University Xiubin Zhu Xidian University Lan Yang Xidian University Jiyuan Liu National University of Defense Technology Xingchen Hu National University of Defense Technology	Paper Supplementary Abstract Due to the arbitrary orientation of objects in aerial images, rotation equivariance is a critical property for aerial object detectors. However, recent studies on rotationequivariant aerial object detection remain scarce. Most detectors rely on data augmentation to enable models to learn approximately rotation-equivariant features. A few detectors have constructed rotation-equivariant networks, but due to the breaking of strict rotation equivariance by typical downsampling processes, these networks only achieve approximately rotation-equivariant backbones. Whether strict rotation equivariance is necessary for aerial image object detection remains an open question. In this paper, we implement a strictly rotation-equivariant backbone and neck network with a more advanced network structure and compare it with approximately rotation-equivariant networks to quantitatively measure the impact of rotation equivariance on the performance of aerial image detectors. Additionally, leveraging the inherently grouped nature of rotation-equivariant features, we propose a multi-branch head network that reduces the parameter count while improving detection accuracy. Based on the aforementioned improvements, this study proposes the Multi-branch head rotation-equivariant single-stage Detector (MessDet), which achieves state-of-the-art performance on the challenging aerial image datasets DOTA-v1.0, DOTA-v1.5 and DIOR-R with an exceptionally low parameter count.
Motal: Unsupervised 3D Object Detection by Modality and Task-specific Knowledge Transfer	Hai Wu Xiamen University Hongwei Lin Xiamen University Xusheng Guo Xiamen University Xin Li Texas A&M University Mingming Wang Tsinghua University Cheng Wang Xiamen University Chenglu Wen Xiamen University	Paper Supplementary Abstract The performance of unsupervised 3D object classification and bounding box regression relies heavily on the quality of initial pseudo-labels. Traditionally, the labels of classification and regression are represented by a single set of candidate boxes generated by motion or geometry heuristics. However, due to the similarity of many objects to the background in shape or lack of motion, the labels often fail to achieve high accuracy in two tasks simultaneously. Using these labels to directly train the network results in decreased detection performance. To address this challenge, we introduce Motal that performs unsupervised 3D object detection by Modality and task-specific knowledge transfer. Motal decouples the pseudo-labels into two sets of candidates, from which Motal discovers classification knowledge by motion and image appearance prior, and discovers box regression knowledge by geometry prior, respectively. Motal finally transfers all knowledge to a single student network by a TMT (Task-specific Masked Training) scheme, attaining high performance in both classification and regression. Motal can greatly enhance various unsupervised methods by about 2x mAP. For example, on the WOD test set, Motal improves the state-of-the-art CPD by 21.56% mAP L1 (from 20.54% to 42.10%) and 19.90% mAP L2 (from 18.18% to 38.08%). These achievements highlight the significance of our method.
On-Device Diffusion Transformer Policy for Efficient Robot Manipulation	Yiming Wu The University of Hong Kong Huan Wang Westlake University Zhenghao Chen The University of Newcastle Jianxin Pang UBTech Robotics Corp. Dong Xu The University of Hong Kong	Paper Supplementary Abstract Diffusion Policies have significantly advanced robotic manipulation tasks via imitation learning, but their application on resource-constrained mobile platforms remains challenging due to computational inefficiency and extensive memory footprint. In this paper, we propose LightDP, a novel framework specifically designed to accelerate Diffusion Policies for real-time deployment on mobile devices. LightDP addresses the computational bottleneck through two core strategies: network compression of the denoising modules and reduction of the required sampling steps. We first conduct an extensive computational analysis on existing Diffusion Policy architectures, identifying the denoising network as the primary contributor to latency. To overcome performance degradation typically associated with conventional pruning methods, we introduce a unified pruning and retraining pipeline, optimizing the model's postpruning recoverability explicitly. Furthermore, we combine pruning techniques with consistency distillation to effectively reduce sampling steps while maintaining action prediction accuracy. Experimental evaluations on the standard datasets, i.e., PushT, Robomimic, CALVIN, and LIBERO, demonstrate that LightDP achieves real-time action prediction on mobile devices with competitive performance, marking an important step toward practical deployment of diffusion-based policies in resource-limited environments. Extensive real-world experiments also show the proposed LightDP can achieve performance comparable to state-ofthe-art Diffusion Policies.
Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding	Mingxuan Wu University of California, Berkeley Huang Huang University of California, Berkeley Justin Kerr University of California, Berkeley Chung Min Kim University of California, Berkeley Anthony Zhang University of California, Berkeley Brent Yi University of California, Berkeley Angjoo Kanazawa University of California, Berkeley	Paper Supplementary Abstract Humans can resort to long-form inspection to build intuition on predicting the 3D configurations of unseen objects. The more we observe the object motion, the better we get at predicting its 3D state immediately. Existing systems either optimize underlying representations from multi-view observations or train a feed-forward predictor from supervised datasets. We introduce Predict-OptimizeDistill (POD), a self-improving framework that interleaves prediction and optimization in a mutually reinforcing cycle to achieve better 4D object understanding with increasing observation time. Given a multi-view object scan and a long-form monocular video of human-object interaction, POD iteratively trains a neural network to predict local part poses from RGB frames, uses this predictor to initialize a global optimization which refines output poses through inverse rendering, then finally distills the results of optimization back into the model by generating synthetic self-labeled training data from novel viewpoints. Each iteration improves both the predictive model and the optimized motion trajectory, creating a virtuous cycle that bootstraps its own training data to learn about the pose configurations of an object. We also introduce a quasi-multiview mining strategy for reducing depth ambiguity by leveraging long video. We evaluate POD on 14 real-world and 5 synthetic objects with various joint types, including revolute and prismatic joints as well as multi-body configurations where parts detach or reattach independently. POD demonstrates significant improvement over a pure optimization baseline which gets stuck in local minima, particularly for longer videos. We also find that POD's performance improves with both video length and successive iterations of the self-improving cycle, highlighting its ability to scale performance with additional observations and looped refinement.
RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping	Dongming Wu The Chinese University of Hong Kong Yanping Fu Institute of Computing Technology, Chinese Academy of Sciences Saike Huang Dexmal Yingfei Liu Dexmal Fan Jia Dexmal Nian Liu Mohamed bin Zayed University of Artificial Intelligence Feng Dai Institute of Computing Technology, Chinese Academy of Sciences Tiancai Wang Dexmal Rao Muhammad Anwer Mohamed bin Zayed University of Artificial Intelligence Fahad Shahbaz Khan Mohamed bin Zayed University of Artificial Intelligence Jianbing Shen University of Macau	Paper Supplementary Abstract General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pretrained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our data and code is available at this link.
TARS: Traffic-Aware Radar Scene Flow Estimation	Jialong Wu University of Wuppertal Marco Braun Aptiv Services Deutschland GmbH Dominic Spata Aptiv Services Deutschland GmbH Matthias Rottmann Osnabrück University	Paper Supplementary Abstract Scene flow provides crucial motion information for autonomous driving. Recent LiDAR scene flow models utilize the rigid-motion assumption at the instance level, assuming objects are rigid bodies. However, these instance-level methods are not suitable for sparse radar point clouds. In this work, we present a novel Traffic-Aware Radar SceneFlow (TARS) estimation method, which utilizes motion rigidity at the traffic level. To address the challenges in radar scene flow, we perform object detection and scene flow jointly and boost the latter. We incorporate the feature map from the object detector, trained with detection losses, to make radar scene flow aware of the environment and road users. From this, we construct a Traffic Vector Field (TVF) in the feature space to achieve holistic traffic-level scene understanding in our scene flow branch. When estimating the scene flow, we consider both point-level motion cues from point neighbors and traffic-level consistency of rigid motion within the space. TARS outperforms the state of the art on a proprietary dataset and the View-of-Delft dataset, improving the benchmarks by 23% and 15%, respectively.
UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control	Yan Wu ETH Zurich Korrawe Karunratanakul ETH Zurich Zhengyi Luo Carnegie Mellon University Siyu Tang ETH Zurich	Paper Supplementary Abstract Generating natural and physically plausible character motion remains challenging, particularly for long-horizon control with diverse guidance signals. While prior work combines high-level diffusion-based motion planners with lowlevel physics controllers, these systems suffer from domain gaps that degrade motion quality and require task-specific fine-tuning. To tackle this problem, we introduce UniPhys, a diffusion-based behavior cloning framework that unifies motion planning and control into a single model. UniPhys enables flexible, expressive character motion conditioned on multi-modal inputs such as text, trajectories, and goals. To address accumulated prediction errors over long sequences, UniPhys is trained with the Diffusion Forcing paradigm, learning to denoise noisy motion histories and handle discrepancies introduced by the physics simulator. This design allows UniPhys to robustly generate physically plausible, long-horizon motions. Through guided sampling, UniPhys generalizes to a wide range of control signals, including unseen ones, without requiring task-specific fine-tuning. Experiments show that UniPhys outperforms prior methods in motion naturalness, generalization, and robustness across diverse control tasks.
Visual Textualization for Image Prompted Object Detection	Yongjian Wu Beihang University Yang Zhou Beihang University Jiya Saiyin Beihang University Bingzheng Wei ByteDance Inc. Yan Xu Beihang University	Paper Supplementary Abstract We propose VisTex-OVLM, a novel image prompted object detection method that introduces visual textualization -- a process that projects a few visual exemplars into the text feature space to enhance Object-level Vision-Language Models' (OVLMs) capability in detecting rare categories that are difficult to describe textually and nearly absent from their pretraining data, while preserving their pre-trained object-text alignment. Specifically, VisTex-OVLM leverages multi-scale textualizing blocks and a multi-stage fusion strategy to integrate visual information from visual exemplars, generating textualized visual tokens that effectively guide OVLMs alongside text prompts. Unlike previous methods, our method maintains the original architecture of OVLM, maintaining its generalization capabilities while enhancing performance in few-shot settings. VisTex-OVLM demonstrates superior performance across open-set datasets which have minimal overlap with OVLM's pre-training data and achieves stateof-the-art results on few-shot benchmarks PASCAL VOC and MSCOCO. The code will be released at VisTex-OVLM.
Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images	Philipp Wulff Technical University of Munich Felix Wimbauer Technical University of Munich Dominik Muhle Technical University of Munich Daniel Cremers Technical University of Munich	Paper Supplementary Abstract Volumetric scene reconstruction from a single image is crucial for a broad range of applications like autonomous driving and robotics. Recent volumetric reconstruction methods achieve impressive results, but generally require expensive 3D ground truth or multi-view supervision. We propose to leverage pre-trained 2D diffusion models and depth prediction models to generate synthetic scene geometry from a single image. This can then be used to distill a feed-forward scene reconstruction model. Our experiments on the challenging KITTI-360 and Waymo datasets demonstrate that our method matches or outperforms state-of-the-art baselines that use multi-view supervision, and offers unique advantages, for example regarding dynamic scenes. For more details and code, please check out our project page.
ScenePainter: Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment	Chong Xia Tsinghua University Shengjun Zhang Tsinghua University Fangfu Liu Tsinghua University Chang Liu Tsinghua University Khodchaphun Hirunyaratsameewong SceneScape Yueqi Duan WonderJourney	Paper Supplementary Abstract Perpetual 3D scene generation aims to produce longrange and coherent 3D view sequences, which is applicable for long-term video synthesis and 3D scene reconstruction. Existing methods follow a 'navigate-and-imagine' fashion and rely on outpainting for successive view expansion. However, the generated view sequences suffer from semantic drift issue derived from the accumulated deviation of the outpainting module. To tackle this challenge, we propose ScenePainter, a new framework for semantically consistent 3D scene generation, which aligns the outpainter's scenespecific prior with the comprehension of the current scene. To be specific, we introduce a hierarchical graph structure dubbed SceneConceptGraph to construct relations among multi-level scene concepts, which directs the outpainter for consistent novel views and can be dynamically refined to enhance diversity. Extensive experiments demonstrate that our framework overcomes the semantic drift issue and generates more consistent and immersive 3D view sequences.
TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes	Yan Xia University of Science and Technology of China Yunxiang Lu Technical University of Munich Rui Song Technical University of Munich Oussema Dhaouadi Technical University of Munich João F. Henriques University of Oxford Daniel Cremers Technical University of Munich	Paper Supplementary Abstract We tackle the problem of localizing traffic cameras within a 3D reference map and propose a novel image-to-point cloud registration (I2P) method, TrafficLoc, in a coarse-tofine matching fashion. To overcome the lack of large-scale real-world intersection datasets, we first introduce Carla Intersection, a new simulated dataset with 75 urban and rural intersections in Carla. We find that current I2P methods struggle with cross-modal matching under large viewpoint differences, especially at traffic intersections. TrafficLoc thus employs a novel Geometry-guided Attention Loss (GAL) to focus only on the corresponding geometric regions under different viewpoints during 2D-3D feature fusion. To address feature inconsistency in paired image patch-point groups, we further propose Inter-intra Contrastive Learning (ICL) to enhance separating 2D patch /3D group features within each intra-modality and introduce Dense Training Alignment (DTA) with soft-argmax for improving position regression. Extensive experiments show our TrafficLoc greatly improves the performance over the SOTA I2P methods (up to 86%) on Carla Intersection and generalizes well to real-world data. TrafficLoc also achieves new SOTA performance on KITTI and NuScenes datasets, demonstrating the superiority across both in-vehicle and traffic cameras. Our project page is publicly available at https://tumluk.github.io/projects/trafficloc/.
DEPTHOR: Depth Enhancement from a Practical Light-Weight dToF Sensor and RGB Image	Jijun Xiang Huazhong University of Science and Technology Xuan Zhu Huazhong University of Science and Technology Xianqi Wang Huazhong University of Science and Technology Yu Wang Honor Device Co., Ltd Hong Zhang Honor Device Co., Ltd Fei Guo Honor Device Co., Ltd Xin Yang Huazhong University of Science and Technology	Paper Supplementary Abstract Depth enhancement, which uses RGB images as guidance to convert raw signals from dToF into high-precision, dense depth maps, is a critical task in computer vision. Although existing super-resolution-based methods show promising results on public datasets, they often rely on idealized assumptions like accurate region correspondences and reliable dToF inputs, overlooking calibration errors that cause misalignment and anomaly signals inherent to dToF imaging, limiting real-world applicability. To address these challenges, we propose a novel completion-based method, named DEPTHOR, featuring advances in both the training strategy and model architecture. First, we propose a method to simulate real-world dToF data from the accurate ground truth in synthetic datasets to enable noiserobust training. Second, we design a novel network that incorporates monocular depth estimation (MDE), leveraging global depth relationships and contextual information to improve prediction in challenging regions. On the ZJU-L5 dataset, our training strategy significantly enhances depth completion models, achieving results comparable to depth super-resolution methods, while our model achieves stateof-the-art results, improving Rel and RMSE by 27% and 18%, respectively. On a more challenging set of dToF samples we collected, our method outperforms SOTA methods on preliminary stereo-based GT, improving Rel and RMSE by 23% and 22%, respectively. Our Code is available at https://github.com/ShadowBbBb/Depthor
ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds	Binbin Xiang Norwegian Institute of Bioeconomy Research (NIBIO) Maciej Wielgosz Norwegian Institute of Bioeconomy Research (NIBIO) Stefano Puliti Norwegian Institute of Bioeconomy Research (NIBIO) Kamil Král Silva Tarouca Research Institute for Landscape and Ornamental Gardening Martin Krůček Silva Tarouca Research Institute for Landscape and Ornamental Gardening Azim Missarov Silva Tarouca Research Institute for Landscape and Ornamental Gardening Rasmus Astrup Norwegian Institute of Bioeconomy Research (NIBIO)	Paper Supplementary Abstract The segmentation of forest LiDAR 3D point clouds, including both individual tree and semantic segmentation, is fundamental for advancing forest management and ecological research. However, current approaches often struggle with the complexity and variability of natural forest environments. We present ForestFormer3D, a new unified and end-to-end framework designed for precise individual tree and semantic segmentation. ForestFormer3D incorporates ISA-guided query point selection, a score-based block merging strategy during inference, and a one-to-many association mechanism for effective training. By combining these new components, our model achieves state-of-the-art performance for individual tree segmentation on the newly introduced FOR-instanceV2 dataset, which spans diverse forest types and regions. Additionally, ForestFormer3D generalizes well to unseen test sets (Wytham woods and LAUTx), showcasing its robustness across different forest conditions and sensor modalities. The FOR-instanceV2 dataset and the ForestFormer3D code are publicly available at https://bxiang233.github.io/FF3D/.
SG-LDM: Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion	Zhengkang Xiang The University of Melbourne Zizhao Li The University of Melbourne Amir Khodabandeh The University of Melbourne Kourosh Khoshelham The University of Melbourne	Paper Supplementary Abstract Lidar point cloud synthesis based on generative models offers a promising solution to augment deep learning pipelines, particularly when real-world data is scarce or lacks diversity. By enabling ﬂexible object manipulation, this synthesis approach can significantly enrich training datasets and enhance discriminative models. However, existing methods focus on unconditional lidar point cloud generation, overlooking their potential for real-world applications. In this paper, we propose SG-LDM, a SemanticGuided Lidar Diffusion Model that employs latent alignment to enable robust semantic-to-lidar synthesis. By directly operating in the native lidar space and leveraging explicit semantic conditioning, SG-LDM achieves state-ofthe-art performance in generating high-fidelity lidar point clouds guided by semantic labels. Moreover, we propose the first diffusion-based lidar translation framework based on SG-LDM, which enables cross-domain translation as a domain adaptation strategy to enhance downstream perception performance. Systematic experiments demonstrate that SG-LDM significantly outperforms existing lidar diffusion models and the proposed lidar translation framework further improves data augmentation performance in the downstream lidar segmentation task.
MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space	Lixing Xiao Zhejiang University Shunlin Lu The Chinese University of Hong Kong (Shenzhen) Huaijin Pi The University of Hong Kong Ke Fan Shanghai Jiao Tong University Liang Pan The University of Hong Kong Yueer Zhou Zhejiang University Ziyong Feng DeepGlint Xiaowei Zhou Zhejiang University Sida Peng Zhejiang University Jingbo Wang Shanghai AI Laboratory	Paper Supplementary Abstract This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized noncausal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: https://zju3dv. github.io/MotionStreamer/
RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case	Baihui Xiao Meituan Chengjian Feng Meituan Zhijian Huang Meituan Feng Yan Meituan Yujie Zhong Meituan Lin Ma Shenzhen Campus of Sun Yat-sen University	Paper Supplementary Abstract Collecting real-world data for rare high-risk scenarios, long-tailed driving events, and complex interactions remains challenging, leading to poor performance of existing autonomous driving systems in these critical situations. In this paper, we propose RoboTron-Sim that improves real-world driving in critical situations by utilizing simulated hard cases. First, we develop a simulated dataset called Hard-case Augmented Synthetic Scenarios (HASS), which covers 13 high-risk edge-case categories, as well as balanced environmental conditions such as day/night and sunny/rainy. Second, we introduce Scenario-aware Prompt Engineering (SPE) and an Image-to-Ego Encoder (I2E Encoder) to enable multimodal large language models to effectively learn real-world challenging driving skills from HASS, via adapting to environmental deviations and hardware differences between real-world and simulated scenarios. Extensive experiments on nuScenes show that RoboTron-Sim improves driving performance in challenging scenarios by ∼50%, achieving state-of-the-art results in real-world open-loop planning. Qualitative results further demonstrate the effectiveness of RoboTron-Sim in better managing rare high-risk driving scenarios.
SRefiner: Soft-Braid Attention for Multi-Agent Trajectory Refinement	Liwen Xiao Huazhong University of Science and Technology Zhiyu Pan Huazhong University of Science and Technology Zhicheng Wang Huazhong University of Science and Technology Zhiguo Cao Huazhong University of Science and Technology Wei Li Nanyang Technological University	Paper Supplementary Abstract Accurate prediction of multi-agent future trajectories is crucial for autonomous driving systems to make safe and efficient decisions. Trajectory refinement has emerged as a key strategy to enhance prediction accuracy. However, existing refinement methods often overlook the topological relationships between trajectories, which are vital for improving prediction precision. Inspired by braid theory, we propose a novel trajectory refinement approach, Soft-Braid Refiner (SRefiner), guided by the soft-braid topological structure of trajectories using Soft-Braid Attention. Soft-Braid Attention captures spatio-temporal topological relationships between trajectories by considering both spatial proximity and vehicle motion states at 'soft intersection points'. Additionally, we extend this approach to model interactions between trajectories and lanes, further improving the prediction accuracy. SRefiner is a multi-iteration, multi-agent framework that iteratively refines trajectories, incorporating topological information to enhance interactions within traffic scenarios. SRefiner achieves significant performance improvements over four baseline methods across two datasets, establishing a new state-of-the-art in trajectory refinement. Codes are available at https://github.com/LiwenXiao/SRefiner.
SpatialTrackerV2: Advancing 3D Point Tracking with Explicit Camera Motion	Yuxi Xiao Zhejiang University Jianyuan Wang Oxford Nan Xue Ant Group Nikita Karaev Pixelwise AI Yuri Makarov Pixelwise AI Bingyi Kang Bytedance Seed Xing Zhu Ant Group Hujun Bao Zhejiang University Yujun Shen Ant Group Xiaowei Zhou Zhejiang University	Paper Supplementary Abstract We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. Going beyond modular pipelines built on off-the-shelf components for 3D tracking, our approach unifies the intrinsic connections between point tracking, monocular depth, and camera pose estimation into a high-performing and feedforward 3D point tracker. It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion, with a fully differentiable and end-to-end architecture, allowing scalable training across a wide range of datasets, including synthetic sequences, posed RGB-D videos, and unlabeled in-the-wild footage. By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%, and matches the accuracy of leading dynamic 3D reconstruction approaches while running 50x faster.
AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion	Liuyue Xie Carnegie Mellon University Jiancong Guo Google Ozan Cakmakci Google Andre Araujo Google DeepMind László A. Jeni Carnegie Mellon University Zhiheng Jia Google	Paper Supplementary Abstract Accurate camera calibration is a fundamental task for 3D perception, especially when dealing with real-world, in-thewild environments where complex optical distortions are common. Existing methods often rely on pre-rectified images or calibration patterns, which limits their applicability and flexibility. In this work, we introduce a novel framework that addresses these challenges by jointly modeling camera intrinsic and extrinsic parameters using a generic ray camera model. Unlike previous approaches, AlignDiff shifts focus from semantic to geometric features, enabling more accurate modeling of local distortions. We propose AlignDiff, a diffusion model conditioned on geometric priors, enabling the simultaneous estimation of camera distortions and scene geometry. To enhance distortion prediction, we incorporate edge-aware attention, focusing the model on geometric features around image edges, rather than semantic content. Furthermore, to enhance generalizability to real-world captures, we incorporate a large database of ray-traced lenses containing over three thousand samples. This database characterizes the distortion inherent in a diverse variety of lens forms. Our experiments demonstrate that the proposed method significantly reduces the angular error of estimated ray bundles by ∼8.2◦and overall calibration accuracy, outperforming existing approaches on challenging, real-world datasets.
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives	Shaoyuan Xie University of California, Irvine Lingdong Kong Shanghai AI Laboratory Yuhao Dong Shanghai AI Laboratory Chonghao Sima Shanghai AI Laboratory Wenwei Zhang Shanghai AI Laboratory Qi Alfred Chen University of California, Irvine Ziwei Liu Shanghai AI Laboratory Liang Pan Shanghai AI Laboratory	Paper Supplementary Abstract Recent advancements in Vision-Language Models (VLMs) have fueled interest in autonomous driving applications, particularly for interpretable decision-making. However, the assumption that VLMs provide visually grounded and reliable driving explanations remains unexamined. To address this, we introduce DriveBench, a benchmark evaluating 12 VLMs across 17 settings, covering 19,200 images, 20,498 QA pairs, and four key driving tasks. Our findings reveal that existing VLMs often generate plausible responses from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs possess inherent corruption-awareness but only explicitly acknowledge these issues when directly prompted. Given the challenges and inspired by the inherent corruption awareness, we propose Robust Agentic Utilization (RAU), leveraging VLMs' corruption awareness and agentic planning with external tools to enhance perception reliability for a diverse set of downstream tasks. Our study challenges existing evaluation paradigms and provides a road map toward more robust and interpretable autonomous driving systems.
GS-LIVM: Real-Time Photo-Realistic LiDAR-Inertial-Visual Mapping with Gaussian Splatting	Yusen Xie The Hong Kong University of Science and Technology (Guangzhou) Zhenmin Huang The Hong Kong University of Science and Technology Jin Wu The Hong Kong University of Science and Technology Jun Ma The Hong Kong University of Science and Technology	Paper Supplementary Abstract In this paper, we introduce GS-LIVM, a real-time photorealistic LiDAR-Inertial-Visual mapping framework with Gaussian Splatting tailored for outdoor scenes. Compared to existing methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), our approach enables real-time photo-realistic mapping while ensuring high-quality image rendering in large-scale unbounded outdoor environments. In this work, Gaussian Process Regression (GPR) is employed to mitigate the issues resulting from sparse and unevenly distributed LiDAR observations. The voxel-based 3D Gaussians map representation facilitates real-time dense mapping in large outdoor environments with acceleration governed by custom CUDA kernels. Moreover, the overall framework is designed in a covariance-centered manner, where the estimated covariance is used to initialize the scale and rotation of 3D Gaussians, as well as update the parameters of the GPR. We evaluate our algorithm on several outdoor datasets, and the results demonstrate that our method achieves state-of-the-art performance in terms of mapping efficiency and rendering quality. The source code is available on GitHub.
Hi-Gaussian: Hierarchical Gaussians under Normalized Spherical Projection for Single-View 3D Reconstruction	Binjian Xie Institute of Automation, Chinese Academy of Sciences Pengju Zhang Institute of Automation, Chinese Academy of Sciences Hao Wei Institute of Automation, Chinese Academy of Sciences Yihong Wu Institute of Automation, Chinese Academy of Sciences	Paper Supplementary Abstract Single-view 3D reconstruction is a fundamental problem in computer vision, having a significant impact on downstream tasks such as autonomous driving, virtual reality and augmented reality. However, existing single-view reconstruction methods are unable to reconstruct the regions outside the input field-of-view or the areas occluded by visible parts. In this paper, we propose Hi-Gaussian, which employs feed-forward 3D Gaussians for efficient and generalizable single-view 3D reconstruction. A Normalized Spherical Projection module is introduced following an EncoderDecoder network in our model, assigning a larger range to the transformed spherical coordinates, which can enlarge the field of view during scene reconstruction. Besides, to reconstruct occluded regions behind the visible part, we introduce a novel Hierarchical Gaussian Sampling strategy, utilizing two layers of Gaussians to hierarchically represent 3D scenes. We first use a pre-trained monocular depth estimation model to provide depth initialization for leader Gaussians, and then leverage the leader Gaussians to estimate the distribution followed by follower Gaussians, which can flexibly move into occluded areas. Extensive experiments show that our method outperforms other methods for scene reconstruction and novel view synthesis, on both outdoor and indoor datasets.
Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling	Christopher Xie Meta Reality Labs Armen Avetisyan Meta Reality Labs Henry Howard-Jenkins Meta Reality Labs Yawar Siddiqui Meta Reality Labs Julian Straub Meta Reality Labs Richard Newcombe Meta Reality Labs Vasileios Balntas Meta Reality Labs Jakob Engel Meta Reality Labs	Paper Supplementary Abstract We present a novel human-in-the-loop approach to estimate 3D scene layout that uses human feedback from an egocentric standpoint. We study this approach through introduction of a novel local correction task, where users identify local errors and prompt a model to automatically correct them. Building on SceneScript [3], a state-of-the-art framework for 3D scene layout estimation that leverages structured language, we propose a solution that structures this problem as 'infilling', a task studied in natural language processing. We train a multi-task version of SceneScript that maintains performance on global predictions while significantly improving its local correction ability. We integrate this into a human-in-the-loop system, enabling a user to iteratively refine scene layout estimates via a lowfriction 'one-click fix' workﬂow. Our system enables the final refined layout to diverge from the training distribution, allowing for more accurate modelling of complex layouts.
PVMamba: Parallelizing Vision Mamba via Dynamic State Aggregation	Fei Xie Shanghai Jiao Tong University Zhongdao Wang Huawei Noah's Ark Lab Weijia Zhang Shanghai Jiao Tong University Chao Ma Shanghai Jiao Tong University	Paper Supplementary Abstract Mamba, an architecture with RNN-like sequence modeling of State Space Model (SSM), has demonstrated promising capabilities in long-range modeling with high efficiency. However, Mamba models struggle with structured 2D visual data using sequential computing, thereby lagging behind their attention-based counterparts. In this paper, we propose a Parallel Vision Mamba (PVMamba), a novel SSM architecture tailored for visual data. PVMamba encompasses two key designs: 1) Based on the sparsity and adjacency of visual signals, we parallelize the sequential computing through three core steps, termed Dynamic State Aggregation (DSA), i.e., parallelization, alignment, and aggregation. DSA generates the hidden state in SSM by a feasible spatial aggregation, thereby overcoming the inherent sequential constraints. 2) Along with maintaining linear computational complexity, we apply a dynamic operator to learn the spatial samplings for each hidden state. To further boost the local modeling capability, we restrict the dynamic operator to the neighboring pixels in shallow layers. We also devise a layer multiplexing technique to stabilize the training and reduce the learning redundancy. PVMamba is a versatile backbone network with dynamic operators for various vision tasks, such as image classification and dense prediction. Extensive experiments show that PVMamba achieves state-of-the-art performance on a range of benchmarks. The code is available at https: //github.com/VISION-SJTU/PVMamba.
SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions	Mengwei Xie Alibaba Group Shuang Zeng Alibaba Group Xinyuan Chang Alibaba Group Xinran Liu Alibaba Group Zheng Pan Alibaba Group Mu Xu Alibaba Group Xing Wei Xi'an Jiaotong University	Paper Abstract Accurate lane topology is essential for autonomous driving, yet traditional methods struggle to model the complex, non-linear structures-such as loops and bidirectional lanes-prevalent in real-world road structure. We present SeqGrowGraph, a novel framework that learns lane topology as a chain of graph expansions, inspired by human mapdrawing processes. Representing the lane graph as a directed graph G = (V, E), with intersections (V ) and centerlines (E), SeqGrowGraph incrementally constructs this graph by introducing one vertex at a time. At each step, an adjacency matrix (A) expands from n ⇥n to (n + 1) ⇥(n + 1) to encode connectivity, while a geometric matrix (M) captures centerline shapes as quadratic B´ezier curves. The graph is serialized into sequences, enabling a transformer model to autoregressively predict the chain of expansions, guided by a depth-first search ordering. Evaluated on nuScenes and Argoverse 2 datasets, SeqGrowGraph achieves state-of-the-art performance.
Efficient Track Anything	Yunyang Xiong Meta AI Research Chong Zhou Meta AI Research Xiaoyu Xiang Meta AI Research Lemeng Wu Meta AI Research Chenchen Zhu Meta AI Research Zechun Liu Meta AI Research Saksham Suri Meta AI Research Balakrishnan Varadarajan Meta AI Research Ramya Krishna Akula Meta AI Research Forrest Iandola Meta AI Research Raghuraman Krishnamoorthi Meta AI Research Bilge Soran Meta AI Research Vikas Chandra Meta AI Research	Paper Supplementary Abstract Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything. Key components of SAM 2 that drive the impressive video object segmentation performance include a large multistage image encoder for frame feature extraction and a memory mechanism that stores memory contexts from past frames to help current frame segmentation. The high computation complexity of image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight end-to-end track anything models that produce high-quality results with low latency and small model size. Our idea is based on adopting lightweight Vision Transformer (ViT) as an image encoder for video object segmentation, and introducing an efficient memory module, which reduces the complexity for both frame feature extraction and memory computation for current frame segmentation. We take vanilla lightweight ViTs and efficient memory module to build EfficientTAMs, and train the models on SA-1B and SA-V datasets for video object segmentation and track anything tasks. We evaluate on multiple video segmentation benchmarks including semisupervised VOS and promptable video segmentation, and find that our proposed EfficientTAM with lightweight ViT performs comparably to SAM 2 model (SAM 2-HieraB+) with ∼1.6x speedup on A100 and ∼2.4x parameter reduction. On segment anything image tasks, our EfficientTAMs also perform favorably over original SAM with ∼20x speedup on A100 and ∼20x parameter reduction. On mobile devices such as iPhone 15 Pro Max, our EfficientTAM can run at ∼28 FPS for near real-time video object segmentation with reasonable quality, highlighting the capability of small models for on-device video object segmentation applications.
Geometric Alignment and Prior Modulation for View-Guided Point Cloud Completion on Unseen Categories	Jingqiao Xiu National University of Singapore Yicong Li National University of Singapore Na Zhao Singapore University of Technology and Design Han Fang National University of Singapore Xiang Wang University of Science and Technology of China Angela Yao National University of Singapore	Paper Abstract View-Guided Point Cloud Completion (VG-PCC) aims to reconstruct complete point clouds from partial inputs by referencing single-view images. While existing VG-PCC models perform well on in-class predictions, they exhibit significant performance drops when generalizing to unseen categories. We identify two key limitations underlying this challenge: (1) Current encoders struggle to bridge the substantial modality gap between images and point clouds. Consequently, their learned representations often lack robust cross-modal alignment and over-rely on superficial classspecific patterns. (2) Current decoders refine global structures holistically, overlooking local geometric patterns that are class-agnostic and transferable across categories. To address these issues, we present a novel generalizable VGPCC framework for unseen categories based on Geometric Alignment and Prior Modulation (GAPM). First, we introduce a Geometry Aligned Encoder that lifts reference images into 3D space via depth maps for natural alignment with partial point clouds. This reduces dependency on class-specific RGB patterns that hinder generalization to unseen classes. Second, we propose a Prior Modulated Decoder that incorporates class-agnostic local priors to reconstruct shapes on a regional basis. This allows the adaptive reuse of learned geometric patterns that promote generalization to unseen classes. Extensive experiments validate that GAPM outperforms existing models on both seen and, notably, unseen categories, establishing a new benchmark for unseen-category generalization in VG-PCC.
A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation	Rongtao Xu Spatialtemporal AI Jian Zhang MBZUAI Minghao Guo MBZUAI Youpeng Wen Sun Yat-sen University Haoting Yang Southern University of Science and Technology Min Lin Sun Yat-sen University Jianzheng Huang Southern University of Science and Technology Zhe Li Southern University of Science and Technology Kaidong Zhang Southern University of Science and Technology Liqiong Wang Southern University of Science and Technology Yuxuan Kuang MBZUAI Meng Cao MBZUAI Feng Zheng Spatialtemporal AI Xiaodan Liang MBZUAI	Paper Supplementary Abstract Robotic manipulation faces critical challenges in understanding spatial affordances-the 'where' and 'how' of object interactions-essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A0, a hierarchical affordance-aware diffusion model that decomposes manipulation task into highlevel spatial affordance understanding and low-level action execution. A0 leverages the Embodiment-Agnostic Affordance Representation, which captures object-centric spatial affordances by predicting contact point and postcontact trajectories. A0 is pre-trained on 1 million contact points data and fine-tuned on annotated trajectories, enabling generalization across platforms. Key components include Position Offset Attention for motion-aware feature extraction and a Spatial Information Aggregation Layer for precise coordinate mapping. The output is executed by the action execution module. Experiments on multiple robotic systems (Franka, Kinova, Realman and Dobot) demonstrate A0's superior performance in complex tasks, showcasing its efficiency, flexibility, and real-world applicability.
AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving	Jiawei Xu Nankai University Kai Deng Nankai University Zexin Fan Nankai University Shenlong Wang University of Illinois Urbana-Champaign Jin Xie Nanjing University Jian Yang Nankai University	Paper Supplementary Abstract Modeling and rendering dynamic urban driving scenes is crucial for self-driving simulation. Current high-quality methods typically rely on costly manual object tracklet annotations, while self-supervised approaches fail to capture dynamic object motions accurately and decompose scenes properly, resulting in rendering artifacts. We introduce ADGS, a novel self-supervised framework for high-quality freeviewpoint rendering of driving scenes from a single log. At its core is a novel learnable motion model that integrates locality-aware B-spline curves with global-aware trigonometric functions, enabling ﬂexible yet precise dynamic object modeling. Rather than requiring comprehensive semantic labeling, AD-GS automatically segments scenes into objects and background with the simplified pseudo 2D segmentation, representing objects using dynamic Gaussians and bidirectional temporal visibility masks. Further, our model incorporates visibility reasoning and physically rigid regularization to enhance robustness. Extensive evaluations demonstrate that our annotation-free model significantly outperforms current state-of-the-art annotationfree methods and is competitive with annotation-dependent approaches. Project Page: https://jiaweixu8. github.io/AD-GS-web/
Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning	Lizhen Xu Xi'an Jiaotong University Xiuxiu Bai Xi'an Jiaotong University Xiaojun Jia Nanyang Technological University Jianwu Fang Xi'an Jiaotong University Shanmin Pang Xi'an Jiaotong University	Paper Supplementary Abstract Query-based methods with dense features have demonstrated remarkable success in 3D object detection tasks. However, the computational demands of these models, particularly with large image sizes and multiple transformer layers, pose significant challenges for efficient running on edge devices. Existing pruning and distillation methods either need retraining or are designed for ViT models, which are hard to migrate to 3D detectors. To address this issue, we propose a zero-shot runtime pruning method for transformer decoders in 3D object detection models. The method, termed tgGBC (trim keys gradually Guided By Classification scores), systematically trims keys in transformer modules based on their importance. We expand the classification score to multiply it with the attention map to get the importance score of each key and then prune certain keys after each transformer layer according to their importance scores. Our method achieves a 1.99x speedup in the transformer decoder of the latest ToC3D model, with only a minimal performance loss of less than 1%. Interestingly, for certain models, our method even enhances their performance. Moreover, we deploy 3D detectors with tgGBC on an edge device, further validating the effectiveness of our method. The code can be found at https: //github.com/iseri27/tg_gbc.
BANet: Bilateral Aggregation Network for Mobile Stereo Matching	Gangwei Xu Huazhong University of Science and Technology Jiaxin Liu Huazhong University of Science and Technology Xianqi Wang Huazhong University of Science and Technology Junda Cheng Huazhong University of Science and Technology Yong Deng Autel Robotics Jinliang Zang Autel Robotics Yurui Chen Autel Robotics Xin Yang Optics Valley Laboratory	Paper Supplementary Abstract State-of-the-art stereo matching methods typically use costly 3D convolutions to aggregate a full cost volume, but their computational demands make mobile deployment challenging. Directly applying 2D convolutions for cost aggregation often results in edge blurring, detail loss, and mismatches in textureless regions. Some complex operations, like deformable convolutions and iterative warping, can partially alleviate this issue; however, they are not mobile-friendly, limiting their deployment on mobile devices. In this paper, we present a novel bilateral aggregation network (BANet) for mobile stereo matching that produces high-quality results with sharp edges and fine details using only 2D convolutions. Specifically, we first separate the full cost volume into detailed and smooth volumes using a spatial attention map, then perform detailed and smooth aggregations accordingly, ultimately fusing both to obtain the final disparity map. Experimental results demonstrate that our BANet-2D significantly outperforms other mobile-friendly methods, achieving 35.3% higher accuracy on the KITTI 2015 leaderboard than MobileStereoNet-2D, with faster runtime on mobile devices. Code: https://github.com/gangweix/BANet.
Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations	Xiang Xu Nanjing University of Aeronautics and Astronautics Lingdong Kong National University of Singapore Song Wang Zhejiang University Chuanwei Zhou Nanjing University of Posts and Telecommunications Qingshan Liu Nanjing University of Posts and Telecommunications	Paper Supplementary Abstract LiDAR representation learning aims to extract rich structural and semantic information from large-scale, readily available datasets, reducing reliance on costly human annotations. However, existing LiDAR representation strategies often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. In this work, we propose LiMA, a novel long-term image-to-LiDAR Memory Aggregation framework that explicitly captures longer range temporal correlations to enhance LiDAR representation learning. LiMA comprises three key components: 1) a Cross-View Aggregation module that aligns and fuses overlapping regions across neighboring camera views, constructing a more unified and redundancy-free memory bank; 2) a Long-Term Feature Propagation mechanism that efficiently aligns and integrates multi-frame image features, reinforcing temporal coherence during LiDAR representation learning; and 3) a Cross-Sequence Memory Alignment strategy that enforces consistency across driving sequences, improving generalization to unseen environments. LiMA maintains high pretraining efficiency and incurs no additional computational overhead during downstream tasks. Extensive experiments on mainstream LiDARbased perception benchmarks demonstrate that LiMA significantly improves both LiDAR semantic segmentation and 3D object detection. We hope this work inspires more effective pretraining paradigms for autonomous driving. The code has been made publicly accessible for future research.
DAA*: Deep Angular A Star for Image-based Path Planning	Zhiwei Xu The University of Melbourne	Paper Supplementary Abstract Path smoothness is often overlooked in path imitation learning from expert demonstrations. In this paper, we introduce a novel learning method, termed deep angular A∗(DAA∗), by incorporating the proposed path angular freedom (PAF) into A∗to improve path similarity through adaptive path smoothness. The PAF aims to explore the effect of move angles on path node expansion by finding the trade-off between their minimum and maximum values, allowing for high adaptiveness for imitation learning. DAA∗improves path optimality by closely aligning with the reference path through joint optimization of path shortening and smoothing, which correspond to heuristic distance and PAF, respectively. Throughout comprehensive evaluations on 7 datasets, including 4 maze datasets, 2 video-game datasets, and a real-world drone-view dataset containing 2 scenarios, we demonstrate remarkable improvements of our DAA∗over neural A∗in path similarity between the predicted and reference paths with a shorter path length when the shortest path is plausible, improving by 9.0% SPR, 6.9% ASIM, and 3.9% PSIM. Furthermore, when jointly learning pathfinding with both path loss and path probability map loss, DAA∗significantly outperforms the state-of-theart TransPath by 6.3% SPR, 6.0% PSIM, and 3.7% ASIM. We also discuss the minor trade-off between path optimality and search efficiency where applicable.
Diffusion-Based Imaginative Coordination for Bimanual Manipulation	Huilin Xu Fudan University Jian Ding King Abdullah University of Science and Technology Jiakun Xu ETH Zurich Ruixiang Wang The Chinese University of Hong Kong, Shenzhen Jun Chen King Abdullah University of Science and Technology Jinjie Mai King Abdullah University of Science and Technology Yanwei Fu Fudan University Bernard Ghanem King Abdullah University of Science and Technology Feng Xu Fudan University Mohamed Elhoseiny King Abdullah University of Science and Technology	Paper Supplementary Abstract Bimanual manipulation is crucial in robotics, enabling complex tasks in industrial automation and household services. However, it poses significant challenges due to the high-dimensional action space and intricate coordination requirements. While video prediction has been recently studied for representation learning and control, leveraging its ability to capture rich dynamic and behavioral information, its potential for enhancing bimanual coordination remains underexplored. To bridge this gap, we propose a unified diffusion-based framework for the joint optimization of video and action prediction. Specifically, we propose a multi-frame latent prediction strategy that encodes future states in a compressed latent space, preserving task-relevant features. Furthermore, we introduce a unidirectional attention mechanism where video prediction is conditioned on the action, while action prediction remains independent of video prediction. This design allows us to omit video prediction during inference, significantly enhancing efficiency. Experiments on two simulated benchmarks and a real-world setting demonstrate a significant improvement in the success rate over the strong baseline ACT using our method, achieving a 24.9% increase on ALOHA, an 11.1% increase on RoboTwin, and a 32.5% increase in real-world experiments. Our models and code are publicly available at Diffusion based imaginative Coordination.
Dual-Temporal Exemplar Representation Network for Video Semantic Segmentation	Xiaolong Xu Sichuan University Lei Zhang Sichuan University Jiayi Li Sichuan University Lituan Wang Sichuan University Yifan Guan Sichuan University Yu Yan Sichuan University Leyi Zhang Sichuan University Hao Song Sichuan University	Paper Abstract Video semantic segmentation aims to assign a class label for each pixel in every video frame. Existing methods predominantly follow the reference-target interaction paradigm, focusing on extracting local temporal contexts while neglecting the integration of global temporal information. Moreover, complex dynamics and varying lighting conditions introduce inter-frame intra-class discrepancies in feature representations, leading to unstable predictions. In this paper, we propose a novel framework, the Dual-Temporal Exemplar Representation Network (DTERN), which utilizes the strong representational capability of cluster centers, i.e., exemplars, to effectively model both local and global temporal information. DTERN consists of two core modules: 1) the Local Temporal Exemplar Module (LTEM), which constructs local exemplars to capture local temporal contexts, ensuring stable and reliable predictions. 2) the Global Temporal Exemplar Module (GTEM), which introduces learnable global exemplars to dynamically model global temporal information, thereby improving the effective consistency of segmentation. Furthermore, we observe that the existing Video Consistency (VC) metric fails to evaluate segmentation accuracy and lacks sensitivity to small-object segmentation. To this end, we propose Video Effective Consistency (VEC) to comprehensively evaluate temporal consistency and segmentation effectiveness. Experiments on VSPW and Cityscape demonstrate that DTERN outperforms state-of-the-art methods. The code is available at https://github.com/zlxilo/DTERN.
Event-boosted Deformable 3D Gaussians for Dynamic Scene Reconstruction	Wenhao Xu University of Science and Technology of China Wenming Weng University of Science and Technology of China Yueyi Zhang MiroMind Ruikang Xu University of Science and Technology of China Zhiwei Xiong University of Science and Technology of China	Paper Supplementary Abstract Deformable 3D Gaussian Splatting (3D-GS) is limited by missing intermediate motion information due to the low temporal resolution of RGB cameras. To address this, we introduce the first approach combining event cameras, which capture high-temporal-resolution, continuous motion data, with deformable 3D-GS for dynamic scene reconstruction. We observe that threshold modeling for events plays a crucial role in achieving high-quality reconstruction. Therefore, we propose a GS-Threshold Joint Modeling strategy, creating a mutually reinforcing process that greatly improves both 3D reconstruction and threshold modeling. Moreover, we introduce a Dynamic-Static Decomposition strategy that first identifies dynamic areas by exploiting the inability of static Gaussians to represent motions, then applies a buffer-based soft decomposition to separate dynamic and static areas. This strategy accelerates rendering by avoiding unnecessary deformation in static areas, and focuses on dynamic areas to enhance fidelity. Additionally, we contribute the first event-inclusive 4D benchmark with synthetic and real-world dynamic scenes, on which our method achieves state-of-the-art performance.
FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction	Jiale Xu ARC Lab, Tencent PCG Shenghua Gao The University of Hong Kong	Paper Supplementary Abstract Sparse-view reconstruction models typically require precise camera poses, yet obtaining these parameters from sparse-view images remains challenging. We introduce FreeSplatter, a scalable feed-forward framework that generates high-quality 3D Gaussians from uncalibrated sparse-view images while estimating camera parameters within seconds. Our approach employs a streamlined transformer architecture where self-attention blocks facilitate information exchange among multi-view image tokens, decoding them into pixel-aligned 3D Gaussian primitives within a unified reference frame. This representation enables both high-fidelity 3D modeling and efficient camera parameter estimation using off-the-shelf solvers. We develop two specialized variants-for object-centric and scene-level reconstruction-trained on comprehensive datasets. Remarkably, FreeSplatter outperforms several pose-dependent Large Reconstruction Models (LRMs) by a notable margin while achieving comparable or even better pose estimation accuracy compared to state-of-the-art pose-free reconstruction approach MASt3R in challenging benchmarks. Beyond technical benchmarks, FreeSplatter streamlines text/image-to-3D content creation pipelines, eliminating the complexity of camera pose management while delivering exceptional visual fidelity.
GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors	Tian-Xing Xu Tsinghua University Xiangjun Gao HKUST Wenbo Hu ARC Lab, Tencent PCG Xiaoyu Li ARC Lab, Tencent PCG Song-Hai Zhang Qinghai University Ying Shan ARC Lab, Tencent PCG	Paper Supplementary Abstract Despite remarkable advancements in video depth estimation, existing methods fall short in geometric fidelity due to their affine-invariant predictions, restricting their applicability in reconstruction and other metrically grounded downstream tasks. We propose a novel point map Variational Autoencoder (VAE) for encoding and decoding unbounded point maps. Notably, its latent space is agnostic to video latent distributions of video diffusion models, allowing us to leverage generation priors to model the distribution of point map sequences conditioned on the input videos. Thus, we can recover high-fidelity point map sequences with temporal coherence from open-world videos, facilitating accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. Extensive evaluations on diverse datasets demonstrate that our method achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.
INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception	Yunjiang Xu School of Computer Science and Technology, Soochow University Lingzhi Li School of Computer Science and Technology, Soochow University Jin Wang School of Future Science and Engineering, Soochow University Yupeng Ouyang School of Computer Science and Technology, Soochow University Benyuan Yang School of Future Science and Engineering, Soochow University	Paper Supplementary Abstract Collaborative perception systems overcome single-vehicle limitations in long-range detection and occlusion scenarios by integrating multi-agent sensory data, improving accuracy and safety. However, frequent cooperative interactions and real-time requirements impose stringent bandwidth constraints. Previous works proves that querybased instance-level interaction reduces bandwidth demands and manual priors, however, LiDAR-focused implementations in collaborative perception remain underdeveloped, with performance still trailing state-of-theart approaches. To bridge this gap, we propose INSTINCT (INSTance-level INteraCtion ArchiTecture), a novel collaborative perception framework featuring three core components: 1) a quality-aware filtering mechanism for high-quality instance feature selection; 2) a dualbranch detection routing scheme to decouple collaborationirrelevant and collaboration-relevant instances; and 3) a Cross Agent Local Instance Fusion module to aggregate local hybrid instance features. Additionally, we enhance the ground truth (GT) sampling technique to facilitate training with diverse hybrid instance features. Extensive experiments across multiple datasets demonstrate that INSTINCT achieves superior performance. Specifically, our method achieves an improvement in accuracy 13.23%/33.08% in DAIR-V2X and V2V4Real while reducing the communication bandwidth to 1/281 and 1/264 compared to state-of-the-art methods. The code is available at https://github.com/CrazyShout/INSTINCT.
Learnable Feature Patches and Vectors for Boosting Low-light Image Enhancement without External Knowledge	Xiaogang Xu The Chinese University of Hong Kong Jiafei Wu The University of Hong Kong Qingsen Yan Northwestern Polytechnical University Jiequan Cui Hefei University of Technology Richang Hong Hefei University of Technology Bei Yu The Chinese University of Hong Kong	Paper Supplementary Abstract A major challenge in Low-Light Image Enhancement (LLIE) is its ill-posed nature: low-light images often lack sufficient information to align with normal-light ones (e.g., not all training data can be fully fitted to the ground truth). Numerous studies have attempted to bridge the gap between low- and normal-light data by introducing effective additional information, which is called 'references' in this paper. However, existing methods overlook the valuable references hidden within the training dataset itself. In this work, we propose a novel LLIE strategy that simultaneously learns image-specific features by neural networks while formulating effective common features from the training data as the reference. These common features are correlated with the samples that are not fully fitted by the LLIE network itself, and they are represented as a set of Learnable Feature Patches and Vectors (LFPVs) in the hidden feature space. LFPVs are updated through two mechanisms: the sampleupdater, which extracts useful features from training samples to refine LFPVs, and the mutual-updater, which propagates information across LFPVs to mutually update them. LFPVs can be adaptively aligned with image-specific features via our designed query-and-fusion procedure, boosting the LLIE performance. Our proposed method can be integrated into any LLIE framework, improving both enhancement quality and downstream task performance. Extensive experiments on various benchmarks demonstrate the effectiveness of our approach.
MergeOcc: Bridge the Domain Gap between Different LiDARs for Robust Occupancy Prediction	Zikun Xu Tsinghua University Shaobing Xu Tsinghua University	Paper Supplementary Abstract LiDAR-based 3D occupancy prediction algorithms evolved rapidly with the advent of large-scale datasets. However, the full potential of the existing diverse datasets remains underutilized, as they are typically employed in isolation. Models trained on a single dataset often suffer considerable performance degradation when deployed to real-world scenarios or datasets involving disparate LiDARs. To address this limitation, we introduce MergeOcc, a generalized pipeline designed to handle different LiDARs by leveraging multiple datasets concurrently. The gaps among LiDAR datasets primarily manifest in geometric disparities and semantic inconsistencies, which correspond to the fundamental components of datasets: data and labels. In response, MergeOcc incorporates a novel model architecture that features a geometric realignment and a semantic label mapping to facilitate multiple datasets training (MDT). The effectiveness of MergeOcc is validated through extensive experiments on two prominent datasets for autonomous vehicles: OpenOccupancy-nuScenes and SemanticKITTI. The results demonstrate its enhanced robustness and performance improvements across both types of LiDARs, outperforming several SOTA methods. Additionally, despite using an identical model architecture and hyper-parameter set, MergeOcc can significantly surpass the baselines thanks to its ability to learn from diverse datasets. To the best of our knowledge, this work presents the first cross-dataset 3D occupancy prediction pipeline that effectively bridges the domain gap for seamless deployment across heterogeneous platforms.
NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation	Peiran Xu Peking University Xicheng Gong Peking University Yadong Mu Peking University	Paper Supplementary Abstract In this work we concentrate on the task of goal-oriented Vision-and-Language Navigation (VLN). Existing methods often make decisions based on historical information, overlooking the future implications and long-term outcomes of the actions. In contrast, we aim to develop a foresighted agent. Specifically, we draw upon Q-learning to train a Qmodel using large-scale unlabeled trajectory data, in order to learn the general knowledge regarding the layout and object relations within indoor scenes. This model can generate a Q-feature, analogous to the Q-value in traditional Q-network, for each candidate action, which describes the potential future information that may be observed after taking the specific action. Subsequently, a cross-modal future encoder integrates the task-agnostic Q-feature with navigation instructions to produce a set of action scores reflecting future prospects. These scores, when combined with the original scores based on history, facilitate an A*-style searching strategy to effectively explore the regions that are more likely to lead to the destination. Extensive experiments conducted on widely used goal-oriented VLN datasets validate the effectiveness of the proposed method.
OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding	Tianrun Xu Department of Automation, Tsinghua University Guanyu Chen Department of Automation, Tsinghua University Ye Li School of Software, Xinjiang University Yuxin Xi School of Artificial Intelligence, Beijing Normal University Zeyu Mu Department of Automation, Tsinghua University Ruichen Wang Department of Automation, Tsinghua University Tianren Zhang Department of Automation, Tsinghua University Haichuan Gao Department of Automation, Tsinghua University Feng Chen Department of Automation, Tsinghua University	Paper Abstract Multimodal large models have made significant progress, yet fine-grained understanding of complex scenes remains a challenge. High-quality, large-scale vision-language datasets are essential for addressing this issue. However, existing methods often rely on labor-intensive manual annotations or closed-source models with optimal performance, making large-scale data collection costly. To overcome these limitations, we propose a self-bootstrapped training pipeline that leverages the model's own multimodal capabilities to recursively refine its understanding. By decomposing existing multimodal data into localized sub-regions and generating hierarchical scene descriptions and multi-faceted question-answer pairs, we construct a dataset based on 1.4M image-task instances. We further utilize this dataset to train the base model, significantly enhancing its ability to interpret complex visual scenes and perform various vision-related tasks. Our OURO model, fine-tuned on Qwen2-VL-7B-Instruct using LoRA, achieves substantial improvements over both the base model and similarly-sized counterparts across multiple multimodal benchmarks. Our self-bootstrapped training pipeline offers a novel paradigm for the continuous improvement of multimodal models. Code and datasets are available at https://github.com/tinnel123666888/OURO.git.
Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions	Liang Xu MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Chengqun Yang MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Zili Lin MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Fei Xu MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Yifan Liu MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Congsheng Xu MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Yiyi Zhang MoE Key Lab of AI, School of Computer Science, Shanghai Jiao Tong University Jie Qin Nanjing University of Aeronautics and Astronautics Xingdong Sheng Lenovo Yunhui Liu Lenovo Xin Jin Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China Yichao Yan MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Wenjun Zeng Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China Xiaokang Yang MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University	Paper Supplementary Abstract Learning action models from real-world human-centric interaction datasets is important towards building generalpurpose intelligent assistants with efficiency. However, most existing datasets only offer specialist interaction category and ignore that AI assistants perceive and act based on first-person acquisition. We urge that both the generalist interaction knowledge and egocentric modality are indispensable. In this paper, we embed the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands. With our hybrid RGB-MoCap system, pairs of assistants and instructors engage with multiple objects and the scene following GPT-generated scripts. Under this setting, we accomplish InterVLA, the first largescale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data, spanning 2 egocentric and 5 exocentric videos, accurate human/object motions and verbal commands. Furthermore, we establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis. We believe that our InterVLA testbed and the benchmarks will foster future works on building AI agents in the physical world.
SAM4D: Segment Anything in Camera and LiDAR Streams	Jianyun Xu Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group Song Wang Zhejiang University Ziqian Ni Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group Chunyong Hu Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group Sheng Yang Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group Jianke Zhu Zhejiang University Qiang Li Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group	Paper Supplementary Abstract We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudolabels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.
Sequential Gaussian Avatars with Hierarchical Motion Context	Wangze Xu Shanghai Artificial Intelligence Laboratory Yifan Zhan Shanghai Artificial Intelligence Laboratory Zhihang Zhong Shanghai Artificial Intelligence Laboratory Xiao Sun Shanghai Artificial Intelligence Laboratory	Paper Supplementary Abstract The emergence of neural rendering has significantly advanced the rendering quality of 3D human avatars, with the recently popular 3DGS technique enabling real-time performance. However, SMPL-driven 3DGS human avatars still struggle to capture fine appearance details due to the complex mapping from pose to appearance during fitting. In this paper, we propose SeqAvatar, which excavates the explicit 3DGS representation to better model human avatars based on a hierarchical motion context. Specifically, we utilize a coarse-to-fine motion conditions that incorporate both the overall human skeleton and fine-grained vertex motions for non-rigid deformation. To enhance the robustness of the proposed motion conditions, we adopt a spatiotemporal multi-scale sampling strategy to hierarchically integrate more motion clues to model human avatars. Extensive experiments demonstrate that our method significantly outperforms 3DGS-based approaches and renders human avatars orders of magnitude faster than the latest NeRFbased models that incorporate temporal context, all while delivering performance that is at least comparable or even superior. Project page: https://zezeaaa.github. io/projects/SeqAvatar/
Stable-Sim2Real: Exploring Simulation of Real-Captured 3D Data with Two-Stage Depth Diffusion	Mutian Xu SSE, CUHKSZ Chongjie Ye FNii-Shenzhen Haolin Liu Tencent Hunyuan3D Yushuang Wu ByteDance Games Jiahao Chang SSE, CUHKSZ Xiaoguang Han SSE, CUHKSZ	Paper Supplementary Abstract 3D data simulation aims to bridge the gap between simulated and real-captured 3D data, which is a fundamental problem for real-world 3D visual tasks. Most 3D data simulation methods inject predefined physical priors but struggle to capture the full complexity of real data. An optimal approach involves learning an implicit mapping from synthetic to realistic data in a data-driven manner, but progress in this solution has met stagnation in recent studies. This work explores a new solution path of data-driven 3D simulation, called Stable-Sim2Real, based on a novel two-stage depth diffusion model. The initial stage finetunes StableDiffusion to generate the residual between the real and synthetic paired depth, producing a stable but coarse depth, where some local regions may deviate from realistic patterns. To enhance this, both the synthetic and initial output depth are fed into a second-stage diffusion, where diffusion loss is adjusted to prioritize these distinct areas identified by a 3D discriminator. We provide a new benchmark scheme to evaluate 3D data simulation methods. Extensive experiments show that training the network with the 3D simulated data derived from our method significantly enhances performance in real-world 3D visual tasks. Moreover, the evaluation demonstrates the high similarity between our 3D simulated data and real-captured patterns. Project page: mutianxu.github.io/stable-sim2real.
TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring	Zhu Xu Wangxuan Institute of Computer Technology, Peking University Ting Lei Wangxuan Institute of Computer Technology, Peking University Zhimin Li Tencent Inc. Guan Wang Baidu Inc. Qingchao Chen National Institute of Health Data Science, Peking University Yuxin Peng Wangxuan Institute of Computer Technology, Peking University Yang Liu Wangxuan Institute of Computer Technology, Peking University	Paper Abstract Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-theshelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relationaware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method, which leverages knowledge to enhance detection in relation-aware dynamic scenarios. TRKT is built on two key components:(1)Relationaware knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas. Then we propose an Inter-frame Attention Augmentation strategy that exploits optical flow for neighboring frames to enhance the attention maps, making them motionaware and robust to motion blur. This step yields relationand motion-aware knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset. Our code is avaliable at https://github.com/XZPKU/TRKT.git
Training-Free Industrial Defect Generation with Diffusion Models	Ruyi Xu National Taiwan University Yen-Tzu Chiu National Taiwan University Tai-I Chen National Taiwan University Oscar Chew ASUS Yung-Yu Chuang National Taiwan University Wen-Huang Cheng National Taiwan University	Paper Supplementary Abstract Anomaly generation has become essential in addressing the scarcity of defective samples in industrial anomaly inspection. However, existing training-based methods fail to handle complex anomalies and multiple defects simultaneously, especially when only a single anomaly sample is available per defect type. To address this issue, we propose TF-IDG, a novel training-free defect generation framework capable of generating diverse anomaly samples in a oneshot setting. We propose a Feature Alignment strategy that provides fine-grained appearance guidance by minimizing the distributional gap between generated and real defects with high complexity. Additionally, we introduce an Adaptive Anomaly Mask mechanism to mitigate the issue of defects with small regions being ignored during the generation process, enhancing consistency between synthetic defects and their corresponding masks. Finally, we incorporate a Texture Preservation module that extracts background information from anomaly-free images, ensuring that the visual properties of synthetic defects are seamlessly integrated into the image. Extensive experiments demonstrate the effectiveness of our method in generating accurate and diverse anomalies, further leading to superior performance in downstream anomaly inspection tasks. Our code is available at https://github.com/rubymiaomiao/TF-IDG.
ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation	Xiwei Xuan University of California, Davis Ziquan Deng University of California, Davis Kwan-Liu Ma University of California, Davis	Paper Supplementary Abstract Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning. Existing solutions often explore attention mechanisms of pre-trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high-quality reference set can significantly benefit training-free OVS. With this observation, we introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training. Our code is available here.
Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging	Ying Xue ETH Zürich Jiaxi Jiang ETH Zürich Rayan Armani ETH Zürich Dominik Hollidt ETH Zürich Yi-Chi Liao ETH Zürich Christian Holz ETH Zürich	Paper Abstract Tracking human full-body motion using sparse wearable inertial measurement units (IMUs) overcomes the limitations of occlusion and instrumentation of the environment inherent in vision-based approaches. However, purely IMUbased tracking compromises translation estimates and accurate relative positioning between individual people, as inertial cues are inherently self-referential and provide no direct spatial reference about others. In this paper, we present a novel approach for robustly estimating body poses and global translation for multiple individuals by leveraging the distances between sparse wearable sensors-both on each individual and across different people. Our method Group Inertial Poser estimates these absolute distances between pairs of sensors from ultra-wideband ranging (UWB) and fuses them with inertial observations as input into structured state-space models to integrate temporal motion patterns for precise 3D pose estimation. Our novel two-step optimization further leverages the estimated distances for accurately tracking people's global trajectories through the world. We also introduce GIP-DB, the first IMU+UWB dataset for two-person tracking, which comprises 200 minutes of motion recordings from 14 participants. In our evaluation, Group Inertial Poser outperforms previous stateof-the-art methods in accuracy and robustness across synthetic and real-world captures, showing the promise of IMU+UWB-based multi-human motion capture in the wild. [Code & Dataset]
SDFormer: Vision-based 3D Semantic Scene Completion via SAM-assisted Dual-channel Voxel Transformer	Yujie Xue Hunan University Huilong Pi Hunan University Jiapeng Zhang Hunan University Yunchuan Qin Hunan University Zhuo Tang Hunan University Kenli Li Hunan University Ruihui Li Hunan University	Paper Supplementary Abstract Vision-based semantic scene completion (SSC) is able to predict complex scene information from limited 2D images, which has attracted widespread attention. Currently, SSC methods typically construct unified voxel features containing both geometry and semantics, which lead to different depth positions in occluded regions sharing the same 2D semantic information, resulting in ambiguous semantic segmentation. To address this problem, we propose SDFormer, a novel SAM-assisted Dual-channel Voxel Transformer framework for SSC. We uncouple the task based on its multi-objective nature and construct two parallel subnetworks: a semantic constructor (SC) and a geometric refiner (GR). The SC utilizes the Segment Anything Model (SAM) to construct dense semantic voxel features from reliable visible semantic information in the image. The GR accurately predicts depth positions and then further adjusts the semantic output by SAM. Additionally, we design a Semantic Calibration Affinity to enhance semantic-aware transformations in SC. Within the GR, Shape Segments Interactive and Learnable mask generation module to emphasize the spatial location of semantics to obtain finegrained voxel information. Extensive qualitative and quantitative results on the SemanticKITTI and SSCBench-KITTI360 datasets show that our method outperforms state-ofthe-art approaches.
Adversarial Attention Perturbations for Large Object Detection Transformers	Zachary Yahn Georgia Institute of Technology Selim Furkan Tekin Georgia Institute of Technology Fatih Ilhan Georgia Institute of Technology Sihao Hu Georgia Institute of Technology Tiansheng Huang Georgia Institute of Technology Yichang Xu Georgia Institute of Technology Margaret Loper Georgia Tech Research Institute Ling Liu Georgia Institute of Technology	Paper Supplementary Abstract Adversarial perturbations are useful tools for exposing vulnerabilities in neural networks. Existing adversarial perturbation methods for object detection are either limited to attacking CNN-based detectors or weak against transformer-based detectors. This paper presents an Attention-Focused Offensive Gradient (AFOG) attack against object detection transformers. By design, AFOG is neural-architecture agnostic and effective for attacking both large transformer-based object detectors and conventional CNN-based detectors with a unified adversarial attention framework. This paper makes three original contributions. First, AFOG utilizes a learnable attention mechanism that focuses perturbations on vulnerable image regions in multi-box detection tasks, increasing performance over non-attention baselines by up to 30.6%. Second, AFOG's attack loss is formulated by integrating two types of feature loss through learnable attention updates with iterative injection of adversarial perturbations. Finally, AFOG is an efficient and stealthy adversarial perturbation method. It probes the weak spots of detection transformers by adding strategically generated and visually imperceptible perturbations which can cause well-trained object detection models to fail. Extensive experiments conducted with twelve large detection transformers on COCO demonstrate the efficacy of AFOG. Our empirical results also show that AFOG outperforms existing attacks on transformer-based and CNN-based object detectors by up to 83% with superior speed and imperceptibility. Code is available at: Link.
MVTrajecter: Multi-View Pedestrian Tracking with Trajectory Motion Cost and Trajectory Appearance Cost	Taiga Yamane NTT Human Informatics Laboratories, NTT Corporation Ryo Masumura NTT Human Informatics Laboratories, NTT Corporation Satoshi Suzuki NTT Human Informatics Laboratories, NTT Corporation Shota Orihashi NTT Human Informatics Laboratories, NTT Corporation	Paper Supplementary Abstract Multi-View Pedestrian Tracking (MVPT) aims to track pedestrians in the form of a bird's eye view occupancy map from multi-view videos. End-to-end methods that detect and associate pedestrians within one model have shown great progress in MVPT. The motion and appearance information of pedestrians is important for the association, but previous end-to-end MVPT methods rely only on the current and its single adjacent past timestamp, discarding the past trajectories before that. This paper proposes a novel endto-end MVPT method called Multi-View Trajectory Tracker (MVTrajecter) that utilizes information from multiple timestamps in past trajectories for robust association. MVTrajecter introduces trajectory motion cost and trajectory appearance cost to effectively incorporate motion and appearance information, respectively. These costs calculate which pedestrians at the current and each past timestamp are likely identical based on the information between those timestamps. Even if a current pedestrian could be associated with a false pedestrian at some past timestamp, these costs enable the model to associate that current pedestrian with the correct past trajectory based on other past timestamps. In addition, MVTrajecter effectively captures the relationships between multiple timestamps leveraging the attention mechanism. Extensive experiments demonstrate the effectiveness of each component in MVTrajecter and show that it outperforms the previous state-of-the-art methods.
RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation	Feng Yan Meituan Fanfan Liu Meituan Yiyang Huang Meituan Zechao Guan Meituan Liming Zheng Meituan Yufeng Zhong Meituan Chengjian Feng Meituan Lin Ma Meituan	Paper Supplementary Abstract Recently, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model RoboTron-Mani and the comprehensive dataset RoboData. RoboTron-Mani, on one hand, enhances 3D perception through camera parameters and occupancy supervision. On the other hand, it further incorporates Modality-Isolation-Mask and multimodal decoder blocks based on OpenFlamingo, improving modality fusion and fine-grained perception. RoboData integrats several publicly-available datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, actions, and space alignment, which facilitates comprehensive learning from diverse robotic datasets and offers one complete evaluation system. Trained on RoboData, RoboTronMani is the first generalist policy that surpasses expert models, enabling simultaneous evaluation of all tasks across multiple datasets, rather than being limited to specific data or task selections. Specifically, RoboTron-Mani boosts manipulation performance by increasing the average sequence length on CALVIN from 1.7 to 3.5, enabling crossembodiment generalization, and achieving state-of-the-art results on both simulated and real-world datasets.
TurboReg: TurboClique for Robust and Efficient Point Cloud Registration	Shaocheng Yan Wuhan University Pengcheng Shi Wuhan University Zhenjun Zhao University of Zaragoza Kaixin Wang Beijing University of Technology Kuang Cao Wuhan University Ji Wu Wuhan University Jiayuan Li Wuhan University	Paper Supplementary Abstract Robust estimation is essential in correspondence-based Point Cloud Registration (PCR). Existing methods using maximal clique search in compatibility graphs achieve high recall but suffer from exponential time complexity, limiting their use in time-sensitive applications. To address this challenge, we propose a fast and robust estimator, TurboReg, built upon a novel lightweight clique, TurboClique, and a highly parallelizable Pivot-Guided Search (PGS) algorithm. First, we define the TurboClique as a 3-clique within a highlyconstrained compatibility graph. The lightweight nature of the 3-clique allows for efficient parallel searching, and the highly-constrained compatibility graph ensures robust spatial consistency for stable transformation estimation. Next, PGS selects matching pairs with high SC2 scores as pivots, effectively guiding the search toward TurboCliques with higher inlier ratios. Moreover, the PGS algorithm has linear time complexity and is significantly more efficient than the maximal clique search with exponential time complexity. Extensive experiments show that TurboReg achieves stateof-the-art performance across multiple real-world datasets, with substantial speed improvements. For example, on the 3DMatch+FCGF dataset, TurboReg (1K) operates 208.22x faster than 3DMAC while also achieving higher recall. Our code is accessible at TurboReg.
3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection	Yung-Hsu Yang ETH Zürich Luigi Piccinelli ETH Zürich Mattia Segu ETH Zürich Siyuan Li ETH Zürich Rui Huang ETH Zürich Yuqian Fu INSAIT Marc Pollefeys ETH Zürich Hermann Blum ETH Zürich Zuria Bauer ETH Zürich	Paper Supplementary Abstract Monocular 3D object detection is valuable for various applications such as robotics and AR/VR. Existing methods are confined to closed-set settings, where the training and testing sets consist of the same scenes and/or object categories. However, real-world applications often introduce new environments and novel object categories, posing a challenge to these methods. In this paper, we address monocular 3D object detection in an open-set setting and introduce the first end-to-end 3D Monocular Openset Object Detector (3D-MOOD). We propose to lift the open-set 2D detection into 3D space through our designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks to yield better overall performance. We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes. To further improve performance, we design the canonical image space for more efficient cross-dataset training. We evaluate 3D-MOOD on both closed-set settings (Omni3D) and open-set settings (Omni3D →Argoverse 2, ScanNet), and achieve new state-of-the-art results. Code and models are available at royyang0714.github.io/3D-MOOD.
AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning	Dejie Yang Peking University Zijing Zhao Peking University Yang Liu Peking University	Paper Supplementary Abstract Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multimodal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pretraining with large-scale data. However, they either utilize web data that differs from robotic tasks, or train the model in an implicit way (e.g., predicting future frames at the pixel level), thus showing limited generalization ability under insufficient robot data. In this paper, we propose to learn from large-scale human action video datasets in an explicit way (i.e., imitating human actions from hand keypoints), introducing Visual Robot Manipulation with Analogical Reasoning (AR-VRM). To acquire action knowledge explicitly from human action videos, we propose a keypoint Vision-Language Model (VLM) pretraining scheme, enabling the VLM to learn human action knowledge and directly predict human hand keypoints. During fine-tuning on robot data, to facilitate the robotic arm in imitating the action patterns of human motions, we first retrieve human action videos that perform similar manipulation tasks and have similar historical observations , and then learn the Analogical Reasoning (AR) map between human hand keypoints and robot components. Taking advantage of focusing on action keypoints instead of irrelevant visual cues, our method achieves leading performance on the CALVIN benchmark and real-world experiments. In few-shot scenarios, our AR-VRM outperforms previous methods by large margins , underscoring the effectiveness of explicitly imitating human actions under data scarcity. Code available at https://github.com/idejie/ar.
Clink! Chop! Thud! - Learning Object Sounds from Real-World Interactions	Mengyu Yang Georgia Institute of Technology Yiming Chen Georgia Institute of Technology Haozheng Pei Georgia Institute of Technology Siddhant Agarwal Georgia Institute of Technology Arun Balajee Vasudevan Carnegie Mellon University James Hays Georgia Institute of Technology	Paper Supplementary Abstract Can a model distinguish between the sound of a spoon hitting a hardwood ﬂoor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model's ability to link these sounds to the objects directly involved. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. To encourage an object-centric approach, we first develop an automatic pipeline to compute segmentation masks of the objects involved to guide the model's focus during training towards the most informative regions of the interaction. A slot attention visual encoder is used to further enforce an object prior. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.
CounterPC: Counterfactual Feature Realignment for Unsupervised Domain Adaptation on Point Clouds	Feng Yang Southeast University Yichao Cao Southeast University Xiu Su Central South University Dan Niu Southeast University Xuanpeng Li Southeast University	Paper Abstract Understanding real-world 3D point clouds is challenging due to domain shifts. The key challenge is disentangling domain-invariant semantics from domain-specific geometric variations, as point clouds exhibit local inconsistency and global redundancy, making direct alignment ineffective. To address this, we propose CounterPC, a counterfactual intervention-based framework, which formulates domain adaptation within a causal latent space, identifying category-discriminative features entangled with intra-class geometric variation confounders. Through counterfactual interventions, we generate counterfactual target samples that retain domain-specific characteristics while improving class separation, mitigating domain bias for optimal feature transfer. To achieve this, we introduce two key modules: i) Joint Distribution Alignment, which leverages 3D foundation models (3D-FMs) and a self-supervised autoregressive generative prediction task to unify feature alignment, and ii) Counterfactual Feature Realignment, which employs Optimal Transport to align category-relevant and category-irrelevant feature distributions, ensuring robust sample-level adaptation while preserving domain properties. CounterPC outperforms current methods on PointDA and GraspNetPC-10 with significant improvements.
DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving	Xuemeng Yang Shanghai Artificial Intelligence Laboratory Licheng Wen Shanghai Artificial Intelligence Laboratory Tiantian Wei Technical University of Munich Yukai Ma Zhejiang University Jianbiao Mei Zhejiang University Xin Li Shanghai Artificial Intelligence Laboratory Wenjie Lei Zhejiang University Daocheng Fu Shanghai Artificial Intelligence Laboratory Pinlong Cai Shanghai Artificial Intelligence Laboratory Min Dou Shanghai Artificial Intelligence Laboratory Liang He East China Normal University Yong Liu Zhejiang University Botian Shi Shanghai Artificial Intelligence Laboratory Yu Qiao Shanghai Artificial Intelligence Laboratory	Paper Abstract This paper introduces DRIVEARENA, the first high-fidelity closed-loop simulation system designed for driving agents navigating real-world scenarios. DRIVEARENA comprises two core components: Traffic Manager, a traffic simulator capable of generating realistic traffic flow on any global street map, and World Dreamer, a high-fidelity conditional generative model with infinite auto-regression. DRIVEARENA supports closed-loop simulation using road networks from cities worldwide, enabling the generation of diverse traffic scenarios with varying styles. This powerful synergy empowers any driving agent capable of processing real-world images to navigate in DRIVEARENA's simulated environment. Furthermore, DRIVEARENA features a flexible, modular architecture, allowing for multiple implementations of its core components and driving agents. Serving as a highly realistic arena for these players, our work provides a valuable platform for developing and evaluating driving agents across diverse and challenging scenarios. DRIVEARENA takes a significant leap forward in leveraging generative models for driving simulation platforms, opening new avenues for closed-loop evaluation of autonomous driving systems.
Driving View Synthesis on Free-form Trajectories with Generative Prior	Zeyu Yang Fudan University Zijie Pan Fudan University Yuankun Yang Fudan University Xiatian Zhu University of Surrey Li Zhang Fudan University	Paper Supplementary Abstract Driving view synthesis along free-form trajectories is essential for realistic driving simulations, enabling closed-loop evaluation of end-to-end driving policies. Existing methods excel at view interpolation along recorded paths but struggle to generalize to novel trajectories due to limited viewpoints in driving videos. To tackle this challenge, we propose DriveX, a novel free-form driving view synthesis framework, that progressively distills generative prior into the 3D Gaussian model during its optimization. Within this framework, we utilize a video diffusion model to refine the degraded novel trajectory renderings from the in-training Gaussian model, while the restored videos in turn serve as additional supervision for optimizing the 3D Gaussian. Concretely, we craft an inpainting-based video restoration task, which can disentangle the identification of degraded regions from the generative capability of the diffusion model and remove the need of simulating specific degraded pattern in the training of the diffusion model. To further enhance the consistency and fidelity of generated contents, the pseudo ground truth is progressively updated with gradually improved novel trajectory rendering, allowing both components to co-adapt and reinforce each other while minimizing the disruption on the optimization. By tightly integrating 3D scene representation with generative prior, DriveX achieves high-quality view synthesis beyond recorded trajectories in real time-unlocking new possibilities for flexible and realistic driving simulations on free-form trajectories.
GSRecon: Efficient Generalizable Gaussian Splatting for Surface Reconstruction from Sparse Views	Hang Yang Nanjing University of Science and Technology Le Hui Northwestern Polytechnical University Jianjun Qian Nanjing University of Science and Technology Jin Xie Nanjing University Jian Yang Nanjing University of Science and Technology	Paper Supplementary Abstract Generalizable surface reconstruction aims to recover the surface the scene from a sparse set of images in a feedforward manner. Existing volume rendering-based methods evaluate numerous points along camera rays to infer the geometry, resulting in inefficient reconstruction. Recently, 3D Gaussian Splatting offers an alternative efficient scene representation and has inspired a series of surface reconstruction methods. However, these methods require dense views and cannot be generalized to new scenes. In this paper, we propose a novel surface reconstruction method with Gaussian splatting, named GSRecon, which leverages the advantages of rasterization-based rendering to achieve efficient reconstruction. To obtain accurate geometry representation, we propose a geometry-aware cross-view enhancement module to improve the unreliable geometry estimation in the current view by incorporating accurate geometric information from other views. To generate the fine-grained Gaussian primitives, we propose a hybrid cross-view feature aggregation module that integrates an efficient voxel branch and a fine-grained point branch to jointly capture cross-view geometric information. Subsequently, per-view depth maps are rendered using these Gaussian primitives and fused to obtain the final 3D surface. Extensive experiments on the DTU, BlendedMVS, and Tanks and Temples datasets validate that GSRecon achieves state-of-the-art performance efficiently. Code is available at https://github.com/hyangwinter/GSRecon.
HFD-Teacher: High-Frequency Depth Distillation from Depth Foundation Models for Enhanced Depth Completion	Zhiyuan Yang Nanyang Technological University Anqi Cheng Nanyang Technological University Haiyue Zhu SIMTech, ASTAR Tianjiao Li Nanyang Technological University Pey Yuen Tao SIMTech, ASTAR Kezhi Mao Nanyang Technological University	Paper Supplementary Abstract Depth completion, the task of reconstructing dense depth maps from sparse depth and RGB images, plays a critical role in 3D scene understanding. However, existing methods often struggle to recover high-frequency details, such as regions with fine structures or weak signals, since depth sensors may fail to capture accurate depth maps in those regions, leading to imperfect supervision ground truth. To overcome this limitation, it is essential to introduce an alternative training source for the models. Emerging depth foundation models excel at producing high-frequency details from RGB images, yet their depth maps suffer from inconsistent scaling. Therefore, we propose a novel teacherstudent framework that enhances depth completion by distilling high-frequency knowledge from depth foundation models across multiple scales. Our approach introduces two key innovations: Adaptive Local Wavelet Decomposition, which dynamically adjusts wavelet decomposition level based on local complexity for efficient feature extraction, and Topological Constraints, which apply persistent homology to enforce structural coherence and suppress spurious depth edges. Experiment results demonstrate that our method outperforms state-of-the-art methods, preserving high-frequency details and overall depth fidelity.
InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation	Zhuoran Yang University of Science and Technology of China Xi Guo SenseAuto Chenjing Ding SenseAuto Chiyu Wang SenseAuto Wei Wu SenseAuto Yanyong Zhang University of Science and Technology of China	Paper Supplementary Abstract Autonomous driving relies on robust models trained on high-quality, large-scale multi-view driving videos. While world models offer a cost-effective solution for generating realistic driving videos, they struggle to maintain instance-level temporal consistency and spatial geometric fidelity. To address these challenges, we propose InstaDrive, a novel framework that enhances driving video realism through two key advancements: (1) Instance Flow Guider, which extracts and propagates instance features across frames to enforce temporal consistency, preserving instance identity over time. (2) Spatial Geometric Aligner, which improves spatial reasoning, ensures precise instance positioning, and explicitly models occlusion hierarchies. By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality and enhances downstream autonomous driving tasks on the nuScenes dataset. Additionally, we utilize CARLA's autopilot to procedurally and stochastically simulate rare but safety-critical driving scenarios across diverse maps and regions, enabling rigorous safety evaluation for autonomous systems.
InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes	Zesong Yang Zhejiang University Bangbang Yang ByteDance Wenqi Dong Zhejiang University Chenxuan Cao Zhejiang University Liyuan Cui Zhejiang University Yuewen Ma Zhejiang University Zhaopeng Cui Zhejiang University Hujun Bao Zhejiang University	Paper Supplementary Abstract Humans can naturally identify and mentally complete occluded objects in cluttered environments. However, imparting similar cognitive ability to robotics remains challenging even with advanced reconstruction techniques, which models scenes as undifferentiated wholes and fails to recognize complete object from partial observations. In this paper, we propose InstaScene, a new paradigm towards holistic 3D perception of complex scenes with a primary goal: decomposing arbitrary instances while ensuring complete reconstruction. To achieve precise decomposition, we develop a novel spatial contrastive learning by tracing rasterization of each instance across views, significantly enhancing semantic supervision in cluttered scenes. To overcome incompleteness from limited observations, we introduce in-situ generation that harnesses valuable observations and geometric cues, effectively guiding 3D generative models to reconstruct complete instances that seamlessly align with the real world. Experiments on scene decomposition and object completion across complex real-world and synthetic scenes demonstrate that our method achieves superior decomposition accuracy while producing geometrically faithful and visually intact objects.
Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation	Xiuyu Yang UT Austin Shuhan Tan UT Austin Philipp Krähenbühl UT Austin	Paper Supplementary Abstract An ideal traffic simulator replicates the realistic longterm point-to-point trip that a self-driving system experiences during deployment. Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene. This is problematic for long-term simulation. Agents enter and exit the scene as the ego vehicle enters new regions. We propose InfGen, a unified nexttoken prediction model that performs interleaved closedloop motion simulation and scene generation. InfGen automatically switches between closed-loop motion simulation and scene generation mode. It enables stable long-term rollout simulation. InfGen performs at the state-of-theart in short-term (9s) traffic simulation, and significantly outperforms all other methods in long-term (30s) simulation. The code and model of InfGen will be released at https://orangesodahub.github.io/InfGen.
PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data	ChangHe Yang LG Electronics Hyeonseop Song LG Electronics Seokhun Choi LG Electronics Seungwoo Lee LG Electronics Jaechul Kim LG Electronics Hoseok Do LG Electronics	Paper Supplementary Abstract Despite considerable efforts to enhance the generalization of 3D pose estimators without costly 3D annotations, existing data augmentation methods struggle in real-world scenarios with diverse human appearances and complex poses. We propose PoseSyn, a novel data synthesis framework that transforms abundant in-the-wild 2D pose dataset into diverse 3D pose-image pairs. PoseSyn comprises two key components: Error Extraction Module (EEM), which identifies challenging poses from the 2D pose datasets, and Motion Synthesis Module (MSM), which synthesizes motion sequences around the challenging poses. Then, by generating realistic 3D training data via a human animation model- aligned with challenging poses and appearances-PoseSyn boosts the accuracy of various 3D pose estimators by up to 14% across real-world benchmarks including various backgrounds and occlusions, challenging poses, and multi-view scenarios. Extensive experiments further confirm that PoseSyn is a scalable and effective approach for improving generalization without relying on expensive 3D annotations, regardless of the pose estimator's model size or design.
RALoc: Enhancing Outdoor LiDAR Localization via Rotation Awareness	Yuyang Yang Xiamen University Wen Li Xiamen University Sheng Ao Xiamen University Qingshan Xu Nanyang Technological University Shangshu Yu Northeastern University Yu Guo Xiamen University Yin Zhou GAC R&D Center Siqi Shen Xiamen University Cheng Wang Xiamen University	Paper Supplementary Abstract LiDAR localization is a fundamental task in autonomous driving and robotics. Scene Coordinate Regression (SCR) exhibits leading pose accuracy, achieving impressive results in learning-based localization. We observe that the realworld LiDAR scans captured from different viewpoints usually result in the catastrophic collapse of SCR. However, existing LiDAR localization methods have largely overlooked the issue of rotation sensitivity in SCR. In this paper, we present RALoc, an outdoor LiDAR localization method with rotation awareness to achieve accurate localization. The key to our approach is to design a Point Cloud Canonicalization module, which leverages a powerful equivariant key feature aggregation to transform the input LiDAR scan towards a consistent orientation, effectively eliminating the adverse effects of rotation. This proposed module has promising scalability and can be seamlessly integrated with the existing LiDAR localization network. Moreover, we propose the Bidirectional LiDAR Localization (BiLiLo) dataset as a benchmark to evaluate the performance of various methods in large outdoor scenes with significant rotation changes. Extensive experiments show that RALoc significantly improves localization performance in scenarios with large rotation changes, and also achieves competitive performance in the Oxford Radar RobotCar dataset. Our project is available at https://etheryangyy. github.io/raloc.github.io.
STaR: Seamless Spatial-Temporal Aware Motion Retargeting with Penetration and Consistency Constraints	Xiaohang Yang Queen Mary University of London Qing Wang Queen Mary University of London Jiahao Yang Queen Mary University of London Gregory Slabaugh Queen Mary University of London Shanxin Yuan Queen Mary University of London	Paper Supplementary Abstract Motion retargeting seeks to faithfully replicate the spatio-temporal motion characteristics of a source character onto a target character with a different body shape. Apart from motion semantics preservation, ensuring geometric plausibility and maintaining temporal consistency are also crucial for effective motion retargeting. However, many existing methods prioritize either geometric plausibility or temporal consistency. Neglecting geometric plausibility results in interpenetration, while neglecting temporal consistency leads to motion jitter. In this paper, we propose a novel sequence-to-sequence model for seamless SpatialTemporal aware motion Retargeting (STaR), with penetration and consistency constraints. STaR consists of two modules: (1) a spatial module that incorporates dense shape representation and a novel limb penetration constraint to ensure geometric plausibility while preserving motion semantics, and (2) a temporal module that utilizes a temporal transformer and a novel temporal consistency constraint to predict the entire motion sequence at once while enforcing multi-level trajectory smoothness. The seamless combination of the two modules helps us achieve a good balance between the semantic, geometric, and temporal targets. Extensive experiments on the Mixamo and ScanRet datasets demonstrate that our method produces plausible and coherent motions while significantly reducing interpenetration rates compared with other approaches. Code page: https://github.com/XiaohangYang829/STaR.
SpikeDiff: Zero-shot High-Quality Video Reconstruction from Chromatic Spike Camera and Sub-millisecond Spike Streams	Siqi Yang Institute for Artificial Intelligence, Peking University Jinxiu Liang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University Zhaojun Huang National Engineering Research Center of Visual Technology, School of Computer Science, Peking University Yeliduosi Xiaokaiti National Engineering Research Center of Visual Technology, School of Computer Science, Peking University Yakun Chang Institute of Information Science, Beijing Jiaotong University Zhaofei Yu Institute for Artificial Intelligence, Peking University Boxin Shi State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University	Paper Supplementary Abstract High-speed video reconstruction from neuromorphic spike cameras offers a promising alternative to traditional framebased imaging, providing superior temporal resolution and dynamic range with reduced power consumption. Nevertheless, reconstructing high-quality colored videos from spikes captured in ultra-short time intervals (sub-millisecond) remain challenging due to the inherently noisy nature of spikes. While some existing methods extend the temporal capture window to improve reconstruction quality, they inevitably compromise the temporal resolution advantages of spike cameras. In this paper, we introduce SpikeDiff, the first zeroshot framework that leverages pretrained diffusion models to reconstruct high-quality colored videos from sub-millisecond (0.5ms) chromatic spike streams. By incorporating physicsbased guidance into the diffusion sampling process, SpikeDiff bridges the domain gap between chromatic spikes and conventional images, enabling high-fidelity reconstruction without requiring domain-specific training data. Extensive experiments demonstrate that SpikeDiff achieves impressive reconstruction quality while maintaining ultra-high temporal resolution, outperforming existing methods across diverse challenging scenarios in both perceptual quality and structural preservation.
Unified Multi-Agent Trajectory Modeling with Masked Trajectory Diffusion	Songru Yang Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University Zhenwei Shi Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University Zhengxia Zou Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University	Paper Abstract Understanding movements in multi-agent scenarios is a fundamental problem in intelligent systems. Previous research assumes complete and synchronized observations. However, real-world partial observation caused by occlusions leads to inevitable model failure, which demands a unified framework for coexisting trajectory prediction, imputation, and recovery. Unlike previous attempts that handled observed and unobserved behaviors in a coupled manner, we explore a decoupled denoising diffusion modeling paradigm with a unidirectional information valve to separate the interference from uncertain behaviors. Building on this, we proposed a Unified Masked Trajectory Diffusion model (UniMTD) for arbitrary levels of missing observations. We designed a unidirectional attention as a valve unit to control the direction of information ﬂow between the observed and masked areas, gradually refining the missing observations toward a real-world distribution. We construct it into a unidirectional MoE structure to handle varying proportions of missing observations. A Cached Diffusion model is further designed to improve generation quality while reducing computation and time overhead. Our method has achieved a great leap across human motions and vehicle traffic. UniMTD efficiently achieves 74% improvement in minADE20 and reaches SOTA with advantages of 91%, 66%, 69%, and 58% across 4 fidelity metrics on out-of-boundary, velocity, and trajectory length.
Diving into the Fusion of Monocular Priors for Generalized Stereo Matching	Chengtang Yao Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology Lidong Yu NVIDIA Zhidan Liu Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology Jiaxi Zeng Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology Yuwei Wu Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology Yunde Jia Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology	Paper Supplementary Abstract The matching formulation makes it naturally hard for the stereo matching to handle ill-posed regions like occlusions and non-Lambertian surfaces. Fusing monocular priors has been proven helpful for ill-posed matching, but the biased monocular prior learned from small stereo datasets constrains the generalization. Recently, stereo matching has progressed by leveraging the unbiased monocular prior from the vision foundation model (VFM) to improve the generalization in ill-posed regions. We dive into the fusion process and observe three main problems limiting the fusion of the VFM monocular prior. The first problem is the misalignment between affine-invariant relative monocular depth and absolute depth of disparity. Besides, when we use the monocular feature in an iterative update structure, the over-confidence in the disparity update leads to local optima results. A direct fusion of a monocular depth map could alleviate the local optima problem, but noisy disparity results computed at the first several iterations will misguide the fusion. In this paper, we propose a binary local ordering map to guide the fusion, which converts the depth map into a binary relative format, unifying the relative and absolute depth representation. The computed local ordering map is also used to re-weight the initial disparity update, resolving the local optima and noisy problem. In addition, we formulate the final direct fusion of monocular depth to the disparity as a registration problem, where a pixel-wise linear regression module can globally and adaptively align them. Our method fully exploits the monocular prior to support stereo matching results effectively and efficiently. We significantly improve the performance from the experiments when generalizing from SceneFlow to Middlebury and Booster datasets while barely reducing the efficiency.
MagicCity: Geometry-Aware 3D City Generation from Satellite Imagery with Multi-View Consistency	Xingbo Yao Hong Kong University of Science and Technology (Guangzhou) Xuanmin Wang Tianjin University Hao Wu Hong Kong University of Science and Technology Chengliang Ping Hong Kong University of Science and Technology Doudou Zhang Hong Kong University of Science and Technology Hui Xiong Hong Kong University of Science and Technology	Paper Supplementary Abstract Directly generating 3D cities from satellite imagery opens up new possibilities for gaming and mapping services. However, this task remains challenging due to the limited information in satellite views, making it difficult for existing methods to achieve both photorealistic textures and geometric accuracy. To address these challenges, we propose MagicCity, a novel large-scale generative model for photorealistic 3D city generation with geometric consistency. Given a satellite image, our framework first extracts 3D geometric information and encodes it alongside textural features using a dual encoder. These features then guide a multi-branch diffusion model to generate city-scale, geometrically consistent multi-view images. To further enhance texture consistency across different viewpoints, we propose an Inter-Frame Cross Attention mechanism that enables feature sharing across different frames. Additionally, we incorporate a Hierarchical Geometric-Aware Module and a Consistency Evaluator to improve overall scene consistency. Finally, the generated images are fed into our robust 3D reconstruction pipeline to produce high-visual quality and geometrically consistent 3D cities. Moreover, we contribute CityVista, a high-quality dataset comprising 500 3D city scenes along with corresponding multiview images and satellite imagery to advance research in 3D city generation. Experimental results demonstrate that MagicCity surpasses state-of-the-art methods in both geometric consistency and visual quality. Our project page: https://github.com/YaoXingbo/MagicCity
NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments	Xuan Yao State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA) Junyu Gao School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS) Changsheng Xu Peng Cheng Laboratory	Paper Supplementary Abstract Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to execute sequential navigation actions in complex environments guided by natural language instructions. Current approaches often struggle with generalizing to novel environments and adapting to ongoing changes during navigation. Inspired by human cognition, we present NavMorph, a self-evolving world model framework that enhances environmental understanding and decision-making in VLN-CE. NavMorph employs compact latent representations to model environmental dynamics, equipping agents with foresight for adaptive planning and policy refinement. By integrating a novel Contextual Evolution Memory, NavMorph leverages scene-contextual information to support effective navigation while maintaining online adaptability. Extensive experiments demonstrate that NavMorph achieves notable performance improvements on popular VLN-CE benchmarks. Our Code is available at https://github.com/Feliciaxyao/NavMorph.
UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions	Siyuan Yao Sun Yat-sen University Rui Zhu Beijing University of Posts and Telecommunications Ziqi Wang Beijing University of Posts and Telecommunications Wenqi Ren Sun Yat-sen University Yanyang Yan University of Chinese Academy of Sciences Xiaochun Cao Sun Yat-sen University	Paper Supplementary Abstract Visual object tracking has gained promising progress in past decades. Most of the existing approaches focus on learning target representation in well-conditioned daytime data, while for the unconstrained real-world scenarios with adverse weather conditions, e.g. nighttime or foggy environment, the tremendous domain shift leads to significant performance degradation. In this paper, we propose UMDATrack, which is capable of maintaining high-quality target state prediction under various adverse weather conditions within a unified domain adaptation framework. Specifically, we first use a controllable scenario generator to synthesize a small amount of unlabeled videos (less than 2% frames in source daytime datasets) in multiple weather conditions under the guidance of diﬀerent text prompts. Afterwards, we design a simple yet eﬀective domain-customized adapter (DCA), allowing the target objects' representation to rapidly adapt to various weather conditions without redundant model updating. Furthermore, to enhance the localization consistency between source and target domains, we propose a target-aware confidence alignment module (TCA) following optimal transport theorem. Extensive experiments demonstrate that UMDATrack can surpass existing advanced visual trackers and lead new stateof-the-art performance by a significant margin. Our code is available at https://github.com/Z-Z188/UMDATrack.
Unsupervised Visible-Infrared Person Re-identification under Unpaired Settings	Haoyu Yao School of Computer Science, Wuhan University Bin Yang School of Computer Science, Wuhan University Wenke Huang School of Computer Science, Wuhan University Bo Du School of Computer Science, Wuhan University Mang Ye School of Computer Science, Wuhan University	Paper Supplementary Abstract Unsupervised visible-infrared person re-identification (USL-VI-ReID) aims to train a cross-modality retrieval model without labels, reducing the reliance on expensive cross-modality manual annotation. However, existing USL-VI-ReID methods rely on artificially cross-modality paired data as implicit supervision, which is also expensive for human annotation and contrary to the setting of unsupervised tasks. In addition, this full alignment of identity across modalities is inconsistent with real-world scenarios, where unpaired settings are prevalent. To this end, we study the USL-VI-ReID task under unpaired settings, which uses cross-modality unpaired and unlabeled data for training a VI-ReID model. We propose a novel Mapping and Collaborative Learning (MCL) framework. Specifically, we first design a simple yet effective Cross-modality Feature Mapping (CFM) module to map and generate fake crossmodality positive feature pairs, constructing a cross-modal pseudo-identity space for feature alignment. Then, a Static-Dynamic Collaborative (SDC) learning strategy is proposed to align cross-modality correspondences through a collaborative approach, eliminating inter-modality discrepancies across different aspects i.e., cluster-level and instance-level, in scenarios with cross-modal identity mismatches. Extensive experiments on the conducted SYSU-MM01 and RegDB benchmarks under paired and unpaired settings demonstrate that our proposed MCL significantly outperforms existing unsupervised methods, facilitating USL-VI-ReID to real-world deployment.
GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields	Shunsuke Yasuki Rikkyo University Taiki Miyanishi The University of Tokyo Nakamasa Inoue Institute of Science Tokyo Shuhei Kurita National Institute of Informatics Koya Sakamoto The University of Tokyo Daichi Azuma The University of Tokyo Masato Taki Rikkyo University Yutaka Matsuo The University of Tokyo	Paper Supplementary Abstract The advancement of 3D language fields has enabled intuitive interactions with 3D scenes via natural language. However, existing approaches are typically limited to smallscale environments, lacking the scalability and compositional reasoning capabilities necessary for large, complex urban settings. To overcome these limitations, we propose GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale highfidelity 3D scenes. GeoProg3D consists of two key components: (i) a Geography-aware City-scale 3D Language Field (GCLF) that leverages a memory-efficient hierarchical 3D model to handle large-scale data, integrated with geographic information for efficiently filtering vast urban spaces using directional cues, distance measurements, elevation data, and landmark references; and (ii) Geographical Vision APIs (GV-APIs), specialized geographic vision tools such as area segmentation and object detection. Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF, effectively supporting diverse geographic vision tasks. To assess performance in city-scale reasoning, we introduce GeoEval3D, a comprehensive benchmark dataset containing 952 query-answer pairs across five challenging tasks: grounding, spatial reasoning, comparison, counting, and measurement. Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks. To our knowledge, GeoProg3D is the first framework enabling compositional geographic reasoning in high-fidelity cityscale 3D environments via natural language.
Purge-Gate: Backpropagation-Free Test-Time Adaptation for Point Clouds Classification via Token purging	Moslem Yazdanpanah LIVIA, ÉTS Montréal Ali Bahri International Laboratory on Learning Systems (ILLS) Mehrdad Noori International Laboratory on Learning Systems (ILLS) Sahar Dastani International Laboratory on Learning Systems (ILLS) Gustavo Adolfo Vargas Hakim International Laboratory on Learning Systems (ILLS) David Osowiechi International Laboratory on Learning Systems (ILLS) Ismail Ben Ayed International Laboratory on Learning Systems (ILLS) Christian Desrosiers International Laboratory on Learning Systems (ILLS)	Paper Supplementary Abstract Test-time adaptation (TTA) is crucial for mitigating performance degradation caused by distribution shifts in 3D point cloud classification. In this work, we introduce Token Purging (PG), a novel backpropagation-free approach that removes tokens highly affected by domain shifts before they reach attention layers. Unlike existing TTA methods, PG operates at the token level, ensuring robust adaptation without iterative updates. We propose two variants: PGSP, which leverages source statistics, and PG-SF, a fully source-free version relying on CLS-token-driven adaptation. Extensive evaluations on ModelNet40-C, ShapeNetC, and ScanObjectNN-C demonstrate that PG-SP achieves an average of +10.3% higher accuracy than state-of-theart backpropagation-free methods, while PG-SF sets new benchmarks for source-free adaptation. Moreover, PG is 12.4x faster and 5.5x more memory efficient than our baseline, making it suitable for real-world deployment. Code is available at https://github.com/MosyMosy/Purge-Gate
ESCNet:Edge-Semantic Collaborative Network for Camouflaged Object Detection	Sheng Ye Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Xin Chen Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Yan Zhang Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Xianming Lin Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Liujuan Cao Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University	Paper Abstract Camouflaged object detection (COD) faces unique challenges where target boundaries are intrinsically ambiguous due to their textural similarity to backgrounds. Existing methods relying on single-modality features often produce fragmented predictions due to insufficient boundary constraints.To address this, we propose ESCNet with dynamically coupled edge-texture perception. Our framework introduces three core innovations that work in concert:1) Adaptive Edge-Texture Perceptor (AETP), which creates an edge prediction behaviour where edge and texture information are mutually reinforcing based on the multi-scale features of the image integrated with the global semantic context of the Transformer;2) Dual-Stream Feature Augmentor (DSFA), which dynamically adjusts the kernel sampling position according to the local texture complexity and edge orientation, thus accurately enhancing the feature information at fractal boundaries and amorphous texture locations;3) Multi-Feature Modulation Module (MFMM), which establishes incremental fine-grained improvements for feature calibration and model prediction through enhanced characterisation of edge perception and hierarchical integration of multiple textures. This interconnected system forms a feedback loop where enhanced representations of edge perception enhance model texture prediction and vice versa. Our ESCNet demonstrates significant performance advantages on all three authoritative datasets.
GS-Occ3D: Scaling Vision-only Occupancy Reconstruction with Gaussian Splatting	Baijun Ye IIIS, Tsinghua University Minghui Qin IIIS, Tsinghua University Saining Zhang AIR, Tsinghua University Moonjun Goon IIIS, Tsinghua University Shaoting Zhu IIIS, Tsinghua University Hao Zhao AIR, Tsinghua University Hang Zhao IIIS, Tsinghua University	Paper Supplementary Abstract Occupancy is crucial for autonomous driving, providing essential geometric priors for perception and planning. However, existing methods predominantly rely on LiDAR-based occupancy annotations, which limits scalability and prevents leveraging vast amounts of potential crowdsourced data for auto-labeling. To address this, we propose GS-Occ3D, a scalable vision-only framework that directly reconstructs occupancy. Vision-only occupancy reconstruction poses significant challenges due to sparse viewpoints, dynamic scene elements, severe occlusions, and long-horizon motion. Existing vision-based methods primarily rely on mesh representation, which suffer from incomplete geometry and additional post-processing, limiting scalability. To overcome these issues, GS-Occ3D optimizes an explicit occupancy representation using an Octreebased Gaussian Surfel formulation, ensuring efficiency and scalability. Additionally, we decompose scenes into static background, ground, and dynamic objects, enabling tailored modeling strategies: (1) Ground is explicitly reconstructed as a dominant structural element, significantly improving large-area consistency; (2) Dynamic vehicles are separately modeled to better capture motion-related occupancy patterns. Extensive experiments on the Waymo dataset demonstrate that GS-Occ3D achieves state-of-theart geometry reconstruction results. By curating visiononly binary occupancy labels from diverse urban scenes, we show their effectiveness for downstream occupancy models on Occ3D-Waymo and superior zero-shot generalization on Occ3D-nuScenes. It highlights the potential of large-scale vision-based occupancy reconstruction as a new paradigm for scalable auto-labeling. Project Page.
Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging	Chongjie Ye SSE, CUHKSZ Yushuang Wu ByteDance Games Ziteng Lu SSE, CUHKSZ Jiahao Chang SSE, CUHKSZ Xiaoyang Guo ByteDance Games Jiaqing Zhou ByteDance Games Hao Zhao AIR, Tsinghua University Xiaoguang Han SSE, CUHKSZ	Paper Supplementary Abstract With the growing demand for high-fidelity 3D models from 2D images, existing methods still face significant challenges in accurately reproducing fine-grained geometric details due to limitations in domain gaps and inherent ambiguities in RGB images. To address these issues, we propose Hi3DGen, a novel framework for generating highfidelity 3D geometry from images via normal bridging. Hi3DGen consists of three key components: (1) an imageto-normal estimator that decouples the low-high frequency image pattern with noise injection and dual-stream training to achieve generalizable, stable, and sharp estimation; (2) a normal-to-geometry learning approach that uses normalregularized latent diffusion learning to enhance 3D geometry generation fidelity; and (3) a 3D data synthesis pipeline that constructs a high-quality dataset to support training. Extensive experiments demonstrate the effectiveness and superiority of our framework in generating rich geometric details, outperforming state-of-the-art methods in terms of fidelity. Our work provides a new direction for high-fidelity 3D geometry generation from images by leveraging normal maps as an intermediate representation.
Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis	Junyan Ye Sun Yat-Sen University Jun He Sun Yat-Sen University Weijia Li Sun Yat-Sen University Zhutao Lv Sun Yat-Sen University Yi Lin Sun Yat-Sen University Jinhua Yu Sun Yat-Sen University Haote Yang Shanghai AI Laboratory Conghui He Shanghai AI Laboratory	Paper Abstract Ground-to-aerial image synthesis focuses on generating realistic aerial images from corresponding ground street view images while maintaining consistent content layout, simulating a top-down view. The significant viewpoint difference leads to domain gaps between views, and dense urban scenes limit the visible range of street views, making this cross-view generation task particularly challenging. In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from street view images, utilizing a diffusion model and the Bird's-Eye View (BEV) paradigm. The CurvedBEV method in SkyDiffusion converts street-view images into a BEV perspective, effectively bridging the domain gap, and employs a "multi-to-one" mapping strategy to address occlusion issues in dense urban scenes. Next, SkyDiffusion designed a BEV-guided diffusion model to generate content-consistent and realistic aerial images. Additionally, we introduce a novel dataset, Ground2Aerial-3, designed for diverse ground-to-aerial image synthesis applications, including disaster scene aerial synthesis, lowaltitude UAV image synthesis, and historical high-resolution satellite image synthesis tasks. Experimental results demonstrate that SkyDiffusion outperforms state-of-the-art methods on cross-view datasets across natural (CVUSA), suburban (CVACT), urban (VIGOR-Chicago), and various application scenarios (G2A-3), achieving realistic and contentconsistent aerial image generation. The code, datasets and more information of this work can be found at https: //opendatalab.github.io/skydiffusion/.
Where am I? Cross-View Geo-localization with Natural Language Descriptions	Junyan Ye Sun Yat-Sen University Honglin Lin Shanghai AI Laboratory Leyan Ou Sun Yat-Sen University Dairong Chen Wuhan University Zihao Wang Sun Yat-Sen University Qi Zhu Sun Yat-Sen University Conghui He Shanghai AI Laboratory Weijia Li Sun Yat-Sen University	Paper Abstract Cross-view geo-localization identifies the locations of streetview images by matching them with geo-tagged satellite images or OSM. However, most existing studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response. In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text descriptions. To support this task, we construct the CVGText dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization details. Additionally, we propose a novel textbased retrieval localization method, CrossText2Loc, which demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons. More can be found at https://github.com/yejy53/CVG-Text.
Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images	Qi Xun Yeo Department of Computer Science, National University of Singapore Yanyan Li Department of Computer Science, National University of Singapore Gim Hee Lee Department of Computer Science, National University of Singapore	Paper Supplementary Abstract Modern 3D semantic scene graph estimation methods utilize ground truth 3D annotations to accurately predict target objects, predicates, and relationships. In the absence of given 3D ground truth representations, we explore leveraging only multi-view RGB images to tackle this task. To attain robust features for accurate scene graph estimation, we must overcome the noisy reconstructed pseudo point-based geometry from predicted depth maps and reduce the amount of background noise present in multi-view image features. The key is to enrich node and edge features with accurate semantic and spatial information and through neighboring relations. We obtain semantic masks to guide feature aggregation to filter background features and design a novel method to incorporate neighboring node information to aid robustness of our scene graph estimates. Furthermore, we leverage on explicit statistical priors calculated from the training summary statistics to refine node and edge predictions based on their one-hop neighborhood. Our experiments show that our method outperforms current methods purely using multi-view images as the initial input. Our project page is available at https://qixun1. github.io/projects/SCRSSG.
ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail	Chandan Yeshwanth Technical University of Munich Dávid Rozenberszki Technical University of Munich Angela Dai Technical University of Munich	Paper Supplementary Abstract Generating text descriptions of objects in 3D indoor scenes is an important building block of embodied understanding. Existing methods describe objects at a single level of detail and do not capture fine-grained details of the parts of objects. In order to produce varying levels of detail capturing both coarse object-level information and detailed part-level descriptions, we propose the task of expressive 3D captioning. Given an input 3D scene, the task is to describe objects at multiple levels of detail: a high-level object description, and a low-level description of the properties of its parts. To produce such captions, we present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan, and for each detected object in the scan, generates a fine-grained collective description of the parts of the object, along with an object-level description conditioned on the part-level description. We design ExCap3D to encourage consistency between the multiple levels of descriptions. To enable this task, we generated the ExCap3D Dataset by leveraging a visual-language model (VLM) for multi-view captioning. The ExCap3D Dataset contains captions on the ScanNet++ dataset with varying levels of detail, comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes. Our experiments show that the object- and part-level details generated by ExCap3D are more expressive than those produced by state-of-the-art methods, with a CIDEr score improvement of 17% and 124% for objectand part-level details respectively. Our code, dataset and models will be made publicly available.
LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion via Distillation to Learnable Look-Up Tables	Xunpeng Yi Electronic Information School, Wuhan University Yibing Zhang Electronic Information School, Wuhan University Xinyu Xiang Electronic Information School, Wuhan University Qinglong Yan Electronic Information School, Wuhan University Han Xu School of Automation, Southeast University Jiayi Ma Electronic Information School, Wuhan University	Paper Supplementary Abstract Current advanced research on infrared and visible image fusion primarily focuses on improving fusion performance, often neglecting the applicability on real-time fusion devices. In this paper, we propose a novel approach that towards extremely fast fusion via distillation to learnable lookup tables specifically designed for image fusion, termed as LUT-Fuse. Firstly, we develop a look-up table structure that utilizing low-order approximation encoding and highlevel joint contextual scene encoding, which is well-suited for multi-modal fusion. Moreover, given the lack of ground truth in multi-modal image fusion, we naturally proposed the efficient LUT distillation strategy instead of traditional quantization LUT methods. By integrating the performance of the multi-modal fusion network (MM-Net) into the MMLUT model, our method achieves significant breakthroughs in efficiency and performance. It typically requires less than one-tenth of the time compared to the current lightweight SOTA fusion algorithms, ensuring high operational speed across various scenarios, even in low-power mobile devices. Extensive experiments validate the superiority, reliability, and stability of our fusion approach. The code is available at https://github.com/zyb5/LUT-Fuse.
ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users	Xiangyu Yin University of Pittsburgh Boyuan Yang University of Pittsburgh Weichen Liu University of Pittsburgh Qiyao Xue University of Pittsburgh Abrar Alamri University of Pittsburgh Goeran Fiedler University of Pittsburgh Wei Gao University of Pittsburgh	Paper Supplementary Abstract Prosthetic legs play a pivotal role in clinical rehabilitation, allowing individuals with lower-limb amputations the ability to regain mobility and improve their quality of life. Gait analysis is fundamental for optimizing prosthesis design and alignment, directly impacting the mobility and life quality of individuals with lower-limb amputations. Visionbased machine learning (ML) methods offer a scalable and non-invasive solution to gait analysis, but face challenges in correctly detecting and analyzing prosthesis, due to their unique appearances and new movement patterns. In this paper, we aim to bridge this gap by introducing a multipurpose dataset, namely ProGait, to support multiple vision tasks including Video Object Segmentation, 2D Human Pose Estimation, and Gait Analysis (GA). ProGait provides 412 video clips from four above-knee amputees when testing multiple newly-fitted prosthetic legs through walking trials, and depicts the presence, contours, poses, and gait patterns of human subjects with transfemoral prosthetic legs. Alongside the dataset itself, we also present benchmark tasks and fine-tuned baseline models to illustrate the practical application and performance of the ProGait dataset. We compared our baseline models against pre-trained vision models, demonstrating improved generalizability when applying the ProGait dataset for prosthesis-specific tasks. The ProGait dataset is available at https://huggingface. co/datasets/ericyxy98/ProGait, and the source codes of our benchmark tasks are available at https: //github.com/pittisl/ProGait.
MOVE: Motion-Guided Few-Shot Video Object Segmentation	Kaining Ying Fudan University Hengrui Hu Fudan University Henghui Ding Fudan University	Paper Supplementary Abstract This work addresses motion-guided few-shot video object segmentation (FSVOS), which aims to segment dynamic objects in videos based on a few annotated examples with the same motion patterns. Existing FSVOS datasets and methods typically focus on object categories, which are static attributes that ignore the rich temporal dynamics in videos, limiting their application in scenarios requiring motion understanding. To fill this gap, we introduce MOVE, a large-scale dataset specifically designed for motion-guided FSVOS. Based on MOVE, we comprehensively evaluate 6 state-of-the-art methods from 3 different related tasks across 2 experimental settings. Our results reveal that current methods struggle to address motion-guided FSVOS, prompting us to analyze the associated challenges and propose a baseline method, Decoupled Motion-Appearance Network (DMA). Experiments demonstrate that our approach achieves superior performance in few-shot motion understanding, establishing a solid foundation for future research in this direction.
SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting	Haiyang Ying University of Maryland, College Park Matthias Zwicker University of Maryland, College Park	Paper Supplementary Abstract Edges are one of the most basic parametric primitives to describe structural information in 3D. In this paper, we study parametric 3D edge reconstruction from calibrated multi-view images. Previous methods usually reconstruct a 3D edge point set from multi-view 2D edge images, and then fit 3D edges to the point set. However, noise in the point set may cause gaps among fitted edges, and the recovered edges may not align with input multi-view images since the edge fitting depends only on the reconstructed 3D point set. To mitigate these problems, we propose SketchSplat, a method to reconstruct accurate, complete, and compact 3D edges via differentiable multi-view sketch splatting. We represent 3D edges as sketches, which are parametric lines and curves defined by attributes including control points, scales, and opacity. During reconstruction, we iteratively sample Gaussian points from a set of sketches and rasterize the Gaussians onto 2D edge images. Then the gradient of the image loss can be back-propagated to optimize the sketch attributes. Our method bridges 2D edge images and 3D edges in a differentiable manner, which ensures that 3D edges align well with 2D images and leads to accurate and complete results. We also propose a series of adaptive topological operations to reduce redundant edges and apply them along with the sketch optimization, yielding a more compact reconstruction. Finally, we contribute an accurate 2D edge detector that improves the performance of both ours and existing methods. Experiments show that our method achieves state-of-the-art accuracy, completeness, and compactness on a benchmark CAD dataset.
Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability	Seungju Yoo Yonsei University Hyuk Kwon Yonsei University Joong-Won Hwang ETRI Kibok Lee Yonsei University	Paper Supplementary Abstract Recent advances in computer vision have made training object detectors more efficient and effective; however, assessing their performance in real-world applications still relies on costly manual annotation. To address this limitation, we develop an automated model evaluation (AutoEval) framework for object detection. We propose Prediction Consistency and Reliability (PCR), which leverages the multiple candidate bounding boxes that conventional detectors generate before non-maximum suppression (NMS). PCR estimates detection performance without ground-truth labels by jointly measuring 1) the spatial consistency between boxes before and after NMS, and 2) the reliability of the retained boxes via the confidence scores of overlapping boxes. For a more realistic and scalable evaluation, we construct a metadataset by applying image corruptions of varying severity. Experimental results demonstrate that PCR yields more accurate performance estimates than existing AutoEval methods, and the proposed meta-dataset covers a wider range of detection performance. The code is available at https: //github.com/YonseiML/autoeval-det.
S4M: Boosting Semi-Supervised Instance Segmentation with SAM	Heeji Yoon KAIST AI Heeseong Shin KAIST AI Eunbeen Hong KAIST AI Hyunwook Choi Korea University Hansang Cho Samsung Electro-Mechanics Daun Jeong Samsung Electro-Mechanics Seungryong Kim KAIST AI	Paper Supplementary Abstract Semi-supervised instance segmentation poses challenges due to limited labeled data, causing difficulties in accurately localizing distinct object instances. Current teacherstudent frameworks still suffer from performance constraints due to unreliable pseudo-label quality stemming from limited labeled data. While the Segment Anything Model (SAM) offers robust segmentation capabilities at various granularities, directly applying SAM to this task introduces challenges such as class-agnostic predictions and potential over-segmentation. To address these complexities, we carefully integrate SAM into the semi-supervised instance segmentation framework, developing a novel distillation method that effectively captures the precise localization capabilities of SAM without compromising semantic recognition. Furthermore, we incorporate pseudo-label refinement as well as a specialized data augmentation with the refined pseudo-labels, resulting in superior performance. We establish state-of-the-art performance, and provide comprehensive experiments and ablation studies to validate the effectiveness of our proposed approach.
MeshMamba: State Space Models for Articulated 3D Mesh Generation and Reconstruction	Yusuke Yoshiyasu National Institute of Advanced Industrial Science and Technology (AIST) Leyuan Sun National Institute of Advanced Industrial Science and Technology (AIST) Ryusuke Sagawa National Institute of Advanced Industrial Science and Technology (AIST)	Paper Supplementary Abstract In this paper, we introduce MeshMamba, a neural network model for learning 3D articulated mesh models by employing the recently proposed Mamba State Space Models (Mamba-SSMs). MeshMamba is efficient and scalable in handling a large number of input tokens, enabling the generation and reconstruction of body mesh models with more than 10,000 vertices, capturing clothing and hand geometries. The key to effectively learning MeshMamba is the serialization technique of mesh vertices into orderings that are easily processed by Mamba. This is achieved by sorting the vertices based on body part annotations or the 3D vertex locations of a template mesh, such that the ordering respects the structure of articulated shapes. Based on MeshMamba, we design 1) MambaDiff3D, a denoising diffusion model for generating 3D articulated meshes and 2) Mamba-HMR, a 3D human mesh recovery model that reconstructs a human body shape and pose from a single image. Experimental results showed that MambaDiff3D can generate dense 3D human meshes in clothes, with grasping hands, etc., and outperforms previous approaches in the 3D human shape generation task. Additionally, MambaHMR extends the capabilities of previous non-parametric human mesh recovery approaches, which were limited to handling body-only poses using around 500 vertex tokens, to the whole-body setting with face and hands, while achieving competitive performance in (near) real-time.
BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation	Yuanhong Yu Zhejiang University Xingyi He Zhejiang University Chen Zhao EPFL Junhao Yu Chongqing University Jiaqi Yang Northwestern Polytechnical University Ruizhen Hu Shenzhen University Yujun Shen Ant Group Xing Zhu Ant Group Xiaowei Zhou Zhejiang University Sida Peng Zhejiang University	Paper Supplementary Abstract This paper presents a generalizable RGB-based approach for object pose estimation, specifically designed to address challenges in sparse-view settings. While existing methods can estimate the poses of unseen objects, their generalization ability remains limited in scenarios involving occlusions and sparse reference views, restricting their realworld applicability. To overcome these limitations, we introduce corner points of the object bounding box as an intermediate representation of the object pose. The 3D object corners can be reliably recovered from sparse input views, while the 2D corner points in the target view are estimated through a novel reference-based point synthesizer, which works well even in scenarios involving occlusions. As object semantic points, object corners naturally establish 2D-3D correspondences for object pose estimation with a PnP algorithm. Extensive experiments on the YCB-Video and Occluded-LINEMOD datasets show that our approach outperforms state-of-the-art methods, highlighting the effectiveness of the proposed representation and significantly enhancing the generalization capabilities of object pose estimation, which is crucial for real-world applications.
DADet: Safeguarding Image Conditional Diffusion Models against Adversarial and Backdoor Attacks via Diffusion Anomaly Detection	Hongwei Yu University of Science and Technology Beijing Xinlong Ding University of Science and Technology Beijing Jiawei Li University of Science and Technology Beijing Jinlong Wang University of Science and Technology Beijing Yudong Zhang Tsinghua University Rongquan Wang University of Science and Technology Beijing Huimin Ma University of Science and Technology Beijing Jiansheng Chen University of Science and Technology Beijing	Paper Supplementary Abstract While image conditional diffusion models demonstrate impressive generation capabilities, they exhibit high vulnerability when facing backdoor and adversarial attacks. In this paper, we define a scenario named diffusion anomaly where the generated results of a reverse process under attack deviate significantly from the normal ones. By analyzing the underlying formation mechanism of the diffusion anomaly, we reveal how perturbations are amplified during the reverse process and accumulated in the results. Based on the analysis, we reveal the phenomena of divergence and homogeneity, which cause the diffusion process to deviate significantly from the normal process and to decline in diversity. Leveraging these two phenomena, we propose a method named Diffusion Anomaly Detection (DADet) to effectively detect both backdoor and adversarial attacks. Extensive experiments demonstrate that our proposal achieves excellent defense performance against backdoor and adversarial attacks. Specifically, for the backdoor attack detection, our method achieves an F1 score of 99% on different datasets, including MS COCO and CIFAR-10. For the detection of adversarial samples, the F1 score exceeds 84% across three adversarial attacks and two different tasks, evaluated on the MS COCO and Places365 datasets, respectively.
DistillDrive: End-to-End Multi-Mode Autonomous Driving Distillation by Isomorphic Hetero-Source Planning Model	Rui Yu East China University of Science and Technology Xianghang Zhang SenseAuto Research Runkai Zhao The University of Sydney Huaicheng Yan East China University of Science and Technology Meng Wang East China University of Science and Technology	Paper Supplementary Abstract End-to-end autonomous driving has been recently seen rapid development, exerting a profound influence on both industry and academia. However, the existing work places excessive focus on ego-vehicle status as their sole learning objectives and lacks of planning-oriented understanding, which limits the robustness of the overall decision-making prcocess. In this work, we introduce DistillDrive, an endto-end knowledge distillation-based autonomous driving model that leverages diversified instance imitation to enhance multi-mode motion feature learning. Specifically, we employ a planning model based on structured scene representations as the teacher model, leveraging its diversified planning instances as multi-objective learning targets for the end-to-end model. Moreover, we incorporate reinforcement learning to enhance the optimization of stateto-decision mappings, while utilizing generative modeling to construct planning-oriented instances, fostering intricate interactions within the latent space. We validate our model on the nuScenes and NAVSIM datasets, achieving a 50% reduction in collision rate and a 3-point improvement in closed-loop performance compared to the baseline model. Code and model are publicly available at https://github.com/YuruiAI/DistillDrive
Dynamic Reconstruction of Hand-Object Interaction with Distributed Force-aware Contact Representation		Paper Supplementary Abstract We present ViTaM-D, a novel visual-tactile framework for reconstructing dynamic hand-object interaction with distributed tactile sensing to enhance contact modeling. Existing methods, relying solely on visual inputs, often fail to capture occluded interactions and object deformation. To address this, we introduce DF-Field, a distributed force-aware contact representation leveraging kinetic and potential energy in hand-object interactions. ViTaM-D first reconstructs interactions using a visual network with contact constraint, then refines contact details through force-aware optimization, improving object deformation modeling. To evaluate deformable object reconstruction, we introduce the HOT dataset, featuring 600 hand-object interaction sequences in a high-precision simulation environment. Experiments on DexYCB and HOT datasets show that ViTaM-D outperforms state-of-the-art methods in reconstruction accuracy for both rigid and deformable objects. DF-Field also proves more effective in refining hand poses and enhancing contact modeling than previous refinement methods.
From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision	Chuang Yu Key Laboratory of Opto-Electronic Information Processing, Chinese Academy of Sciences Jinmiao Zhao Shenyang Institute of Automation, Chinese Academy of Sciences Yunpeng Liu Shenyang Institute of Automation, Chinese Academy of Sciences Sicheng Zhao Tsinghua University Yimian Dai Nankai University Xiangyu Yue MMLab, The Chinese University of Hong Kong	Paper Supplementary Abstract Recently, single-frame infrared small target (SIRST) detection with single point supervision has drawn wide-spread attention. However, the latest label evolution with single point supervision (LESPS) framework suffers from instability, excessive label evolution, and difficulty in exerting embedded network performance. Inspired by organisms gradually adapting to their environment and continuously accumulating knowledge, we construct an innovative Progressive Active Learning (PAL) framework, which drives the existing SIRST detection networks progressively and actively recognizes and learns harder samples. Specifically, to avoid the early low-performance model leading to the wrong selection of hard samples, we propose a model pre-start concept, which focuses on automatically selecting a portion of easy samples and helping the model have basic task-specific learning capabilities. Meanwhile, we propose a refined dual-update strategy, which can promote reasonable learning of harder samples and continuous refinement of pseudo-labels. In addition, to alleviate the risk of excessive label evolution, a decay factor is reasonably introduced, which helps to achieve a dynamic balance between the expansion and contraction of target annotations. Extensive experiments show that existing SIRST detection networks equipped with our PAL framework have achieved state-of-the-art (SOTA) results on multiple public datasets. Furthermore, our PAL framework can build an efficient and stable bridge between full supervision and single point supervision tasks. Our code is available at https://github.com/YuChuang1205/PAL
GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning	Kelin Yu University of Maryland, College Park Sheng Zhang University of Maryland, College Park Harshit Soora University of Maryland, College Park Furong Huang University of Maryland, College Park Heng Huang University of Maryland, College Park Pratap Tokekar University of Maryland, College Park Ruohan Gao University of Maryland, College Park	Paper Supplementary Abstract Recent advances have shown that video generation models can enhance robot learning by deriving effective robot actions through inverse dynamics. However, these methods heavily depend on the quality of generated data and struggle with fine-grained manipulation due to the lack of environment feedback. While video-based reinforcement learning improves policy robustness, it remains constrained by the uncertainty of video generation and the challenges of collecting large-scale robot datasets for training diffusion models. To address these limitations, we propose GENFLOWRL, which derives shaped rewards from generated flow trained from diverse cross-embodiment datasets. This enables learning generalizable and robust policies from diverse demonstrations using low-dimensional, object-centric features. Experiments on 10 manipulation tasks, both in simulation and real-world cross-embodiment evaluations, demonstrate that GENFLOWRL effectively leverages manipulation features extracted from generated object-centric flow, consistently achieving superior performance across diverse and challenging scenarios. Our Project Page: https://colinyu1.github.io/genflowrl/.
Language Driven Occupancy Prediction	Zhu Yu Zhejiang University Bowen Pang Zhejiang University Lizhe Liu Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group Runmin Zhang Zhejiang University Qiang Li Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group Si-Yuan Cao Ningbo Global Innovation Center, Zhejiang University Maochun Luo Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group Mingxia Chen Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group Sheng Yang Unmanned Vehicle Dept., CaiNiao Inc., Alibaba Group Hui-Liang Shen Zhejiang University	Paper Supplementary Abstract We introduce LOcc, an effective and generalizable framework for open-vocabulary occupancy (OVO) prediction. Previous approaches typically supervise the networks through coarse voxel-to-text correspondences via image features as intermediates or noisy and sparse correspondences from voxel-based model-view projections. To alleviate the inaccurate supervision, we propose a semantic transitive labeling pipeline to generate dense and finegrained 3D language occupancy ground truth. Our pipeline presents a feasible way to dig into the valuable semantic information of images, transferring text labels from images to LiDAR point clouds and ultimately to voxels, to establish precise voxel-to-text correspondences. By replacing the original prediction head of supervised occupancy models with a geometry head for binary occupancy states and a language head for language features, LOcc effectively uses the generated language ground truth to guide the learning of 3D language volume. Through extensive experiments, we demonstrate that our transitive semantic labeling pipeline can produce more accurate pseudo-labeled ground truth, diminishing labor-intensive human annotations. Additionally, we validate LOcc across various architectures, where all models consistently outperform state-of-the-art zero-shot occupancy prediction approaches on the Occ3DnuScenes dataset.
Learning to Generalize without Bias for Open-Vocabulary Action Recognition	Yating Yu Northwestern Polytechnical University Congqi Cao Northwestern Polytechnical University Yifan Zhang Institute of Automation, Chinese Academy of Sciences Yanning Zhang Northwestern Polytechnical University	Paper Supplementary Abstract Leveraging the effective visual-text alignment and static generalizability from CLIP, recent video learners adopt CLIP initialization with further regularization or recombination for generalization in open-vocabulary action recognition in-context. However, due to the static bias of CLIP, such video learners tend to overfit on shortcut static features, thereby compromising their generalizability, especially to novel out-of-context actions. To address this issue, we introduce Open-MeDe, a novel Meta-optimization framework with static Debiasing for Open-vocabulary action recognition. From a fresh perspective of generalization, Open-MeDe adopts a meta-learning approach to improve 'known-to-open generalizing' and 'image-to-video debiasing' in a cost-effective manner. Specifically, Open-MeDe introduces a cross-batch meta-optimization scheme that explicitly encourages video learners to quickly generalize to arbitrary subsequent data via virtual evaluation, steering a smoother optimization landscape. In effect, the free of CLIP regularization during optimization implicitly mitigates the inherent static bias of the video meta-learner. We further apply self-ensemble over the optimization trajectory to obtain generic optimal parameters that can achieve robust generalization to both in-context and out-of-context novel data. Extensive evaluations show that Open-MeDe not only surpasses state-of-the-art regularization methods tailored for in-context open-vocabulary action recognition but also substantially excels in out-of-context scenarios. Code is released at https://github.com/Mia-YatingYu/Open-MeDe.
TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models	Mark Yu ARC Lab, Tencent PCG Wenbo Hu The Chinese University of Hong Kong Jinbo Xing The Chinese University of Hong Kong Ying Shan The Chinese University of Hong Kong	Paper Abstract We present TrajectoryCrafter, a novel approach to redirect camera trajectories for monocular videos. By disentangling deterministic view transformations from stochastic content generation, our method achieves precise control over userspecified camera trajectories. We propose a novel dualstream conditional video diffusion model that concurrently integrates point cloud renders and source videos as conditions, ensuring accurate view transformations and coherent 4D content generation. Instead of leveraging scarce multiview videos, we curate a hybrid training dataset combining web-scale monocular videos with static multi-view datasets, by our innovative double-reprojection strategy, significantly fostering robust generalization across diverse scenes. Extensive evaluations on multi-view and large-scale monocular videos demonstrate the superior performance of our method.
ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching	Yuxuan Yuan Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Luyao Tang Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Yixin Chen Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Chaoqi Chen Shenzhen University Yue Huang Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Xinghao Ding Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University	Paper Supplementary Abstract Albeit existing Single-Domain Generalized Object Detection (Single-DGOD) methods enable models to generalize to unseen domains, most assume that the training and testing data share the same label space. In real-world scenarios, unseen domains often introduce previously unknown objects, a challenge that has been largely overlooked. In this paper, we tackle the practical problem of Single-domain Generalizable Open-Set Object Detection (SG-OSOD), which addresses both unseen domains and unknown classes. We identify two key challenges: (1) detecting unknown classes with only known-class data, and (2) learning robust features to mitigate domain shift. To address these challenges, we propose the framework termed ASGS, which leverages adaptive subgraph structures to enhance the understanding of unknown scenes and classes. ASGS consists of Subgraph-wise Unknown-class Learning (SUL) and Class-wise Embedding Compaction (CEC). SUL employs non-parametric methods to detect unknown samples and performs Adaptive Subgraph Searching (ASS) for high-order structural feature extraction, enabling domainrobust unknown class learning. Moreover, the CEC module enhances class discrimination robustness through contrastive learning, which results in more compact class clusters in unknown scenarios. Experimental results demonstrate the effectiveness of the proposed ASGS.
CAT: A Unified Click-and-Track Framework for Realistic Tracking	Yongsheng Yuan Dalian University of Technology, China Jie Zhao Dalian University of Technology, China Dong Wang Dalian University of Technology, China Huchuan Lu Dalian University of Technology, China	Paper Abstract Modern visual trackers have achieved robust performance with precisely initialized target bounding boxes. However, providing high-precision initial annotations is a process both labor-intensive and error-prone in realworld scenarios. Interactive initialization (e.g., click-based, scribble-based) presents a more practical alternative. In this paper, we introduce a unified Click-and-Track (CAT) framework for full-process tracking, eliminating the need for auxiliary models or complex initialization pipelines. We present a novel fine-tuning paradigm that bridges the information gap inherent in click-based initialization through two key innovations: 1) The proposed click-based localization and joint spatial-visual prompt refinement are sequentially performed to compensate for the geometric information loss (e.g., boundary ambiguity, shape uncertainty) inherent in click-based initialization. 2) We design a parameter-efficient module called CTMoE to leverage the tracker's inherent capabilities when fine-tuning. The proposed CTMoE enables the foundation model to learn different matching patterns, unifying click-based initialization and tracking within a unified architecture. Extensive experimental results demonstrate state-of-the-art performance of our click-based tracking method on the LaSOT benchmark (70.5% AUC) while maintaining parameter efficiency, surpassing existing click-based tracking frameworks by a large margin and even outperforming some bounding-boxinitialized trackers. The code and models are available at https://github.com/ysyuann/CAT.
Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction	Zhensheng Yuan Jinan University Haozhi Huang Jinan University Zhen Xiong Jinan University Di Wang Jinan University Guanghua Yang Jinan University	Paper Supplementary Abstract We present a framework that enables fast reconstruction and real-time rendering of urban-scale scenes while maintaining robustness against appearance variations across multi-view captures. Our approach begins with scene partitioning for parallel training, employing a visibility-based image selection strategy to optimize training efficiency. A controllable level-of-detail (LOD) strategy explicitly regulates Gaussian density under a user-defined budget, enabling efficient training and rendering while maintaining high visual fidelity. The appearance transformation module mitigates the negative effects of appearance inconsistencies across images while enabling flexible adjustments. Additionally, we utilize enhancement modules, such as depth regularization, scale regularization, and antialiasing, to improve reconstruction fidelity. Experimental results demonstrate that our method effectively reconstructs urban-scale scenes and outperforms previous approaches in both efficiency and quality. The source code is available at: https://yzslab.github.io/REUrbanGS.
Scaling 3D Compositional Models for Robust Classification and Pose Estimation	Xiaoding Yuan Johns Hopkins University Guofeng Zhang Johns Hopkins University Prakhar Kaushik Johns Hopkins University Artur Jesslen University of Freiburg Adam Kortylewski University of Freiburg Alan Yuille Johns Hopkins University	Paper Supplementary Abstract Deep learning algorithms for object classification and 3D object pose estimation lack robustness to out-of-distribution factors such as synthetic stimuli, changes in weather conditions, and partial occlusion. Recently, a class of Neural Mesh Models have been developed where objects are represented in terms of 3D meshes with learned features at the vertices. These models have shown robustness in small-scale settings, involving 10 objects, but it is unclear that they can be scaled up to 100s of object classes. The main problem is that their training involves contrastive learning among the vertices of all object classes, which scales quadratically with the number of classes. We present a strategy which exploits the compositionality of the objects, i.e. the independence of the feature vectors of the vertices, which greatly reduces the training time while also improving the performance of the algorithms. We first restructure the per-vertex contrastive learning into contrasting within class and between classes. Then we propose a process that dynamically decouples the contrast between classes which are rarely confused, and enhances the contrast between the vertices of classes that are most confused. Our large-scale 3D compositional model not only achieves state-of-the-art performance on the task of predicting classification and pose estimation simultaneously, surpassing Neural Mesh Models and standard DNNs, but is also more robust to out-of-distribution testing including occlusion, weather conditions, synthetic data, and generalization to unknown classes.
Self-Supervised Monocular 4D Scene Reconstruction for Egocentric Videos	Chengbo Yuan Institute for Interdisciplinary Information Sciences, Tsinghua University Geng Chen Shanghai Qi Zhi Institute Li Yi Institute for Interdisciplinary Information Sciences, Tsinghua University Yang Gao Institute for Interdisciplinary Information Sciences, Tsinghua University	Paper Supplementary Abstract Egocentric videos provide valuable insights into human interactions with the physical world, which has sparked growing interest in the computer vision and robotics communities. A critical challenge in fully understanding the geometry and dynamics of egocentric videos is dense scene reconstruction. However, the lack of high-quality labeled datasets in this field has hindered the effectiveness of current supervised learning methods. In this work, we aim to address this issue by exploring an self-supervised dynamic scene reconstruction approach. We introduce EgoMono4D, a novel model that unifies the estimation of multiple variables necessary for Egocentric Monocular 4D reconstruction, including camera intrinsic, camera poses, and video depth, all within a fast feed-forward framework. Starting from pretrained single-frame depth and intrinsic estimation model, we extend it with camera poses estimation and align multi-frame results on large-scale unlabeled egocentric videos. We evaluate EgoMono4D in both in-domain and zero-shot generalization settings, achieving superior performance in dense pointclouds sequence reconstruction compared to all baselines. EgoMono4D represents the first attempt to apply self-supervised learning for pointclouds sequence reconstruction to the label-scarce egocentric field, enabling fast, dense, and generalizable reconstruction. The interactable visualization, code and trained models are released https://egomono4d.github.io/.
WalkVLM: Aid Visually Impaired People Walking by Vision Language Model		Paper Supplementary Abstract Approximately 200 million individuals around the world suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to offer walking assistance for these people. With the recent progress of vision-language models (VLMs), applying VLMs to offer walking guidance has become popular. However, the existing methods of walking guidance are mainly based on self-curated question-answering datasets that are not publicly accessible, without a standardized benchmark for training or evaluation. Moreover, walking assistance often requires real-time streaming video analysis and the generation of concise yet informative reminders, making VLMs struggle due to excessive responses and low efficiency in inferences. In this paper, we introduce the first large-scale dataset dedicated to walking assistance, comprising 12,000 video-annotation pairs, to provide a unified benchmark for training and evaluating systems to help visually-impaired individuals walk. Furthermore, a WalkVLM model is proposed, which employs chain of thought for hierarchical planning to generate concise but informative reminders and utilizes temporal-aware adaptive prediction to reduce the temporal redundancy of reminders. Finally, we have established a solid benchmark for blind walking task and verified the advantages of WalkVLM in stream video processing for this task compared to other VLMs.
MOSCATO: Predicting Multiple Object State Change Through Actions	Parnian Zameni Northeastern University Yuhan Shen Northeastern University Ehsan Elhamifar Northeastern University	Paper Supplementary Abstract We introduce MOSCATO: a new benchmark for predicting the evolving states of multiple objects through long procedural videos with multiple actions. While prior work in object state prediction has typically focused on a single object undergoing one or a few state changes, realworld tasks require tracking many objects whose states evolve over multiple actions. Given the high cost of gathering framewise object-state labels for many videos, we develop a weakly-supervised multiple object state prediction framework, which only uses action labels during training. Specifically, we propose a novel Pseudo-Label Acquisition (PLA) pipeline that integrates large language models, vision-language models, and action segment annotations to generate fine-grained, per-frame object-state pseudo-labels for training a Multiple Object State Prediction (MOSP) network. We further devise a State-Action Interaction (SAI) module that explicitly models the correlations between actions and object states, thereby improving MOSP. To facilitate comprehensive evaluation, we create the MOSCATO benchmark by augmenting three egocentric video datasets with framewise object-state annotations. Experiments show that our multi-stage pseudo-labeling approach and SAI module significantly boost performance over zero-shot VLM baselines and naive extensions of existing methods, underscoring the importance of holistic action-state modeling for fine-grained procedural video understanding.1
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding	Tatiana Zemskova AIRI Dmitry Yudin AIRI	Paper Supplementary Abstract A 3D scene graph represents a compact scene model by capturing both the objects present and the semantic relationships between them, making it a promising structure for robotic applications. To effectively interact with users, an embodied intelligent agent should be able to answer a wide range of natural language queries about the surrounding 3D environment. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for learning scene representations have shown that adapting these representations to the 3D world can significantly improve the quality of LLM responses. However, existing methods typically rely only on geometric information, such as object coordinates, and overlook the rich semantic relationships between objects. In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph that explicitly incorporates semantic relationships. This representation is used as input to LLMs for performing 3D vision-language tasks. In our experiments on popular ScanRefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate that our approach outperforms baselines that do not leverage semantic relationships between objects. The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM.
From Objects to Events: Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning	Yuhui Zeng Xiamen University Haoxiang Wu Xiamen University Wenjie Nie Xiamen University Guangyao Chen Peking University Xiawu Zheng Peking University Yunhang Shen Tencent Youtu Lab Jun Peng Xiamen University Yonghong Tian Peking University Rongrong Ji Xiamen University	Paper Supplementary Abstract Current object detectors excel at entity localization and classification, yet exhibit inherent limitations in event recognition capabilities. This deficiency arises from their architecture's emphasis on discrete object identification rather than modeling the compositional reasoning, interobject correlations, and contextual semantics essential for comprehensive event understanding. To address this challenge, we present a novel framework that expands the capability of standard object detectors beyond mere object recognition to complex event understanding through LLM-guided symbolic reasoning. Our key innovation lies in bridging the semantic gap between object detection and event understanding without requiring expensive taskspecific training. The proposed plug-and-play framework interfaces with any open-vocabulary detector while extending their inherent capabilities across architectures. At its core, our approach combines (i) a symbolic regression mechanism exploring relationship patterns among detected entities and (ii) a LLM-guided strategy guiding the search toward meaningful expressions. These discovered symbolic rules transform low-level visual perception into interpretable event understanding, providing a transparent reasoning path from objects to events with strong transferability across domains. We compared our training-free framework against specialized event recognition systems across diverse application domains. Experiments demonstrate that our framework enhances multiple object detector architectures to recognize complex events such as illegal fishing activities (75% AUROC, +8.36% improvement), construction safety violations (+15.77%), and abnormal crowd behaviors (+23.16%). Code is available at here.
GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments	Lin Zeng Zhejiang University Boming Zhao Zhejiang University Jiarui Hu Zhejiang University Xujie Shen Zhejiang University Ziqiang Dang Zhejiang University Hujun Bao Zhejiang University Zhaopeng Cui Zhejiang University	Paper Supplementary Abstract Novel view synthesis with neural models has advanced rapidly in recent years, yet adapting these models to scene changes remains an open problem. Existing methods are either labor-intensive, requiring extensive model retraining, or fail to capture detailed types of changes over time. In this paper, we present GaussianUpdate, a novel approach that combines 3D Gaussian representation with continual learning to address these challenges. Our method effectively updates the Gaussian radiance fields with current data while preserving information from past scenes. Unlike existing methods, GaussianUpdate explicitly models different types of changes through a novel multi-stage update strategy. Additionally, we introduce a visibility-aware continual learning approach with generative replay, enabling self-aware updating without the need to store images. The experiments on the benchmark dataset demonstrate our method achieves superior and real-time rendering with the capability of visualizing changes over different times. Please refer to our project webpage for more informations: https://zju3dv.github.io/GaussianUpdate.
AR-1-to-3: Single Image to Consistent 3D Object via Next-View Prediction	Xuying Zhang Nankai University Yupeng Zhou Nankai University Kai Wang Nankai University Yikai Wang Tsinghua University Zhen Li Nankai University Shaohui Jiao ByteDance Inc. Daquan Zhou ByteDance Inc. Qibin Hou Nankai University Ming-Ming Cheng Nankai University	Paper Abstract Novel view synthesis (NVS) is a cornerstone for imageto-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our empirical observation that the target views closer to the input views exhibit higher fidelity. With this inspiration, we propose AR1-to-3, a novel next-view prediction paradigm based on diffusion models that first generates views close to the input views, which are then utilized as contextual information to progressively synthesize farther views. To encode the generated view subsequences as local and global conditions for the next-view prediction, we accordingly develop a stacked local feature encoding strategy (Stacked-LE) and an LSTMbased global feature encoding strategy (LSTM-GE). Extensive experiments demonstrate that our method significantly improves the consistency between the generated views and the input views, producing high-fidelity 3D assets.
A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions	Youliang Zhang Tsinghua University Ronghui Li Tsinghua University Yachao Zhang Xiamen University Liang Pan The University of Hong Kong Jingbo Wang Shanghai AI Laboratory Yebin Liu Tsinghua University Xiu Li Tsinghua University	Paper Supplementary Abstract Extracting physically plausible 3D human motion from videos is a critical task. Although existing simulation-based motion imitation methods can enhance the physical quality of daily motions estimated from monocular video capture, extending this capability to high-difficulty motions remains an open challenge. This can be attributed to some flawed motion clips in video-based motion capture results and the inherent complexity in modeling high-difficulty motions. Therefore, sensing the advantage of segmentation in localizing human body, we introduce a mask-based motion correction module (MCM) that leverages motion context and video mask to repair flawed motions; and propose a physics-based motion transfer module (PTM), which employs a prior injected pretrain and adapt approach for motion imitation, improving physical plausibility with the ability to handle in-the-wild and challenging motions. Our approach is designed as a plug-and-play module to physically refine the video motion capture, which also excels in motion generation tasks. Finally, we collected a challenging in-the-wild test set to establish a benchmark, and our method has demonstrated effectiveness on both the new benchmark and existing public datasets. Our project page is : https://physicalmotionrestoration.github.io/
AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving	Ruifei Zhang The Chinese University of Hong Kong, Shenzhen Junlin Xie The Chinese University of Hong Kong, Shenzhen Wei Zhang Baidu Inc. Weikai Chen Guangdong Key Laboratory of Big Data Analysis and Processing Xiao Tan Baidu Inc. Xiang Wan Shenzhen Research Institute of Big Data Guanbin Li Sun Yat-sen University	Paper Supplementary Abstract Effectively integrating Large Language Models (LLMs) into autonomous driving requires a balance between leveraging high-level reasoning and maintaining real-time efficiency. Existing approaches either activate LLMs too frequently, causing excessive computational overhead, or use fixed schedules, failing to adapt to dynamic driving conditions. To address these challenges, we propose AdaDrive, an adaptively collaborative slow-fast framework that optimally determines when and how LLMs contribute to decisionmaking. (1) When to activate the LLM: AdaDrive employs a novel adaptive activation loss that dynamically determines LLM invocation based on a comparative learning mechanism, ensuring activation only in complex or critical scenarios. (2) How to integrate LLM assistance: Instead of rigid binary activation, AdaDrive introduces an adaptive fusion strategy that modulates a continuous, scaled LLM influence based on scene complexity and prediction confidence, ensuring seamless collaboration with conventional planners. Through these strategies, AdaDrive provides a flexible, context-aware framework that maximizes decision accuracy without compromising real-time performance. Extensive experiments on language-grounded autonomous driving benchmarks demonstrate that AdaDrive state-of-the-art performance in terms of both driving accuracy and computational efficiency. Code is available at https://github.com/ReaFly/AdaDrive.
Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding	Xiaojie Zhang Beijing University of Posts and Telecommunications Yuanfei Wang Peking University Ruihai Wu Peking University Kunqi Xu Peking University Yu Li Beijing University of Posts and Telecommunications Liuyu Xiang Beijing University of Posts and Telecommunications Hao Dong Peking University Zhaofeng He Beijing University of Posts and Telecommunications	Paper Abstract Articulated objects pose diverse manipulation challenges for robots. Since their internal structures are not directly observable, robots must adaptively explore and refine actions to generate successful manipulation trajectories. While existing works have attempted cross-category generalization in adaptive articulated object manipulation, two major challenges persist: (1) the geometric diversity of real-world articulated objects complicates visual perception and understanding, and (2) variations in object functions and mechanisms hinder the development of a unified adaptive manipulation strategy. To address these challenges, we propose AdaRPG, a novel framework that leverages foundation models to extract object parts, which exhibit greater local geometric similarity than entire objects, thereby enhancing visual affordance generalization for functional primitive skills. To support this, we construct a part-level affordance annotation dataset to train the affordance model. Additionally, AdaRPG utilizes the common knowledge embedded in foundation models to reason about complex mechanisms and generate high-level control codes that invoke primitive skill functions based on part affordance inference. Simulation and real-world experiments demonstrate AdaRPG's strong generalization ability across novel articulated object categories.
Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction	Runmin Zhang Zhejiang University Zhu Yu Zhejiang University Si-Yuan Cao Ningbo Global Innovation Center, Zhejiang University Lingyu Zhu City University of Hong Kong Guangyi Zhang Zhejiang University Xiaokai Bai Zhejiang University Hui-Liang Shen Zhejiang University	Paper Supplementary Abstract This work presents SGCDet, a novel multi-view indoor 3D object detection framework based on adaptive 3D volume construction. Unlike previous approaches that restrict the receptive field of voxels to fixed locations on images, we introduce a geometry and context aware aggregation module to integrate geometric and contextual information within adaptive regions in each image and dynamically adjust the contributions from different views, enhancing the representation capability of voxel features. Furthermore, we propose a sparse volume construction strategy that adaptively identifies and selects voxels with high occupancy probabilities for feature refinement, minimizing redundant computation in free space. Benefiting from the above designs, our framework achieves effective and efficient volume construction in an adaptive way. Better still, our network can be supervised using only 3D bounding boxes, eliminating the dependence on ground-truth scene geometry. Experimental results demonstrate that SGCDet achieves stateof-the-art performance on the ScanNet, ScanNet200 and ARKitScenes datasets. The source code is available at https://github.com/RM-Zhang/SGCDet.
Breaking Rectangular Shackles: Cross-View Object Segmentation for Fine-Grained Object Geo-Localization	Qingwang Zhang Shenzhen University Yingying Zhu Shenzhen University	Paper Supplementary Abstract This paper addresses the limitations of existing cross-view object geo-localization schemes, which rely on rectangular proposals to localize irregular objects in satellite imagery. These 'rectangular shackles' inherently struggle to precisely define objects with complex geometries, leading to incomplete coverage or erroneous localization. We propose a novel scheme, cross-view object segmentation (CVOS), which achieves fine-grained geo-localization by predicting pixel-level segmentation masks of query objects. CVOS enables accurate extraction of object shapes, sizes, and areas-critical for applications like urban planning and agricultural monitoring. We introduce the CVOGLSeg dataset specifically to support and evaluate the new CVOS scheme. To tackle CVOS challenges, we propose Transformer Object Geo-localization (TROGeo), a twostage framework. First, the Heterogeneous Task Training Stage (HTTS) employs a single transformer encoder with a Cross-View Object Perception Module (CVOPM) and is trained by learning a heterogeneous task. Second, the SAM Prompt Stage (SPS) utilizes SAM's zero-shot segmentation capability, guided by HTTS outputs, to generate precise masks. Extensive experiments on both CVOGL and CVOGL-Seg datasets demonstrate that our approach achieves state-of-the-art performance, effectively breaking the rectangular shackles and unlocking new possibilities for fine-grained object geo-localization. Our project page: https://zqwlearning.github.io/CVOS.
CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation	Dengke Zhang South China University of Technology Fagui Liu Pengcheng Laboratory Quan Tang Pengcheng Laboratory	Paper Supplementary Abstract Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without being constrained by a predefined set of categories. While Contrastive LanguageImage Pre-training (CLIP) excels in zero-shot classification, it struggles to align image patches with category embeddings because of its incoherent patch correlations. This study reveals that inter-class correlations are the main reason for impairing CLIP's segmentation performance. Accordingly, we propose CorrCLIP, which reconstructs the scope and value of patch correlations. Specifically, CorrCLIP leverages the Segment Anything Model (SAM) to define the scope of patch interactions, reducing inter-class correlations. To mitigate the problem that SAM-generated masks may contain patches belonging to different classes, CorrCLIP incorporates self-supervised models to compute coherent similarity values, suppressing the weight of interclass correlations. Additionally, we introduce two additional branches to strengthen patch features' spatial details and semantic representation. Finally, we update segmentation maps with SAM-generated masks to improve spatial consistency. Based on the improvement across patch correlations, feature representations, and segmentation maps, CorrCLIP achieves superior performance across eight benchmarks. Codes are available at: https://github.com/zdk258/CorrCLIP.
Detect Anything 3D in the Wild	Hanxue Zhang OpenDriveLab at Shanghai AI Laboratory Haoran Jiang Fudan University Qingsong Yao Stanford University Yanan Sun OpenDriveLab at Shanghai AI Laboratory Renrui Zhang CUHK MMLab Hao Zhao Tsinghua University Hongyang Li OpenDriveLab at Shanghai AI Laboratory Hongzi Zhu Shanghai Jiao Tong University Zetong Yang OpenDriveLab at Shanghai AI Laboratory	Paper Supplementary Abstract Despite the success of deep learning in close-set 3D object detection, existing approaches struggle with zero-shot generalization to novel objects and camera configurations. We introduce DetAny3D, a promptable 3D detection foundation model capable of detecting any novel object under arbitrary camera configurations using only monocular inputs. Training a foundation model for 3D detection is fundamentally constrained by the limited availability of annotated 3D data, which motivates DetAny3D to leverage the rich prior knowledge embedded in extensively pre-trained 2D foundation models to compensate for this scarcity. To effectively transfer 2D knowledge to 3D, DetAny3D incorporates two core modules: the 2D Aggregator, which aligns features from different 2D foundation models, and the 3D Interpreter with Zero-Embedding Mapping, which stabilizes early training in 2D-to-3D knowledge transfer. Experimental results validate the strong generalization of our DetAny3D, which not only achieves state-of-the-art performance on unseen categories and novel camera configurations, but also surpasses most competitors on in-domain data. DetAny3D sheds light on the potential of the 3D foundation model for diverse applications in real-world scenarios, e.g., rare object detection in autonomous driving, and demonstrates promise for further exploration of 3D-centric tasks in open-world settings. More visualization results can be found at our code repository.
DiffPCI: Large Motion Point Cloud frame Interpolation with Diffusion Model	Tianyu Zhang Nankai University Haobo Jiang Nanyang Technological University Jian Yang Nankai University Jin Xie Nanjing University	Paper Supplementary Abstract Point cloud interpolation aims to recover intermediate frames for temporally smoothing a point cloud sequence. However, real-world challenges, such as uneven or large scene motions, cause existing methods to struggle with limited interpolation precision. To address this, we introduce DiffPCI, a novel diffusion interpolation model that formulates the frame interpolation task as a progressive denoising diffusion process. Training DiffPCI involves two key stages: a forward interpolation diffusion process and a reverse interpolation denoising process. In the forward process, the clean intermediate frame is progressively transformed into a noisy one through continuous Gaussian noise injection. The reverse process then focuses on training a denoiser to gradually refine this noisy frame back to the ground-truth frame. In particular, we derive a point cloud interpolationspecific variational lower bound as our optimization objective for denoiser training. Furthermore, to alleviate the interpolation error especially in highly dynamic scenes, we develop a novel full-scale, dual-branch denoiser that enables more comprehensive front-back frame information fusion for robust bi-directional interpolation. Extensive experiments demonstrate that DiffPCI significantly outperforms current state-of-the-art frame interpolation methods (e.g. 27% and 860% reduction in the Chamfer Distance and Earth Mover's Distance on Nuscenes).
Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion	Shengyuan Zhang Zhejiang University An Zhao Zhejiang University Ling Yang Peking University Zejian Li Zhejiang University Chenye Meng Zhejiang University Haoran Xu Zhejiang Green Zhixing Technology co., ltd Tianrun Chen Zhejiang University AnYang Wei Zhejiang Green Zhixing Technology co., ltd Perry Pengyun GU Zhejiang Green Zhixing Technology co., ltd Lingyun Sun Zhejiang University	Paper Supplementary Abstract Diffusion models have been applied to 3D LiDAR scene completion due to their strong training stability and high completion quality. However, the slow sampling speed limits the practical application of diffusion-based scene completion models since autonomous vehicles require an efficient perception of surrounding environments. This paper proposes a novel distillation method tailored for 3D LiDAR scene completion models, dubbed ScoreLiDAR, which achieves efficient yet high-quality scene completion. ScoreLiDAR enables the distilled model to sample in significantly fewer steps after distillation. To improve completion quality, we also introduce a novel Structural Loss, which encourages the distilled model to capture the geometric structure of the 3D LiDAR scene. The loss contains a scene-wise term constraining the holistic structure and a point-wise term constraining the key landmark points and their relative configuration. Extensive experiments demonstrate that ScoreLiDAR significantly accelerates the completion time from 30.55 to 5.37 seconds per frame (>5x) on SemanticKITTI and achieves superior performance compared to state-of-the-art 3D LiDAR scene completion models. Our model and code are publicly available on https: //github.com/happyw1nd/ScoreLiDAR.
EMatch: A Unified Framework for Event-based Optical Flow and Stereo Matching	Pengjie Zhang Beijing Institute of Technology Lin Zhu Beijing Institute of Technology Xiao Wang Anhui University Lizhi Wang Beijing Normal University Hua Huang Beijing Normal University	Paper Supplementary Abstract Event cameras have shown promise in vision applications like optical flow estimation and stereo matching with many specialized architectures. However, existing works only focus event data within the confines of task-specific domains, overlooking the correlations between tasks across the temporal and spatial domains. In this paper, we propose a novel matching-based framework for event cameras to estimate flow and disparity simultaneously in a shared representation space, reformulating them as a unified pixelwise correspondence matching problem. Specifically, our method utilizes a Temporal Recurrent Network to aggregate asynchronous event features across temporal or spatial domains, and a Spatial Contextual Attention to enhance knowledge transfer across event flows via temporal or spatial interactions. By utilizing a shared pixel-wise feature similarities module, our network performs optical flow estimation from temporal event segments and stereo matching from spatial event segments simultaneously. Our unified model inherently supports multi-task unification and cross-task transfer, which facilitate training and streamline deployment. Without the need for retraining on specific tasks, our model can effectively handle both event-based flow and stereo estimation, achieving state-of-the-art performance on both tasks. Our code is publicly available at https://github.com/BIT-Vision/EMatch.
Efficient Visual Place Recognition Through Multimodal Semantic Knowledge Integration	Sitao Zhang The Pennsylvania State University Hongda Mao Amazon Qingshuang Chen Amazon Yelin Kim Amazon	Paper Abstract Visual place recognition is crucial for autonomous navigation and robotic mapping. Current methods struggle with perceptual aliasing and computational inefficiency. We present SemVPR, a novel approach integrating multimodal semantic knowledge into VPR. By leveraging a pre-trained vision-language model as a teacher during the training phase, SemVPR learns local visual and semantic descriptors simultaneously, effectively mitigating perceptual aliasing through semantic-aware aggregation without extra inference cost. The proposed nested descriptor learning strategy generates a series of ultra-compact global descriptors, reduced by approximately 66x compared to state-of-the-art methods, in a coarse-to-fine manner, eliminating the need for offline dimensionality reduction or training multiple models. Extensive experiments across various VPR benchmarks demonstrate that SemVPR consistently outperforms state-of-the-art methods with significantly lower computational costs, rendering its feasibility for latency-sensitive scenarios in real-world applications.
Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance	Mingfang Zhang The University of Tokyo Ryo Yonetani CyberAgent AI Lab Yifei Huang The University of Tokyo Liangyang Ouyang The University of Tokyo Ruicong Liu The University of Tokyo Yoichi Sato The University of Tokyo	Paper Supplementary Abstract This paper presents a novel inertial localization framework named Egocentric Action-aware Inertial Localization (EAIL), which leverages egocentric action cues from headmounted IMU signals to localize the target individual within a 3D point cloud. Human inertial localization is challenging due to IMU sensor noise that causes trajectory drift over time. The diversity of human actions further complicates IMU signal processing by introducing various motion patterns. Nevertheless, we observe that some actions captured by the head-mounted IMU correlate with spatial environmental structures (e.g., bending down to look inside an oven, washing dishes next to a sink), thereby serving as spatial anchors to compensate for the localization drift. The proposed EAIL framework learns such correlations via hierarchical multi-modal alignment with vision-language guidance. By assuming that the 3D point cloud of the environment is available, it contrastively learns modality encoders that align short-term egocentric action cues in IMU signals with local environmental features in the point cloud. The learning process is enhanced using concurrently collected vision and language signals to improve multimodal alignment. The learned encoders are then used in reasoning the IMU data and the point cloud over time and space to perform inertial localization. Interestingly, these encoders can further be utilized to recognize the corresponding sequence of actions as a by-product. Extensive experiments demonstrate the effectiveness of the proposed framework over state-of-the-art inertial localization and inertial action recognition baselines. Project page: https://github. com/mf-zhang/Ego-Inertial-Localization.
Enhanced Event-based Dense Stereo via Cross-Sensor Knowledge Distillation	Haihao Zhang Institute of Information Engineering, Chinese Academy of Sciences Yunjian Zhang Tsinghua University Jianing Li Tsinghua University Lin Zhu Beijing Institute of Technology Meng Lv Beijing Institute of Technology Yao Zhu Tsinghua University Yanwei Liu Tsinghua University Xiangyang Ji Tsinghua University	Paper Supplementary Abstract Accurate stereo matching under fast motion and extreme lighting conditions is a challenge for many vision applications. Event cameras have the advantages of low latency and high dynamic range, thus providing a reliable solution to this challenge. However, since events are sparse, this makes it an ill-posed problem to obtain dense disparity using only events. In this work, we propose a novel framework for event-based dense stereo via cross-sensor knowledge distillation. Specifically, a multi-level intensityto-event distillation strategy is designed to maximize the potential of long-range information, local texture details, and task-related knowledge of the intensity images. Simultaneously, to enforce the cross-view consistency, an intensityevent joint left-right consistency module is proposed. With our framework, extensive dense and structural information contained in intensity images is distilled to the event branch. Therefore, retaining only the events can predict dense disparities during inference, preserving the low latency characteristics of the events. Adequate experiments conducted on the MVSEC and DSEC datasets demonstrate that our method exhibits superior stereo matching performance than baselines, both quantitatively and qualitatively.
Enhancing Zero-shot Object Counting via Text-guided Local Ranking and Number-evoked Global Attention	Shiwei Zhang Xi'an Jiaotong University Qi Zhou Xi'an Jiaotong University Wei Ke Pengcheng Laboratory	Paper Supplementary Abstract Text-guided zero-shot object counting leverages visionlanguage models (VLMs) to count objects of an arbitrary class given by a text prompt. Existing approaches for this challenging task only utilize local patch-level features to fuse with text feature, ignoring the important inﬂuence of the global image-level feature. In this paper, we propose a universal strategy that can exploit both local patchlevel features and global image-level feature simultaneously. Specifically, to improve the localization ability of VLMs, we propose Text-guided Local Ranking. Depending on the prior knowledge that foreground patches have higher similarity with the text prompt, a new local-text rank loss is designed to increase the differences between the similarity scores of foreground and background patches which push foreground and background patches apart. To enhance the counting ability of VLMs, Number-evoked Global Attention is introduced to first align global image-level feature with multiple number-conditioned text prompts. Then, the one with the highest similarity is selected to compute cross-attention with the global image-level feature. Through extensive experiments on widely used datasets and methods, the proposed approach has demonstrated superior advancements in performance, generalization, and scalability. Furthermore, to better evaluate text-guided zeroshot object counting methods, we propose a dataset named ZSC-8K, which is larger and more challenging, to establish a new benchmark. Codes and dataset are released at https://github.com/zaqai/LGCount.
Environment-Agnostic Pose: Generating Environment-independent Object Representations for 6D Pose Estimation	Shaobo Zhang Northwest University Yuhang Huang National University of Defense Technology Wanqing Zhao Northwest University Wei Zhao Xidian University Ziyu Guan Northwest University Jinye Peng Northwest University	Paper Supplementary Abstract This paper introduces EA6D, a novel diffusion-based framework for 6D pose estimation that operates effectively in any environment. Traditional pose estimation methods struggle with the variability and complexity of real-world scenarios, often leading to overfitting on controlled datasets and poor generalization to new scenes. To address these challenges, we propose a generative pose estimation paradigm that generates environment-independent object representations for pose estimation, which are robust to environmental variations such as illumination, occlusion, and background clutter. Specifically, we propose the novel Environment Decoupling Diffusion Model (EDDM) which separates object representations from environmental factors while enabling efficient few-step sampling by leveraging input image priors instead of pure noise initialization. We validate our approach on four standard benchmarks and a self-made dataset DiverseScenes. The results demonstrate that EA6D, trained using only synthetic data, can outperform the stateof-the-art methods with both synthetic and realistic data. In particular, for fair comparisons with synthetic data, we can exceed the previous SOTA by 18.1% and 33.5% on Linemod and Linemod-Occluded datasets respectively. Project page: https://github.com/acmff22/EA6D
Epona: Autoregressive Diffusion World Model for Autonomous Driving	Kaiwen Zhang Horizon Robotics Zhenyu Tang Tsinghua University Xiaotao Hu Tsinghua University Xingang Pan Nanyang Technological University Xiaoyang Guo Horizon Robotics Yuan Liu Hong Kong University of Science and Technology Jingwei Huang Tencent Hunyuan Li Yuan Shenzhen Graduate School, Peking University Qian Zhang Horizon Robotics Xiao-Xiao Long Nanjing University Xun Cao Nanjing University Wei Yin Horizon Robotics	Paper Supplementary Abstract Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in an end-toend framework. Our architecture enables high-resolution, long-duration generation while introducing a novel chainof-forward training strategy to address error accumulation in autoregressive loops. Experimental results demonstrate state-of-the-art performance with 7.4% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a realtime motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks.
Function-centric Bayesian Network for Zero-Shot Object Goal Navigation	Sixian Zhang State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing Xinyao Yu State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing Xinhang Song State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing Yiyao Wang State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing Shuqiang Jiang Institute of Intelligent Computing Technology, Suzhou	Paper Supplementary Abstract Object goal navigation requires an agent to navigate to a specified target in unseen environments without an explicit map, which demands an understanding of object-scene context to infer the target's location based on partial observations. The function of an object plays a crucial role in its categorization and naming. Analyzing an object's functional role within a given scene enhances the understanding of its contextual relationships, thereby aiding in goal inference. In this paper, we propose the Function-centric Bayesian Network (FBN) for the zero-shot ObjectNav task. FBN is designed to uncover the functions that observed objects afford individually or collaboratively with other objects, as well as the functional semantics contained within the observed scenes. The probabilistic directed edges in FBN describe the object-function and scene-function relationships, which are derived by prompting LLMs with the proposed CounterfactCoT. Leveraging FBN with Bayesian inference, the probability of each function group and probability map of goal occurrence are computed. Then the waypoint is selected based on obtained probability map. Experiments on MP3D and HM3D demonstrate that FBN effectively captures object-scene-function relationships and improves zero-shot ObjectNav performance.
GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography	Mengchen Zhang Zhejiang University Tong Wu Stanford University Jing Tan The Chinese University of Hong Kong Ziwei Liu Nanyang Technological University Gordon Wetzstein Stanford University Dahua Lin The Chinese University of Hong Kong	Paper Supplementary Abstract Camera trajectory design plays a crucial role in video production, serving as a fundamental tool for conveying directorial intent and enhancing visual storytelling. In cinematography, Directors of Photography meticulously craft camera movements to achieve expressive and intentional framing. However, existing methods for camera trajectory generation remain limited: Traditional approaches rely on geometric optimization or handcrafted procedural systems, while recent learning-based methods often inherit structural biases or lack textual alignment, constraining creative synthesis. In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. We first introduce DataDoP, a large-scale multi-modal dataset containing 29K realworld shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent. Thanks to the comprehensive and diverse database, we further train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movement generation based on text guidance and RGBD inputs, named GenDoP. Extensive experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability. We believe our approach establishes a new standard for learningbased cinematography, paving the way for future advancements in camera control and filmmaking. Our project website: https://kszpxxzmc.github.io/GenDoP/.
Harnessing Uncertainty-aware Bounding Boxes for Unsupervised 3D Object Detection	Ruiyang Zhang University of Macau, China Hu Zhang CSIRO Data61, Australia Zhedong Zheng University of Macau, China	Paper Supplementary Abstract Unsupervised 3D object detection aims to identify objects of interest from unlabeled raw data, such as LiDAR points. Recent approaches usually adopt pseudo 3D bounding boxes (3D bboxes) from clustering algorithm to initialize the model training. However, pseudo bboxes inevitably contain noise, and such inaccuracies accumulate to the final model, compromising the performance. Therefore, in an attempt to mitigate the negative impact of inaccurate pseudo bboxes, we introduce a new uncertainty-aware framework for unsupervised 3D object detection, dubbed UA3D. In particular, our method consists of two phases: uncertainty estimation and uncertainty regularization. (1) In the uncertainty estimation phase, we incorporate an extra auxiliary detection branch alongside the original primary detector. The prediction disparity between the primary and auxiliary detectors could reflect fine-grained uncertainty at the box coordinate level. (2) Based on the assessed uncertainty, we adaptively adjust the weight of every 3D bbox coordinate via uncertainty regularization, refining the training process on pseudo bboxes. For pseudo bbox coordinate with high uncertainty, we assign a relatively low loss weight. Extensive experiments verify that UA3D is robust against the noisy pseudo bboxes, yielding substantial improvements on nuScenes and Lyft compared to existing approaches, with increases of +3.9% APBEV and +1.5% AP3D on nuScenes, and +2.3% APBEV and +1.8% AP3D on Lyft.
Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation	Wenyao Zhang MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Hongsi Liu Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China Bohan Li MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Jiawei He CASIA Zekun Qi Tsinghua University Yunnan Wang MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Shengyang Zhao Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China Xinqiang Yu CASIA Wenjun Zeng Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China Xin Jin Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China	Paper Supplementary Abstract Current self-supervised monocular depth estimation (MDE) approaches encounter performance limitations due to insufficient semantic-spatial knowledge extraction. To address this challenge, we propose Hybrid-depth, a novel framework that systematically integrates foundation models (e.g., CLIP and DINO) to extract visual priors and acquire sufficient contextual information for MDE. Our approach introduces a coarse-to-fine progressive learning framework: 1) Firstly, we aggregate multi-grained features from CLIP (global semantics) and DINO (local spatial details) under contrastive language guidance. A proxy task comparing close-distant image patches is designed to enforce depthaware feature alignment using text prompts; 2) Next, building on the coarse features, we integrate camera pose information and pixel-wise language alignment to refine depth predictions. This module seamlessly integrates with existing self-supervised MDE pipelines (e.g., Monodepth2, ManyDepth) as a plug-and-play depth encoder, enhancing continuous depth estimation. By aggregating CLIP's semantic context and DINO's spatial details through language guidance, our method effectively addresses feature granularity mismatches. Extensive experiments on the KITTI benchmark demonstrate that our method significantly outperforms SOTA methods across all metrics, which also indeed benefits downstream tasks like BEV perception. Code is available at https://github.com/Zhangwenyao1/Hybrid-depth.
HyperGCT: A Dynamic Hyper-GNN-Learned Geometric Constraint for 3D Registration	Xiyu Zhang Northwestern Polytechnical University Jiayi Ma Wuhan University Jianwei Guo Chinese Academy of Sciences Wei Hu Peking University Zhaoshuai Qi Northwestern Polytechnical University Fei Hui Chang'an University Jiaqi Yang Northwestern Polytechnical University Yanning Zhang Northwestern Polytechnical University	Paper Supplementary Abstract Geometric constraints between feature matches are critical in 3D point cloud registration problems. Existing approaches typically model unordered matches as a consistency graph and sample consistent matches to generate hypotheses. However, explicit graph construction introduces noise, posing great challenges for handcrafted geometric constraints to render consistency. To overcome this, we propose HyperGCT, a flexible dynamic Hyper-GNNlearned geometric ConstrainT that leverages high-order consistency among 3D correspondences. To our knowledge, HyperGCT is the first method that mines robust geometric constraints from dynamic hypergraphs for 3D registration. By dynamically optimizing the hypergraph through vertex and edge feature aggregation, HyperGCT effectively captures the correlations among correspondences, leading to accurate hypothesis generation. Extensive experiments on 3DMatch, 3DLoMatch, KITTI-LC, and ETH show that HyperGCT achieves state-of-the-art performance. Furthermore, HyperGCT is robust to graph noise, demonstrating a significant advantage in terms of generalization.
KinMo: Kinematic-aware Human Motion Understanding and Generation	Pengfei Zhang University of California, Irvine Pinxin Liu University of Rochester Pablo Garrido Flawless AI Hyeongwoo Kim Imperial College, London Bindita Chaudhuri Flawless AI	Paper Supplementary Abstract Current human motion synthesis frameworks rely on global action descriptions, creating a modality gap that limits both motion understanding and generation capabilities. A single coarse description, such as 'run', fails to capture details such as variations in speed, limb positioning, and kinematic dynamics, leading to ambiguities between text and motion modalities. To address this challenge, we introduce KinMo, a unified framework built on a hierarchical describable motion representation that extends beyond global actions by incorporating kinematic group movements and their interactions. We design an automated annotation pipeline to generate high-quality, fine-grained descriptions for this decomposition, resulting in the KinMo dataset and offering a scalable and cost-efficient solution for dataset enrichment. To leverage these structured descriptions, we propose Hierarchical Text-Motion Alignment that progressively integrates additional motion details, thereby improving semantic motion understanding. Furthermore, we introduce a coarse-to-fine motion generation procedure to leverage enhanced spatial understanding to improve motion synthesis. Experimental results show that KinMo significantly improves motion understanding, demonstrated by enhanced text-motion retrieval performance and enabling more finegrained motion generation and editing capabilities. Project Page: https://andypinxinliu.github.io/KinMo
Learning Beyond Still Frames: Scaling Vision-Language Models with Video	Yiyuan Zhang MMLab, CUHK Handong Li School of Artificial Intelligence, UCAS Jing Liu School of Artificial Intelligence, UCAS Xiangyu Yue MMLab, CUHK	Paper Abstract High-quality image-text data is critical for VisionLanguage Models (VLMs), yet traditional image-based pretraining is resource-intensive and fails to capture the temporal dynamics needed for video understanding. To address this, we introduce video pretraining to enhance VLMs with temporal reasoning. We propose Causal Hierarchical Aggregation, a novel method that efficiently processes video by separating computationally heavy spatial encoding from lightweight temporal propagation. This technique builds hierarchical receptive fields, enabling effective learning from large-scale video data. Scaling our method to over 100 billion video tokens, we achieve state-of-the-art performance and high throughput on both image and video understanding tasks (Figure 1). Our approach offers a scalable solution to advance multimodal learning for dynamic contexts. Our code and pretrained models will be released at https://github.com/invictus717/LLaVAPrime.
MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes	Xinjie Zhang iComAI Lab, The Hong Kong University of Science and Technology Zhening Liu iComAI Lab, The Hong Kong University of Science and Technology Yifan Zhang Skywork AI Xingtong Ge iComAI Lab, The Hong Kong University of Science and Technology Dailan He The Chinese University of Hong Kong Tongda Xu Institute for AI Industry Research (AIR), Tsinghua University Yan Wang Institute for AI Industry Research (AIR), Tsinghua University Zehong Lin iComAI Lab, The Hong Kong University of Science and Technology Shuicheng Yan National University of Singapore Jun Zhang iComAI Lab, The Hong Kong University of Science and Technology	Paper Supplementary Abstract 4D Gaussian Splatting (4DGS) has recently emerged as a promising technique for capturing complex dynamic 3D scenes with high fidelity. It utilizes a 4D Gaussian representation and a GPU-friendly rasterizer, enabling rapid rendering speeds. Despite its advantages, 4DGS faces significant challenges, notably the requirement of millions of 4D Gaussians, each with extensive associated attributes, leading to substantial memory and storage cost. This paper introduces a memory-efficient framework for 4DGS. We streamline the color attribute by decomposing it into a per-Gaussian direct color component with only 3 parameters and a shared lightweight alternating current color predictor. This approach eliminates the need for spherical harmonics coefficients, which typically involve up to 144 parameters in classic 4DGS, thereby creating a memoryefficient 4D Gaussian representation. Furthermore, we introduce an entropy-constrained Gaussian deformation technique that uses a deformation field to expand the action range of each Gaussian and integrates an opacity-based entropy loss to limit the number of Gaussians, thus forcing our model to use as few Gaussians as possible to fit a dynamic scene well. With simple half-precision storage and zip compression, our framework achieves a storage reduction by approximately 190x and 125x on the Technicolor and Neural 3D Video datasets, respectively, compared to the original 4DGS. Meanwhile, it maintains comparable rendering speeds and scene representation quality, setting a new standard in the field. Code is available at https://github.com/Xinjie-Q/MEGA.
Manual-PA: Learning 3D Part Assembly from Instruction Diagrams	Jiahao Zhang The Australian National University Anoop Cherian Mitsubishi Electric Research Labs Cristian Rodriguez The Australian Institute for Machine Learning Weijian Deng The Australian National University Stephen Gould The Australian National University	Paper Supplementary Abstract Assembling furniture amounts to solving the discretecontinuous optimization task of selecting the furniture parts to assemble and estimating their connecting poses in a physically realistic manner. The problem is hampered by its combinatorially large yet sparse solution space thus making learning to assemble a challenging task for current machine learning models. In this paper, we attempt to solve this task by leveraging the assembly instructions provided in diagrammatic manuals that typically accompany the furniture parts. Our key insight is to use the cues in these diagrams to split the problem into discrete and continuous phases. Specifically, we present Manual-PA, a transformer-based instruction Manual-guided 3D Part Assembly framework that learns to semantically align 3D parts with their illustrations in the manuals using a contrastive learning backbone towards predicting the assembly order and infers the 6D pose of each part via relating it to the final furniture depicted in the manual. To validate the efficacy of our method, we conduct experiments on the benchmark PartNet dataset. Our results show that using the diagrams and the order of the parts lead to significant improvements in assembly performance against the state of the art. Further, Manual-PA demonstrates strong generalization to real-world IKEA furniture assembly on the IKEA-Manual dataset.
MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation	Pingrui Zhang Fudan University Xianqiang Gao Shanghai AI Laboratory Yuhan Wu University of Science and Technology of China Kehui Liu Northwestern Polytechnical University Dong Wang Shanghai AI Laboratory Zhigang Wang Shanghai AI Laboratory Bin Zhao Northwestern Polytechnical University Yan Ding Shanghai AI Laboratory Xuelong Li TeleAI, China Telecom Corp Ltd	Paper Supplementary Abstract In mobile manipulation, navigation and manipulation are often treated as separate problems, resulting in a significant gap between merely approaching an object and engaging with it effectively. Many navigation approaches primarily define success by proximity to the target, often overlooking the necessity for optimal positioning that facilitates subsequent manipulation. To address this, we introduce MoMa-Kitchen, a benchmark dataset comprising over 100k samples that provide training data for models to learn optimal final navigation positions for seamless transition to manipulation. Our dataset includes affordance-grounded ﬂoor labels collected from diverse kitchen environments, in which robotic mobile manipulators of different models attempt to grasp target objects amidst clutter. Using a fully automated pipeline, we simulate diverse real-world scenarios and generate affordance labels for optimal manipulation positions. Visual data are collected from RGB-D inputs captured by a first-person view camera mounted on the robotic arm, ensuring consistency in viewpoint during data collection. We also develop a lightweight baseline model, NavAff, for navigation affordance grounding that demonstrates promising performance on the MoMa-Kitchen benchmark. Our approach enables models to learn affordance-based final positioning that accommodates different arm types and platform heights, thereby paving the way for more robust and generalizable integration of navigation and manipulation in embodied AI. Project page: https://momakitchen.github.io/.
POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction	Songyan Zhang Nanyang Technological University, Singapore Yongtao Ge Zhejiang University, China Jinyuan Tian Zhejiang University, China Guangkai Xu Zhejiang University, China Hao Chen Zhejiang University, China Chen Lv Nanyang Technological University, Singapore Chunhua Shen Zhejiang University, China	Paper Supplementary Abstract Recent approaches to 3D reconstruction in dynamic scenes primarily rely on the integration of separate geometry estimation and matching modules, where the latter plays a critical role in distinguishing dynamic regions and mitigating the interference caused by moving objects. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, effectively reducing computational overhead by eliminating the need for redundant auxiliary modules. However, it still struggles with ambiguous correspondences in dynamic regions, which limits reconstruction performance in such scenarios. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying POintmap MAtching with Temporal mOtion. Specifically, our method first learns an explicit matching relationship by mapping RGB pixels across different views to 3D pointmaps within a unified coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in 3D reconstruction tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of our proposed POMATO by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, 3D point tracking, and pose estimation. Code and models are publicly available at https://github.com/wyddmw/POMATO.
PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning	Yan Zhang Meshcapade Yao Feng Meshcapade Alpár Cseke Meshcapade Nitin Saini Meshcapade Nathan Bajandas Meshcapade Nicolas Heron Meshcapade Michael J. Black Max Planck Institute for Intelligent Systems, Tübingen	Paper Supplementary Abstract We formulate the motor system of an interactive avatar as a generative motion model that can drive the body to move through 3D space in a perpetual, realistic, controllable, and responsive manner. Although human motion generation has been extensively studied, many existing methods lack the responsiveness and realism of real human movements. Inspired by recent advances in foundation models, we propose PRIMAL, which is learned with a two-stage paradigm. In the pretraining stage, the model learns body movements from a large number of sub-second motion segments, providing a generative foundation from which more complex motions are built. This training is fully unsupervised without annotations. Given a single-frame initial state during inference, the pretrained model not only generates unbounded, realistic, and controllable motion, but also enables the avatar to be responsive to induced impulses in real time. In the adaptation phase, we employ a novel ControlNet-like adaptor to fine-tune the base model efficiently, adapting it to new tasks such as few-shot personalized action generation and spatial target reaching. Evaluations show that our proposed method outperforms stateof-the-art baselines. We leverage the model to create a realtime character animation system in Unreal Engine that feels highly responsive and natural. 1
PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Model	Jinhua Zhang University of Electronic Science and Technology of China Hualian Sheng Independent Researcher Sijia Cai Independent Researcher Bing Deng Independent Researcher Qiao Liang Independent Researcher Wen Li University of Electronic Science and Technology of China Ying Fu Beijing Institute of Technology Jieping Ye Independent Researcher Shuhang Gu University of Electronic Science and Technology of China	Paper Supplementary Abstract Controllable generation is considered a potentially vital approach to address the challenge of annotating 3D data, and the precision of such controllable generation becomes particularly imperative in the context of data production for autonomous driving. Existing methods focus on the integration of diverse generative information into controlling inputs, utilizing frameworks such as GLIGEN or ControlNet, to produce commendable outcomes in controllable generation. However, such approaches intrinsically restrict generation performance to the learning capacities of predefined network architectures. In this paper, we explore the innovative integration of controlling information and introduce PerLDiff (Perspective-Layout Diffusion Models), a novel method for effective street view image generation that fully leverages perspective 3D geometric information. Our PerLDiff employs 3D geometric priors to guide the generation of street view images with precise object-level control within the network learning process, resulting in a more robust and controllable output. Moreover, it demonstrates superior controllability compared to alternative layout control methods. Empirical results justify that our PerLDiff markedly enhances the precision of controllable generation on the NuScenes and KITTI datasets.
PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling	Hao Zhang University of Illinois Urbana Champaign Haolan Xu University of Illinois Urbana Champaign Chun Feng University of Illinois Urbana Champaign Varun Jampani Stability AI Narendra Ahuja University of Illinois Urbana Champaign	Paper Supplementary Abstract Skinning and rigging are fundamental components in animation, articulated object reconstruction, motion transfer, and 4D generation. Existing approaches predominantly rely on Linear Blend Skinning (LBS), due to its simplicity and differentiability. However, LBS introduces artifacts such as volume loss and unnatural deformations, and it fails to model elastic materials like soft tissues, fur, and flexible appendages (e.g., elephant trunks, ears, and fatty tissues). In this work, we propose PhysRig: a differentiable physicsbased skinning and rigging framework that overcomes these limitations by embedding the rigid skeleton into a volumetric representation (e.g., a tetrahedral mesh), which is simulated as a deformable soft-body structure driven by the animated skeleton. Our method leverages continuum mechanics and discretizes the object as particles embedded in an Eulerian background grid to ensure differentiability with respect to both material properties and skeletal motion. Additionally, we introduce material prototypes, significantly reducing the learning space while maintaining high expressiveness. To evaluate our framework, we construct a comprehensive synthetic dataset using meshes from Objaverse [5], The Amazing Animals Zoo [35], and MixaMo [1], covering diverse object categories and motion patterns. Our method consistently outperforms traditional LBS-based approaches, generating more realistic and physically plausible results. Furthermore, we demonstrate the applicability of our framework in the pose transfer task highlighting its versatility for articulated object modeling. This project is available at https://physrig.github.io/.
PlaneRAS: Learning Planar Primitives for 3D Plane Recovery	Fang Zhang Beijing University of Posts and Telecommunications, China Wenzhao Zheng Tsinghua University, China Linqing Zhao Tsinghua University, China Zelan Zhu Beijing University of Posts and Telecommunications, China Jiwen Lu Tsinghua University, China Xiuzhuang Zhou Beijing University of Posts and Telecommunications, China	Paper Supplementary Abstract 3D plane recovery from monocular images constitutes a fundamental task in indoor scene understanding. Recent methods formulate this problem as 2D pixel-level segmentation through convolutional networks or query-based architectures, which purely rely on 2D pixel features while neglecting the inherent 3D spatial nature of planar surfaces. To address this limitation, we propose an endto-end Plane Reconstruction, Aggregation, and Splatting (PlaneRAS) framework that explicitly leverages 3D geometric reasoning combined with online planar primitive reconstruction. Our framework introduces two core components: 1) a reconstruction module utilizing customized planar primitives to compactly represent 3D scene, and 2) a recovery module that aggregates local primitives to derive globally consistent plane instances. The proposed 3D-aware representation enables direct integration of pretrained geometric priors, significantly enhancing performance beyond conventional 2D-centric approaches. Extensive experiments on ScanNet and NYUv2 datasets demonstrate state-of-the-art results across various evaluation metrics, resulting from our explicit 3D geometric modeling and effective fusion of cross-dimensional features.
Quadratic Gaussian Splatting: High Quality Surface Reconstruction with Second-order Geometric Primitives	Ziyu Zhang CASIA Binbin Huang The University of Hong Kong Hanqing Jiang SenseTime Research Liyang Zhou SenseTime Research Xiaojun Xiang SenseTime Research Shuhan Shen CASIA	Paper Supplementary Abstract We propose Quadratic Gaussian Splatting (QGS), a novel representation that replaces static primitives with deformable quadric surfaces (e.g., ellipse, paraboloids) to capture intricate geometry. Unlike prior works that rely on Euclidean distance for primitive density modeling-a metric misaligned with surface geometry under deformation-QGS introduces geodesic distance-based density distributions. This innovation ensures that density weights adapt intrinsically to the primitive curvature, preserving consistency during shape changes (e.g., from planar disks to curved paraboloids). By solving geodesic distances in closed form on quadric surfaces, QGS enables surfaceaware splatting, where a single primitive can represent complex curvature that previously required dozens of planar surfels, potentially reducing memory usage while maintaining efficient rendering via fast ray-quadric intersection. Experiments on DTU, Tanks and Temples, and MipNeRF360 datasets demonstrate state-of-the-art surface reconstruction, with QGS reducing geometric error (chamfer distance) by 33% over 2DGS and 27% over GOF on the DTU dataset. Crucially, QGS retains competitive appearance quality, bridging the gap between geometric precision and visual fidelity for applications like robotics and immersive reality.
Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment	Shi-Chen Zhang VCIP, CS, Nankai University Yunheng Li VCIP, CS, Nankai University Yu-Huan Wu IHPC, A*STAR, Singapore Qibin Hou Nankai University Ming-Ming Cheng Nankai University	Paper Abstract Semantic segmentation is fundamental to vision systems requiring pixel-level scene understanding, yet deploying it on resource-constrained devices demands efficient architectures. Although existing methods achieve real-time inference through lightweight designs, we reveal their inherent limitation: misalignment between class representations and image features caused by a per-pixel classification paradigm. With experimental analysis, we find that this paradigm results in a highly challenging assumption for efficient scenarios: Image pixel features should not vary for the same category in different images. To address this dilemma, we propose a coupled dual-branch offset learning paradigm that explicitly learns feature and class offsets to dynamically refine both class representations and spatial image features. Based on the proposed paradigm, we construct an efficient semantic segmentation network, OffSeg. Notably, the offset learning paradigm can be adopted to existing methods with no additional architectural changes. Extensive experiments on four datasets, including ADE20K, Cityscapes, COCO-Stuff-164K, and Pascal Context, demonstrate consistent improvements with negligible parameters.
RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation	Kaidong Zhang Sun Yat-sen University Rongtao Xu MBZUAI Pengzhen Ren Peng Cheng Laboratory Junfan Lin Peng Cheng Laboratory Hefeng Wu Sun Yat-sen University Liang Lin Sun Yat-sen University Xiaodan Liang Sun Yat-sen University	Paper Supplementary Abstract Operating robots in open-ended scenarios with diverse tasks is a crucial research and application direction in robotics. While recent progress in natural language processing and large multimodal models has enhanced robots' ability to understand complex instructions, robot manipulation still faces the procedural skill dilemma and the declarative skill dilemma in open environments. Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained visionlanguage model (VLM), an invariant operable representation (IOR) serving as a symbolic bridge, and a guided embodied agent (GEA). RoBridge maintains the declarative skill of VLM and unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution. RoBridge demonstrates significant performance improvements over existing baselines, achieving a 75% success rate on new tasks and an 83% average success rate in sim-to-real generalization using only five real-world data samples per task. This work represents a significant step towards integrating cognitive reasoning with physical execution in robotic systems, offering a new paradigm for general robotic manipulation.
SU-RGS: Relightable 3D Gaussian Splatting from Sparse Views under Unconstrained Illuminations	Qi Zhang College of Intelligence and Computing, Tianjin University, China Chi Huang College of Intelligence and Computing, Tianjin University, China Qian Zhang College of Intelligence and Computing, Tianjin University, China Nan Li College of Intelligence and Computing, Tianjin University, China Wei Feng College of Intelligence and Computing, Tianjin University, China	Paper Abstract The latest advancements in scene relighting have been predominantly driven by inverse rendering with 3D Gaussian Splatting (3DGS). However, existing methods remain overly reliant on densely sampled images under static illumination conditions, which is prohibitively expensive and even impractical in real-world scenarios. In this paper, we propose a novel learning from Sparse views under Unconstrained illuminations Relightable 3D Gaussian Splatting (dubbed SU-RGS), to address this challenge by jointly optimizing 3DGS representations, surface materials, and environment illuminations (i.e., unknown and various lighting conditions in training) using only sparse input views. Firstly, SU-RGS presents a varying appearance rendering strategy, enabling each 3D Gaussian can perform inconsistent color under various lightings. Next, SU-RGS establishes the multi-view semantic consistency by constructing hierarchical semantics pseudo-labels across inter-views, to compensate for extra supervisions and facilitate sparse inverse rendering for confronting unconstrained illuminations. Additionally, we introduce an adaptive transient object perception component that integrates the scene geometry and semantics in a fine-grained manner, to quantify and eliminate the uncertainty of the foreground. Extensive experiments on both synthetic and real-world challenging datasets demonstrate the effectiveness of SU-RGS, achieving the state-of-the-art performance for scene inverse rendering by learning 3DGS from only sparse views under unconstrained illuminations.
Semantic-guided Camera Ray Regression for Visual Localization	Yesheng Zhang School of Automation and Intelligent Sensing, Shanghai Jiao Tong University Xu Zhao School of Automation and Intelligent Sensing, Shanghai Jiao Tong University	Paper Supplementary Abstract This work presents a novel framework for Visual Localization (VL), that is, regressing camera rays from query images to derive camera poses. As an overparameterized representation of the camera pose, camera rays possess superior robustness in optimization. Of particular importance, Camera Ray Regression (CRR) is privacy-preserving, rendering it a viable VL approach for real-world applications. Thus, we introduce DINO-based Multi-Mappers, coined DIMM, to achieve VL by CRR. DIMM utilizes DINO as a sceneagnostic encoder to obtain powerful features from images. To mitigate ambiguity, the features integrate both local and global perception, as well as potential geometric constraint. Then, a scene-specific mapper head regresses camera rays from these features. It incorporates a semantic attention module for soft fusion of multiple mappers, utilizing the rich semantic information in DINO features. In extensive experiments on both indoor and outdoor datasets, our methods showcase impressive performance, revealing a promising direction for advancements in VL.
SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing	Yingying Zhang Ant Group Lixiang Ru Ant Group Kang Wu Wuhan University Lei Yu Ant Group Lei Liang Ant Group Yansheng Li Wuhan University Jingdong Chen Ant Group	Paper Supplementary Abstract The multi-modal remote sensing foundation model (MMRSFM) has significantly advanced various Earth observation tasks, such as urban planning, environmental monitoring, and natural disaster management. However, most existing approaches generally require the training of separate backbone networks for each data modality, leading to redundancy and inefficient parameter utilization. Moreover, prevalent pre-training methods typically apply selfsupervised learning (SSL) techniques from natural images without adequately accommodating the characteristics of remote sensing (RS) images, such as the complicated semantic distribution within a single RS image. In this work, we present SkySense V2, a unified MM-RSFM that employs a single transformer backbone to handle multiple modalities. This backbone is pre-trained with a novel SSL strategy tailored to the distinct traits of RS data. In particular, SkySense V2 incorporates an innovative adaptive patch merging module and learnable modality prompt tokens to address challenges related to varying resolutions and limited feature diversity across modalities. In additional, we incorporate the mixture of experts (MoE) module to further enhance the performance of the foundation model. SkySense V2 demonstrates impressive generalization abilities through an extensive evaluation involving 16 datasets over 7 tasks, outperforming SkySense by an average of 1.8 points.
SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations	Songchun Zhang ZJU Huiyao Xu ZJU Sitong Guo ZJU Zhongwei Xie HKUST Hujun Bao ZJU Weiwei Xu ZJU Changqing Zou ZJU	Paper Supplementary Abstract Novel view synthesis (NVS) boosts immersive experiences in computer vision and graphics. Existing techniques, though progressed, rely on dense multi-view observations, restricting their application. We tackle the task of reconstructing photorealistic 3D scenes from only one or a few input views. We introduce SpatialCrafter, a framework that leverages the rich knowledge in video diffusion models to generate plausible additional observations, thereby alleviating reconstruction ambiguity. Through a trainable camera encoder and an epipolar attention mechanism for explicit geometric constraints, we achieve precise camera control and 3D consistency, further reinforced by a unified scale estimation strategy to handle scale discrepancies across datasets. Furthermore, by integrating monocular depth priors with semantic features in the video latent space, our framework directly regresses 3D Gaussian primitives and efficiently processes long-sequence features using a hybrid network structure. Extensive experiments show our method enhances sparse view reconstruction and restores the realistic appearance of 3D scenes. Project page: https://franklinz233. github.io/projects/spatialcrafter/.
StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth	Zheng Zhang The University of Hong Kong Lihe Yang The University of Hong Kong Tianyu Yang DAMO Academy, Alibaba Group Chaohui Yu DAMO Academy, Alibaba Group Xiaoyang Guo The Chinese University of Hong Kong Yixing Lao The University of Hong Kong Hengshuang Zhao The University of Hong Kong	Paper Supplementary Abstract Recent advances in monocular depth estimation significantly improve robustness and accuracy. However, relative depth models exhibit flickering and 3D inconsistency in video data, limiting 3D reconstruction applications. We introduce StableDepth, a scene-consistent and scaleinvariant depth estimation method achieving scene-level 3D consistency. Our dual-decoder architecture learns from large-scale unlabeled video data, enhancing generalization and reducing flickering. Unlike previous methods requiring full video sequences, StableDepth enables online inference at 13x faster speed, achieving significant improvements across benchmarks with comparable temporal consistency to video diffusion-based estimators.
TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction	Xuying Zhang VCIP, CS, Nankai University Yutong Liu USTC Yangguang Li CUHK MMLab Renrui Zhang CUHK MMLab Yufei Liu Shanghai AI Lab Kai Wang VCIP, CS, Nankai University Wanli Ouyang CUHK MMLab Zhiwei Xiong USTC Peng Gao Shanghai AI Lab Qibin Hou VCIP, CS, Nankai University Ming-Ming Cheng VCIP, CS, Nankai University	Paper Abstract We present TAR3D, a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQVAE) and a Generative Pre-trained Transformer (GPT) to generate high-quality 3D assets. The core insight of this work is to migrate the multimodal unification and promising learning capabilities of the next-token prediction paradigm to conditional 3D object generation. To achieve this, the 3D VQ-VAE first encodes a wide range of 3D shapes into a compact triplane latent space and utilizes a set of discrete representations from a trainable codebook to reconstruct fine-grained geometries under the supervision of query point occupancy. Then, the 3D GPT, equipped with a custom triplane position embedding called TriPE, predicts the codebook index sequence with prefilling prompt tokens in an autoregressive manner so that the composition of 3D geometries can be modeled part by part. Extensive experiments on several public 3D datasets demonstrate that TAR3D can achieve superior generation quality over existing methods in text-to-3D and image-to-3D tasks.
Towards More Diverse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views	Xiangdong Zhang School of AI, Shanghai Jiao Tong University Shaofeng Zhang School of AI, Shanghai Jiao Tong University Junchi Yan School of AI, Shanghai Jiao Tong University	Paper Supplementary Abstract Point cloud learning, especially in a self-supervised way without manual labels, has gained growing attention in both vision and learning communities due to its potential utility in a wide range of applications. Most existing generative approaches for point cloud self-supervised learning focus on recovering masked points from visible ones within a single view. Recognizing that a two-view pre-training paradigm inherently introduces greater diversity and variance, it may thus enable more challenging and informative pre-training. Inspired by this, we explore the potential of two-view learning in this domain. In this paper, we propose Point-PQAE, a cross-reconstruction generative paradigm that first generates two decoupled point clouds/views and then reconstructs one from the other. To achieve this goal, we develop a crop mechanism for point cloud view generation for the first time and further propose a novel positional encoding to represent the 3D relative position between the two decoupled views. The cross-reconstruction significantly increases the difficulty of pre-training compared to selfreconstruction, which enables our method to surpass previous single-modal self-reconstruction methods in 3D selfsupervised learning. Specifically, it outperforms the selfreconstruction baseline (Point-MAE) by 6.5%, 7.0%, and 6.7% in three variants of ScanObjectNN with the MLPLINEAR evaluation protocol. The code is available at https://github.com/aHapBean/Point-PQAE
Tracking Tiny Drones against Clutter: Large-Scale Infrared Benchmark with Motion-Centric Adaptive Algorithm	Jiahao Zhang College of Computer Science, Beijing University of Technology Zongli Jiang College of Computer Science, Beijing University of Technology Jinli Zhang College of Computer Science, Beijing University of Technology Yixin Wei College of Computer Science, Beijing University of Technology Liang Li NAIVE Lab, Brain Research Center, Beijing Institute of Basic Medical Sciences Yizheng Wang NAIVE Lab, Brain Research Center, Beijing Institute of Basic Medical Sciences Gang Wang NAIVE Lab, Brain Research Center, Beijing Institute of Basic Medical Sciences	Paper Supplementary Abstract Tracking flying drones in infrared videos is a crucial yet challenging task. Existing drone trackers and datasets have limitations in dealing with and characterizing tiny targets (≤20x20 pixels) against highly complex backgrounds. To tackle this issue, we have developed a large-scale benchmark for tiny drone tracking in infrared videos (TDTIV), which comprises 290k frames and 280k manually annotated bounding boxes. Unlike traditional trackers that primarily rely on appearance matching, we introduce a novel method called Motion-Centric Adaptive Tracking (MCATrack), which initially employs a magnocell-inspired motion response to enhance the local signal-to-noise ratio of tiny target regions while suppressing complex clutter. Moreover, we design a Dynamic Cross-Guided module that integrates both initial and updated target features to address pose variations in long-term tracking. This module captures the latest target information to generate highly relevant candidate regions and refines them through precise optimization to achieve more accurate tracking results. Extensive experiments performed on the TDTIV and the well-recognized Anti-UAV 410 datasets have demonstrated the superiority of MCATrack over state-of-the-art competing trackers. Code and dataset are available at https://github.com/zhangjiahao02/MCATrack.
UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement	Xiao Zhang Dalian University of Technology Fei Wei AMAP, Alibaba Group Yong Wang AMAP, Alibaba Group Wenda Zhao Dalian University of Technology Feiyi Li Dalian University of Technology Xiangxiang Chu AMAP, Alibaba Group	Paper Supplementary Abstract Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain. Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge, exploiting their zero-shot learning capabilities. However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on manually crafted prompts. To overcome these limitations, we propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimizes both textual prompts and visual representations. Specifically, our approach introduces a multi-view domain prompt that combines linguistic domain priors with detection-specific knowledge, and a visual representation enhancement module that produces domain style variations. Furthermore, we introduce multi-level enhancement strategies, including relative domain distance and positive-negative separation, which align multi-modal representations at the image level and capture diverse visual representations at the instance level, respectively. Extensive experiments conducted on nine benchmark datasets demonstrate the superior performance of our framework in ZSDA detection scenarios. Code is available at https://github.com/AMAP-ML/UPRE.
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks	Shiduo Zhang Fudan University Zhe Xu Fudan University Peiju Liu Fudan University Xiaopeng Yu Fudan University Yuan Li Fudan University Qinghui Gao Fudan University Zhaoye Fei Fudan University Zhangyue Yin Fudan University Zuxuan Wu Fudan University Yu-Gang Jiang Fudan University Xipeng Qiu Fudan University	Paper Supplementary Abstract General-purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks. Recently, methods based on foundation models especially Vision-LanguageAction models (VLAs) have shown a substantial potential to solve language-conditioned manipulation (LCM) tasks well. However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects. VLABench stands out from previous benchmarks in four key aspects: 1) tasks requiring world knowledge and common sense transfer, 2) natural language instructions with implicit human intentions rather than templates, 3) long-horizon tasks demanding multi-step reasoning, and 4) evaluation of both action policies and language model capabilities. The benchmark assesses multiple competencies including understanding of mesh&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning, etc. To support the downstream finetuning, we provide high-quality training data collected via an automated framework incorporating heuristic skills and prior information. The experimental results indicate that both the current state-of-theart pretrained VLAs and the workflow based on VLMs face challenges in our tasks. 12. 1Codes and more videos are available at https://vlabench.github.io/2Corresponding to: sdzhang23@m.fudan.edu.cn, xpqiu@fudan.edu.cn
VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving	Ruifei Zhang The Chinese University of Hong Kong, Shenzhen Wei Zhang Baidu Inc. Xiao Tan Baidu Inc. Sibei Yang Sun Yat-sen University Xiang Wan Shenzhen Research Institute of Big Data Xiaonan Luo Guilin University of Electronic Technology Guanbin Li Sun Yat-sen University	Paper Supplementary Abstract Recent advancements in language-grounded autonomous driving have been significantly promoted by the sophisticated cognition and reasoning capabilities of large language models (LLMs). However, current LLM-based approaches encounter critical challenges: (1) Failure analysis reveals that frequent collisions and obstructions, stemming from limitations in visual representations, remain primary obstacles to robust driving performance. (2) The substantial parameters of LLMs pose considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. VLDrive achieves compact visual tokens through innovative strategies, including cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation. Furthermore, we propose a distance-decoupled instruction attention mechanism to improve joint visual-linguistic feature learning, particularly for long-range visual tokens. Extensive experiments conducted in the CARLA simulator demonstrate VLDrive's effectiveness. Notably, VLDrive achieves stateof-the-art driving performance while reducing parameters by 81% (from 7B to 1.3B), yielding substantial driving score improvements of 15.4%, 16.8%, and 7.6% at tiny, short, and long distances, respectively, in closed-loop evaluations. Code is available at https://github.com/ReaFly/VLDrive.
VertexRegen: Mesh Generation with Continuous Level of Detail	Xiang Zhang UC San Diego Yawar Siddiqui Meta Reality Labs Research Armen Avetisyan Meta Reality Labs Research Chris Xie Meta Reality Labs Research Jakob Engel Meta Reality Labs Research Henry Howard-Jenkins Meta Reality Labs Research	Paper Abstract We introduce VertexRegen, a novel mesh generation framework that enables generation at a continuous level of detail. Existing autoregressive methods generate meshes in a partial-to-complete manner and thus intermediate steps of generation represent incomplete structures. VertexRegen takes inspiration from progressive meshes and reformulates the process as the reversal of edge collapse, i.e. vertex split, learned through a generative model. Experimental results demonstrate that VertexRegen produces meshes of comparable quality to state-of-the-art methods while uniquely offering anytime generation with the flexibility to halt at any step to yield valid meshes with varying levels of detail.
Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning	Yafei Zhang Faculty of Information Engineering and Automation, Kunming University of Science and Technology Lingqi Kong Faculty of Information Engineering and Automation, Kunming University of Science and Technology Huafeng Li Faculty of Information Engineering and Automation, Kunming University of Science and Technology Jie Wen School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen	Paper Supplementary Abstract To reduce the reliance of visible-infrared person reidentification (ReID) models on labeled cross-modal samples, this paper explores a weakly supervised cross-modal person ReID method that uses only single-modal sample identity labels, addressing scenarios where cross-modal identity labels are unavailable. To mitigate the impact of missing cross-modal labels on model performance, we propose a heterogeneous expert collaborative consistency learning framework, designed to establish robust crossmodal identity correspondences in a weakly supervised manner. This framework leverages labeled data from each modality to independently train dedicated classification experts. To associate cross-modal samples, these classification experts act as heterogeneous predictors, predicting the identities of samples from the other modality. To improve prediction accuracy, we design a cross-modal relationship fusion mechanism that effectively integrates predictions from different experts. Under the implicit supervision provided by cross-modal identity correspondences, collaborative and consistent learning among the experts is encouraged, significantly enhancing the model's ability to extract modality-invariant features and improve crossmodal identity recognition. Experimental results on two challenging datasets validate the effectiveness of the proposed method. Code is available at https://github. com/KongLingqi2333/WSL-VIReID.
DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion	Qingcheng Zhao ShanghaiTech University Xiang Zhang UC San Diego Haiyang Xu UC San Diego Zeyuan Chen UC San Diego Jianwen Xie Lambda, Inc. Yuan Gao Stanford University Zhuowen Tu UC San Diego	Paper Supplementary Abstract We propose DepR, a depth-guided single-view scene reconstruction framework that integrates instance-level diffusion within a compositional paradigm. Instead of reconstructing the entire scene holistically, DepR generates individual objects and subsequently composes them into a coherent 3D layout. Unlike previous methods that use depth solely for object layout estimation during inference and therefore fail to fully exploit its rich geometric information, DepR leverages depth throughout both training and inference. Specifically, we introduce depth-guided conditioning to effectively encode shape priors into diffusion models. During inference, depth further guides DDIM sampling and layout optimization, enhancing alignment between the reconstruction and the input image. Despite being trained on limited synthetic data, DepR achieves state-of-the-art performance and demonstrates strong generalization in singleview scene reconstruction, as shown through evaluations on both synthetic and real-world datasets.
HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding	Jiahe Zhao State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China Ruibing Hou State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China Zejie Tian Communication University of China Hong Chang State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China Shiguang Shan State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China	Paper Supplementary Abstract We propose a new task to benchmark human-in-scene understanding for embodied agents: Human-In-Scene Question Answering (HIS-QA). Given a human motion within a 3D scene, HIS-QA requires the agent to comprehend human states and behaviors, reason about its surrounding environment, and answer human-related questions within the scene. To support this new task, we present HISBench, a multimodal benchmark that systematically evaluates HIS understanding across a broad spectrum, from basic perception to commonsense reasoning and planning. Our evaluation of various vision-language models on HIS-Bench reveals significant limitations in their ability to handle HIS-QA tasks. To this end, we propose HIS-GPT, the first foundation model for HIS understanding. HIS-GPT integrates 3D scene context and human motion dynamics into large language models while incorporating specialized mechanisms to capture human-scene interactions. Extensive experiments demonstrate that HISGPT sets a new state-of-the-art on HIS-QA tasks. We hope this work inspires future research on human behavior analysis in 3D scenes, advancing embodied AI and world models. Codes and data will be available at https://github.com/ZJHTerry18/HumanInScene.
ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation	Guosheng Zhao Institute of Automation, Chinese Academy of Sciences Xiaofeng Wang Institute of Automation, Chinese Academy of Sciences Chaojun Ni GigaAI Zheng Zhu GigaAI Wenkang Qin GigaAI Guan Huang GigaAI Xingang Wang Institute of Automation, Chinese Academy of Sciences	Paper Supplementary Abstract Combining reconstruction models with generative models has emerged as a promising paradigm for closed-loop simulation in autonomous driving. For example, ReconDreamer has demonstrated remarkable success in rendering large-scale maneuvers. However, a significant gap remains between the generated data and real-world sensor observations, particularly in terms of fidelity for structured elements, such as the ground surface. To address these challenges, we propose ReconDreamer++, an enhanced framework that significantly improves the overall rendering quality by mitigating the domain gap and refining the representation of the ground surface. Specifically, ReconDreamer++ introduces the Novel Trajectory Deformable Network (NTDNet), which leverages learnable spatial deformation mechanisms to bridge the domain gap between synthesized novel views and original sensor observations. Moreover, for structured elements such as the ground surface, we preserve geometric prior knowledge in 3D Gaussians, and the optimization process focuses on refining appearance attributes while preserving the underlying geometric structure. Experimental evaluations conducted on multiple datasets (Waymo, nuScenes, PandaSet, and EUVS) confirm the superior performance of ReconDreamer++. Specifically, on Waymo, ReconDreamer++ achieves performance comparable to Street Gaussians for the original trajectory while significantly outperforming ReconDreamer on novel trajectories. In particular, it achieves substantial improvements, including a 6.1% increase in NTA-IoU, a 23. 0% improvement in FID, and a remarkable 4.5% gain in the ground surface metric NTL-IoU, highlighting its effectiveness in accurately reconstructing structured elements such as the road surface.
Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning	Tianyi Zhao Institute of Artificial Intelligence, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University Boyang Liu Institute of Artificial Intelligence, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University Yanglei Gao Institute of Artificial Intelligence, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University Yiming Sun Southeast University Maoxun Yuan Institute of Artificial Intelligence, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University Xingxing Wei Institute of Artificial Intelligence, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University	Paper Abstract Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the monomodality insufficient learning problem, which arises from decreased feature extraction capability in multi-modal joint learning. This leads to a prevalent but unreasonable phenomenon-Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multimodal object detection task from the mono-modality learning perspective. Therefore, we construct a novel framework called M2D-LIF, which consists of the Mono-Modality Distillation (M2D) method and the Local Illuminationaware Fusion (LIF) module. The M2D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M2D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors. The codes are available at https://github.com/Zhao-Tian-yi/M2D-LIF.
Toward Material-Agnostic System Identification from Videos	Yizhou Zhao Carnegie Mellon University Haoyu Chen Carnegie Mellon University Chunjiang Liu Carnegie Mellon University Zhenyang Li University of Alabama at Birmingham Charles Herrmann Google Junhwa Hur Google Yinxiao Li Google Ming-Hsuan Yang Google Bhiksha Raj Carnegie Mellon University Min Xu Carnegie Mellon University	Paper Supplementary Abstract System identification from videos aims to recover object geometry and governing physical laws. Existing methods integrate differentiable rendering with simulation but rely on predefined material priors, limiting their ability to handle unknown ones. We introduce MASIV, the first vision-based framework for material-agnostic system identification. Unlike existing approaches that depend on hand-crafted constitutive laws, MASIV employs learnable neural constitutive models, inferring object dynamics without assuming a scene-specific material prior. However, the absence of full particle state information imposes unique challenges, leading to unstable optimization and physically implausible behaviors. To address this, we introduce dense geometric guidance by reconstructing continuum particle trajectories, providing temporally rich motion constraints beyond sparse visual cues. Comprehensive experiments show that MASIV achieves state-of-the-art performance in geometric accuracy, rendering quality, and generalization ability.
Learning 4D Embodied World Models	Haoyu Zhen UMass Amherst Qiao Sun UMass Amherst Hongxin Zhang UMass Amherst Junyan Li UMass Amherst Siyuan Zhou HKUST Yilun Du Harvard University Chuang Gan UMass Amherst	Paper Supplementary Abstract This paper presents an effective approach for learning novel 4D embodied world models, TesserAct, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.
Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation	Bozhong Zheng ShanghaiTech University Jinye Gan ShanghaiTech University Xiaohao Xu University of Michigan, Ann Arbor Xintao Chen ShanghaiTech University Wenqiao Li ShanghaiTech University Xiaonan Huang University of Michigan, Ann Arbor Na Ni ShanghaiTech University Yingna Wu ShanghaiTech University	Paper Supplementary Abstract 3D point cloud anomaly detection is essential for robust vision systems but is challenged by pose variations and complex geometric anomalies. Existing patchbased methods often suffer from geometric fidelity issues due to discrete voxelization or projection-based representations, limiting fine-grained anomaly localization. We introduce Pose-Aware Signed Distance Field (PASDF), a novel framework that integrates 3D anomaly detection and repair by learning a continuous, poseinvariant shape representation. PASDF leverages a Pose Alignment Module for canonicalization and a SDF Network to dynamically incorporate pose, enabling implicit learning of high-fidelity anomaly repair templates from the continuous SDF. This facilitates precise pixel-level anomaly localization through an Anomaly-Aware Scoring Module. Crucially, the continuous 3D representation in PASDF extends beyond detection, facilitating insitu anomaly repair. Experiments on Real3D-AD and Anomaly-ShapeNet demonstrate state-of-the-art performance, achieving high object-level AUROC scores of 80.2% and 90.0%, respectively. These results highlight the effectiveness of continuous geometric representations in advancing 3D anomaly detection and facilitating practical anomaly region repair. Our code will be released to drive further research.
Efficient Multi-Person Motion Prediction by Lightweight Spatial and Temporal Interactions	Yuanhong Zheng School of Mechanical, Electrical and Information Engineering, Shandong University, Weihai Ruixuan Yu School of Mechanical, Electrical and Information Engineering, Shandong University, Weihai Jian Sun School of Mathematics and Statistics, Xi'an Jiaotong University	Paper Supplementary Abstract 3D multi-person motion prediction is a highly complex task, primarily due to the dependencies on both individual past movements and the interactions between agents. Moreover, effectively modeling these interactions often incurs substantial computational costs. In this work, we propose a computationally efficient model for multi-person motion prediction by simplifying spatial and temporal interactions. Our approach begins with the design of lightweight dual branches that learn local and global representations for individual and multiple persons separately. Additionally, we introduce a novel cross-level interaction block to integrate the spatial and temporal representations from both branches. To further enhance interaction modeling, we explicitly incorporate the spatial inter-person distance embedding. With above efficient temporal and spatial design, we achieve state-of-the-art performance for multiple metrics on standard datasets of CMU-Mocap, MuPoTS-3D, and 3DPW, while significantly reducing the computational cost. Code is available at https://github.com/Yuanhong-Zheng/EMPMP.
Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding	Minghang Zheng Wangxuan Institute of Computer Technology, Peking University Yuxin Peng Wangxuan Institute of Computer Technology, Peking University Benyuan Sun Central Media Technology Institute, Huawei Yi Yang Central Media Technology Institute, Huawei Yang Liu Wangxuan Institute of Computer Technology, Peking University	Paper Supplementary Abstract In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given text query within a video stream. Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames. As online videos are streaming inputs and can go on indefinitely, it is impractical and inefficient to store all historical inputs. The existing OnVTG models employ memory to store recent historical video frame features and predict scores indicating whether the current frame corresponds to the start or end time of the target event. However, these methods lack effective event modeling and cannot retain long-term historical information, leading to low performance. To tackle these challenges, we propose a hierarchical event memory for OnVTG. We propose an event-based OnVTG framework that makes predictions based on event proposals that model event-level information with various durations. To preserve historically valuable event information, we introduce a hierarchical event memory that retains historical events, allowing the model to access both recent and long-term information. To enable the real-time prediction, we further propose a future prediction branch that predicts whether the target event will occur shortly and further regresses the start time of the event. We achieve state-of-the-art performance on the TACoS, ActivityNet Captions, and MAD datasets. Code is available at https://github.com/minghangz/OnVTG.
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory	Chenhao Zheng University of Washington Jieyu Zhang University of Washington Mohammadreza Salehi University of Washington Ziqi Gao University of Washington Vishnu Iyengar University of Washington Norimasa Kobori Woven by Toyota, Inc Quan Kong Woven by Toyota, Inc Ranjay Krishna University of Washington	Paper Supplementary Abstract Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.
RARE: Refine Any Registration of Pairwise Point Clouds via Zero-Shot Learning	Chengyu Zheng Nanjing University of Aeronautics and Astronautics Jin Huang Nanjing University of Aeronautics and Astronautics Honghua Chen Lingnan University Mingqiang Wei Nanjing University of Aeronautics and Astronautics	Paper Supplementary Abstract Recent research leveraging large-scale pretrained diffusion models has demonstrated the potential of using diffusion features to establish semantic correspondences in images. Inspired by advancements in diffusion-based techniques, we propose a novel zero-shot method for refining point cloud registration algorithms. Our approach leverages correspondences derived from depth images to enhance point feature representations, eliminating the need for a dedicated training dataset. Specifically, we first project the point cloud into depth maps from multiple perspectives and extract implicit knowledge from a pretrained diffusion network as depth diffusion features. These features are then integrated with geometric features obtained from existing methods to establish more accurate correspondences between point clouds. By leveraging these refined correspondences, our approach achieves significantly improved registration accuracy. Extensive experiments demonstrate that our method not only enhances the performance of existing point cloud registration techniques but also exhibits robust generalization capabilities across diverse datasets. Codes are available at https://github.com/zhengcy-lambo/RARE.git.
Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation, Large-Scale Dataset, and New Insights	Junhao Zheng Xi'an Jiaotong University Jiahao Sun Xi'an Jiaotong University Chenhao Lin Xi'an Jiaotong University Zhengyu Zhao Xi'an Jiaotong University Chen Ma Xi'an Jiaotong University Chong Zhang Xi'an Jiaotong University Cong Wang City University of Hong Kong Qian Wang Wuhan University Chao Shen Xi'an Jiaotong University	Paper Supplementary Abstract Developing reliable defenses against patch attacks on object detectors has attracted increasing interest. However, we identify that existing defense evaluations lack a unified and comprehensive framework, resulting in inconsistent and incomplete assessments of current methods. To address this issue, we revisit 11 representative defenses and present the first patch defense benchmark, involving 2 attack goals, 13 patch attacks, 11 object detectors, and 4 diverse metrics. This leads to the large-scale adversarial patch dataset with 94 types of patches and 94,000 images. Our comprehensive analyses reveal new insights: (1) The difficulty in defending against naturalistic patches lies in the data distribution, rather than the commonly believed high frequencies. Our new dataset with diverse patch distributions can be used to improve existing defenses by 15.09% AP@0.5. (2) The average precision of the attacked object, rather than the commonly pursued patch detection accuracy, shows high consistency with defense performance. (3) Adaptive attacks can substantially bypass existing defenses, and defenses with complex/stochastic models or universal patch properties are relatively robust. We hope that our analyses will serve as guidance on properly evaluating patch attacks/defenses and advancing their design. Code and dataset are available at https://github.com/Gandolfczjh/APDE, where we will keep integrating new attacks/defenses.
S3R-GS: Streamlining the Pipeline for Large-Scale Street Scene Reconstruction	Guangting Zheng University of Science and Technology of China Jiajun Deng The University of Adelaide Xiaomeng Chu University of Science and Technology of China Yu Yuan University of Science and Technology of China Houqiang Li University of Science and Technology of China Yanyong Zhang University of Science and Technology of China	Paper Supplementary Abstract Recently, 3D Gaussian Splatting (3DGS) has reshaped the field of photorealistic 3D reconstruction, achieving impressive rendering quality and speed. However, when applied to large-scale street scenes, existing methods suffer from rapidly escalating per-viewpoint reconstruction costs as scene size increases, leading to significant computational overhead. After revisiting the conventional pipeline, we identify three key factors accounting for this issue: unnecessary local-to-global transformations, excessive 3D-to2D projections, and inefficient rendering of distant content. To address these challenges, we propose S3R-GS, a 3DGS framework that Streamlines the pipeline for largescale Street Scene Reconstruction, effectively mitigating these limitations. Moreover, most existing street 3DGS methods rely on ground-truth 3D bounding boxes to separate dynamic and static components, but 3D bounding boxes are difficult to obtain, limiting real-world applicability. To address this, we propose an alternative solution with 2D boxes, which are easier to annotate or can be predicted by off-the-shelf vision foundation models. Such designs together make S3R-GS readily adapt to large, in-the-wild scenarios. Extensive experiments demonstrate that S3R-GS enhances rendering quality and significantly accelerates reconstruction. Remarkably, when applied to videos from the challenging Argoverse2 dataset, it achieves state-of-theart PSNR and SSIM, reducing reconstruction time to below 50%-and even 20%-of competing methods. Code is available at https://github.com/Tom-zgt/S3RGS.
World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model	Yupeng Zheng Institute of Automation, Chinese Academy of Sciences Pengxuan Yang School of Artificial Intelligence, UCAS Zebin Xing School of Artificial Intelligence, UCAS Qichao Zhang Institute of Automation, Chinese Academy of Sciences Yuhang Zheng National University of Singapore Yinfeng Gao Institute of Automation, Chinese Academy of Sciences Pengfei Li Tsinghua University Teng Zhang School of Artificial Intelligence, UCAS Zhongpu Xia Institute of Automation, Chinese Academy of Sciences Peng Jia School of Artificial Intelligence, UCAS XianPeng Lang School of Artificial Intelligence, UCAS Dongbin Zhao Institute of Automation, Chinese Academy of Sciences	Paper Supplementary Abstract End-to-end autonomous driving directly generates planning trajectories from raw sensor data, yet it typically relies on costly perception supervision to extract scene information. A critical research challenge arises: constructing an informative driving world model to enable perception annotation-free, end-to-end planning via self-supervised learning. In this paper, we present World4Drive, an endto-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. Specifically, World4Drive first extracts scene features, including driving intention and world latent representations enriched with spatial-semantic priors provided by vision foundation models. It then generates multi-modal planning trajectories based on current scene features and driving intentions and predicts multiple intention-driven future states within the latent space. Finally, it introduces a world model selector module to evaluate and select the best trajectory. We achieve perception annotation-free, endto-end planning through self-supervised alignment between actual future observations and predicted observations reconstructed from the latent space. World4Drive achieves state-of-the-art performance without manual perception annotations on both the open-loop nuScenes and closed-loop NavSim benchmarks, demonstrating an 18.0% relative reduction in L2 error, 46.7% lower collision rate, and 3.75x faster training convergence. Codes will be accessed at https://github.com/ucaszyp/World4Drive.
iManip: Skill-Incremental Learning for Robotic Manipulation	Zexin Zheng Sun Yat-sen University Jia-Feng Cai Sun Yat-sen University Xiao-Ming Wu Sun Yat-sen University Yi-Lin Wei Sun Yat-sen University Yu-Ming Tang Sun Yat-sen University Ancong Wu Sun Yat-sen University Wei-Shi Zheng Sun Yat-sen University	Paper Supplementary Abstract The development of a generalist agent with adaptive multiple manipulation skills has been a long-standing goal in the robotics community. In this paper, we explore a crucial task, skill-incremental learning, in robotic manipulation, which is to endow the robots with the ability to learn new manipulation skills based on the previous learned knowledge without re-training. First, we build a skill-incremental environment based on the RLBench benchmark, and explore how traditional incremental methods perform in this setting. We find that they suffer from severe catastrophic forgetting due to the previous methods on classification overlooking the characteristics of temporality and action complexity in robotic manipulation tasks. Towards this end, we propose an incremental Manipulation framework, termed iManip, to mitigate the above issues. We firstly design a temporal replay strategy to maintain the integrity of old skills when learning new skill. Moreover, we propose the Extendable PerceiverIO, consisting of an action prompt with extendable weight to adapt to new action primitives in new skill. Extensive experiments show that our framework performs well in Skill-Incremental Learning.
CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection	Hanzhi Zhong Zhejiang University Zhiyu Xiang Zhejiang University Ruoyu Xu Zhejiang University Jingyun Fu Zhejiang University Peng Xu Zhejiang University Shaohong Wang Zhejiang University Zhihao Yang Zhejiang University Tianyu Pu Zhejiang University Eryun Liu Zhejiang University	Paper Abstract 4D radar has received significant attention in autonomous driving thanks to its robustness under adverse weathers. Due to the sparse points and noisy measurements of the 4D radar, most of the research perform the 3D object detection task by integrating images from camera and perform modality fusion in BEV space. However, the potential of the radar and the fusion mechanism is still largely unexplored, hindering the performance improvement. In this study, we propose a cross-view two-stage fusion network called CVFusion. In the first stage, we design a radar guided iterative (RGIter) BEV fusion module to generate high-recall 3D proposal boxes. In the second stage, we aggregate features from multiple heterogeneous views including points, image, and BEV for each proposal. These comprehensive instance level features greatly help refine the proposals and generate high-quality predictions. Extensive experiments on public datasets show that our method outperforms the previous state-of-the-art methods by a large margin, with 9.10% and 3.68% mAP improvements on View-of-Delft (VoD) and TJ4DRadSet, respectively. Our code will be available at https://github.com/zhzhzhzhzhz/CVFusion.
CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception	Jiaru Zhong Tsinghua University Jiahao Wang Tsinghua University Jiahui Xu The University of Hong Kong Xiaofan Li Baidu Inc. Zaiqing Nie Tsinghua University Haibao Yu The University of Hong Kong	Paper Supplementary Abstract Cooperative perception aims to address the inherent limitations of single-vehicle autonomous driving systems through information exchange among multiple agents. Previous research has primarily focused on single-frame perception tasks. However, the more challenging cooperative sequential perception tasks, such as cooperative 3D multiobject tracking, have not been thoroughly investigated. Therefore, we propose CoopTrack, a fully instance-level end-to-end framework for cooperative tracking, featuring learnable instance association, which fundamentally differs from existing approaches. CoopTrack transmits sparse instance-level features that significantly enhance perception capabilities while maintaining low transmission costs. Furthermore, the framework comprises two key components: Multi-Dimensional Feature Extraction, and CrossAgent Association and Aggregation, which collectively enable comprehensive instance representation with semantic and motion features, and adaptive cross-agent association and fusion based on a feature graph. Experiments on both the V2X-Seq and Griffin datasets demonstrate that CoopTrack achieves excellent performance. Specifically, it attains state-of-the-art results on V2X-Seq, with 39.0% mAP and 32.8% AMOTA. The project is available at https://github.com/zhongjiaru/CoopTrack.
OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation	Ding Zhong AI Thrust, HKUST(GZ) Xu Zheng INSAIT, Sofia University Chenfei Liao AI Thrust, HKUST(GZ) Yuanhuiyi Lyu AI Thrust, HKUST(GZ) Jialei Chen Nagoya University Shengyang Wu UMich Linfeng Zhang SJTU Xuming Hu AI Thrust, HKUST(GZ)	Paper Supplementary Abstract Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to 360◦domain, the significant field-of-view (FoV) gap between pinhole (70◦x70◦) and panoramic images (180◦x 360◦) poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixellevel semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2's memory mechanism to extract cross-patch correspondences that embeds the cross-FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries. For the second gap, OmniSAM fine-tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV-based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization ability across different sizes of source models. Extensive experimental results demonstrate that our method outperforms the state-ofthe-art methods by large margins, e.g., 79.06% (10.22%↑) on SPin8-to-SPan8, 62.46% (6.58%↑) on CS13-to-DP13.
RoboTrom-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction	Yufeng Zhong Meituan Chengjian Feng Meituan Feng Yan Meituan Fanfan Liu Meituan Liming Zheng Meituan Lin Ma Meituan	Paper Supplementary Abstract In language-guided visual navigation, agents locate target objects in unseen environments using natural language instructions. For reliable navigation in unfamiliar scenes, agents should possess strong perception, planning, and prediction capabilities. Additionally, when agents revisit previously explored areas during long-term navigation, they may retain irrelevant and redundant historical perceptions, leading to suboptimal results. In this work, we propose RoboTron-Nav, a unified framework that integrates perception, planning, and prediction capabilities through multitask collaborations on navigation and embodied question answering tasks, thereby enhancing navigation performances. Furthermore, RoboTron-Nav employs an adaptive 3D-aware history sampling strategy to effectively and efficiently utilize historical observations. By leveraging large language model, RoboTron-Nav comprehends diverse commands and complex visual scenes, resulting in appropriate navigation actions. RoboTron-Nav achieves an 81.1% success rate in object goal navigation on the CHORES-S benchmark, setting a new state-of-the-art performance.
UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI	Fangwei Zhong Beijing Normal University Kui Wu Beihang University Churan Wang Peking University Hao Chen City University of Macau Hai Ci National University of Singapore Zhoujun Li Peking University Yizhou Wang Peking University	Paper Supplementary Abstract We introduce UnrealZoo, a collection of over 100 photorealistic 3D virtual worlds built on Unreal Engine, designed to reflect the complexity and variability of open-world environments. We also provide a rich variety of playable entities, including humans, animals, robots, and vehicles for embodied AI research. We extend UnrealCV with optimized APIs and tools for data collection, environment augmentation, distributed training, and benchmarking. These improvements achieve significant improvements in the efficiency of rendering and communication, enabling advanced applications such as multi-agent interactions. Our experimental evaluation across visual navigation and tracking tasks reveals two key insights: 1) environmental diversity provides substantial benefits for developing generalizable reinforcement learning (RL) agents, and 2) current embodied agents face persistent challenges in open-world scenarios, including navigation in unstructured terrain, adaptation to unseen morphologies, and managing latency in the close-loop control systems for interacting in highly dynamic objects. UnrealZoo thus serves as both a comprehensive testing ground and a pathway toward developing more capable embodied AI systems for real-world deployment.
AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting	Xiaoyu Zhou Peking University Jingqi Wang Peking University Yongtao Wang Peking University Yufei Wei Chongqing Changan Automobile Co., Ltd Nan Dong Peking University Ming-Hsuan Yang University of California, Merced	Paper Abstract Obtaining high-quality 3D semantic occupancy from raw sensor data remains an essential yet challenging task, often requiring extensive manual labeling. In this work, we propose AutoOcc, a vision-centric automated pipeline for open-ended semantic occupancy annotation that integrates differentiable Gaussian splatting guided by visionlanguage models. We formulate the open-ended semantic 3D occupancy reconstruction task to automatically generate scene occupancy by combining attention maps from vision-language models and foundation vision models. We devise semantic-aware Gaussians as intermediate geometric descriptors and propose a cumulative Gaussian-to-voxel splatting algorithm that enables effective and efficient occupancy annotation. Our framework outperforms existing automated occupancy annotation methods without human labels. Auto-cc also enables open-ended semantic occupancy auto-labeling, achieving robust performance in both static and dynamically complex scenarios.
Event-based Visual Vibrometry	Xinyu Zhou Peking University Peiqi Duan Peking University Yeliduosi Xiaokaiti Peking University Chao Xu Peking University Boxin Shi Peking University	Paper Supplementary Abstract Visual vibrometry has emerged as a powerful technique for remote acquisition of audio and the physical properties of materials. To capture high-frequency vibrations, framebased approaches often require a high-speed video camera and bright lighting to compensate for the short exposure time. In this paper, we introduce event-based visual vibrometry, a new high-speed visual vibration sensing method using an event camera. By leveraging the high temporal resolution and low bandwidth characteristics of event cameras, event-based visual vibrometry enables high-speed vibration sensing under ambient lighting conditions with improved data efficiency. Specifically, we leverage a hybrid camera system and propose an event-based subtle motion estimation framework that integrates an optimization-based approach based on the event generation model and a motion refinement network. We demonstrate our method by capturing vibration caused by audio sources and estimating material properties for various objects.
HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation	Xin Zhou Huazhong University of Science and Technology Dingkang Liang Huazhong University of Science and Technology Sifan Tu Huazhong University of Science and Technology Xiwu Chen The University of Hong Kong Yikang Ding MEGVII Technology Dingyuan Zhang Huazhong University of Science and Technology Feiyang Tan The University of Hong Kong Hengshuang Zhao The University of Hong Kong	Paper Supplementary Abstract Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction. However, existing DWMs are limited to scene generation and fail to incorporate scene understanding, which involves interpreting and reasoning about the driving environment. In this paper, we present a unified Driving World Model named HERMES. We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios. Specifically, HERMES leverages a Bird's-Eye View (BEV) representation to consolidate multi-view spatial information while preserving geometric relationships and interactions. We also introduce world queries, which incorporate world knowledge into BEV features via causal attention in the Large Language Model, enabling contextual enrichment for understanding and generation tasks. We conduct comprehensive studies on nuScenes and OmniDrive-nuScenes datasets to validate the effectiveness of our method. HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%. The model and code will be publicly released at https://github.com/LMD0311/HERMES.
Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving	Hao Zhou University of Chinese Academy of Sciences Zhanning Gao DeepRoute.AI Zhili Chen The Hong Kong University of Science and Technology Maosheng Ye DeepRoute.AI Qifeng Chen The Hong Kong University of Science and Technology Tongyi Cao DeepRoute.AI Honggang Qi University of Chinese Academy of Sciences	Paper Supplementary Abstract In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to accurately represent driving-specific scenarios, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations by capturing driving-related representations with limited domain data, ensuring faster adaptation to driving scenarios. Extensive experiments confirm the effectiveness of the HoP framework, showing that it significantly outperforms previous state-of-the-art methods in all key metrics.
Latent-Reframe: Enabling Camera Control for Video Diffusion Models without Training	Zhenghong Zhou University of Rochester Jie An University of Rochester Jiebo Luo University of Rochester	Paper Supplementary Abstract Precise camera pose control is crucial for video generation with diffusion models. Existing methods require fine-tuning with additional datasets containing paired videos and camera pose annotations, which are both data-intensive and computationally costly, and may disrupt the model's distribution learned from the training data. We introduce Latent-Reframe, which enables camera control in a pretrained video diffusion model without fine-tuning. Unlike existing methods, Latent-Reframe operates during the sampling stage, maintaining efficiency while preserving the distribution learned during pretraining. Our approach reframes the latent code of video frames to align with the input camera trajectory through time-aware point clouds. Latent code inpainting and harmonization then refine the model's latent space, ensuring high-quality video generation. Latent-Reframe can be applied to both DiT- and UNet-based video diffusion models. Experimental results demonstrate that Latent-Reframe can achieve comparable or superior camera control precision and video quality to training-based methods, without the need for fine-tuning on additional datasets.
MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions	Qingyuan Zhou Fudan University Yuehu Gong Fudan University Weidong Yang Fudan University Jiaze Li Nanyang Technological University Yeqi Luo Fudan University Baixin Xu Nanyang Technological University Shuhao Li Fudan University Ben Fei The Chinese University of Hong Kong Ying He Nanyang Technological University	Paper Supplementary Abstract Novel view synthesis (NVS) and surface reconstruction (SR) are essential tasks in 3D Gaussian Splatting (3DGS). Despite recent progress, these tasks are often addressed independently, with GS-based rendering methods struggling under diverse light conditions and failing to produce accurate surfaces, while GS-based reconstruction methods frequently compromise rendering quality. This raises a central question: must rendering and reconstruction always involve a trade-off? To address this, we propose MGSR, a 2D/3D Mutual-boosted Gaussian Splatting for Surface Reconstruction that enhances both rendering quality and 3D reconstruction accuracy. MGSR introduces two branches-one based on 2DGS and the other on 3DGS. The 2DGS branch excels in surface reconstruction, providing precise geometry information to the 3DGS branch. Leveraging this geometry, the 3DGS branch employs a geometryguided illumination decomposition module that captures reflected and transmitted components, enabling realistic rendering under varied light conditions. Using the transmitted component as supervision, the 2DGS branch also achieves high-fidelity surface reconstruction. Throughout the optimization process, the 2DGS and 3DGS branches undergo alternating optimization, providing mutual supervision. Prior to this, each branch completes an independent warm-up phase, with an early stopping strategy implemented to reduce computational costs. We evaluate MGSR on a diverse set of synthetic and real-world datasets, at both object and scene levels, demonstrating strong performance in rendering and surface reconstruction. Code is available at https://github.com/TsingyuanChou/MGSR.
MonoMobility: Zero-Shot 3D Mobility Analysis from Monocular Videos	Hongyi Zhou National University of Defense Technology Yulan Guo Sun Yat-sen University Xiaogang Wang Southwest University Kai Xu National University of Defense Technology	Paper Supplementary Abstract Accurately analyzing the motion parts and their motion attributes in dynamic environments is crucial for advancing key areas such as embodied intelligence. Addressing the limitations of existing methods that rely on dense multiview images or detailed part-level annotations, we propose an innovative framework that can analyze 3D mobility from monocular videos in a zero-shot manner. This framework can precisely parse motion parts and motion attributes only using a monocular video, completely eliminating the need for annotated training data. Specifically, our method first constructs the scene geometry and roughly analyzes the motion parts and their initial motion attributes combining depth estimation, optical flow analysis and point cloud registration method, then employs 2D Gaussian splatting for scene representation. Building on this, we introduce an end-to-end dynamic scene optimization algorithm specifically designed for articulated objects, refining the initial analysis results to ensure the system can handle ‘rotation', ‘translation', and even complex movements (‘rotation+translation'), demonstrating high flexibility and versatility. To validate the robustness and wide applicability of our method, we created a comprehensive dataset comprising both simulated and real-world scenarios. Experimental results show that our framework can effectively analyze articulated object motions in an annotation-free manner, showcasing its significant potential in future embodied intelligence applications. The project page is at: https: //monomobility.github.io/MonoMobility.
OV3D-CG: Open-vocabulary 3D Instance Segmentation with Contextual Guidance	Mingquan Zhou Chinese Academy of Sciences Chen He Chinese Academy of Sciences Ruiping Wang Chinese Academy of Sciences Xilin Chen Chinese Academy of Sciences	Paper Supplementary Abstract Open-vocabulary 3D instance segmentation (OV-3DIS), which aims to segment and classify objects beyond predefined categories, is a critical capability for embodied AI applications. Existing methods rely on pre-trained 2D foundation models, focusing on instance-level features while overlooking contextual relationships, limiting their ability to generalize to rare or ambiguous objects. To address these limitations, we propose an OV-3DIS framework guided by contextual information. First, we employ a Class-agnostic Proposal Module, integrating a pre-trained 3D segmentation model with a SAM-guided segmenter to extract robust 3D instance masks. Subsequently, we design a Semantic Reasoning Module, which selects the best viewpoint for each instance and constructs three 2D contextaware representations. The representations are processed using Multimodal Large Language Models with Chain-ofThought prompting to enhance semantic inference. Notably, our method outperforms state-of-the-art methods on the ScanNet200 and Replica datasets, demonstrating superior open-vocabulary segmentation capabilities. Moreover, preliminary implementation in real-world scenarios verifies our method's robustness and accuracy, highlighting its potential for embodied AI tasks such as object-driven navigation. Our project page is at: https://viplvsu.github.io/OV3D-CG/.
Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes	Zhangjun Zhou Huazhong University of Science and Technology Yiping Li Huazhong University of Science and Technology Chunlin Zhong Huazhong University of Science and Technology Jianuo Huang Huazhong University of Science and Technology Jialun Pei The Chinese University of Hong Kong Hua Li Hainan University He Tang Huazhong University of Science and Technology	Paper Supplementary Abstract While the human visual system employs distinct mechanisms to perceive salient and camouflaged objects, existing models struggle to disentangle these tasks. Specifically, salient object detection (SOD) models frequently misclassify camouflaged objects as salient, while camouflaged object detection (COD) models conversely misinterpret salient objects as camouflaged. We hypothesize that this can be attributed to two factors: (i) the specific annotation paradigm of current SOD and COD datasets, and (ii) the lack of explicit aspect relationship modeling in current models. Prevalent SOD/COD datasets enforce a mutual exclusivity constraint, assuming scenes contain either salient or camouflaged objects, which poorly aligns with the real world. Furthermore, current SOD/COD methods are primarily designed for these highly constrained datasets and lack explicit modeling of the relationship between salient and camouflaged objects. In this paper, to promote the development of unconstrained salient and camouflaged object detection, we construct a large-scale dataset, USC12K, which features comprehensive labels and four different scenes that cover all possible logical existence scenarios of both salient and camouflaged objects. To explicitly model the relationship between salient and camouflaged objects, we propose a model called USCNet, which introduces two distinct prompt query mechanisms for modeling inter-sample and intrasample aspect relationships. Additionally, We designed CSCS to evaluate the model's ability to distinguish salient and camouflaged objects. Our method achieves SOTA performance across all scenes. Code and dataset: GitHub.
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts	Gengze Zhou The University of Adelaide Yicong Hong Adobe Research Zun Wang UNC, Chapel Hill Chongyang Zhao UNSW Sydney Mohit Bansal UNC, Chapel Hill Qi Wu The University of Adelaide	Paper Supplementary Abstract The academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the latter concentrates on following detailed textual commands. Despite the differing focuses of these tasks, the underlying requirements of interpreting instructions, comprehending the surroundings, and inferring action decisions remain consistent. This paper consolidates diverse navigation tasks into a unified and generic framework - we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. Powered by SAME, we present a versatile agent capable of addressing seven navigation tasks simultaneously, achieving highly comparable performance to task-specific agents.
STD-GS: Exploring Frame-Event Interaction for SpatioTemporal-Disentangled Gaussian Splatting to Reconstruct High-Dynamic Scene	Hanyu Zhou Huazhong University of Science and Technology Haonan Wang Huazhong University of Science and Technology Haoyue Liu Huazhong University of Science and Technology Yuxing Duan Huazhong University of Science and Technology Luxin Yan Huazhong University of Science and Technology Gim Hee Lee National University of Singapore	Paper Supplementary Abstract High-dynamic scene reconstruction aims to represent static background with rigid spatial features and dynamic objects with deformed continuous spatiotemporal features. Typically, existing methods adopt unified representation model (e.g., Gaussian) to directly match the spatiotemporal features of dynamic scene from frame camera. However, this unified paradigm fails in the potential discontinuous temporal features of objects due to frame imaging and the heterogeneous spatial features between background and objects. In this work, we introduce event camera to compensate for frame camera, and propose a spatiotemporal-disentangled Gaussian splatting framework for high-dynamic scene reconstruction. As for dynamic scene, we figure out that background and objects have appearance discrepancy in frame-based spatial features and motion discrepancy in event-based temporal features, which motivates us to distinguish the spatiotemporal features between background and objects via clustering. As for dynamic object, we discover that Gaussian representations and event data share the consistent spatiotemporal characteristic, which could serve as a prior to guide the spatiotemporal disentanglement of object Gaussians. Within Gaussian splatting framework, the cumulative scene-object disentanglement can improve the spatiotemporal discrimination between background and objects to render the time-continuous dynamic scene. Extensive experiments are performed to verify the superiority of our method.
TopoTTA: Topology-Enhanced Test-Time Adaptation for Tubular Structure Segmentation	Jiale Zhou Zhejiang University Wenhan Wang Beihang University Shikun Li Westlake University Xiaolei Qu Beihang University Xin Guo Beihang University Yizhong Liu Beihang University Wenzhong Tang Beihang University Xun Lin Beihang University Yefeng Zheng Westlake University	Paper Supplementary Abstract Tubular structure segmentation (TSS) is important for various applications, such as hemodynamic analysis and route navigation. Despite significant progress in TSS, domain shifts remain a major challenge, leading to performance degradation in unseen target domains. Unlike other segmentation tasks, TSS is more sensitive to domain shifts, as changes in topological structures can compromise segmentation integrity, and variations in local features distinguishing foreground from background (e.g., texture and contrast) may further disrupt topological continuity. To address these challenges, we propose Topology-enhanced Test-Time Adaptation (TopoTTA), the first test-time adaptation framework designed specifically for TSS. TopoTTA consists of two stages: Stage 1 adapts models to cross-domain topological discrepancies using the proposed Topological Meta Difference Convolutions (TopoMDCs), which enhance topological representation without altering pre-trained parameters; Stage 2 improves topological continuity by a novel Topology Hard sample Generation (TopoHG) strategy and prediction alignment on hard samples with pseudo-labels in the generated pseudobreaks. Extensive experiments across four scenarios and ten datasets demonstrate TopoTTA's effectiveness in handling topological distribution shifts, achieving an average improvement of 31.81% in clDice. TopoTTA also serves as a plug-and-play TTA solution for CNN-based TSS models.
TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction	Zewei Zhou University of California, Los Angeles Seth Z. Zhao University of California, Los Angeles Tianhui Cai University of California, Los Angeles Zhiyu Huang University of California, Los Angeles Bolei Zhou University of California, Los Angeles Jiaqi Ma University of California, Los Angeles	Paper Supplementary Abstract End-to-end training of multi-agent systems offers significant advantages in improving multi-task performance. However, training such models remains challenging and requires extensive manual design and monitoring. In this work, we introduce TurboTrain, a novel and efficient training framework for multi-agent perception and prediction. TurboTrain comprises two key components: a multi-agent spatiotemporal pretraining scheme based on masked reconstruction learning and a balanced multi-task learning strategy based on gradient conflict suppression. By streamlining the training process, our framework eliminates the need for manually designing and tuning complex multi-stage training pipelines, substantially reducing training time and improving performance. We evaluate TurboTrain on a real-world cooperative driving dataset, V2XPnP-Seq, and demonstrate that it further improves the performance of state-of-the-art multi-agent perception and prediction models. Our results highlight that pretraining effectively captures spatiotemporal multi-agent features and significantly benefits downstream tasks. Moreover, the proposed balanced multi-task learning strategy enhances detection and prediction.
V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction	Zewei Zhou University of California, Los Angeles Hao Xiang University of California, Los Angeles Zhaoliang Zheng University of California, Los Angeles Seth Z. Zhao University of California, Los Angeles Mingyue Lei University of California, Los Angeles Yun Zhang University of California, Los Angeles Tianhui Cai University of California, Los Angeles Xinyi Liu University of California, Los Angeles Johnson Liu University of California, Los Angeles Maheswari Bajji University of California, Los Angeles Xin Xia University of California, Los Angeles Zhiyu Huang University of California, Los Angeles Bolei Zhou University of California, Los Angeles Jiaqi Ma University of California, Los Angeles	Paper Supplementary Abstract Vehicle-to-everything (V2X) technologies offer a promising paradigm to mitigate the limitations of constrained observability in single-vehicle systems. Prior work primarily focuses on single-frame cooperative perception, which fuses agents' information across different spatial locations but ignores temporal cues and temporal tasks (e.g., temporal perception and prediction). In this paper, we focus on the spatio-temporal fusion in V2X scenarios and design one-step and multi-step communication strategies (when to transmit) as well as examine their integration with three fusion strategies - early, late, and intermediate (what to transmit), providing comprehensive benchmarks with 11 fusion models (how to fuse). Furthermore, we propose V2XPnP, a novel intermediate fusion framework within one-step communication for end-to-end perception and prediction. Our framework employs a unified Transformer-based architecture to effectively model complex spatio-temporal relationships across multiple agents, frames, and high-definition maps. Moreover, we introduce the V2XPnP Sequential Dataset that supports all V2X collaboration modes and addresses the limitations of existing real-world datasets, which are restricted to singleframe or single-mode cooperation. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods in both perception and prediction tasks.
When Pixel Difference Patterns Meet ViT: PiDiViT for Few-Shot Object Detection	Hongliang Zhou National University of Defense Technology, China Yongxiang Liu National University of Defense Technology, China Canyu Mo National University of Defense Technology, China Weijie Li National University of Defense Technology, China Bowen Peng National University of Defense Technology, China Li Liu National University of Defense Technology, China	Paper Supplementary Abstract Few-shot object detection aims to detect novel classes with limited samples. Recent methods have leveraged rich semantic representations of pretrained vision transformer (ViT) to overcome limitations of model fine-tuning, thereby improving performance on novel classes. However, existing pretrained ViT schemes only perform transformer encoding in feature dimension, ignoring exploration of pixel-wise differences in low-level features and multiscale variations. The current challenges lie in: (i) extracted features suffer from blurred boundary features and smooth transition from center to boundary, leading to insufficient distinction between objects and backgrounds, and (ii) how to balance extraction of local details and global contour features under multiscale scenarios. So Pixel Difference Vision Transformer (PiDiViT) is proposed. Innovations include: (i) difference convolution fusion module (DCFM), which enhances feature differences from object centers to boundaries and effectively preserves global information by fusing pixel-wise central difference features with original features through an attention mechanism, and (ii) multiscale feature fusion module (MFFM), which adaptively fuses features extracted by five different scale convolutional kernels using a scale attention mechanism to generate attention weights, achieving an optimal balance between local detail and global semantic information extraction. PiDiViT achieves SOTA on the COCO benchmark: surpassing few-shot detection SOTA by 2.7 nAP50 (10-shot) and 4.0 nAP50 (30-shot) for novel classes, exceeding one-shot detection SOTA by 4.4 nAP50 and open-vocabulary detection SOTA by 3.7 nAP50. The code is available at https://github.com/Seaz9/PiDiViT.
Where, What, Why: Towards Explainable Driver Attention Prediction	Yuchen Zhou Sun Yat-sen University Jiayu Tang Sun Yat-sen University Xiaoyan Xiao Sun Yat-sen University Yueyao Lin Sun Yat-sen University Linkai Liu Sun Yat-sen University Zipeng Guo Sun Yat-sen University Hao Fei National University of Singapore Xiaobo Xia National University of Singapore Chao Gou Sun Yat-sen University	Paper Abstract Modeling task-driven attention in driving is a fundamental challenge for both autonomous vehicles and cognitive science. Existing methods primarily predict where drivers look by generating spatial heatmaps, but fail to capture the cognitive motivations behind attention allocation in specific contexts, which limits deeper understanding of attention mechanisms. To bridge this gap, we introduce Explainable Driver Attention Prediction, a novel task paradigm that jointly predicts spatial attention regions (where), parses attended semantics (what), and provides cognitive reasoning for attention allocation (why). To support this, we present W³DA, the first large-scale explainable driver attention dataset. It enriches existing benchmarks with detailed semantic and causal annotations across diverse driving scenarios, including normal conditions, safety-critical situations, and traffic accidents. We further propose LLada, a Large Language model-driven framework for driver attention prediction, which unifies pixel modeling, semantic parsing, and cognitive reasoning within an end-to-end architecture. Extensive experiments demonstrate the effectiveness of LLada, exhibiting robust generalization across datasets and driving conditions. This work serves as a key step toward a deeper understanding of driver attention mechanisms, with significant implications for autonomous driving, intelligent driver training, and human-computer interaction.
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding	Wenxuan Zhu King Abdullah University of Science and Technology Bing Li King Abdullah University of Science and Technology Cheng Zheng King Abdullah University of Science and Technology Jinjie Mai King Abdullah University of Science and Technology Jun Chen King Abdullah University of Science and Technology Letian Jiang King Abdullah University of Science and Technology Abdullah Hamdi University of Oxford Sara Rojas Martinez King Abdullah University of Science and Technology Chia-Wen Lin National Tsing Hua University Mohamed Elhoseiny King Abdullah University of Science and Technology Bernard Ghanem King Abdullah University of Science and Technology	Paper Supplementary Abstract Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning. 4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks. With 4DBench, we evaluate a wide range of open-source and closedsource MLLMs. The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding. 4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63% accuracy compared to the human baseline of 91%. These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs. Project page: https://4dbench.github.io/
A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition	Jie Zhu Michigan State University Yiyang Su Michigan State University Minchul Kim Michigan State University Anil Jain Michigan State University Xiaoming Liu Michigan State University	Paper Supplementary Abstract Whole-body biometric recognition is a challenging multimodal task that integrates various biometric modalities, including face, gait, and body. This integration is essential for overcoming the limitations of unimodal systems. Traditionally, whole-body recognition involves deploying different models to process multiple modalities, achieving the final outcome by score-fusion (e.g., weighted averaging of similarity matrices from each model). However, these conventional methods may overlook the variations in score distributions of individual modalities, making it challenging to improve final performance. In this work, we present Quality-guided Mixture of score-fusion Experts (QME), a novel framework designed for improving whole-body biometric recognition performance through a learnable score-fusion strategy using a Mixture of Experts (MoE). We introduce a novel pseudo-quality loss for quality estimation with a modality-specific Quality Estimator (QE), and a score triplet loss to improve the metric performance. Extensive experiments on multiple whole-body biometric datasets demonstrate the effectiveness of our proposed approach, achieving state-of-the-art results across various metrics compared to baseline methods. Our method is effective for multimodal and multi-model, addressing key challenges such as model misalignment in the similarity score domain and variability in data quality. Code is available at the Project Link.
Aether: Geometric-Aware Unified World Modeling	Haoyi Zhu USTC Yifan Wang Shanghai AI Lab Jianjun Zhou SII Wenzheng Chang SJTU Yang Zhou ZJU Zizun Li FDU Junyi Chen FDU Chunhua Shen unknown Jiangmiao Pang unknown Tong He unknown	Paper Supplementary Abstract The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes AETHER, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, AETHER achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates zero-shot synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Notably, even without real-world data, its reconstruction performance is comparable with or even better than that of domain-specific models. Additionally, AETHER employs camera trajectories as geometry-informed action spaces, enabling effective action-conditioned prediction and visual planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.
Beyond Pixel Uncertainty: Bounding the OoD Objects in Road Scenes	Huachao Zhu Wuhan University Zelong Liu Wuhan University Zhichao Sun Wuhan University Yuda Zou Wuhan University Gui-Song Xia Wuhan University Yongchao Xu Wuhan University	Paper Abstract Recognizing out-of-distribution (OoD) objects on roads is crucial for safe driving. Most existing methods rely on segmentation models' uncertainty as anomaly scores, often resulting in false positives - especially at ambiguous regions like boundaries, where segmentation models inherently exhibit high uncertainty. Additionally, it is challenging to define a suitable threshold to generate anomaly masks, especially with the inconsistencies in predictions across consecutive frames. We propose DetSeg, a novel paradigm that helps incorporate object-level understanding. DetSeg first detects all objects in the open world and then suppresses in-distribution (ID) bounding boxes, leaving only OoD proposals. These proposals can either help previous methods eliminate false positives (DetSeg-R), or generate binary anomaly masks without complex threshold search when combined with a box-prompted segmentation module (DetSeg-S). Additionally, we introduce vanishing point guided Hungarian matching (VPHM) to smooth the prediction results within a video clip, mitigating abrupt variations of predictions between consecutive frames. Comprehensive experiments on various benchmarks demonstrate that DetSeg significantly improves performance, reducing the FPR95 of previous methods by up to 37.45%, offering a more robust and practical solution. Code: https://github.com/huachao0124/DetSeg-official.
ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis	Benjin ZHU CUHK MMLab Xiaogang WANG CUHK MMLab Hongsheng LI CUHK MMLab	Paper Abstract Scene synthesis plays a crucial role in autonomous driving by addressing data scarcity and closed-loop validation. Current approaches struggle to maintain temporal consistency in synthesized videos while preserving fine-grained details. We introduce ConsistentCity, a two-stage framework with a novel Semantic Flow-guided Diffusion Transformers (SF-DiT) that convert sequential BEV semantic maps into temporally consistent driving videos. Operating in a pretrained occupancy VQ-VAE latent space, our SFDiT generates temporally consistent 3D occupancy, which provides guidance for controlled image and video diffusion for scene synthesis. To address the temporal consistency, SF-DiT enhances standard DiT blocks with temporal semantic modeling through two designs: (1) A Semantic Flow Estimation module capturing scene motions (flow, uncertainty, and classification) from sequential BEV semantic maps, and (2) A Semantic Flow-Modulated Cross-Attention module that dynamically adapts attention based on semantic flow patterns. This integration of semantic flow modeling in DiT enables consistent scene evolution understanding. Evaluations of image and video synthesis on nuScenes dataset demonstrate state-of-the-art performance with FID 8.3 and FVD 73.6, and superior temporal occupancy generation results on nuCraft and OpenOccupancy benchmarks. Code is available.
Depth Any Event Stream: Enhancing Event-based Monocular Depth Estimation via Dense-to-Sparse Distillation	Jinjing Zhu HKUST(GZ) Tianbo Pan HKUST Zidong Cao HKUST Yexin Liu HKUST James T. Kwok HKUST Hui Xiong HKUST	Paper Abstract With the superior sensitivity of event cameras to high-speed motion and extreme lighting conditions, event-based monocular depth estimation has gained popularity to predict structural information about surrounding scenes in challenging environments. However, the scarcity of labeled event data constrains prior supervised learning methods. Unleashing the promising potential of the existing RGB-based depth foundation model, DAM [41], we propose Depth Any Event stream (EventDAM) to achieve high-performance eventbased monocular depth estimation in an annotation-free manner. EventDAM effectively combines paired dense RGB images with sparse event data by incorporating three key cross-modality components: Sparsity-aware Feature Mixture (SFM), Sparsity-aware Feature Distillation (SFD), and Sparsity-invariant Consistency Module (SCM). With the proposed sparsity metric, SFM mixes features from RGB images and event data to generate auxiliary depth predictions, while SFD facilitates adaptive feature distillation. Furthermore, SCM ensures output consistency across varying sparsity levels in event data, thereby endowing EventDAM with zeroshot capabilities across diverse scenes. Extensive experiments across a variety of benchmark datasets, compared to approaches using diverse input modalities, robustly substantiate the generalization and zero-shot capabilities of EventDAM.
EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment	Yufei Zhu ShanghaiTech University Yiming Zhong ShanghaiTech University Zemin Yang ShanghaiTech University Peishan Cong ShanghaiTech University Jingyi Yu ShanghaiTech University Xinge Zhu The Chinese University of Hong Kong Yuexin Ma ShanghaiTech University	Paper Supplementary Abstract Dexterous robotic hands often struggle to generalize effectively in complex environments due to models trained on low-diversity data. However, the real world presents an inherently unbounded range of scenarios. A natural solution is to enable robots learning from experience in complex environments-an approach akin to evolution, where systems improve through learning from both failures and successes. Motivated by this, we propose EvolvingGrasp, an evolutionary grasp generation method that continuously enhances grasping performance through efficient preference alignment. Specifically, we introduce Handpose-wise Preference Optimization (HPO), which allows the model to continuously align with preferences from both positive and negative feedback while progressively refining its grasping strategies. To further enhance efficiency and reliability during online adjustments, we incorporate a Physics-aware Consistency Model within HPO, which accelerates inference, reduces the number of timesteps needed for preference fine-tuning, and ensures physical plausibility throughout the process. Our results validate that EvolvingGrasp enables evolutionary grasp generation, ensuring robust, physically feasible, and preference-aligned grasping in both simulation and real scenarios.
IRASim: A Fine-Grained World Model for Robot Manipulation	Fangqi Zhu Hong Kong University of Science and Technology Hongtao Wu ByteDance Seed Song Guo Hong Kong University of Science and Technology Yuxiao Liu ByteDance Seed Chilam Cheang ByteDance Seed Tao Kong ByteDance Seed	Paper Supplementary Abstract World models allow autonomous agents to plan and explore by predicting the visual outcomes of different actions. However, for robot manipulation, it is challenging to accurately model the fine-grained robot-object interaction within the visual space using existing methods which overlooks precise alignment between each action and the corresponding frame. In this paper, we present IRASim, a novel world model capable of generating videos with fine-grained robotobject interaction details, conditioned on historical observations and robot action trajectories. We train a diffusion transformer and introduce a novel frame-level actionconditioning module within each transformer block to explicitly model and strengthen the action-frame alignment. Extensive experiments show that: (1) the quality of the videos generated by our method surpasses all the baseline methods and scales effectively with increased model size and computation; (2) policy evaluations using IRASim exhibit a strong correlation with those using the ground-truth simulator, highlighting its potential to accelerate real-world policy evaluation; (3) testing-time scaling through modelbased planning with IRASim significantly enhances policy performance, as evidenced by an improvement in the IoU metric on the Push-T benchmark from 0.637 to 0.961; (4) IRASim provides flexible action controllability, allowing virtual robotic arms in datasets to be controlled via a keyboard or VR controller. Video and code are available at https://gen-irasim.github.io/.
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities	Chenming Zhu The University of Hong Kong Tai Wang Shanghai AI Laboratory Wenwei Zhang Shanghai AI Laboratory Jiangmiao Pang Shanghai AI Laboratory Xihui Liu The University of Hong Kong	Paper Supplementary Abstract Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D scene understanding capabilities has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D visual understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we utilize the 3D position embeddings to enhance the 2D CLIP Patches with 3D spatial context information and construct 3D patches. By integrating the 3D position embeddings into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D visual understanding and 3D scene understanding. In contrast to previous 3D LMMs, LLaVA-3D supports decoding accurate 3D spatial perception outputs, e.g., 3D bounding boxes, directly from these 3D patches, without relying on the time-consuming off-the-shelf 3D segmentors. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D visual understanding and vision-language conversation capabilities with LLaVA.
LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment	Juelin Zhu National University of Defense Technology Shuaibang Peng National University of Defense Technology Long Wang Westlake University Hanlin Tan National University of Defense Technology Yu Liu National University of Defense Technology Maojun Zhang National University of Defense Technology Shen Yan National University of Defense Technology	Paper Supplementary Abstract We propose a novel method for aerial visual localization over low Level-of-Detail (LoD) city models. Previous wireframe-alignment-based method LoD-Loc [99] has shown promising localization results leveraging LoD models. However, LoD-Loc mainly relies on high-LoD (LoD3 or LoD2) city models, but the majority of available models and those many countries plan to construct nationwide are lowLoD (LoD1). Consequently, enabling localization on lowLoD city models could unlock drones' potential for global urban localization. To address these issues, we introduce LoD-Loc v2, which employs a coarse-to-fine strategy using explicit silhouette alignment to achieve accurate localization over low-LoD city models in the air. Specifically, given a query image, LoD-Loc v2 first applies a building segmentation network to shape building silhouettes. Then, in the coarse pose selection stage, we construct a pose cost volume by uniformly sampling pose hypotheses around a prior pose to represent the pose probability distribution. Each cost of the volume measures the degree of alignment between the projected and predicted silhouettes. We select the pose with maximum value as the coarse pose. In the fine pose estimation stage, a particle filtering method incorporating a multi-beam tracking approach is used to efficiently explore the hypothesis space and obtain the final pose estimation. To further facilitate research in this field, we release two datasets with LoD1 city models covering 10.7 km2, along with real RGB queries and ground-truth pose annotations. Experimental results show that LoDLoc v2 improves estimation accuracy with high-LoD models and enables localization with low-LoD models for the first time. Moreover, it outperforms state-of-the-art baselines by large margins, even surpassing texture-model-based methods, and broadens the convergence basin to accommodate larger prior errors. The project are available at https: //github.com/VictorZoo/LoD-Loc-v2.
MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model	Yaoye Zhu Institute for AI Industry Research (AIR), Tsinghua University Zhe Wang Institute for AI Industry Research (AIR), Tsinghua University Yan Wang Institute for AI Industry Research (AIR), Tsinghua University	Paper Supplementary Abstract As cooperative systems that leverage roadside cameras to assist autonomous vehicle perception become increasingly widespread, large-scale precise calibration of infrastructure cameras has become a critical issue. Traditional manual calibration methods are often time-consuming, laborintensive, and may require road closures. This paper proposes MamV2XCalib, the first V2X-based infrastructure camera calibration method with the assistance of vehicleside LiDAR. MamV2XCalib only requires autonomous vehicles equipped with LiDAR to drive near the cameras to be calibrated in the infrastructure, without the need for specific reference objects or manual intervention. We also introduce a new targetless LiDAR-camera calibration method, which combines multi-scale features and a 4D correlation volume to estimate the correlation between vehicle-side point clouds and roadside images. We model the temporal information and estimate the rotation angles with Mamba, effectively addressing calibration failures in V2X scenarios caused by defects in the vehicle-side data (such as occlusions) and large differences in viewpoint. We evaluate MamV2XCalib on the V2X-Seq and TUMTraf-V2X realworld datasets, demonstrating the effectiveness and robustness of our V2X-based automatic calibration approach. Compared to previous LiDAR-camera methods designed for calibration on one car, our approach achieves better and more stable calibration performance in V2X scenarios with fewer parameters. The code is available at https://github.com/zhuyaoye/MamV2XCalib.
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation	Ziyu Zhu Tsinghua University Xilin Wang Beihang University Yixuan Li State Key Laboratory of General Artificial Intelligence, BIGAI, China Zhuofan Zhang Tsinghua University Xiaojian Ma State Key Laboratory of General Artificial Intelligence, BIGAI, China Yixin Chen State Key Laboratory of General Artificial Intelligence, BIGAI, China Baoxiong Jia State Key Laboratory of General Artificial Intelligence, BIGAI, China Wei Liang Beijing Institute of Technology Qian Yu Beihang University Zhidong Deng Tsinghua University Siyuan Huang State Key Laboratory of General Artificial Intelligence, BIGAI, China Qing Li State Key Laboratory of General Artificial Intelligence, BIGAI, China	Paper Supplementary Abstract Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus on grounding objects in static observations from 3D reconstruction, such as meshes and point clouds, but lack the ability to actively perceive and explore their environment. To address this limitation, we introduce Move to Understand (MTU3D), a unified framework that integrates active perception with 3D vision-language learning, enabling embodied agents to effectively explore and understand their environment. This is achieved by three key innovations: 1) Online query-based representation learning, enabling direct spatial memory construction from RGB-D frames, eliminating the need for explicit 3D reconstruction. 2) A unified objective for grounding and exploring, which represents unexplored locations as frontier queries and jointly optimizes object grounding and frontier selection. 3) End-to-end trajectory learning that combines Vision-Language-Exploration pre-training over a million diverse trajectories collected from both simulated and realworld RGB-D sequences. Extensive evaluations across various embodied navigation and question-answering benchmarks show that MTU3D outperforms state-of-the-art reinforcement learning and modular navigation approaches by 14%, 23%, 9%, and 2% in success rate on HM3D-OVON, GOAT-Bench, SG3D, and A-EQA, respectively. MTU3D's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images. The deployment on a real robot demonstrates MTU3D's effectiveness in handling real-world data. These findings highlight the importance of bridging visual grounding and exploration for embodied intelligence.
ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting	Ruijie Zhu University of Science and Technology of China Mulin Yu Shanghai Artificial Intelligence Laboratory Linning Xu The Chinese University of Hong Kong Lihan Jiang University of Science and Technology of China Yixuan Li The Chinese University of Hong Kong Tianzhu Zhang unknown Jiangmiao Pang Shanghai Artificial Intelligence Laboratory Bo Dai The University of Hong Kong	Paper Supplementary Abstract 3D Gaussian Splatting is renowned for its high-fidelity reconstructions and real-time novel view synthesis, yet its lack of semantic understanding limits object-level perception. In this work, we propose ObjectGS, an object-aware framework that unifies 3D scene reconstruction with semantic understanding. Instead of treating the scene as a unified whole, ObjectGS models individual objects as local anchors that generate neural Gaussians and share object IDs, enabling precise object-level reconstruction. During training, we dynamically grow or prune these anchors and optimize their features, while a one-hot ID encoding with a classification loss enforces clear semantic constraints. We show through extensive experiments that ObjectGS not only outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks, but also integrates seamlessly with applications like mesh extraction and scene editing. Project page: https://ruijiezhu94.github.io/ObjectGS_
PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation	Zhihao Zhu Shanghai Jiao Tong University Yifan Zheng Shanghai Jiao Tong University Siyu Pan Shanghai Jiao Tong University Yaohui Jin Shanghai Jiao Tong University Yao Mu Shanghai Jiao Tong University	Paper Supplementary Abstract The fragmentation between high-level task semantics and low-level geometric features remains a persistent challenge in robotic manipulation. While vision-language models (VLMs) have shown promise in generating affordanceaware visual representations, the lack of semantic grounding in canonical spaces and reliance on manual annotations severely limit their ability to capture dynamic semanticaffordance relationships. To address these, we propose Primitive-Aware Semantic Grounding (PASG), a closed-loop framework that introduces: (1) Automatic primitive extraction through geometric feature aggregation, enabling cross-category detection of keypoints and axes; (2) VLM-driven semantic anchoring that dynamically couples geometric primitives with functional affordances and task-relevant description; (3) A spatial-semantic reasoning benchmark and a fine-tuned VLM (Qwen2.5VL-PA). We demonstrate PASG's effectiveness in practical robotic manipulation tasks across diverse scenarios, achieving performance comparable to manual annotations. PASG achieves a finer-grained semantic-affordance understanding of objects, establishing a unified paradigm for bridging geometric primitives with task semantics in robotic manipulation.
VGMamba: Attribute-to-Location Clue Reasoning for Quantity-Agnostic 3D Visual Grounding	Yihang Zhu Xidian University Jinhao Zhang Xidian University Yuxuan Wang Xidian University Aming Wu Xidian University Cheng Deng Xidian University	Paper Abstract As an important direction of embodied intelligence, 3D Visual Grounding has attracted much attention, aiming to identify 3D objects matching the given language description. Most existing methods often follow a two-stage process, i.e., first detecting proposal objects and identifying the right objects based on the relevance to the given query. However, when the query is complex, it is difficult to leverage an abstract language representation to lock the corresponding objects accurately, affecting the grounding performance. In general, given a specific object, humans usually follow two clues to finish the corresponding grounding, i.e., attribute and location clues. To this end, we explore a new mechanism, attribute-to-location clue reasoning, to conduct accurate grounding. Particularly, we propose a VGMamba network that consists of an SVD-based attribute mamba, location mamba, and multi-modal fusion mamba. Taking a 3D point cloud scene and language query as the input, we first exploit SVD to make a decomposition of the extracted features. Then, a slidingwindow operation is conducted to capture attribute characteristics. Next, a location mamba is presented to obtain the corresponding location information. Finally, by means of multi-modal mamba fusion, the model could effectively localize the object that matches the given query. In the experiment, our method is verified on four datasets. Extensive experimental results demonstrate the superiority of our method.
WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection	Haodong Zhu Beihang University Wenhao Dong Beihang University Linlin Yang Communication University of China Hong Li Beihang University Yuguang Yang Beihang University Yangyang Ren Beihang University Qingcheng Zhu Beihang University Zichao Feng Beihang University Changbai Li Beihang University Shaohui Lin East China Normal University Runqi Wang Beijing Jiaotong University Xiaoyan Luo Beihang University Baochang Zhang Beihang University	Paper Supplementary Abstract Leveraging the complementary characteristics of visible (RGB) and infrared (IR) imagery offers significant potential for improving object detection. In this paper, we propose WaveMamba, a cross-modality fusion method that efficiently integrates the unique and complementary frequency features of RGB and IR decomposed by Discrete Wavelet Transform (DWT). An improved detection head incorporating the Inverse Discrete Wavelet Transform (IDWT) is also proposed to reduce information loss and produce the final detection results. The core of our approach is the introduction of WaveMamba Fusion Block (WMFB), which facilitates comprehensive fusion across low-/high-frequency sub-bands. Within WMFB, the Low-frequency Mamba Fusion Block (LMFB), built upon the Mamba framework, first performs initial low-frequency feature fusion with channel swapping, followed by deep fusion with an advanced gated attention mechanism for enhanced integration. Highfrequency features are enhanced using a strategy that applies an 'absolute maximum' fusion approach. These advancements lead to significant performance gains, with our method surpassing state-of-the-art approaches and achieving average mAP improvements of 4.5% on four benchmarks.
OMNI-DC: Highly Robust Depth Completion with Multiresolution Depth Integration	Yiming Zuo Princeton University Willow Yang Princeton University Zeyu Ma Princeton University Jia Deng Princeton University	Paper Supplementary Abstract Depth completion (DC) aims to predict a dense depth map from an RGB image and a sparse depth map. Existing DC methods generalize poorly to new datasets or unseen sparse depth patterns, limiting their real-world applications. We propose OMNI-DC, a highly robust DC model that generalizes well zero-shot to various datasets. The key design is a novel Multi-resolution Depth Integrator, allowing our model to deal with very sparse depth inputs. We also introduce a novel Laplacian loss to model the ambiguity in the training process. Moreover, we train OMNI-DC on a mixture of high-quality datasets with a scale normalization technique and synthetic depth patterns. Extensive experiments on 7 datasets show consistent improvements over baselines, reducing errors by as much as 43%. Codes and checkpoints are available at https://github.com/princeton-vl/OMNI-DC.
PanSt3R: Multi-view Consistent Panoptic Segmentation	Lojze ˇZust Naver Labs Europe Yohann Cabon Naver Labs Europe Juliette Marrie Naver Labs Europe Leonid Antsfeld Naver Labs Europe Boris Chidlovskii Naver Labs Europe J´erˆome Revaud Naver Labs Europe Gabriela Csurka Naver Labs Europe	Paper Supplementary Abstract Panoptic segmentation in 3D is a fundamental problem in scene understanding. Existing approaches typically rely on costly test-time optimizations (often based on NeRF) to consolidate 2D predictions of off-the-shelf panoptic segmentation methods into 3D. Instead, in this work, we propose a unified and integrated approach PanSt3R, which eliminates the need for test-time optimization by jointly predicting 3D geometry and multi-view-consistent panoptic segmentation in a single forward pass. Our approach harnesses the 3D representations of MUSt3R, a recent scalable multi-view version of DUSt3R, and 2D representations of DINOv2, then performs joint multi-view panoptic prediction via a mask transformer architecture. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach for multi-view segmentation. We also introduce a simple method for generating novel-view predictions based on the predictions of PanSt3R and vanilla 3DGS. Overall, the proposed PanSt3R is conceptually simple yet fast and scalable, and achieves state-of-the-art performance on several benchmarks, while being orders of magnitude faster. More information and examples available on our project page.

Want to develop your own custom dataset?

Whether you are looking for Multi-sensor Annotation tool or a complete annotation solution, we can help! Please tell us what you are looking for and we will get back to you within 24 hours.

Semantic segmentation annotation services

Data labeling platform →

Want to develop your own custom dataset?