YIMING LI

I am a Dean’s PhD Fellow at NYU, advised by Institute Associate Professor Chen Feng. I am also a Graduate Fellow at NVIDIA Research, working on world simulators and foundation models with Prof. Marco Pavone at Stanford University. I am fortunate to collaborate with Dr. Zhiding Yu at NVIDIA Learning and Perception Research, as well as Dr. Zan Gojcjc and Prof. Sanja Fidler at NVIDIA Toronto AI Lab.

Previously, I spent some time at NVIDIA AI Research with Prof. Anima Anandkumar, NVIDIA Applied AV Research with Dr. Jose M. Alvarez, Tsinghua MARS Lab with Prof. Hang Zhao, and SJTU with Prof. Siheng Chen.

My research aims to replicate human-like spatial intelligence with minimal human supervision, paving the way for trustworthy and accessible autonomous robots and embodied agents working for and around humans. My current efforts are in three dimensions: (1) agent: building a generalist embodied visual agent powered by foundation models to perceive and act in 3D; (2) world: building a photorealistic world simulator to train and evaluate the foundation agents; (3) data: visual-spatial and multimodal dataset curation.

news

Oct 31, 2024	I successfully passed my PhD defense - video. Many thanks to my committee and collaborators!
Jul 23, 2024	MARS is covered by TechXplore: New dataset kicks autonomous vehicle research into high gear.
Jun 01, 2024	We are organizing the 2nd VCAD Workshop and 1st MLLMAV Workshop at ECCV 2024.
May 20, 2024	I serve as an Associate Editor for IROS 2024.
Dec 10, 2023	I am honored to receive the NVIDIA Graduate Fellowship (2024-2025) (10 awardees worldwide).

selected publications

2024

Preprint

Extrapolated Urban View Synthesis Benchmark

Xiangyu Han, Zhen Jia, Boyi Li, and 8 more authors

Technical Report, 2024

Abs arXiv PDF Code Website

Photorealistic simulators are essential for the training and evaluation of vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis (NVS), a crucial capability that generates diverse unseen viewpoints to accommodate the broad and continuous pose distribution of AVs. Recent advances in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic rendering at real-time speeds and have been widely used in modeling large-scale driving scenes. However, their performance is commonly evaluated using an interpolated setup with highly correlated training and test views. In contrast, extrapolation, where test views largely deviate from training views, remains underexplored, limiting progress in generalizable simulation technology. To address this gap, we leverage publicly available AV datasets with multiple traversals, multiple vehicles, and multiple cameras to build the first Extrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct quantitative and qualitative evaluations of state-of-the-art Gaussian Splatting methods across different difficulty levels. Our results show that Gaussian Splatting is prone to overfitting to training views. In addition, incorporating diffusion priors and improving geometry cannot fundamentally improve NVS under large view changes, highlighting the need for more robust approaches. We will release our resources for the community to advance simulation technology for self-driving and urban robotics.
Preprint

Unleashing the Power of Data Synthesis in Visual Localization

Sihang Li, Siqi Tan, Bowen Chang, and 3 more authors

Technical Report, 2024

Abs arXiv PDF Code Website

Visual localization, which estimates a camera’s pose within a known scene, is a long-standing challenge in vision and robotics. Recent end-to-end methods that directly regress camera poses from query images have gained attention for fast inference. However, existing methods often struggle to generalize to unseen views. In this work, we aim to unleash the power of data synthesis to promote the generalizability of pose regression. Specifically, we lift real 2D images into 3D Gaussian Splats with varying appearance and deblurring abilities, which are then used as a data engine to synthesize more posed images. To fully leverage the synthetic data, we build a two-branch joint training pipeline, with an adversarial discriminator to bridge the syn-to-real gap. Experiments on established benchmarks show that our method outperforms state-of-the-art end-to-end approaches, reducing translation and rotation errors by 50% and 21.6% on indoor datasets, and 35.56% and 38.7% on outdoor datasets. We also validate the effectiveness of our method in dynamic driving scenarios under varying weather conditions. Notably, as data synthesis scales up, the ability to interpolate and extrapolate training data for localizing unseen views emerges.
NeurIPS Spotlight

Memorize What Matters: Emergent Scene Decomposition from Multitraverse

Yiming Li, Zehong Wang, Yue Wang, and 5 more authors

In Advances in Neural Information Processing Systems, 2024

Abs arXiv PDF Code Website

Humans naturally retain memories of permanent elements, while ephemeral moments often slip through the cracks of memory. This selective retention is crucial for robotic perception, localization, and mapping. To endow robots with this capability, we introduce 3D Gaussian Mapping (3DGM), a self-supervised, camera-only offline mapping framework grounded in 3D Gaussian Splatting. 3DGM converts multitraverse RGB videos from the same region into a Gaussian-based environmental map while concurrently performing 2D ephemeral object segmentation. Our key observation is that the environment remains consistent across traversals, while objects frequently change. This allows us to exploit self-supervision from repeated traversals to achieve environment-object decomposition. More specifically, 3DGM formulates multitraverse environmental mapping as a robust representation learning problem, treating pixels of the environment and objects as inliers and outliers, respectively. Using robust feature distillation, feature residuals mining, and robust optimization, 3DGM jointly performs 3D mapping and 2D segmentation without human intervention. We build the Mapverse benchmark, sourced from the Ithaca365 and nuPlan datasets, to evaluate our method in unsupervised 2D segmentation, 3D reconstruction, and neural rendering. Extensive results verify the effectiveness and potential of our method for self-driving and robotics.
CVPR

Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset

Yiming Li, Zhiheng Li, Nuo Chen, and 5 more authors

In IEEE Conference on Computer Vision and Pattern Recognition, 2024

Abs PDF Code Website

Large-scale datasets have fueled recent advancements in AI-based autonomous vehicle research. However, these datasets are usually collected from a single vehicle’s one-time pass of a certain location, lacking multiagent interactions or repeated traversals of the same place. Such information could lead to transformative enhancements in autonomous vehicles’ perception, prediction, and planning capabilities. To bridge this gap, in collaboration with the self-driving company May Mobility, we present MARS dataset which unifies scenarios that enable MultiAgent, multitraveRSal, and multimodal autonomous vehicle research. More specifically, MARS is collected with a fleet of autonomous vehicles driving within a certain geographical area. Each vehicle has its own route and different vehicles may appear at nearby locations. Each vehicle is equipped with a LiDAR and surround-view RGB cameras. We curate two subsets in MARS: one facilitates collaborative driving with multiple vehicles simultaneously present at the same location, and the other enables memory retrospection through asynchronous traversals of the same location by multiple vehicles. We conduct experiments in place recognition and neural reconstruction. More importantly, MARS introduces new research opportunities and challenges such as multitraversal 3D reconstruction, multiagent perception, and unsupervised object discovery. Our data and codes can be found at https://ai4ce.github.io/MARS/.

2023

RSS

Metric-Free Exploration for Topological Mapping by Task and Motion Imitation in Feature Space

Yuhang He, Irving Fang, Yiming Li, and 2 more authors

In Proceedings of Robotics: Science and Systems, 2023

Abs PDF Code Slides Website

We propose DeepExplorer, a simple and lightweight metric-free exploration method for topological mapping of unknown environments. It performs task and motion planning (TAMP) entirely in image feature space. The task planner is a recurrent network using the latest image observation sequence to hallucinate a feature as the next-best exploration goal. The motion planner then utilizes the current and the hallucinated features to generate an action taking the agent towards that goal. The two planners are jointly trained via deeply-supervised imitation learning from expert demonstrations. During exploration, we iteratively call the two planners to predict the next action, and the topological map is built by constantly appending the latest image observation and action to the map and using visual place recognition (VPR) for loop closing. The resulting topological map efficiently represents an environment’s connectivity and traversability, so it can be used for tasks such as visual navigation. We show DeepExplorer’s exploration efficiency and strong sim2sim generalization capability on large-scale simulation datasets like Gibson and MP3D. Its effectiveness is further validated via the image-goal navigation performance on the resulting topological map. We further show its strong zero-shot sim2real generalization capability in real-world experiments.
ICCV

Among Us: Adversarially Robust Collaborative Perception by Consensus

Yiming Li, Qi Fang, Jiamu Bai, and 3 more authors

In IEEE International Conference on Computer Vision, 2023

Abs arXiv PDF Code

Multiple robots could perceive a scene (e.g., detect objects) collaboratively better than individuals, although easily suffer from adversarial attacks when using deep learning. This could be addressed by the adversarial defense, but its training requires the often-unknown attacking mechanism. Differently, we propose ROBOSAC, a novel sampling-based defense strategy generalizable to unseen attackers. Our key idea is that collaborative perception should lead to consensus rather than dissensus in results compared to individual perception. This leads to our hypothesize-and-verify framework: perception results with and without collaboration from a random subset of teammates are compared until reaching a consensus. In such a framework, more teammates in the sampled subset often entail better perception performance but require longer sampling time to reject potential attackers. Thus, we derive how many sampling trials are needed to ensure the desired size of an attacker-free subset, or equivalently, the maximum size of such a subset that we can successfully sample within a given number of trials. We validate our method on the task of collaborative 3D object detection in autonomous driving scenarios.
CVPR Highlight

VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion

Yiming Li, Zhiding Yu, Christopher Choy, and 5 more authors

In IEEE Conference on Computer Vision and Pattern Recognition, 2023

Abs PDF Code Slides

Humans can easily imagine the complete 3D geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformerbased semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images. Our framework adopts a two-stage design where we start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense 3D voxels from the sparse ones. A key idea of this design is that the visual features on 2D images correspond only to the visible scene structures rather than the occluded or empty spaces. Therefore, starting with the featurization and prediction of the visible structures is more reliable. Once we obtain the set of sparse queries, we apply a masked autoencoder design to propagate the information to all the voxels by self-attention. Experiments on SemanticKITTI show that VoxFormer outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics and reduces GPU memory during training to less than 16GB. Our code is available on https://github.com/NVlabs/VoxFormer.

2022

CoRL

Multi-Robot Scene Completion: Towards Task-Agnostic Collaborative Perception

Yiming Li, Juexiao Zhang, Dekun Ma, and 2 more authors

In 6th Conference on Robot Learning, 2022

Abs PDF Code Website

Collaborative perception learns how to share information among multiple robots to perceive the environment better than individually done. Past research on this has been task-specific, such as detection or segmentation. Yet this leads to different information sharing for different tasks, hindering the large-scale deployment of collaborative perception. We propose the first task-agnostic collaborative perception paradigm that learns a single collaboration module in a self-supervised manner for different downstream tasks. This is done by a novel task termed multi-robot scene completion, where each robot learns to effectively share information for reconstructing a complete scene viewed by all robots. Moreover, we propose a spatiotemporal autoencoder (STAR) that amortizes over time the communication cost by spatial sub-sampling and temporal mixing. Extensive experiments validate our method’s effectiveness on scene completion and collaborative perception in autonomous driving scenarios.
CVPR

Egocentric Prediction of Action Target in 3D

Yiming Li, Ziang Cao, Andrew Liang, and 4 more authors

In IEEE Conference on Computer Vision and Pattern Recognition, Jun 2022

Abs PDF Code Website

We are interested in anticipating as early as possible the target location of a person’s object manipulation action in a 3D workspace from egocentric vision. It is important in fields like human-robot collaboration, but has not yet received enough attention from vision and learning communities. To stimulate more research on this challenging egocentric vision task, we propose a large multimodality dataset of more than 1 million frames of RGB-D and IMU streams, and provide evaluation metrics based on our high-quality 2D and 3D labels from semi-automatic annotation. Meanwhile, we design baseline methods using recurrent neural networks (RNNs) and conduct various ablation studies to validate their effectiveness. Our results demonstrate that this new task is worthy of further study by researchers in robotics, vision, and learning communities.

2021

NeurIPS

Learning Distilled Collaboration Graph for Multi-Agent Perception

Yiming Li, Shunli Ren, Pengxiang Wu, and 3 more authors

In Advances in Neural Information Processing Systems, Jun 2021

Abs arXiv PDF Code

To promote better performance-bandwidth trade-off for multi-agent perception, we propose a novel distilled collaboration graph (DiscoGraph) to model trainable, pose-aware, and adaptive collaboration among agents. Our key novelties lie in two aspects. First, we propose a teacher-student framework to train DiscoGraph via knowledge distillation. The teacher model employs an early collaboration with holistic-view inputs; the student model is based on intermediate collaboration with single-view inputs. Our framework trains DiscoGraph by constraining post-collaboration feature maps in the student model to match the correspondences in the teacher model. Second, we propose a matrix-valued edge weight in DiscoGraph. In such a matrix, each element reflects the inter-agent attention at a specific spatial region, allowing an agent to adaptively highlight the informative regions. During inference, we only need to use the student model named as the distilled collaboration network (DiscoNet). Attributed to the teacher-student framework, multiple agents with the shared DiscoNet could collaboratively approach the performance of a hypothetical teacher model with a holistic view. Our approach is validated on V2X-Sim 1.0, a large-scale multi-agent perception dataset that we synthesized using CARLA and SUMO co-simulation. Our quantitative and qualitative experiments in multi-agent 3D object detection show that DiscoNet could not only achieve a better performance-bandwidth trade-off than the state-of-the-art collaborative perception methods, but also bring more straightforward design rationale.
ICCV Oral

Fooling LiDAR Perception via Adversarial Trajectory Perturbation

Yiming Li, Congcong Wen, Felix Juefei-Xu, and 1 more author

In IEEE International Conference on Computer Vision, Jun 2021

Abs PDF Code Website

LiDAR point clouds collected from a moving vehicle are functions of its trajectories, because the sensor motion needs to be compensated to avoid distortions. When autonomous vehicles are sending LiDAR point clouds to deep networks for perception and planning, could the motion compensation consequently become a wide-open backdoor in those networks, due to both the adversarial vulnerability of deep learning and GPS-based vehicle trajectory estimation that is susceptible to wireless spoofing? We demonstrate such possibilities for the first time: instead of directly attacking point cloud coordinates which requires tampering with the raw LiDAR readings, only adversarial spoofing of a self-driving car’s trajectory with small perturbations is enough to make safety-critical objects undetectable or detected with incorrect positions. Moreover, polynomial trajectory perturbation is developed to achieve a temporally-smooth and highly-imperceptible attack. Extensive experiments on 3D object detection have shown that such attacks not only lower the performance of the state-of-the-art detectors effectively, but also transfer to other detectors, raising a red flag for the community.