YIMING LI

Haleakala, Maui, 2021

I am a final-year PhD student at NYU AI4CE Lab led by Chen Feng. I am also an NVIDIA Graduate Fellow at the Autonmous Vehicle Research Group, working closely with Marco Pavone, Jose M. Alvarez, and Sanja Fidler. Before that, I had the opportunity to work as an intern at NVIDIA AI Research advised by Anima Anandkumar in 2022, and a research assistant at Shanghai Jiao Tong University (SJTU) advised by Siheng Chen in 2021.

I will graduate in Fall 2024 and am actively seeking a postdoctoral or industrial position, starting in Spring 2025.

My research vision is to enable collaborative autonomous intelligence by empowering robots with human-level spatial, social, and self-awareness, allowing them to actively perceive and plan in unstructured environments, interact effectively with humans or other robots, and leverage as well as augment the associated memory. To this end, I draw from vision, learning, robotics, graphics, language, sensing, data science, and cognitive science. My research works include developing robust, efficient, and scalable computational models for 3D scene parsing and decision-making from high-dimensional sensory input, as well as curating large-scale datasets to effectively train and verify these models for self-driving and robotics.

I am looking for UG/MS students to work on cutting-edge research projects with me and my collaborators at NYU/NVIDIA/USC/Stanford/Tsinghua. Please send me an email if you are interested!

Neural Representations for Dynamic Scenes (NeRF/3DGS)
Vision-Language Models for Spatial Robotics
Embodied and Cognitive AI for Robotics
Generative Models for Robotic Perception and Planning
Dataset Curation and Autolabeling for Spatial Robotics

news

Jun 1, 2024	We are organizing the 2nd Vision-Centric Autonomous Driving (VCAD) Workshop at ECCV 2024. We invite you to attend our workshop and submit your papers!
May 20, 2024	I served as an Associate Editor for IROS 2024.
Dec 10, 2023	I have received NVIDIA Graduate Fellowship (2024-2025) (<2.0% acceptance rate). Thank you, NV !
Aug 25, 2023	NVIDIA featured VoxFormer together with FB-OCC! Here is the youtube video: Taking Autonomous Vehicle Occupancy Prediction into the Third Dimension - NVIDIA DRIVE Labs Ep. 30.
Jul 14, 2023	Among Us and PVT++ are accepted by ICCV 2023. See you in Paris!
Jun 19, 2023	I am hosting Vision-Centric Autonomous Driving (VCAD) CVPR 2023 Workshop at Vancoucer, together with Vitor Guizilini, Yue Wang, and Hang Zhao!
Jun 18, 2023	I give an invited talk about VoxFormer at C3DV: 1st Workshop On Compositional 3D Vision@CVPR2023.
Jun 2, 2023	Our NYU team is organizing Collaborative Perception and Learning (CoPerception) ICRA 2023 Workshop at London, together with UCLA Mobility Lab and SJTU MediaBrain Group.
Apr 23, 2023	DeepExplorer is accepted at RSS 2023. See you in Daegu!
Mar 21, 2023	VoxFormer was selected as a highlight at CVPR 2023. Specifically, CVPR 2023 has received 9155 submissions, accepted 2360 papers, and selected 235 highlights (10% of accepted papers, 2.5% of submissions).
Jun 20, 2022	I give an invited talk about egocentric 3D target prediction at EPIC Workshop@CVPR2022.
Jun 5, 2022	I give an invited talk about collaborative and adversarial 3D perception at 3D-DLAD Workshop@IV2022.
Jul 23, 2021	FLAT is accepted at ICCV 2021 as an oral presentation. ICCV 2021 received a record number of 6236 submissions and accepted 1617 papers. ACs recommended the selection of 210 oral papers. These are 3% of all submissions and 13% of all papers.

selected publications

2024

Preprint

Memorize What Matters: Emergent Scene Decomposition from Multitraverse

Yiming Li, Zehong Wang, Yue Wang, and 5 more authors

arXiv preprint arXiv:2405.17187, 2024

Abs arXiv PDF Code Website

Humans naturally retain memories of permanent elements, while ephemeral moments often slip through the cracks of memory. This selective retention is crucial for robotic perception, localization, and mapping. To endow robots with this capability, we introduce 3D Gaussian Mapping (3DGM), a self-supervised, camera-only offline mapping framework grounded in 3D Gaussian Splatting. 3DGM converts multitraverse RGB videos from the same region into a Gaussian-based environmental map while concurrently performing 2D ephemeral object segmentation. Our key observation is that the environment remains consistent across traversals, while objects frequently change. This allows us to exploit self-supervision from repeated traversals to achieve environment-object decomposition. More specifically, 3DGM formulates multitraverse environmental mapping as a robust differentiable rendering problem, treating pixels of the environment and objects as inliers and outliers, respectively. Using robust feature distillation, feature residuals mining, and robust optimization, 3DGM jointly performs 3D mapping and 2D segmentation without human intervention. We build the Mapverse benchmark, sourced from the Ithaca365 and nuPlan datasets, to evaluate our method in unsupervised 2D segmentation, 3D reconstruction, and neural rendering. Extensive results verify the effectiveness and potential of our method for self-driving and robotics.
Preprint

SSCBench: A Large-Scale 3D Semantic Scene Completion Benchmark for Autonomous Driving

Yiming Li, Sihang Li, Xinhao Liu, and 8 more authors

arXiv preprint arXiv:2306.09001, 2024

Abs arXiv PDF Code

Semantic scene completion (SSC) is crucial for holistic 3D scene understanding by jointly estimating semantics and geometry from sparse observations. However, progress in SSC, particularly in autonomous driving scenarios, is hindered by the scarcity of high-quality datasets. To overcome this challenge, we introduce SSCBench, a comprehensive benchmark that integrates scenes from widely-used automotive datasets (e.g., KITTI-360, nuScenes, and Waymo). SSCBench follows an established setup and format in the community, facilitating the easy exploration of the camera- and LiDAR-based SSC across various real-world scenarios. We present quantitative and qualitative evaluations of state-of-the-art algorithms on SSCBench and commit to continuously incorporating novel automotive datasets and SSC algorithms to drive further advancements in this field.
CVPR

Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset

Yiming Li, Zhiheng Li, Nuo Chen, and 5 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Abs Code Website

Large-scale datasets have fueled recent advancements in AI-based autonomous vehicle research. However, these datasets are usually collected from a single vehicle’s one-time pass of a certain location, lacking multiagent interactions or repeated traversals of the same place. Such information could lead to transformative enhancements in autonomous vehicles’ perception, prediction, and planning capabilities. To bridge this gap, in collaboration with the self-driving company May Mobility, we present MARS dataset which unifies scenarios that enable MultiAgent, multitraveRSal, and multimodal autonomous vehicle research. More specifically, MARS is collected with a fleet of autonomous vehicles driving within a certain geographical area. Each vehicle has its own route and different vehicles may appear at nearby locations. Each vehicle is equipped with a LiDAR and surround-view RGB cameras. We curate two subsets in MARS: one facilitates collaborative driving with multiple vehicles simultaneously present at the same location, and the other enables memory retrospection through asynchronous traversals of the same location by multiple vehicles. We conduct experiments in place recognition and neural reconstruction. More importantly, MARS introduces new research opportunities and challenges such as multitraversal 3D reconstruction, multiagent perception, and unsupervised object discovery. Our data and codes can be found at https://ai4ce.github.io/MARS/.

2023

ICCV

Among Us: Adversarially Robust Collaborative Perception by Consensus

Yiming Li, Qi Fang, Jiamu Bai, and 3 more authors

In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Abs arXiv PDF Code

Multiple robots could perceive a scene (e.g., detect objects) collaboratively better than individuals, although easily suffer from adversarial attacks when using deep learning. This could be addressed by the adversarial defense, but its training requires the often-unknown attacking mechanism. Differently, we propose ROBOSAC, a novel sampling-based defense strategy generalizable to unseen attackers. Our key idea is that collaborative perception should lead to consensus rather than dissensus in results compared to individual perception. This leads to our hypothesize-and-verify framework: perception results with and without collaboration from a random subset of teammates are compared until reaching a consensus. In such a framework, more teammates in the sampled subset often entail better perception performance but require longer sampling time to reject potential attackers. Thus, we derive how many sampling trials are needed to ensure the desired size of an attacker-free subset, or equivalently, the maximum size of such a subset that we can successfully sample within a given number of trials. We validate our method on the task of collaborative 3D object detection in autonomous driving scenarios.
RSS

Metric-Free Exploration for Topological Mapping by Task and Motion Imitation in Feature Space

Yuhang He, Irving Fang, Yiming Li, and 2 more authors

In Proceedings of Robotics: Science and Systems, 2023

Abs PDF Code Slides Website

We propose DeepExplorer, a simple and lightweight metric-free exploration method for topological mapping of unknown environments. It performs task and motion planning (TAMP) entirely in image feature space. The task planner is a recurrent network using the latest image observation sequence to hallucinate a feature as the next-best exploration goal. The motion planner then utilizes the current and the hallucinated features to generate an action taking the agent towards that goal. The two planners are jointly trained via deeply-supervised imitation learning from expert demonstrations. During exploration, we iteratively call the two planners to predict the next action, and the topological map is built by constantly appending the latest image observation and action to the map and using visual place recognition (VPR) for loop closing. The resulting topological map efficiently represents an environment’s connectivity and traversability, so it can be used for tasks such as visual navigation. We show DeepExplorer’s exploration efficiency and strong sim2sim generalization capability on large-scale simulation datasets like Gibson and MP3D. Its effectiveness is further validated via the image-goal navigation performance on the resulting topological map. We further show its strong zero-shot sim2real generalization capability in real-world experiments.
CVPR

VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion

Yiming Li, Zhiding Yu, Christopher Choy, and 5 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (highlight, top 2.5%), 2023

Abs PDF Code Slides

Humans can easily imagine the complete 3D geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformerbased semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images. Our framework adopts a two-stage design where we start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense 3D voxels from the sparse ones. A key idea of this design is that the visual features on 2D images correspond only to the visible scene structures rather than the occluded or empty spaces. Therefore, starting with the featurization and prediction of the visible structures is more reliable. Once we obtain the set of sparse queries, we apply a masked autoencoder design to propagate the information to all the voxels by self-attention. Experiments on SemanticKITTI show that VoxFormer outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics and reduces GPU memory during training to less than 16GB. Our code is available on https://github.com/NVlabs/VoxFormer.
CVPR

DeepMapping2: Self-Supervised Large-Scale LiDAR Map Optimization

Chao Chen, Xinhao Liu, Yiming Li, and 2 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Abs PDF Code Website

LiDAR mapping is important yet challenging in self-driving and mobile robotics. To tackle such a global point cloud registration problem, DeepMapping [1] converts the complex map estimation into a self-supervised training of simple deep networks. Despite its broad convergence range on small datasets, DeepMapping still cannot produce satisfactory results on large-scale datasets with thousands of frames. This is due to the lack of loop closures and exact cross-frame point correspondences, and the slow convergence of its global localization network. We propose DeepMapping2 by adding two novel techniques to address these issues: (1) organization of training batch based on map topology from loop closing, and (2) self-supervised local-to-global point consistency loss leveraging pairwise registration. Our experiments and ablation studies on public datasets such as KITTI, NCLT, and Nebula demonstrate the effectiveness of our method.

2022

CVPR

Egocentric Prediction of Action Target in 3D

Yiming Li, Ziang Cao, Andrew Liang, and 4 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2022

Abs PDF Code Website

We are interested in anticipating as early as possible the target location of a person’s object manipulation action in a 3D workspace from egocentric vision. It is important in fields like human-robot collaboration, but has not yet received enough attention from vision and learning communities. To stimulate more research on this challenging egocentric vision task, we propose a large multimodality dataset of more than 1 million frames of RGB-D and IMU streams, and provide evaluation metrics based on our high-quality 2D and 3D labels from semi-automatic annotation. Meanwhile, we design baseline methods using recurrent neural networks (RNNs) and conduct various ablation studies to validate their effectiveness. Our results demonstrate that this new task is worthy of further study by researchers in robotics, vision, and learning communities.

2021

NeurIPS

Learning Distilled Collaboration Graph for Multi-Agent Perception

Yiming Li, Shunli Ren, Pengxiang Wu, and 3 more authors

In Advances in Neural Information Processing Systems, Jun 2021

Abs arXiv PDF Code

To promote better performance-bandwidth trade-off for multi-agent perception, we propose a novel distilled collaboration graph (DiscoGraph) to model trainable, pose-aware, and adaptive collaboration among agents. Our key novelties lie in two aspects. First, we propose a teacher-student framework to train DiscoGraph via knowledge distillation. The teacher model employs an early collaboration with holistic-view inputs; the student model is based on intermediate collaboration with single-view inputs. Our framework trains DiscoGraph by constraining post-collaboration feature maps in the student model to match the correspondences in the teacher model. Second, we propose a matrix-valued edge weight in DiscoGraph. In such a matrix, each element reflects the inter-agent attention at a specific spatial region, allowing an agent to adaptively highlight the informative regions. During inference, we only need to use the student model named as the distilled collaboration network (DiscoNet). Attributed to the teacher-student framework, multiple agents with the shared DiscoNet could collaboratively approach the performance of a hypothetical teacher model with a holistic view. Our approach is validated on V2X-Sim 1.0, a large-scale multi-agent perception dataset that we synthesized using CARLA and SUMO co-simulation. Our quantitative and qualitative experiments in multi-agent 3D object detection show that DiscoNet could not only achieve a better performance-bandwidth trade-off than the state-of-the-art collaborative perception methods, but also bring more straightforward design rationale.
ICCV

Fooling LiDAR Perception via Adversarial Trajectory Perturbation

Yiming Li, Congcong Wen, Felix Juefei-Xu, and 1 more author

In Proceedings of the IEEE/CVF International Conference on Computer Vision (oral, top 3.0%), Jun 2021

Abs PDF Code Website

LiDAR point clouds collected from a moving vehicle are functions of its trajectories, because the sensor motion needs to be compensated to avoid distortions. When autonomous vehicles are sending LiDAR point clouds to deep networks for perception and planning, could the motion compensation consequently become a wide-open backdoor in those networks, due to both the adversarial vulnerability of deep learning and GPS-based vehicle trajectory estimation that is susceptible to wireless spoofing? We demonstrate such possibilities for the first time: instead of directly attacking point cloud coordinates which requires tampering with the raw LiDAR readings, only adversarial spoofing of a self-driving car’s trajectory with small perturbations is enough to make safety-critical objects undetectable or detected with incorrect positions. Moreover, polynomial trajectory perturbation is developed to achieve a temporally-smooth and highly-imperceptible attack. Extensive experiments on 3D object detection have shown that such attacks not only lower the performance of the state-of-the-art detectors effectively, but also transfer to other detectors, raising a red flag for the community.