Axi's Blog

Academic

views

About Me#

I am a senior student in the School of Artificial Intelligence / Qian Xuesen Honors College at Xi’an Jiaotong University. I am an incoming Ph.D. student in the joint program between Shanghai Jiao Tong University and the Shanghai Artificial Intelligence Laboratory, starting September 2026 under the supervision of Research Scientist Jiangmiao Pang.

My research philosophy is simple: I build things because it’s fun.

Currently, I co-lead StarVLA, an open-source Lego-like VLA framework adopted across the community, and lead the evaluation platform of InternVerse. I am a core contributor to InternVLA-M1 and to the InternData-M1 / InternData-A1 data initiatives, and the first author of GenManip (CVPR 2025). The pursuit of scaling laws and the deep connection between data and models keeps pushing me further into VLA, simulation infrastructure, and generalist robot policies. You can find more of my insights on Embodied AI on my blog.

Here is my academic CV, feel free to download it.

Download CV

Research Interests#

ROS

Robotics Manipulation

Robotic Manipulation, Grasping, Dexterous Control, Object Interaction

PyTorch

VLA

Vision-Language-Action Models, Multi-modal Learning, Embodied AI

NVIDIA

Simulation Platform

Virtual Environments, Physics Simulation, Training Platforms

Selected Publications#

2026

StarVLA-α: Reducing Complexity in Vision-Language-Action Systems

Preprint Preprint
Jinhui Ye * , Ning Gao * , Senqiao Yang , Jinliang Zheng , Zixuan Wang , Yuxin Chen , Pengguang Chen , Yilun Chen , Shu Liu , Jiaya Jia
*Equal contribution
arXiv Preprint 2026

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents, yet the VLA landscape remains highly fragmented in architectures, training data, embodiment configurations, and benchmark-specific engineering. We introduce StarVLA-α, a simple yet strong baseline that deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis of design choices including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive — our single generalist model outperforms π₀.₅ by 20% on the public real-world RoboChallenge benchmark, suggesting a strong VLM backbone with minimal design is already sufficient without additional architectural complexity.

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

Preprint Preprint
StarVLA Community §
§Core contributor
Technical Report 2026

StarVLA is an open-source codebase that addresses the fragmentation of VLA research across incompatible architectures, codebases, and evaluation protocols. It provides a modular backbone–action-head architecture supporting both VLM backbones (e.g., Qwen-VL) and world-model backbones (e.g., Cosmos) alongside representative action-decoding paradigms (FAST, OFT, flow-matching π, GR00T-style dual-system); reusable training strategies including cross-embodiment learning and multimodal co-training; and a unified evaluation interface across LIBERO, SimplerEnv, RoboTwin 2.0, RoboCasa-GR1, and BEHAVIOR-1K supporting both simulation and real-robot deployment. As a core contributor, I help maintain training infrastructure, benchmark integration, and the community release pipeline.

ST4VLA: Spatially Guided Training for Vision-Language-Action Models

Conference Accepted
Jinhui Ye * , Fangjing Wang * , Ning Gao * , Junqiu Yu , Yangkun Zhu , Bin Wang , Jinyu Zhang , Weiyang Jin , Yanwei Fu , Feng Zheng , Yilun Chen , Jiangmiao Pang
*Equal contribution Corresponding author
International Conference on Learning Representations (ICLR) 2026

ST4VLA establishes spatially guided training as a unifying principle for scalable, instruction-following generalist robots. Through a two-stage pipeline — large-scale spatial grounding pre-training on 2.3M spatial reasoning samples to determine "where to act", followed by spatially guided action post-training that produces embodiment-aware actions via plug-and-play spatial prompting — the recipe yields consistent gains across SimplerEnv Google Robot (+14.6%), WidowX (+17%), and LIBERO Franka (+4.3%) while strengthening box, point, and trace prediction. With 244K simulation pick-and-place episodes for co-training, it achieves +20.6% on unseen objects and surpasses prior work by over 10% on long-horizon reasoning-intensive tasks.

SynthVerse: A Large-Scale Diverse Synthetic Dataset for Point Tracking

Conference Accepted
Weiguang Zhao , Haoran Xu , Xingyu Miao , Qin Zhao , Rui Zhang , Kaizhu Huang , Ning Gao , Peizhou Cao , Mingze Sun , Mulin Yu , Tao Lu , Linning Xu , Junting Dong , Jiangmiao Pang
Project Leader Corresponding author
ACM SIGGRAPH 2026

SynthVerse is a large-scale, diverse synthetic dataset specifically designed for point tracking. It introduces several new domains and object types missing from prior synthetic datasets — animated-film-style content, embodied manipulation, scene navigation, and articulated objects — substantially expanding diversity while providing high-quality dynamic motions and interactions. We further establish a highly diverse point tracking benchmark for evaluating state-of-the-art methods under broader domain shifts. Extensive experiments show that training with SynthVerse yields consistent improvements in generalization and exposes limitations of existing trackers under diverse settings.

Nimbus: A Unified Embodied Synthetic Data Generation Framework

Preprint Preprint
Zeyu He , Yuchang Zhang , Yuanzhen Zhou , Miao Tao , Hengjie Li , Hui Wang , Yang Tian , Jia Zeng , Tai Wang , Wenzhe Cai , Yilun Chen , Ning Gao , Jiangmiao Pang
Corresponding author
Technical Report 2026

Nimbus is a unified synthetic data generation framework integrating heterogeneous navigation and manipulation pipelines for embodied intelligence. It introduces a modular four-layer architecture with a decoupled execution model that separates trajectory planning, rendering, and storage into asynchronous stages. Through dynamic pipeline scheduling, global load balancing, distributed fault tolerance, and backend-specific rendering optimizations, Nimbus achieves a 2–3× improvement in end-to-end throughput over unoptimized baselines and serves as the production backbone for the InternData suite.

InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy

Conference Accepted
Yang Tian * , Yuyin Yang * , Yiman Xie * , Zetao Cai * , Xu Shi * , Ning Gao , Hangxu Liu , Xuekun Jiang , Zherui Qiu , Feng Yuan , Yaping Li , Ping Wang , Junhao Cai , Jia Zeng , Hao Dong , Jiangmiao Pang
*Equal contribution Corresponding author
Conference on Computer Vision and Pattern Recognition (CVPR) 2026

InternData-A1 is a large-scale synthetic robotic dataset (630k trajectories, 7,433 hours across 4 embodiments and 70 tasks) generated through a fully autonomous and compositional simulation pipeline. Using the same architecture as π₀, we show—for the first time—that a VLA model trained entirely on synthetic data can match the strongest real-robot datasets, achieving comparable performance across 49 simulation tasks, 5 real-world tasks, and long-horizon dexterous manipulation. The model also demonstrates zero-shot sim-to-real transfer, highlighting the substantial value of scalable simulation for embodied AI. The dataset and generation pipeline are released to enable broader access to large-scale robotic data creation.

2025

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

Preprint Published
Yilun Chen § , Ning Gao § , Jiangmiao Pang § , Bolun Wang § , Fangjing Wang § , Jinhui Ye § , Junqiu Yu § , Jinyu Zhang § , Yangkun Zhu § , Xinyi Chen , Weiyang Jin , Hao Li , Yu Qiao , Yang Tian , Bin Wang , Hanqing Wang , Tai Wang , Ziqin Wang , Xueyuan Wei , Chao Wu , Shuai Yang , Jia Zeng , Jingjing Zhang , Shi Zhang , Bowen Zhou
§Core contributor Authors are listed in alphabetical order
Technical Report 2025

InternVLA-M1 is a unified framework for spatial grounding and robot control that advances instruction-following robots toward general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine “where to act” by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide “how to act” by generating embodiment-aware actions through plug-and-play spatial prompting.

GenManip: LLM-driven Simulation for Generalizable Instruction-Following Manipulation

Conference Published
Ning Gao * , Yilun Chen * , Shuai Yang * , Xinyi Chen * , Yang Tian , Hao Li , Haifeng Huang , Hanqing Wang , Tai Wang , Jiangmiao Pang
*Equal contribution Project Leader Corresponding author
Conference on Computer Vision and Pattern Recognition (CVPR) 2025

GenManip is an LLM-driven, Isaac Sim–based embodied manipulation benchmark with automatic demonstration and layout generation and closed-loop evaluation. Designed to systematically test generalizable instruction-following manipulation in the era of MLLMs, it has served as core infrastructure supporting subsequent VLA research at Shanghai AI Laboratory.

2024

PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation

Conference Published
Ning Gao , Sanping Zhou , Le Wang , Nanning Zheng
Corresponding author
European Conference on Computer Vision (ECCV) 2024

Proposed a semi-supervised learning framework for medical image segmentation using Mean Teachers to enhance model diversity and regularization. Achieved state-of-the-art results and demonstrated generalization across datasets.

Open-source Projects#

StarVLA

VLA Framework · Core Maintainer
Active

A Lego-like codebase for Vision-Language-Action model development with 2.1k+ GitHub stars. Each component (model, data, trainer, configuration, evaluation) is designed for high cohesion and low coupling, supporting multiple action-decoding paradigms (FAST, OFT, π, GR00T) with unified evaluation across LIBERO, SimplerEnv, RoboTwin 2.0, RoboCasa-GR1, and BEHAVIOR-1K.

InternVLA-M1

VLA Framework · Core Contributor
Active

A spatially guided Vision-Language-Action framework for generalist robot policy. Two-stage pipeline combining spatial grounding pre-training and spatially guided action post-training, with consistent gains across SimplerEnv, WidowX, and LIBERO Franka. 405+ GitHub stars.

InternData-M1

Robotics Dataset · Core Contributor
Active

A comprehensive embodied robotics dataset containing ~250,000 simulation demonstrations with rich frame-based information including 2D/3D boxes, trajectories, grasp points, and semantic masks, with comprehensive annotations.

InternData-A1

Robotics Dataset
Active

A hybrid synthetic-real manipulation dataset containing over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. Accepted at CVPR 2026.