Role overview
We are actively seeking a full-time
Research Scientist Intern
to drive the modeling and algorithmic development of XPENG’s next-generation
Vision-Language-Action (VLA) Foundation Model
— the core brain that powers our end-to-end autonomous driving systems.
You will work closely with world-class researchers, perception and planning engineers, and infrastructure experts to design, train, and deploy large-scale multi-modal models that unify vision, language, and control. Your work will directly shape the intelligence that enables XPENG’s future L3/L4 autonomous driving products.
What you'll work on
- Conduct research on designing and implementing large-scale multi-modal architectures (e.g., vision–language–action transformers) for end-to-end autonomous driving.
- Design and integrate cross-modal alignment (e.g., visual grounding, temporal reasoning, policy distillation, imitation and reinforcement learning) to improve model interpretability and action quality.
- Closely collaborate with researchers and engineers across the modeling and infrastructure team.
- Contribute to top-tier AI/CV/ML conferences publications and present research findings.
What we're looking for
- Publication record in top-tier AI conferences (CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, etc).
- Prior experience building foundation or end-to-end driving models , or LLM/VLM architectures (e.g., ViT, Flamingo, BEVFormer, RT-2, or GRPO-style policies).
- Knowledge of RLHF/DPO/GRPO , trajectory prediction , or policy learning for control tasks.
- Familiarity with distributed training (DDP, FSDP) and large-batch optimization.