Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

1Stanford University, 2NYU Shanghai, 3UNC Chapel Hill
0:00 / 0:00

Abstract

TL;DR: Generated Reality turns human-tracked data into an autoregressively generated video, enabling interactive human-centric experiences supporting dexterous hand-object interactions.

Extended reality (XR) demands generative models that respond to users’ tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion model conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand-object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.


Toggle to visualize the (2D) hand pose conditioning. 3D hand and head pose conditioning not shown.



How it works

Pipeline

We track the user’s head and hand poses with a commercial VR headset. Hand motion is represented using the UmeTrack, which provides the wrist pose and 20 joint angles per hand. Our conditioning uses a hybrid 2D–3D scheme, combining a 2D image of hand skeleton and the 3D model hand-pose parameters. Features extracted from these modules are combined with the head pose features via token addition and fed into the DiT. The model then autoregressively generates new frames.



Hand Motion Conditioning

We conduct a comprehensive ablation of hand-pose conditioning strategies and propose a 2D–3D hybrid approach that captures hand motion more reliably. Our method achieves the best hand pose accuracy among all baselines.

Hand Conditioning Method:

Ground Truth
Ours

Select Scene:

Joint Hand–Camera Conditioning

We further extend this framework to jointly condition on both hand and camera poses. Our joint conditioning model disambiguates hand and head motion, enabling accurate object interactions.

Joint Conditioning Method:

Ground Truth
Ours

Select Scene:



The Generated Reality System



We develop our Generated Reality system by distilling the bidirectional model into an autoregressive variant that runs at 11 FPS on a VR headset. Using live-tracked hand and head poses as controls, the system streams generated video directly to the headset.


User Study Results

In our user studies, participants achieved higher task success rates (left) and reported higher levels of perceived control (right) compared to the baseline.

BibTeX

@article{xie2026generatedrealityhumancentricworld,
  author    = {Xie, Linxi and Sun, Lisong C. and Neall, Ashley and Wu, Tong and Cai, Shengqu and Wetzstein, Gordon},
  title     = {Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control},
  journal   = {arXiv preprint arXiv:2602.18422},
  year      = {2026},
}