Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Linxi Xie^*1,2, Lisong C. Sun^*1, Ashley Neall^*1,3, Tong Wu¹, Shengqu Cai¹, Gordon Wetzstein¹

¹Stanford University, ²NYU Shanghai, ³UNC Chapel Hill

0:00 / 0:00

Abstract

TL;DR: Generated Reality turns human-tracked data into an autoregressively generated video, enabling interactive human-centric experiences supporting dexterous hand-object interactions.

Extended reality (XR) demands generative models that respond to users’ tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion model conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand-object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

Toggle to visualize the (2D) hand pose conditioning. 3D hand and head pose conditioning not shown.

The viewer, wearing bulky astronaut gloves, grips the shaft of a waving flag. On the flag is the emblem of an alien civilization: a circular sigil made of angular glyphs surrounding a stylized three-moon icon. A vibrant alien landscape under a colorful sky. The ground is rocky and covered with alien vegetation. In the distance, alien humanoid creatures stare at the viewer, as if greeting the waving flag.

A bright outdoor park scene on a clear day, with soft green grass and scattered trees swaying gently in a light breeze. A friendly golden retriever sits obediently facing towards the viewer. The viewer is bent down and reaches out to pet the dog. The dog’s golden coat catches the sunlight, its mouth slightly open in a relaxed pant and its tail wagging eagerly. The camera emphasizes the close-up interaction, the dog’s happy expression and subtle body shifts, and the warm, calm atmosphere.

A gritty medieval fantasy battle scene set in a torchlit stone corridor of an ancient dungeon. The air is dusty and cold, with flickering firelight casting long shadows across mossy bricks and scattered bones. The viewer’s left hand grips wooden torch, while the right hand wields a steel longsword with a worn leather hilt. A soldier in rusted armor charges toward the viewer. The camera emphasizes realistic hand and weapon motion, dynamic combat movement, and cinematic medieval lighting, with subtle motion blur and a tense, immersive game-like feel.

A simple interior hallway scene with a closed wooden door. The viewer pushes the door open, revealing a magical winter forest. Beyond the threshold, snow blankets the ground and tall pine trees, soft snowflakes drift through the air, and pale blue light fills the scene. A vintage lamppost glows warmly among the trees, contrasting with the cold surroundings, as the door swings wider and reveals more of the silent, enchanting landscape.

A small sailboat at sea. The viewer stands in the cockpit, feet braced on the deck as the boat gently rocks over rolling swells. Wearing sun-faded sailing gloves, the viewer grips a thick halyard rope with both hands and pulls it taut. The rope is coarse and slightly wet, with salt spray beading on the fibers; it slides through the gloves with subtle friction. With each pull, the mainsail rises up the mast and begins to fill, its white sailcloth snapping and fluttering before catching the wind and smoothing into a taut curve. The camera shows the viewer’s forearms straining, the rope running through a metal block, and the sail climbing and swaying in the wind. Bright midday sun, deep blue ocean, distant horizon line, seabirds circling far away.

A vibrant scene of Kenyan golfers at a lush green golf course on a sunny day. The golfers, dressed in casual yet stylish attire, are teeing off with animated expressions, showcasing their enthusiasm for the game. Rolling hills and pristine greens stretch out behind them, creating a picturesque backdrop. In the foreground, a golf buggy and a caddy stand ready, adding to the serene atmosphere.

A bright, roaring boxing arena—a regulation ring with white ropes, blue canvas, and glaring overhead spotlights. The viewer wears red boxing gloves, but the focus is on the opponent: a large, human-sized cat standing upright in the center of the ring, broad-shouldered and imposing, fur bristling under the harsh arena lights. The cat also wears boxing gloves, its eyes lock onto the camera with predatory intensity as its ears twitch and its tail snaps with impatience.

The viewer, wearing slightly weathered beachwear and a simple woven wrist bracelet, holds a tiki torch, gripping a smooth bamboo shaft wrapped with dark twine near the handle. The torch’s head is a carved bamboo-and-metal basket with a cloth wick; a small, steady flame flickers gently, casting warm orange light across the viewer’s hands and forearms. The scene is set on a Hawaii beach at golden-hour sunset: soft sand underfoot, rolling turquoise waves, and distant volcanic mountains silhouetted against a pink-and-amber sky. Palm trees sway in a light trade wind; torchlight reflections shimmer on wet sand near the shoreline. A relaxed beach gathering is happening nearby—people in soft focus chatting, a small ukulele circle, and scattered tiki torches planted in the sand forming a warm pathway. The camera remains natural and handheld, with subtle breathing motion and slight head turns to take in the ocean, the torch flame, and the sunset horizon.

A quaint A-frame cottage made of wood and glass, nestled off the coast of Mexico. The cottage sits on a small peninsula surrounded by crystal-clear turquoise waters and sandy beaches. The exterior is crafted from weathered wooden planks and large glass windows, offering panoramic views of the ocean. Palm trees sway gently in the background, and colorful Mexican flowers adorn the front of the cottage. The sun sets in the distance, casting a warm golden glow over the scene. Wide shot, capturing the serene coastal landscape and the charming cottage.

The viewer’s hands hold a cat wand toy with a thin string dangling a small fuzzy pom‑pom ball that bounces and sways as the viewer flicks their wrist. A cozy, sunlit living room scene with warm afternoon light streaming through sheer curtains, casting soft shadows across a wooden floor and a plush rug. In the foreground, a playful domestic cat crouches low with focused eyes and twitching tail, then springs forward, swiping repeatedly at the fuzzball with quick paw strikes. The camera highlights natural hand motion, the toy’s swinging arc, the cat’s fast reactions and shifting posture, and small details like fur texture and whisker movement, creating an intimate, playful, game-like interaction.

How it works

We track the user’s head and hand poses with a commercial VR headset. Hand motion is represented using the UmeTrack, which provides the wrist pose and 20 joint angles per hand. Our conditioning uses a hybrid 2D–3D scheme, combining a 2D image of hand skeleton and the 3D model hand-pose parameters. Features extracted from these modules are combined with the head pose features via token addition and fed into the DiT. The model then autoregressively generates new frames.

Hand Motion Conditioning

We conduct a comprehensive ablation of hand-pose conditioning strategies and propose a 2D–3D hybrid approach that captures hand motion more reliably. Our method achieves the best hand pose accuracy among all baselines.

Hand Conditioning Method:

Ground Truth

Ours

Select Scene:

Joint Hand–Camera Conditioning

We further extend this framework to jointly condition on both hand and camera poses. Our joint conditioning model disambiguates hand and head motion, enabling accurate object interactions.

Joint Conditioning Method:

Ground Truth

Ours

Select Scene:

The Generated Reality System

We develop our Generated Reality system by distilling the bidirectional model into an autoregressive variant that runs at 11 FPS on a VR headset. Using live-tracked hand and head poses as controls, the system streams generated video directly to the headset.

In our user studies, participants achieved higher task success rates (left) and reported higher levels of perceived control (right) compared to the baseline.

BibTeX

@article{xie2026generatedrealityhumancentricworld,
  author    = {Xie, Linxi and Sun, Lisong C. and Neall, Ashley and Wu, Tong and Cai, Shengqu and Wetzstein, Gordon},
  title     = {Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control},
  journal   = {arXiv preprint arXiv:2602.18422},
  year      = {2026},
}