Reisom: Zero-shot Reconstruction of In-Scene Object Manipulation from Video

We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video. It is challenging due to ill-posed scene reconstruction, ambiguous hand–object depth, and the need for physically plausible interactions. Existing methods operate in hand-centric coordinates and ignore the scene, hindering metric accuracy and practical use. In our method, we first use data-driven foundation models to initialize the core components, including the object mesh and poses, the scene point cloud, and the hand poses. We then apply a two-stage optimization that recovers a complete hand–object motion from grasping to interaction, which remains consistent with the scene information observed in the input video.

Reisom: Zero-shot Reconstruction of In-Scene Object Manipulation from Video

We present a zero shot system that reconstruct in-scene object manipulation motion from daily videos.

Abstract

Pipeline

Results