Neural Fields for Hand-object Interactions

Author: Chandradeep Pokhariya 2021701040
Date: 2024-06-12
Report no: IIIT/TH/2024/77
Advisor:Avinash Sharma,Srinath Sridhar

Abstract

The hand is the most commonly used body part for interacting with our three-dimensional world. While it may seem ordinary, replicating hand movements with robots or in virtual/augmented reality is highly complex. Research on how hands interact with objects is crucial for advancing robotics, virtual reality, and human-computer interaction. Understanding hand movements and manipulation is critical to creating more intuitive and responsive technologies, which can significantly improve accuracy, efficiency, and scalability in various industries. Despite extensive research, programming robots to mimic human-hand interactions remains a challenging goal. One of the biggest challenges is collecting accurate 3D data for hand-object grasping. This process is complicated because of the hand’s flexibility and how hands and objects occlude in grasping poses. Collecting such data often requires expensive and sophisticated setups. However, recently, neural fields [1] have emerged, which can model 3D scenes using only multi-view images or videos. Neural fields use a continuous neural function to represent 3D scenes without needing 3D ground truth data, relying instead on differentiable rendering and multi-view photometric loss. With growing interest, these methods are becoming faster, more efficient, and better at modeling complex scenes. This thesis explores how neural fields can address two specific subproblems in hand-object interaction research. The first problem is generating novel grasps, which means predicting the final grasp pose of a hand based on its initial position and the object’s shape and location. The challenge is creating a generative model that can predict accurate grasp poses using only multi-view videos without 3D ground truth data. To solve this, we developed RealGrasper, a generative model that learns to predict grasp poses from multi-view data using photometric loss and other regularizations. The second problem is accurately capturing grasp poses and extracting contact points from multi-view videos. Current methods use the MANO model [2], which approximates hand shapes but lacks the details for precise contacts. Additionally, there is no easy way to get ground truth data for evaluating contact quality. To address this, we propose MANUS, a method for markerless grasp capture using articulated 3D Gaussians that reconstructs high-fidelity hand models from multi-view videos. We also created a large dataset, MANUS-Grasps, which includes multi-view videos of three subjects grasping over 30 objects. Furthermore, we developed a new way to capture and evaluate contacts, providing a contact metric for better assessment. We thoroughly evaluated our methods through detailed experiments, ablations, and comparisons, demonstrating that our approach outperforms existing state-of-the-art methods. We also summarize our contributions and discuss potential future directions in this field. We believe this thesis will help advance the research community further.

Full thesis: pdf

Centre for Visual Information Technology

IIIT Hyderabad Publications

Neural Fields for Hand-object Interactions

Abstract