OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains

1Centre for Artificial Intelligence and Robotics, HKISI, CAS 2Institute of Automation, Chinese Academy of Sciences 3University of Chinese Academy of Sciences 4Shandong University 5University of Science and Technology Beijing
Corresponding author.
teaser

OOD-HOI, a novel text-driven 3D human-object interaction method for Out-of-Domain generation.

Abstract

Generating realistic 3D human-object interactions (HOIs) from text descriptions is a active research topic with potential applications in virtual and augmented reality, robotics, and animation. However, creating high-quality 3D HOIs remains challenging due to the lack of large-scale interaction data and the difficulty of ensuring physical plausibility, especially in out-of-domain (OOD) scenarios. Current methods tend to focus either on the body or the hands, which limits their ability to produce cohesive and realistic interactions. In this paper, we propose OOD-HOI, a text-driven framework for generating whole-body human-object interactions that generalize well to new objects and actions. Our approach integrates a dual-branch reciprocal diffusion model to synthesize initial interaction poses, a contact-guided interaction refiner to improve physical accuracy based on predicted contact areas, and a dynamic adaptation mechanism which includes semantic adjustment and geometry deformation to improve robustness. Experimental results demonstrate that our OOD-HOI could generate more realistic and physically plausible 3D interaction pose in OOD scenarios compared to existing methods.

Overview

Our approach decomposes the generation process into three module: (1) a dual-branch reciprocal diffusion model that exchanges information between human and object to generate an initial interaction pose, (2) a contact-guided interaction refiner is employed to revise the initial interaction human-object pose with additional inference-time guidance, (3) and a dynamic adaptation module designed for out-of-domain (OOD) generation, ensuring more realistic and physically plausible results.

pipeline

Contact-Guided Interaction Refiner

The refiner module takes text prompt, initial hand pose and object geometry as input, predicts the contact area between hand and object, and optimizes the floating object and interpenetration based on the predicted contact area.

pipeline

Geometry deformation results

Airplane

Origin

Deformation 1

Deformation 2

Deformation 3

Alarmclock

Origin

Deformation 1

Deformation 2

Deformation 3

Binoculars

Origin

Deformation 1

Deformation 2

Deformation 3

In domain results

Hand flashlight with left hand.

Pass apple with right hand.

Toast mug with right hand.

Take picture camera with both hands.

Text-OOD results

Deliver duck with right hand.

Elevate toothpaste with right hand.

Object-OOD results

Pick scissors with right hand.

Pick scissors with right hand.

Pick scissors with right hand.

Lift bottle with right hand.

BibTeX

@article{zhang2024oodhoi,
  author    = {Zhang, Yixuan and Yang, Hui and Luo, Chuanchen and Peng, Junran and Wang, Yuxi and Zhang, Zhaoxiang},
  title     = {OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domain},
  journal   = {arxiv preprint arxiv:xxxx },
  year      = {2024},
}