EquiContact

Hierarchical SE(3) Vision-to-Force Equivariant Policy for Spatially Generalizable Contact-Rich Tasks

Joohwan Seo, Arvind Kruthiventy, Soomi Lee, Megan Teng, Seoyeon Choi, Xiang Xhang,
Jongeun Choi, and Roberto Horowitz

UC Berkeley, Yonsei University

Main manuscript Appendix arXiv Code* (coming)

* New version of code coming soon! In the meantime, you can check out our old simulation and experiment code linked at the bottom of the page.

TL;DR

We present EquiContact, a hierarchical SE(3)-equivariant vision-to-force policy for contact-rich manipulation that achieves spatial generalization. Our policy handles contact-rich tasks, and is only trained on a fixed task configuration but can generalize to unseen configurations with arbitrary SE(3) transformations.

Full EquiContact Pipeline on Peg-in-Hole + Extreme Transformations

EquiContact full pipeline on peg-in-hole (PiH) task with spatial generalization to unseen configurations. Video demonstrates (1) generalization to translational transformation (flat platform), (2) generalization to rotation + translational transformation (tilted platform), and (3) robustness to extreme transformations.

Abstract

This paper presents a framework for learning vision-based robotic policies for contact-rich manipulation tasks that generalize spatially across task configurations. We focus on achieving robust spatial generalization of the policy for the contact-rich tasks trained from a small number of demonstrations. We propose EquiContact, a hierarchical policy composed of a high-level vision planner (Diffusion Equivariant Descriptor Field, Diff-EDF) and a novel low-level compliant visuomotor policy (Geometric Compliant Action Chunking Transformers, G-CompACT). G-CompACT operates using only localized observations (geometrically consistent error vectors (GCEV), force-torque readings, and wrist-mounted RGB images) and produces actions defined in the end-effector frame. Through these design choices, we show that the entire EquiContact pipeline is SE(3)-equivariant, from perception to force control. We also outline three key components for spatially generalizable contact-rich policies: compliance, localized policies, and induced equivariance. Real-world experiments on peg-in-hole (PiH), screwing, and surface wiping tasks demonstrate a near-perfect success rate and robust generalization to unseen spatial configurations, validating the proposed framework and principles.

Method

Pipeline figure

Figure: Overview of EquiContact.

We propose an EquiContact, a hierarchical, provably SE(3) vision-to-force equivariant policy for spatially generalizable contact-rich tasks.

(Left) The proposed EquiContact consists of a Diffusion-Equivariant Descriptor Field (Diff-EDF), a Geometric Compliant Action Chunking Transformer (G-CompACT), and a Geometric Admittance Controller (GAC).

(Right) The G-CompACT method is trained only on the fixed task configuration, but it can be generalized to task configurations that undergo arbitrary SE(3) transformation, given the reference frames.

G-CompACT as a Localized Policy

G-CompACT Architecture

G-CompACT is a low-level visuomotor policy that takes localized observations and outputs actions defined in the end-effector frame. The policy inputs include (i) geometrically consistent error vectors (GCEV) between the current and target end-effector poses, (ii) wrist RGB images, and (iii) force-torque readings in end-effector frame. The policy outputs relative poses and admittance gains for compliant control. Moreover, using the language guidance to the wrist camera input imposes approximately left-invariance to SE(3) task transformation. Therefore, G-CompACT is a left-invariant localized policy, which is a key component for spatial generalization and SE(3)-equivariance.

Main Takeaway: Principles to achieve Spatially Generalizable Policy for Contact-rich Tasks

We outline three key components for spatially generalizable contact-rich policies which are embodied in EquiContact:

Which together can be summarized as the following punchline:

Anchoring localized policy on a globally estimated reference frame for spatially generalizable manipulation.

We also have a proof for full SE(3)-equivariance from vision input to force-control output in our paper. If you are interested, please check out the paper!

Results (Videos)

Surface Wiping & Screwing Tasks

Surface Wiping Task

Task: Erase the marker on the flat/tilted whiteboard

Screwing Task

Task: Screw-lock the peg into the threaded flat/tilted platform

Videos demonstrating EquiContact's spatial generalization on surface wiping and screwing tasks. The policy is trained only on a fixed configuration but successfully generalizes to unseen configurations on flat and tilted platforms without much degradation.

Benchmark Experiments

ACT w/o GAC (In-dist)

Executed without GAC. Without complaint controller, the robot exceeds too high interaction force which activated safety stop.

ACT + GAC, w/o Gain Scheduling (In-dist)

Executed with GAC, but with fixed admittance gains. Without real-time gain modulation, the robot often exerts excessive forces or becomes too stiff.

CompACT: ACT+GAC+Gain Scheduling (In-dist & OOD)

CompACT successfully finish the task by leveraging compliant control and real-time gain scheduling. However, it fails under out-of-distribution (OOD) task transformation, even on small translational displacement.

Lists of Full Videos

Here are the extended version of videos.

Links