Robot Synesthesia:
In-Hand Manipulation with Visuotactile Sensing

1UC San Diego,  2Tsinghua University,  3University of Illinois Urbana-Champaign,
4UC Berkeley,  5Dongguk University

*Equal Contribution

We propose Robot Synesthesia, a novel visuotactile approach to perform in-hand object rotation with visual and tactile modalities. We train our policy in simulation on rotating single or multiple objects around a certain axis and then transfer it to the real robot hand without any real-world data.



Visualization of Point Cloud in Testing

We visualize real-world point cloud observation during testing, including camera point clouds and augmented point clouds.

Abstract

Executing contact-rich manipulation tasks necessitates the fusion of tactile and visual feedback. However, the distinct nature of these modalities poses significant challenges. In this paper, we introduce a system that leverages visual and tactile sensory inputs to enable dexterous in-hand manipulation. Specifically, we propose Robot Synesthesia, a novel point cloud-based tactile representation inspired by human tactile-visual synesthesia. This approach allows for the simultaneous and seamless integration of both sensory inputs, offering richer spatial information and facilitating better reasoning about robot actions. The method, trained in a simulated environment and then deployed to a real robot, is applicable to various in-hand object rotation tasks. Comprehensive ablations are performed on how the integration of vision and touch can improve reinforcement learning and Sim2Real performance.

Video

Visualization in the Real World

Tests on Wheel-Wrench Rotation



Tests on Double-Ball Rotation

Our double-ball rotation policy generalizes to real-world balls with various shapes, sizes and densities.

Two Golfballs

Two Tomatoes

Two Potatoes


Tomato-Potato

Tomato-Golfball

Potato-Golfball



Tests on Three-Axis Rotation

Our three-axis rotation policies generalize to various real-world objects.

x-axis

y-axis

z-axis

x-axis

y-axis

z-axis



Comparison with Baselines

We compare our implementation with four baselines. Non-Visual RL is a non-visual policy that observes only robot proprioception and contact signals. Touch refers to binary contact, Cam refers to camera-based point clouds, Aug refers to augmented point clouds, and Syn refers to proposed tactile point clouds. Our proposed visual-tactile synesthesia approach generally offers benefits.

Touch+Cam+Aug+Syn (Ours)

Non-Visual RL

Touch+Cam+Aug



Visualization in Simulation