Next Best Sense: Guiding Vision and Touch with FisherRF for 3D Gaussian Splatting

Stanford University, University of Pennsylvania

Armlab Logo GRASP Logo
Method Image.

Abstract

We propose a framework for active next best view and touch selection for robotic manipulators using 3D Gaussian Splatting (3DGS). 3DGS is emerging as a useful explicit 3D scene representation for robotics, as it has the ability to represent scenes in a both photorealistic and geometrically accurate manner. However, in real-world online robotic scenes where the number of views is limited given efficiency requirements, random view selection for 3DGS becomes impractical as views are often overlapping and redundant. We address this issue by proposing an end-to-end online training and active view selection pipeline, which enhances the performance of 3DGS in few-view robotics settings. We first elevate the performance of few-shot 3DGS with a novel semantic depth alignment method using Segment Anything Model 2 (SAM2) that we supplement with Pearson depth and normal loss to improve color and depth reconstruction of real-world scenes. We then extend FisherRF, a next-best-view selection method for 3DGS, to select views and touch poses based on depth uncertainty. We perform online view selection a real robot system during live 3DGS training. We motivate our improvements to few-shot GS scenes, and extend depth-based FisherRF to these scenes, where we demonstrate both qualitative and quantitative improvements on challenging robot scenes.

Next Best Sense draws upon state of the art vision models to train few-shot Gaussian Splatting scenes, and turns to impressive next best view selection methods to guide robotic manipulator next best view and touch selection in the wild.

Method

Our method leverages state-of-the-art monocular depth estimation and semantic visual foundational models to train few-shot, challenging Gaussian Splatting scenes. We focus on geometrically precise reconstructions with a novel semantic depth alignment method and relative depth loss. We then utilize a recent work in next best view selection called FisherRF, and extend it to encode depth uncertainty for both vision and touch. We then perform online view selection on a real robot system during live 3DGS training, and demonstrate the effectiveness of uncertainty-guided touch on a tricky mirror scene.

SAM2 Depth Alignment

Initialization is known to be highly sensitive in Gaussian Splatting, requiring metrically accurate Gaussians to seed a scene. We propose a novel semantic depth alignment using Segment Anything 2 to align monocular depths. Concretely, we feed a color image into a monocular depth estimator such as DepthAnythingV2. We then run the SAM2 automatic mask generator on the RGB image, which captures objects and backgrounds in the scene. This method does not require any depth completion networks that require extensive fine-tuning, and is a computationally cheap step. With a depth from a depth image, often prone to noise and incorrect in challenging lighting conditions, we perform a mask-aware alignment on the monocular depth to output a SAM2 aligned depth; retaining the best of both worlds: metrically accurate and geometrically precise.

The alignment is used as initialization for the scene. SAM2 aligned depths (right) improves significantly upon a least squares alignment (left) monocular depth.

Geometric Guidance

We utilize Pearson depth loss and normal supervision to guide the Splat into a realistic scene.

Few-Shot Gaussian Splatting Results

We show our method for few-shot GS in a variety of scenarios The below consists of a RGB and depth comparison between lifted least-squares-aligned depths and semantically aligned depths with Pearson depth loss. Our full method is on the right side of the slider.

Bunny Blender Scene (6 input views)
Bunny Real Scene (8 input views)

Even on a challenging prism object, our SAM2-based method is able to capture the scene densely compared to prior works that use only geometric depth guidance.

Prism Real Scene (single training view)
Method Image.

An ablation of our method shows that SAM2 depth alignment constructs a better scene than prior works. We incrementally build up our method with pairwise sliders (our method on the right, others on the left)

No Depth vs Our Method
Realsense Depth vs Our Method
Pearson Loss Depth vs Our Method
MSE Supervised and Lifted Depth vs Our Method
Lifted Depth with Pearson vs Our Method

Next Best View for Robotics

We turn to FisherRF, a state-of-the-art next best view selection method for 3DGS, and extend it to encode depth uncertainty for vision and touch. This proves to be more impactful than measuring uncertainty from color alone, as depth is a more reliable measure of scene geometry. Then, in an online manner, we can select views based on depth uncertainty to progressively refine a scene.

Offline FisherRF

We demonstrate view selection randomly, with traditional FisherRF, and with modifications to FisherRF with our improved few-shot GS.

View Selection Methods

Random View
Fisher RGB
Fisher Depth
Armlab Logo
Armlab Logo
Armlab Logo

Real World Random vs FisherRF Depth

Online FisherRF

We demonstrate our method operating in end to end fashion on a real robot system.

Interactive Visualization

FisherRF RGB (baseline) vs FisherRF Depth

Use the slider to compare the random view selection with FisherRF depth selection. FisherRF improves the background and removes erroneous floaters.



SAM2 3D Object Mask

We use SAM2 to generate object masks for the object to touch. Each mask per image is distilled into the 3D splat.

Next Best Touch

We extend our method to touch, where we employ FisherRF depth to select touches on the segmented object, adding Gaussians at the touch location and performing a local smoothing to influence local regions of an object. We show renderings of a mirror with only 10 random and FisherRF touches, where FisherRF fills in the upper right corner of the mirror.

Random Touches

FisherRF Guided Touches

BibTeX

@article{strong2024nextbestsense,
  author    = {Matthew Strong and Boshu Lei and Aiden Swann and Wen Jiang and Kostas Daniilidis and Monroe Kennedy III},
  title     = {Next Best Sense: Guiding Vision and Touch with FisherRF for 3D Gaussian Splatting},
  journal   = {arXiv},
  year      = {2024},
}