Navigating unfamiliar environments with drone swarm and accurate localization
Resource | Link |
---|---|
Project Slides | google slides |
This project was built for the Project Eagle Hackathon, hosted by Eric Schmidt’s defense tech startup in SF, where our team placed 3rd out of 20+ teams. I had a lot of fun working with my insanely cracked teammates Pranav, Ali, and Jing. Fun fact: Pranav slept a total of 12 hours the week before the hackathon to pull a conf paper and managed to stay up almost the entirety of the hackathon (s/o Pranav). During our final pitch, we had the chance to meet Eric Schmidt and Sebastian Thrun and got some great advice from Eric on the “to PhD or not to PhD” dilemma :)
.
Our goal for this 24-hour hackathon was to build a system for collaborative drone exploration. We wanted a swarm of drones to enter an unknown space, build a map, and find their own positions within it, all at the same time. Put differently, many classical CV folks can relate this to SLAM without the exploration component. While the final project wasn’t complete, we successfully got a proof of concept working.
The challenge looks deceptively simple: to build a good map, you need to know where you are; but to know where you are, you usually need a map. It’s technically SLAM but harder since we were working with drone swarms here - we wanted multiple drones to solve this together without any GPS or prior knowledge + integrate exploration and mapping together. SLAM focuses only on mapping and the exploration component is handled separately.
Most methods assume they start with some map knowledge. We wanted to see if we could bootstrap everything from scratch using just cameras.
Our solution centers around VGGT (which funnily came out with code a day before the hackathon) that does something pretty remarkable: it takes camera images and directly spits out camera poses, depth maps, and 3D point clouds all at once. No traditional computer vision pipeline needed.
VGGT works by patchifying input images into tokens (using DINO), adding special camera tokens, then running everything through alternating frame-wise and global attention layers. A camera head predicts where the camera is in 3D space and a DPT head generates dense outputs like depth maps. We clocked the time and we were able to get a 3D point cloud in under 0.5 seconds.
The system works in a continuous feedback loop. Multiple drones feed their camera images and rough pose estimates into VGGT, which reconstructs the 3D scene and gives back corrected poses for each drone. These updated poses flow into a planning module that generates new trajectories, which get executed through low-level control.
The feedback loop was the most interesting part of the project: as drones move around and gather more visual data, VGGT gets better at understanding the space, which makes the pose estimates more accurate, which makes the planning better, which leads to more informative exploration.
We added VGG19 with Laplacian filtering to sharpen the visual data before feeding it to VGGT - blurry images make for bad 3D reconstruction. The system progressively builds better maps as the drone swarm explores together.
Since this was a hackathon, most of our time was spent wrestling with hardware. Controlling multiple drones from one computer turns out to be surprisingly tricky. We spent hours debugging ROS architecture issues, IP address conflicts, and driver incompatibilities on Ubuntu.
We only got pieces working in isolation. The next step would be integrating everything into a fully autonomous system that actually works reliably in the real world.
Here’s a video of our system in action:
Here’s our slides from the hackathon: