This Chinese student over a weekend built a teleoperation station that mirrors his movements onto a robotic arm in 80 milliseconds.
He sits at a regular desk with a webcam on a tripod, holds his palm in front of the lens, and a black mechanical copy on a tripod next to him repeats every joint of his fingers in real time.
The entire system runs through a local network on a laptop without any calls to the cloud.
2 days, zero experience in robotics, and entirely through open-source tools.
The full stack:
> MediaPipe Hands, an open model from Google that from regular video detects 21 points on the hand (every joint of every finger) in real time
> OpenCV for camera capture at 60 frames per second
> A UDP socket, a fast network channel through which hand coordinates go to the robot without delays on port 9000
> Arduino, a controller board the size of a matchbox that based on these coordinates rotates 6 servo motors on the robot
> Inmoov robotic arm, an open-source project whose blueprints are downloaded for free and printed on a 3D printer, the final build comes out to $400
> A regular USB webcam for $20
Head-pose tracking works in the same session: the camera additionally determines which way the head is turned, and the guy combined this data with the hand position so the robotic arm rotates at the same angle as his neck and works wherever he is looking.
His process was more about calibration than code:
> Launch MediaPipe and get the base landmarks
> Capture hand coordinates in a neutral position for calibration
> Map 21 points of a human hand onto 6 servo angles (the robot fingers have fewer joints than a human)
> Add an 80 ms buffer for smooth movement
And here is what his server outputs right during the session:
"Stream started
Port: 9000
Protocol: UDP
Hands: Right Hand"
"[Right] Wrist Pos: (-0.043, 0.763, 0.256)"
"[Right] Landmarks
Thumb: (-0.053, -0.068, 0.102)
Index: (-0.018, -0.049, 0.148)
Mid: (0.017, -0.028, 0.180)
Ring: (0.041, -0.022, 0.172)
Pinky: (0.068, -0.045, 0.134)
[Head] Pose
Pos: (0.032, 1.023, 0.006)
Rot: (0.197, -0.055, -0.024, 0.979)"
This entire station works without gloves, without sensors, and without a motion capture suit. All it takes is 1 camera that sees the palm and 1 laptop that turns it into commands for servo motors.
The system knows where in space every joint of the human hand is located. It knows which way the head is turned. It knows which movements need to be replicated in the next 80 milliseconds.
He himself says that human gestures are the cleanest source of training data for embodied AI: meaning neural networks that learn to control physical robots through imitating a human.
The demo works right now in his room with no visible delay.
A year ago this would have required a team of Stanford graduates, a $500,000 grant, and a $30,000 motion capture suit, the same kind actors wear for CGI scenes in blockbusters.
The motion capture suit just had its Linux moment. In 2023 it cost $30,000. In 2026, $20 for a USB webcam and a couple of lines of Python.
$400 for a robotic arm, $20 for a camera, $0 for the software, and the exact same data collection principle used to train Tesla Optimus, Figure 02, and 1X Neo.
In 18 months this will be in every university lab.
In 36, in the hands of every student who took a basic Python course.
Humanoid robotics is going through its vibe-coding moment right now.
The media could not be played.
Quote
CyrilXBT
@cyrilXBT
I Replaced a 4-Person Team With a Multi-Agent Claude System. Here Is the Exact Architecture.
Twelve months ago I was paying four people to run my content operation.
A research assistant. A content writer. A distribution manager. An analytics person.
Combined monthly cost: $11,400.
Combined...
Readers added context they thought people might want to know
The creator of the video has publicly corrected the claims made in this post:
- The creator is Japanese, and is neither Chinese nor a student.
- The setup simply uses a Meta Quest, without any tripod or surprising equipment.
Sources:
x.com/i/status/20440…
x.com/i/status/20507…