Integrating Google Mediapipe with Intel Realsense Depth Cameras for 3D Hand Detection on NVIDIA Jetson AGX Xavier

Hand detection is a critical technology in multi-sensor robotic systems, as it figures into everything from virtual object manipulation to hand gesture control. Numerous solutions exist right now that aim to solve this problem, though most rely on 2D information — for good reason. Using 3D data is computationally-intensive, which gives 2D data a big advantage in speed and cost.

Most common 2D hand tracking methods currently in use rely on RGB cameras to capture images; Google’s Mediapipe is a popular example. Mediapipe is a hand and finger tracking library that determines hand location through a library that uses machine learning to infer “landmarks” on the hand.

After running a series of tests, our team found that Mediapipe appears to work primarily in the 2D plane, despite claiming to use 3D landmarks (see here for more information). Each landmarks 3D information is relative to the others and doesn’t give accurate information as to their location in real space. To obtain true 3D information, we’d have to step away from exclusive traditional 2D RGB cameras.

Intel’s Realsense depth cameras provide both an RGB image and depth data. Our team used a D435 camera, which contains an RGB camera, two stereo cameras for depth, and an IR sensor. The Realsense library contains functions that align the two frames of data into one, providing depth data for every pixel in the RGB images. This creates an opportunity to integrate Mediapipe with Realsense color images, and reference the depth information from the depth camera, allowing us to locate a hand in 3D space, using far less computational power (and incurring less expense) than other 3D solutions.

We decided to build this 3D solution on the NVIDIA Jetson AGX Xavier platform, which brings a few benefits and limitations that I’ll describe later in the post.

RGB and Depth Alignment

Making this solution work depends on being able to align the RGB and depth images. Mediapipe needs the color image information to make its inferences, and we need accurate pixel-by-pixel depth data in order to determine the depth of each landmark on the hand. This is complicated by the fact that the two sensors have different fields of view: 69×42 for the RGB camera, and 87×58 for the depth sensor (nearly twice as big).

Fortunately, Intel’s Realsense library has a built-in ‘align’ function that performs the necessary calculations to overlay these images despite their FOV differences. While experimenting with this approach, though, we discovered that the align function isn’t optimized for non-Intel CPUs, which posed a problem for the ARM processor in our NVIDIA Jetson AGX Xavier. While aligning the depth and color frames during testing, our processing framerate dropped from the expected 30 FPS to just 2 to 4 FPS. Human arms tend to move fairly quickly, making framerate a critical issue in most hand tracking applications.

Fortunately, the Jetson also comes with a 512-core Volta GPU, so we were able to utilize the onboard CUDA cores while building the Realsense library. If you are using an Intel CPU, or otherwise do not need to optimize your Realsense library, feel free to skip this section and move on to the more in-depth examples below.

The first issue with using the camera with the NVIDIA board is that there are no pre-built binary distributions of the RealSense SDK. The Jetson board is based on an ARM chipset, so the SDK provided by Intel has to be built from source on the device itself.

Here’s how you do it.

First, we need to insure that libssl-dev is installed or execute:

$ sudo apt install libssl-dev

Next, we download and unzip the file, and create the build directory:`wget https://github.com/IntelRealSense/librealsense/archive/refs/tags/v2.48.0.zip`

$ unzip v2.48.0.zip
$ cd librealsense-2.48.0
$ mkdir build && cd build

Then we need to ensure the PATH environment variables are set so the build script knows where CUDA and Python are located on the machine. Open .bashrc in your favorite text editor, and add the following lines to the bottom of the file:

$ export PYTHONPATH=$PYTHONPATH:/usr/local/lib
$ export CUDA_HOME=/usr/local/cuda
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
$ export PATH=$PATH:CUDA_HOME/bin
$ export CMAKE_CUDA_COMPILER=/usr/local/cuda-10.2/bin/nvcc

Reload the terminal to load those changes, then do a test build to see if the full build is likely to succeed:

$ cmake ../ -DFORCE_RSUSB_BACKEND=ON -DBUILD_PYTHON_BINDINGS:bool=true -DPYTHON_EXECUTABLE=/usr/bin/python3

If that was successful, build the following (example programs, and optimizations like CUDA). This process will take significant time:

$ cmake ../ -DFORCE_RSUSB_BACKEND=ON -DBUILD_PYTHON_BINDINGS:bool=true -DPYTHON_EXECUTABLE=/home/[username]/archiconda3/envs/[conda env name]/bin/python -DCMAKE_BUILD_TYPE=release -DBUILD_EXAMPLES=true -DBUILD_GRAPHICAL_EXAMPLES=true -DBUILD_WITH_CUDA:bool=true

Once this build succeeds, install to finish the process (this will also take a while):

$ make -j8 #This number is based on the number of CPU cores CPU (our Jetson AGX has 8 cores)
$ sudo make install

Now that we have a CUDA version of the Intel Realsense library, we can move on to the Mediapipe integration. In this example we are working in Python, and will therefore use the Python version of the Realsense and Mediapipe functions. We’re only working with a single camera in this example, but this same process can be applied to multiple cameras for greater coverage.

Code sample

First, we need to import all the relevant libraries:

Note: See GitHub for sample code file

import pyrealsense2 as rs
import mediapipe as mp
import cv2
import numpy as np
import datetime as dt

Then we set up the formatting for our OpenCV window that displays the output of our Hand Detection and Tracking:

font = cv2.FONT_HERSHEY_SIMPLEX
org = (20, 100)
fontScale = .5
color = (0,150,255)
thickness = 1

Next, we need to configure the Realsense environment, detect the cameras, and set up the Realsense variables that will be used later in the example:

# ====== Realsense ======
realsense_ctx = rs.context()
connected_devices = [] # List of serial numbers for present cameras
for i in range(len(realsense_ctx.devices)):
  detected_camera =   
  realsense_ctx.devices[i].get_info(rs.camera_info.serial_number)
  print(f"{detected_camera}")
  connected_devices.append(detected_camera)
device = connected_devices[0] # In this example we are only using one camera
pipeline = rs.pipeline()
config = rs.config()
background_removed_color = 153 # Grey

The final setup step is to define variables for Mediapipe:

# ====== Mediapipe ======
mpHands = mp.solutions.hands
hands = mpHands.Hands()
mpDraw = mp.solutions.drawing_utils

Next, we need to enable the Realsense streams and configure the streams for alignment:

# ====== Enable Streams ======
config.enable_device(device)# # For worse FPS, but better resolution:
# stream_res_x = 1280
# stream_res_y = 720
# # For better FPS. but worse resolution:
stream_res_x = 640
stream_res_y = 480stream_fps = 30
config.enable_stream(rs.stream.depth, stream_res_x, stream_res_y, rs.format.z16, stream_fps)
config.enable_stream(rs.stream.color, stream_res_x, stream_res_y, rs.format.bgr8, stream_fps)
profile = pipeline.start(config)align_to = rs.stream.color
align = rs.align(align_to)

The difference in FOV between the two sensors means that, when we align the images, there’s a large region external to the color region that contains only depth data. Mediapipe was built using color images and performs poorly on depth data alone, so in practice, we’re limited to the color sensor FOV.

Now that the streams have been enabled, we can extract the depth scale from the depth camera and use it to set a clipping distance. In our example, I am setting the clipping distance to 2 meters. In other words, we’ll only look at images where the depth data is within 2 meters, and discard everything else.

# ====== Get depth Scale ======
depth_sensor = profile.get_device().first_depth_sensor()
depth_scale = depth_sensor.get_depth_scale()
print(f"\tDepth Scale for Camera SN {device} is: {depth_scale}")# ====== Set clipping distance ======
clipping_distance_in_meters = 2
clipping_distance = clipping_distance_in_meters / depth_scale
print(f"\tConfiguration Successful for SN {device}")

From here, we enter a loop that acquires the frames, processes the information, and displays the final images. More detailed sections follow for each step described:

While true:
  start_time = dt.datetime.today().timestamp() # Necessary for FPS 
  calculations
  # Get and align frames
  # Process images
  # Process hands
  # Display FPS
  # Display images

Get and align images:
This is where we actually gather the frames and perform the alignment

# Get and align frames
frames = pipeline.wait_for_frames()
aligned_frames = align.process(frames)
aligned_depth_frame = aligned_frames.get_depth_frame()
color_frame = aligned_frames.get_color_frame()if not aligned_depth_frame or not color_frame:
  continue

Process images:
The first step to processing the images is to convert the depth and color images into numpy arrays for later manipulation, and flip them horizontally. The images need to be flipped for Mediapipe to correctly determine the handedness of the detected hands.

# Process images
depth_image = np.asanyarray(aligned_depth_frame.get_data())
depth_image_flipped = cv2.flip(depth_image,1)
color_image = np.asanyarray(color_frame.get_data())

Next, since the depth images is only a single channel, and the color image is three (3), we need to stack the depth images on itself thrice.

depth_image_3d = np.dstack((depth_image,depth_image,depth_image))
background_removed = np.where((depth_image_3d > clipping_distance) | (depth_image_3d <= 0), background_removed_color, color_image)

Fianlly, we format the images both for easier human comprehension and for Mediapipe.

depth_colormap = cv2.applyColorMap(cv2.convertScaleAbs(depth_image, alpha=0.03), cv2.COLORMAP_JET)images = cv2.flip(background_removed,1)
color_image = cv2.flip(color_image,1)
color_images_rgb = cv2.cvtColor(color_image, cv2.COLOR_BGR2RGB)

Process hands:
Now that we have the depth and RGB images correctly formatted and aligned, we begin processing with Mediapipe.

# Process hands
results = hands.process(color_images_rgb)
if results.multi_hand_landmarks:
  number_of_hands = len(results.multi_hand_landmarks)
  i=0

If we have any hands detected in the image, we need to cycle through them for further processing. One important note is that Mediapipe hard codes their algorithm to identify a maximum of two (2) hands, but this can be easily changed in their source code when running it locally.

  for handLms in results.multi_hand_landmarks:
    mpDraw.draw_landmarks(images, handLms, mpHands.HAND_CONNECTIONS)
    org2 = (20, org[1]+(20*(i+1)))
    hand_side_classification_list = results.multi_handedness[i]
    hand_side = hand_side_classification_list.classification[0].label

With the hand landmarks drawn and hand side identified, we need to pick a point to determine the depth information for. In our example here, I have simply chosen the middle finger knuckle as an approximation of the center of the hand. From there we extract the x and y coordinates of the middle knuckle:

    middle_finger_knuckle = results.multi_hand_landmarks[i].landmark[9]
    x = int(middle_finger_knuckle.x*len(depth_image_flipped[0]))
    y = int(middle_finger_knuckle.y*len(depth_image_flipped))
    if x >= len(depth_image_flipped[0]):
      x = len(depth_image_flipped[0]) - 1
    if y >= len(depth_image_flipped):
      y = len(depth_image_flipped) - 1

From there we use the x and y coordinates and reference the depth image to extract the depth of that single point. We translate into feet for this example because it is easier for me to conceptualize during the demonstration, but if you are more familiar with meters, remove the conversion.

    mfk_distance = depth_image_flipped[y,x] * depth_scale # meters
    mfk_distance_feet = mfk_distance * 3.281 # feet

Finally, we display this information on the image directly with OpenCV and loop back to check for another hand. We also include logic/display text for the case of no hands and for a title line.

    images = cv2.putText(images, f"{hand_side} Hand Distance: {mfk_distance_feet:0.3} feet ({mfk_distance:0.3} m) away", org2, font, fontScale, color, thickness, cv2.LINE_AA)
    i+=1
    images = cv2.putText(images, f"Hands: {number_of_hands}", org, font, fontScale, color, thickness, cv2.LINE_AA)else:
  images = cv2.putText(images,"No Hands", org, font, fontScale, color, thickness, cv
2.LINE_AA)

Display FPS

# Display FPS
time_diff = dt.datetime.today().timestamp() - start_time
fps = int(1 / time_diff)
org3 = (20, org[1] + 60)
images = cv2.putText(images, f"FPS: {fps}", org3, font, fontScale, color, thickness, cv2.LINE_AA)
name_of_window = 'SN: ' + str(device)

Display images:

# Display images
cv2.namedWindow(name_of_window, cv2.WINDOW_AUTOSIZE)
cv2.imshow(name_of_window, images)
key = cv2.waitKey(1)
# Press esc or 'q' to close the image window
if key & 0xFF == ord('q') or key == 27:
  print(f"User pressed break key for SN: {device}")
  break

Finally, once the user presses the exit button to break us out of the loop, we need to close the pipeline connecting the Realsense camera:

print(f"Exiting loop for SN: {device}")
print(f"Application Closing.")
pipeline.stop()
print(f"Application Closed.")

Read more about Smart Design Technology on Medium

About Bryce Copenhaver

Bryce Copenhaver is a Systems Engineer with a strong passion for the intersection of electrical, mechanical, and software engineering. He brings expertise in building systems that integrate sensors and hardware with custom software to translate the world around him into data that can be processed by computers. Bryce has worked across numerous engineering industries, including autonomous vehicles and defense, and holds a Robotics Engineering BSE from Arizona State University. He loves building robots to take care of tasks around the house and skating around the city with friends.

Let’s design a smarter world together.

Get in touch

Explore more

Hand detection in 3D space

Integrating Google Mediapipe with Intel Realsense Depth Cameras for 3D Hand Detection on NVIDIA Jetson AGX Xavier

RGB and Depth Alignment

Here’s how you do it.

Code sample

About Bryce Copenhaver

Let’s design a smarter world together.

Sweating it out. The technical journey of the Gx Sweat Patch

How social design can make new products more human

Do bike lanes actually make cyclists safer?