Creating a 3D model from a video

November 12, 2023

In this blog post we will present a technique called Structure from Motion (SfM), which can recover the 3D structure of a scene from 2D images. SfM is used in many applications, such as 3D scanning, augmented reality and visual simultaneous localization and mapping (vSLAM). There are several implementations and approaches for SfM, from which we selected COLMAP for this demonstration.

Humans perceive a great deal of information about the three-dimensional structures in their environment by moving around in it. When the observer moves, objects around them move different amounts depending on their distance from the observer. This is known as motion parallax. From this depth information an accurate 3D representation of the world can be estimated by the observer. Structure from Motion (SfM) tries to capture this ability and utilizes it in computer vision.

SfM can be computed in many ways. The way in which you approach the problem depends on different factors, such as the number and type of cameras used, whether the images are ordered or not, etc. Let us see one such approach in detail through COLMAP by creating a 3D reconstruction of our colleague’s motorbike from a video. For this purpose, we used a 25.5 second video footage captured with an iPhone, which was broken into 764 individual images. These images were used by COLMAP for feature extraction. Image features are local structures in images captured from different viewpoints used to reconstruct the 3D structure of a scene or object.

In the image below (frame #484) the extracted features can be observed, highlighted with red dots.

We need to find matches between the extracted features of all the images to estimate the depths, and coordinates of both the extracted points and image poses (camera 3D positions). From the different matching strategies that can be used, we chose sequential matching, because in the input video there was a high visual overlap of consecutive frames and there was no need to match all image pairs exhaustively.

Note that – in our experience – matching can take a significant amount of time depending on the number of images, the number of features per image, and the chosen matching mode. Expected times for exhaustive matching scale from a few minutes for tens of images to a few hours for hundreds of images to days or weeks for thousands of images.

In the image below an example can be seen for matched features shown with green on images #484 and #420.

After learning the relationship between images and their features, the reconstruction of the 3D scene can begin. COLMAP is incrementally extending the 3D point cloud of our scene by registering new images and triangulating new points.

Note that there are cases where COLMAP splits the reconstruction model into separate sub-models. This happens due to the lack of overlap between images. In this case more images need to be added to fill in the gaps so COLMAP could corelate all into one single model.

In our case all 764 images were registered under one single model. The point cloud consists of 228,770 vertices. The registered images can be seen in red in the 3D space, which also represents the trajectory of the reconstructed camera movement.

In our experience, following the below guidelines is very important for a successful reconstruction.

  • Capture images with good texture. This is needed for feature detection that connects the images to one another.
  • Capture images at similar illumination conditions. Feature matching will fail between two points that should be corelated if they are lit differently.
  • Capture images with high visual overlap. The model will explode into different sub-models otherwise. In our case during investigation of a sewer pipe modelling, more pictures were needed in the pipe turns to maintain cohesion between the images.
  • Capture images from different viewpoints. This helps to estimate depth information better, therefore the result can maintain the same proportions as in the real word.

Want to learn more?
Name Surname