Spatiotemporal Background Subtraction and Video Segmentation

Video segmentation and background subtraction are fundamental tasks in many computer vision and video processing applications. They are mainly used as the first step in applications like tracking, recognition, and classification, among others. In many existing methods, background is modeled as independent pixel processes, and statistical decision algorithms combined with ad hoc methods are applied to extract the background. The major advantage of these methods is the low computational complexity, which allows them to run real-time which is crucial in certain applications, such as tracking. However, the independence assumption between pixels is certainly very strong, and it also leads to excessive noise in the results.

Another class of algorithms incorporate a full probabilistic model of the video frames, and attempt to obtain a full segmentation of the video into objects. Background classification can then be performed using the extracted objects. These algorithms, although they lead to more accurate results than the ones described above, are not practical in many applications due to their very high computational complexity.

In our work, we attempt to combine the advantages of both classes of algorithms. Our method uses segmentation for background modeling to incorporate a better understanding of the semantic content of the background. Starting from a segmentation of the background scene, we calculate local spatial statistics to model the background as a combination of local spatiotemporal pdf’s.

This extension of the probabilistic model to the spatial domain, combined with temporal adaptation of the distribution functions, provides a powerful characterization of foreground and background regions. The proposed algorithm provides an efficient background subtraction by preserving multimodularity both spatially and temporally.

Our method also compares favorably in terms of computational complexity and memory requirements. The processing speed achieves real-time performance 10-30 fps with temporal constraints only, and with spatial constraints included for smoother results, a performance of 5-10 fps can be achieved. The initialization phase requires only 1-2 seconds, which is acceptable in most applications.

For technical details and analysis of the algorithm, see Babacan:2006:120 and Babacan:2007:119 .