Segmented Video Compression

Overview

This project explores a simple idea: not all parts of a video frame are equally important. In many scenes, the viewer’s attention is drawn to moving subjects while the background remains relatively static.

In this system, video frames are analyzed to detect motion and separate foreground regions from the background. These segmented regions can then be treated differently during compression, preserving more detail in moving areas while applying stronger compression to static regions.

The project demonstrates the full pipeline from raw frame processing to synchronized video playback, along with visualizations that show how motion segmentation influences the compression process.

Demo

The demo video below shows the final reconstructed video along with visualizations of the segmentation process used to identify moving regions in each frame.

The first portion of the video shows the reconstructed playback with synchronized audio.
Later in the video, segmentation visualizations illustrate how the system identifies moving regions and foreground blocks during processing.

Source Code

The implementation for this project, including the compression pipeline and visualization scripts, is available on GitHub.

View the repository →

System Pipeline

The compression system processes raw video frames, identifies moving regions, and applies transform-based compression before reconstructing the final video.

01 — Forward Transform The 2D Discrete Cosine Transform converts an 8×8 block of spatial pixel values into the frequency domain. The DC coefficient at (u,v) = (0,0) captures average intensity; higher-index AC coefficients encode increasingly fine spatial detail. $$F(u,v) = \frac{1}{4}\, C(u)\, C(v) \sum_{x=0}^{7} \sum_{y=0}^{7} f(x,y) \cos\!\left[\frac{(2x+1)\,u\,\pi}{16}\right] \cos\!\left[\frac{(2y+1)\,v\,\pi}{16}\right]$$ f(x, y) Pixel intensity at position (x, y) in the 8×8 block F(u, v) DCT coefficient at frequency index (u, v) C(k) Normalization: $\frac{1}{\sqrt{2}}$ when $k=0$, else $1$ u, v ∈ [0, 7] Frequency indices — low = coarse structure, high = fine detail 02 — Quantization Each coefficient is divided by a uniform step size and rounded. Small high-frequency coefficients round to zero and are discarded — this is where compression loss occurs. The step size is controlled by exponent n, set independently per region via n1 (foreground) and n2 (background). $$\hat{F}(u,v) = \operatorname{round}\!\left(\frac{F(u,v)}{2^{n}}\right)$$ Smaller n preserves more detail; larger n zeroes out more coefficients. Typically $n_1 < n_2$ so moving subjects retain crispness while the background is compressed more aggressively. 03 — Inverse Transform The decoder multiplies each coefficient by $2^n$ to dequantize, then applies the 2D Inverse DCT to reconstruct approximate pixel values. The degree of error depends directly on how aggressively the block was quantized. Dequantization $$\tilde{F}(u,v) = \hat{F}(u,v) \cdot 2^{n}$$ Reconstruction $$\hat{f}(x,y) = \frac{1}{4} \sum_{u=0}^{7} \sum_{v=0}^{7} C(u)\, C(v)\, \tilde{F}(u,v) \cos\!\left[\frac{(2x+1)\,u\,\pi}{16}\right] \cos\!\left[\frac{(2y+1)\,v\,\pi}{16}\right]$$ 04 — Why This Works Perceptually The human visual system is more sensitive to low-frequency luminance changes than to fine high-frequency detail. By concentrating energy in the lower DCT coefficients, the transform makes it straightforward to identify what can be discarded with minimal perceptual impact. Background regions tolerate a larger $n_2$, while foreground subjects under $n_1$ stay crisp where the eye is actually focused.

01 — Motion Vector Estimation For each 16×16 macroblock in frame $t$, the encoder searches frame $t-1$ for the best matching block using Three-Step Search (TSS), measured by Mean Absolute Difference (MAD). $$\text{MAD}(d_x, d_y) = \frac{1}{N^2} \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} \left| f_t(x+i,\, y+j) - f_{t-1}(x+i+d_x,\, y+j+d_y) \right|$$ $(d_x, d_y)$ is the displacement candidate, $N = 16$. The displacement minimizing MAD becomes the block's motion vector $\mathbf{v} = (d_x^*, d_y^*)$. 02 — Foreground / Background Classification Blocks are classified using magnitude and directional consistency of motion vectors relative to the global scene motion. Background Near-zero magnitude (static camera) or matches the global dominant vector (moving camera). Foreground Consistent magnitude and direction within a contiguous region that differs from global background motion. Isolated outlier blocks are not treated as foreground. Vector magnitude is $\|\mathbf{v}\| = \sqrt{d_x^2 + d_y^2}$. Blocks exceeding threshold $\tau$ with directional consistency among neighbors are flagged as foreground. 03 — Extended: Detectron2 + Optical Flow Detectron2 instance segmentation was combined with dense optical flow (Farneback) to produce semantically-aware masks. Per-pixel flow is compared against the global motion vector — objects whose local flow deviates significantly in magnitude or direction are retained as foreground. Per-pixel flow magnitude $$m(x,y) = \sqrt{u(x,y)^2 + v(x,y)^2}$$ Foreground condition (moving camera) $$\left|\, \bar{m}_{\text{obj}} - \|\mathbf{v}_{\text{global}}\|\, \right| > \tau_{m} \quad \text{or} \quad \left|\, \bar{\theta}_{\text{obj}} - \theta_{\text{global}} \right| > \tau_{\theta}$$ $\bar{m}$ and $\bar{\theta}$ are the mean magnitude and angle within a detected object's mask. $\tau_m$ and $\tau_\theta$ are tunable thresholds. This reduces false positives from camera-induced motion that TSS alone can misclassify.

Foreground Segmentation with Detectron2

The segmentation step was extended using Detectron2, Facebook AI Research’s framework for object detection and instance segmentation. Rather than relying solely on motion differencing, Detectron2 identifies semantic objects in the scene, making it possible to isolate foreground subjects like the tennis player more reliably.

Classical motion analysis and deep learning segmentation are combined so the pipeline can better determine which regions deserve higher visual quality during compression.

Segmentation Visualization

The video below shows the segmentation stage in action. Detected foreground regions are highlighted to illustrate how object-level masks guide compression decisions across the frame.

Technical Highlights

Built a full video compression pipeline over raw RGB frames with synchronized audio playback.
Used Three-Step Search motion estimation to classify macroblocks as foreground or background per frame.
Applied 2D DCT with separate quantization parameters for foreground and background regions, then reconstructed frames via IDCT.
Extended segmentation with Detectron2 and dense optical flow to produce semantically-aware foreground masks.
Built a four-panel visualization showing original frames, Detectron2 segmentation, optical flow, and final foreground block classification side by side.

Team & Technical Scope

Pranav Rathod

Video compression pipeline, playback system, and visualizations

James Kasaba

Segmentation pipeline and integration

Ruiqi Zhang

Detectron2 integration and segmentation experimentation

Segmented Video Compression

Overview

Demo

Source Code

System Pipeline

01 — Forward Transform

02 — Quantization

03 — Inverse Transform

04 — Why This Works Perceptually

01 — Motion Vector Estimation

02 — Foreground / Background Classification

03 — Extended: Detectron2 + Optical Flow

Foreground Segmentation with Detectron2

Segmentation Visualization

Technical Highlights

Team & Technical Scope

View Next

Jello Cube Simulation

Image Compression: DCT vs DWT