This project explores a simple idea: not all parts of a video frame are equally important. In many scenes, the viewerβs attention is drawn to moving subjects while the background remains relatively static.
In this system, video frames are analyzed to detect motion and separate foreground regions from the background. These segmented regions can then be treated differently during compression, preserving more detail in moving areas while applying stronger compression to static regions.
The project demonstrates the full pipeline from raw frame processing to synchronized video playback, along with visualizations that show how motion segmentation influences the compression process.
Demo
The demo video below shows the final reconstructed video along with visualizations of the segmentation process used to identify moving regions in each frame.
The first portion of the video shows the reconstructed playback with synchronized audio.
Later in the video, segmentation visualizations illustrate how the system identifies moving regions and foreground blocks during processing.
Source Code
The implementation for this project, including the compression pipeline and visualization scripts, is available on GitHub.
The compression system processes raw video frames, identifies moving regions, and applies transform-based compression before reconstructing the final video.
01 β Forward Transform
The 2D Discrete Cosine Transform converts an 8Γ8 block of spatial pixel
values into the frequency domain. The DC coefficient at
(u,v) = (0,0) captures average intensity; higher-index
AC coefficients encode increasingly fine spatial detail.
f(x, y)Pixel intensity at position (x, y) in the 8Γ8 block
F(u, v)DCT coefficient at frequency index (u, v)
C(k)Normalization: $\frac{1}{\sqrt{2}}$ when $k=0$, else $1$
u, v β [0, 7]Frequency indices β low = coarse structure, high = fine detail
02 β Quantization
Each coefficient is divided by a uniform step size and rounded. Small high-frequency
coefficients round to zero and are discarded β this is where compression loss occurs.
The step size is controlled by exponent n, set independently per region via
n1 (foreground) and n2 (background).
Smaller n preserves more detail; larger n zeroes out more coefficients.
Typically $n_1 < n_2$ so moving subjects retain crispness while the background is compressed more aggressively.
03 β Inverse Transform
The decoder multiplies each coefficient by $2^n$ to dequantize, then applies the
2D Inverse DCT to reconstruct approximate pixel values. The degree
of error depends directly on how aggressively the block was quantized.
The human visual system is more sensitive to low-frequency luminance changes than to
fine high-frequency detail. By concentrating energy in the lower DCT coefficients,
the transform makes it straightforward to identify what can be discarded with minimal
perceptual impact. Background regions tolerate a larger $n_2$, while foreground
subjects under $n_1$ stay crisp where the eye is actually focused.
01 β Motion Vector Estimation
For each 16Γ16 macroblock in frame $t$, the encoder searches frame $t-1$ for the
best matching block using Three-Step Search (TSS), measured by
Mean Absolute Difference (MAD).
$(d_x, d_y)$ is the displacement candidate, $N = 16$.
The displacement minimizing MAD becomes the block's motion vector $\mathbf{v} = (d_x^*, d_y^*)$.
02 β Foreground / Background Classification
Blocks are classified using magnitude and directional consistency of motion vectors
relative to the global scene motion.
Background
Near-zero magnitude (static camera) or matches the global dominant vector (moving camera).
Foreground
Consistent magnitude and direction within a contiguous region that differs from global background motion. Isolated outlier blocks are not treated as foreground.
Vector magnitude is $\|\mathbf{v}\| = \sqrt{d_x^2 + d_y^2}$.
Blocks exceeding threshold $\tau$ with directional consistency among neighbors are flagged as foreground.
03 β Extended: Detectron2 + Optical Flow
Detectron2 instance segmentation was combined with
dense optical flow (Farneback) to produce semantically-aware masks.
Per-pixel flow is compared against the global motion vector β objects whose local flow
deviates significantly in magnitude or direction are retained as foreground.
$\bar{m}$ and $\bar{\theta}$ are the mean magnitude and angle within a detected
object's mask. $\tau_m$ and $\tau_\theta$ are tunable thresholds. This reduces
false positives from camera-induced motion that TSS alone can misclassify.
Foreground Segmentation with Detectron2
The segmentation step was extended using Detectron2, Facebook AI Researchβs framework for object detection and instance segmentation. Rather than relying solely on motion differencing, Detectron2 identifies semantic objects in the scene, making it possible to isolate foreground subjects like the tennis player more reliably.
Classical motion analysis and deep learning segmentation are combined so the pipeline can better determine which regions deserve higher visual quality during compression.
Segmentation Visualization
The video below shows the segmentation stage in action. Detected foreground regions are highlighted to illustrate how object-level masks guide compression decisions across the frame.
Technical Highlights
Built a full video compression pipeline over raw RGB frames with synchronized audio playback.
Used Three-Step Search motion estimation to classify macroblocks as foreground or background per frame.
Applied 2D DCT with separate quantization parameters for foreground and background regions, then reconstructed frames via IDCT.
Extended segmentation with Detectron2 and dense optical flow to produce semantically-aware foreground masks.
Built a four-panel visualization showing original frames, Detectron2 segmentation, optical flow, and final foreground block classification side by side.
Team & Technical Scope
Pranav Rathod
Video compression pipeline, playback system, and visualizations
James Kasaba
Segmentation pipeline and integration
Ruiqi Zhang
Detectron2 integration and segmentation experimentation