Panorama

Ziteng (Ender) Ji

Introduction

This project is about turning multiple photos of the same scene into coherent, perspective-correct results using projective geometry and image warping. We estimate planar homographies from point correspondences to relate views, then use them to map pixels between images so that surfaces align in a common frame. With these tools, we rectify slanted objects to a fronto-parallel view and stitch overlapping photographs into seamless mosaics that widen the field of view beyond a single shot. Simple feathering or multi-scale blending reduces visible seams and exposure differences so the composites look natural. Overall, the focus is on understanding how camera motion induces projective transformations and using that insight to align, warp, and blend images into clean visual outcomes.

Manual

Images used in this project

Recover Homographies

I estimate a planar homography $H$ (with 8 DoF, $H_{33} =1$ ) from point correspondences ( $p_i ↔ p_i ′$ ) using two implementations, a direct $Ah=b$ least-squares solver and a normalized DLT variant. Given $n≥4$ pairs $(x_i, y_i)→(x_i ′, y_i ′)$ , I build the linear system with one row per coordinate,

$\bigl[\,x_i\;\; y_i\;\; 1\;\; 0\;\; 0\;\; 0\;\; -x_i x_i'\;\; -y_i x_i'\,\bigr]\cdot\mathbf h \;=\; x_i'\\[4pt] \bigl[\,0\;\; 0\;\; 0\;\; x_i\;\; y_i\;\; 1\;\; -x_i y_i'\;\; -y_i y_i'\,\bigr]\cdot\mathbf h \;=\; y_i'$

then solve $h ∈$ $\mathbb{R}^n$ by np.linalg.lstsq, assemble $H$ , and scale so $H_{33} = 1$ . So we will have

$H=\begin{bmatrix} h_{11}&h_{12}&h_{13}\\ h_{21}&h_{22}&h_{23}\\ h_{31}&h_{32}&1 \end{bmatrix}\ \text{and rescale so}\ H_{33}=1$

This $3 × 3$ matrix maps a homogeneous source point $p$ to $p’$ . Geometrically, the top two rows encode the affine effects (rotation/scale/shear plus translation), while $[h_{31}, h_{32}]$ introduces perspective foreshortening; if $h_{31} = h_{32} = 0$ , H reduces to a purely affine transform. To improve numerical stability beyond the minimal $n = 4$ (which is noise sensitive), I collect $> 4$ correspondences (via a Matplotlib ginput UI or by loading a CSV), making the system overdetermined and solved in least squares. I also provide a normalized DLT, each point set is centered and isotropically scaled to mean distance $\sqrt{2}$ via similarity transforms $T_1, T_2$ ; I then build the $2n × 9$ DLT matrix, take the last right-singular vector from SVD as $\tilde{H}$ , and denormalize with $H = T_2^{−1} \tilde{H}T_1$ before fixing the scale. For deliverables, my script visualizes the clicked correspondences side-by-side with indices, and prints and saves the first rows of $Ah = b$ (and the full matrices), it also outputs the recovered $3 × 3$ $H$ . Images are read as 8-bit (PIL with HEIF support), I also add the feature that points are stored/loaded from CSV so you don’t have to click the point every time.

Doe Library & Campanilli

computed system of equations (saved .txt file by running the code), click on the image to zoom in

computed H matrix (saved .txt file by running the code)

North Reading Room (Doe Library)

computed system of equations (saved .txt file by running the code), click on the image to zoom in

Wheeler Hall & Campanilli

computed H matrix (saved .txt file by running the code)

computed system of equations (saved .txt file by running the code), click on the image to zoom in

computed H matrix (saved .txt file by running the code)

Warp the Image & Rectification

For this section, I implement inverse warping with two from-scratch interpolators and an explicit alpha mask to avoid holes. For a given homography $H$ (source → target), I first predict the output canvas by mapping the four source corners $[0, 0], [W − 1, 0], [W − 1, H − 1], [0, H − 1]$ through $H$ , taking the min/max to form an integer bounding box, then create a regular integer grid of output pixel centers $(x, y)$ (we treat integer coordinates as pixel centers). Each output location is back-projected with $H^{-1}$ to continuous source coords $(s_x, s_y)$ . For nearest neighbor, I round $(s_x, s_y)$ to $(⌊s_x⌉, ⌊s_y⌉)$ , copy that pixel if it lies in bounds, and set alpha=1 there (else 0). For bilinear, I take the four neighbors $(x_0, y_0), (x_1, y_0), (x_0, y_1), (x_1, y_1 )$ with $x_0 = ⌊s_x⌋, x_1 = x_0 + 1, y_0 = ⌊s_y⌋, y_1 = y_0 +1$ , compute weights $w_x = s_x − x_0, w_y =s_y − y_0$ , and form the weighted sum $(1 − w_x)(1 − w_y) I_{00} + w_x(1 − w_y) I_{10} + (1 − w_x)w_y I_{01} + w_xw_y I_{11}$ channel-wise, marking $alpha=1$ only where all four neighbors are valid; outputs are clipped to $[0, 255]$ and cast back to the input dtype. Both warpers return (image, α, meta), where meta records the output bbox and the saved alpha visualizes coverage; I also report the valid-pixel fraction to compare hole behavior. I provide a driver that accepts $H$ from disk, builds $H$ from a CSV of correspondences (so no need to click all the points manually again), or runs a rectification mode, the user clicks 4+ points on a planar object and I map them to a user-specified rectangle of size (rect_w, rect_h) (for $n = 4$ I use the rectangle’s four corners; for $n > 4$ I distribute targets around the perimeter). In terms of the speed Nearest Neighbor is faster than Bilinear, but they do produce similar quality.

Blend the Images into a Mosaic

For this section, I build mosaics by first choosing a reference image and estimating homographies $H_{src→ref}$ for each non-reference view (loaded from text files or computed from CSV correspondences via my $Ah = b$ solver), then predicting a global canvas by projecting all source corner points through their $H$ ’s, taking the min/max to get $[x_{min}, y_{min}, x_{max}, y_{max}]$ , and applying a translation offset $T_{offset}$ so every warp lands in positive pixel coordinates. Each image is then inverse-warped into this common canvas using either my nearest-neighbor or bilinear routine (same code as A.3), producing a warped RGB and a binary valid mask. To reduce edge seams, I assign each original image a feather alpha map that is 1 at the center and falls off linearly toward the borders (computed from the minimum distance to image edges and normalized by half the shorter side); this soft alpha is warped to the canvas (bilinear) and multiplied by the valid mask to form the final per-image weights. For blending, I do simple weighted averaging (feathering), accumulate $∑_i α_i I_i$ and $∑_i α_i$ across all warped images (channel-wise in float64), then divide with an $ε$ to avoid zero-division and clip/cast to uint8. The script supports one shot stacking (all images warped into the same canvas at once), saves all warped images and their alphas, and outputs the final mosaic plus a panel figure that shows every warped layer and the final result; I also save a contact sheet of the source photos for each mosaic. This feathered weighted averaging removes hard seams and most exposure steps while remaining fast and robust. For even stronger high-frequency “ghosting” suppression a small Laplacian-pyramid blend could be substituted, but the feathered approach is sufficient for the results I show with their corresponding source images and documented homographies.

Doe Library & Campanilli

alpha 0

alpha 1

alpha 2

Hayns Reading Room (Doe Library)

alpha 0

alpha 1

alpha 2

Wheeler Hall & Campanilli

alpha 0

alpha 1

alpha 2

Automatic

Harris Corner Detection

I detect corners with a single scale Harris detector and then thin them with Adaptive Non-Maximal Suppression. From a grayscale image normalized to $[0, 1]$ , I compute Sobel gradients $I_x$ , $I_y$ , form the second-moment terms $I_x^2$ , $I_y^2$ , $I_xI_y$ , smooth each with a Gaussian ( $σ$ = window_sigma, default 1.5) to obtain $S_{xx}, S_{yy}, S_{xy}$ , and evaluate the Harris response $R=detM−κ(trM)^2$ with

$M=\begin{bmatrix} S_{xx} & S_{xy} \\[2pt] S_{xy} & S_{yy} \end{bmatrix}$

and $κ=$ kappa (default 0.04). I also normalize $R$ to $[0, 1]$ , then perform a $3×3$ non-max suppression (NMS) keeping pixels above a quantile threshold (harris_quantile, default 0.995) and selecting local maxima; I cap this pre-set to max_candidates strongest responses. For ANMS, I follow the standard radius rule, for each candidate $i$ with response $r_i$ , compute the suppression radius $r_i^⋆ =min_j ∥x_i − x_j∥$ over corners $j$ whose responses satisfy $r_j > cr_i$ (robustness c = anms_robust, default 0.9); if none exist, I fall back to the nearest neighbor. I then keep the anms_keep (default 1000) points with largest radii to ensure spatial diversity. Here I show the denser, clumped detections before ANMS versus the well-distributed set after ANMS.

Feature Descriptor Extraction

For each keypoint from B.1, I build a blurred base image by converting to grayscale in $[0, 1]$ and applying a separable Gaussian (default $σ = 2.0$ ). Around each keypoint $(x, y)$ I sample a $40×40$ window using bilinear interpolation on a sub-pixel grid centered at the keypoint (half-pixel offsets), then downsample $40→8$ by average pooling over non-overlapping $5×5$ cells to form an axis-aligned $8×8$ patch. I then bias/gain normalize each descriptor by flattening to 64D, subtracting the mean, and dividing by the standard deviation (with $ε$ for stability), yielding zero-mean/unit-std descriptors. To avoid boundary artifacts, keypoints within 20 pixels of the border (more generally, < win/2) are discarded so the full $40×40$ window is in-bounds; I also cap the processed set to --keep keypoints for efficiency. For the image in the section above, I am getting the features below.

Feature Matching

I match the part B.2 descriptors by computing all pairwise squared Euclidean distances between the zero-mean/unit-std 64-D vectors from the two images. For each descriptor in image 1 I find the nearest ( $d₁$ ) and second-nearest $(d₂)$ neighbors in image 2 using a fast partial sort, and apply the ratio test $d_1/d_2 < τ$ with a tunable threshold (--ratio, default 0.8). I optionally enforce mutual (cross-check) consistency (--mutual) by requiring that image 2’s nearest neighbor of the chosen match also points back to the same image 1 feature. For the Implementation details, distances use $∣∣a−b∣∣^2_2 = ∥a∥^2 + ∥b∥^2 − 2a^⊤b$ for speed. I also guard small denominators in the ratio. Moreover, I trim descriptor/point lists to the common length to keep indices aligned. In the end, I write a B3_matches.csv table with $(i_1, x_1, y_1, i_2, x_2, y_2, dist, ratio)$ per match, and produce a visualization (shown below) that stacks the two images side by side, plots the matched keypoints, and draws connecting lines (capped by --max_plot, default 300).

RANSAC for Robust Homography

From the B.3 matches $(P_i ↔ Q_i)$ , I run 4-point RANSAC to estimate a homography robustly, each iteration randomly samples 4 correspondences (skipping degenerate quads with near-zero area), fits $H$ (both normalized DLT and $Ah=b$ tried; best kept), and scores all matches with a reprojection error (default symmetric transfer error $∥HP−Q∥^2 + ∥H^{−1}Q − P∥^2$ ) against an inlier threshold of 3 px. I keep the model with the largest inlier count, then re-estimate $H$ on all inliers using the chosen estimator. The implementation exposes --iters (default 3000), --thresh (px), --method (normalized/Ahb), and --one_sided (use $∥HP−Q∥$ only) for more possibilities that you can test on your own. It also logs inlier stats and saves per-pair inlier visualizations overlaying green (inlier) vs. red (outlier) matches. For mosaicing, I choose a reference image, set $H_{ref} = I$ , and compose each non-ref $H$ to the global canvas (computed by warping all image corners, then applying a translation offset). Each image is inverse-warped (bilinear by default; --nn optional) and blended with feathered alpha averaging (alpha falls off to edges) to produce an automatic mosaic. Below I show the compaison between stitching manually and automatically.

manual

automatic

manual

automatic

manual

automatic