Panorama

Ziteng (Ender) Ji

Introduction

This project is about turning multiple photos of the same scene into coherent, perspective-correct results using projective geometry and image warping. We estimate planar homographies from point correspondences to relate views, then use them to map pixels between images so that surfaces align in a common frame. With these tools, we rectify slanted objects to a fronto-parallel view and stitch overlapping photographs into seamless mosaics that widen the field of view beyond a single shot. Simple feathering or multi-scale blending reduces visible seams and exposure differences so the composites look natural. Overall, the focus is on understanding how camera motion induces projective transformations and using that insight to align, warp, and blend images into clean visual outcomes.

Manual

Images used in this project

Recover Homographies

I estimate a planar homography HH (with 8 DoF, H33=1H_{33} =1) from point correspondences (pipip_i ↔ p_i ′) using two implementations, a direct Ah=bAh=b least-squares solver and a normalized DLT variant. Given n4n≥4 pairs (xi,yi)(xi,yi)(x_i, y_i)→(x_i ′, y_i ′), I build the linear system with one row per coordinate,

[xi    yi    1    0    0    0    xixi    yixi]h  =  xi[0    0    0    xi    yi    1    xiyi    yiyi]h  =  yi \bigl[\,x_i\;\; y_i\;\; 1\;\; 0\;\; 0\;\; 0\;\; -x_i x_i'\;\; -y_i x_i'\,\bigr]\cdot\mathbf h \;=\; x_i'\\[4pt] \bigl[\,0\;\; 0\;\; 0\;\; x_i\;\; y_i\;\; 1\;\; -x_i y_i'\;\; -y_i y_i'\,\bigr]\cdot\mathbf h \;=\; y_i' 

then solve hh ∈ Rn\mathbb{R}^n by np.linalg.lstsq, assemble HH, and scale so H33=1H_{33} = 1. So we will have

H=[h11h12h13h21h22h23h31h321] and rescale so H33=1H=\begin{bmatrix} h_{11}&h_{12}&h_{13}\\ h_{21}&h_{22}&h_{23}\\ h_{31}&h_{32}&1 \end{bmatrix}\ \text{and rescale so}\ H_{33}=1

This 3×33 × 3 matrix maps a homogeneous source point pp to pp’. Geometrically, the top two rows encode the affine effects (rotation/scale/shear plus translation), while [h31,h32][h_{31}, h_{32}] introduces perspective foreshortening; if h31=h32=0h_{31} = h_{32} = 0, H reduces to a purely affine transform. To improve numerical stability beyond the minimal n=4n = 4 (which is noise sensitive), I collect >4> 4 correspondences (via a Matplotlib ginput UI or by loading a CSV), making the system overdetermined and solved in least squares. I also provide a normalized DLT, each point set is centered and isotropically scaled to mean distance 2\sqrt{2} via similarity transforms T1,T2T_1, T_2; I then build the 2n×92n × 9 DLT matrix, take the last right-singular vector from SVD as H~\tilde{H}, and denormalize with H=T21H~T1H = T_2^{−1} \tilde{H}T_1 before fixing the scale. For deliverables, my script visualizes the clicked correspondences side-by-side with indices, and prints and saves the first rows of Ah=bAh = b (and the full matrices), it also outputs the recovered 3×33 × 3 HH. Images are read as 8-bit (PIL with HEIF support), I also add the feature that points are stored/loaded from CSV so you don’t have to click the point every time.

Doe Library & Campanilli

computed system of equations (saved .txt file by running the code), click on the image to zoom in

computed H matrix (saved .txt file by running the code)

North Reading Room (Doe Library)

computed system of equations (saved .txt file by running the code), click on the image to zoom in

Wheeler Hall & Campanilli

computed H matrix (saved .txt file by running the code)

computed system of equations (saved .txt file by running the code), click on the image to zoom in

computed H matrix (saved .txt file by running the code)

Warp the Image & Rectification

For this section, I implement inverse warping with two from-scratch interpolators and an explicit alpha mask to avoid holes. For a given homography HH (source → target), I first predict the output canvas by mapping the four source corners [0,0],[W1,0],[W1,H1],[0,H1][0, 0], [W − 1, 0], [W − 1, H − 1], [0, H − 1] through HH, taking the min/max to form an integer bounding box, then create a regular integer grid of output pixel centers (x,y)(x, y) (we treat integer coordinates as pixel centers). Each output location is back-projected with H1H^{-1} to continuous source coords (sx,sy)(s_x, s_y). For nearest neighbor, I round (sx,sy)(s_x, s_y) to (sx,sy)(⌊s_x⌉, ⌊s_y⌉), copy that pixel if it lies in bounds, and set alpha=1 there (else 0). For bilinear, I take the four neighbors (x0,y0),(x1,y0),(x0,y1),(x1,y1)(x_0, y_0), (x_1, y_0), (x_0, y_1), (x_1, y_1 ) with x0=sx,x1=x0+1,y0=sy,y1=y0+1x_0 = ⌊s_x⌋, x_1 = x_0 + 1, y_0 = ⌊s_y⌋, y_1 = y_0 +1, compute weights wx=sxx0,wy=syy0w_x = s_x − x_0, w_y =s_y − y_0, and form the weighted sum (1wx)(1wy)I00+wx(1wy)I10+(1wx)wyI01+wxwyI11(1 − w_x)(1 − w_y) I_{00} + w_x(1 − w_y) I_{10} + (1 − w_x)w_y I_{01} + w_xw_y I_{11} channel-wise, marking alpha=1alpha=1 only where all four neighbors are valid; outputs are clipped to [0,255][0, 255] and cast back to the input dtype. Both warpers return (image, α, meta), where meta records the output bbox and the saved alpha visualizes coverage; I also report the valid-pixel fraction to compare hole behavior. I provide a driver that accepts HH from disk, builds HH from a CSV of correspondences (so no need to click all the points manually again), or runs a rectification mode, the user clicks 4+ points on a planar object and I map them to a user-specified rectangle of size (rect_w, rect_h) (for n=4n = 4 I use the rectangle’s four corners; for n>4n > 4 I distribute targets around the perimeter). In terms of the speed Nearest Neighbor is faster than Bilinear, but they do produce similar quality.

Blend the Images into a Mosaic

For this section, I build mosaics by first choosing a reference image and estimating homographies HsrcrefH_{src→ref} for each non-reference view (loaded from text files or computed from CSV correspondences via my Ah=bAh = b solver), then predicting a global canvas by projecting all source corner points through their HH’s, taking the min/max to get [xmin,ymin,xmax,ymax][x_{min}, y_{min}, x_{max}, y_{max}], and applying a translation offset ToffsetT_{offset} so every warp lands in positive pixel coordinates. Each image is then inverse-warped into this common canvas using either my nearest-neighbor or bilinear routine (same code as A.3), producing a warped RGB and a binary valid mask. To reduce edge seams, I assign each original image a feather alpha map that is 1 at the center and falls off linearly toward the borders (computed from the minimum distance to image edges and normalized by half the shorter side); this soft alpha is warped to the canvas (bilinear) and multiplied by the valid mask to form the final per-image weights. For blending, I do simple weighted averaging (feathering), accumulate iαiIi∑_i α_i I_i and iαi∑_i α_i across all warped images (channel-wise in float64), then divide with an εε to avoid zero-division and clip/cast to uint8. The script supports one shot stacking (all images warped into the same canvas at once), saves all warped images and their alphas, and outputs the final mosaic plus a panel figure that shows every warped layer and the final result; I also save a contact sheet of the source photos for each mosaic. This feathered weighted averaging removes hard seams and most exposure steps while remaining fast and robust. For even stronger high-frequency “ghosting” suppression a small Laplacian-pyramid blend could be substituted, but the feathered approach is sufficient for the results I show with their corresponding source images and documented homographies.

Doe Library & Campanilli

alpha 0

alpha 1

alpha 2

Hayns Reading Room (Doe Library)

alpha 0

alpha 1

alpha 2

Wheeler Hall & Campanilli

alpha 0

alpha 1

alpha 2

Automatic

Harris Corner Detection

I detect corners with a single scale Harris detector and then thin them with Adaptive Non-Maximal Suppression. From a grayscale image normalized to [0,1][0, 1], I compute Sobel gradients IxI_x, IyI_y, form the second-moment terms Ix2I_x^2, Iy2I_y^2, IxIyI_xI_y, smooth each with a Gaussian (σσ= window_sigma, default 1.5) to obtain Sxx,Syy,SxyS_{xx}, S_{yy}, S_{xy}, and evaluate the Harris response R=detMκ(trM)2R=detM−κ(trM)^2 with

M=[SxxSxySxySyy]M=\begin{bmatrix} S_{xx} & S_{xy} \\[2pt] S_{xy} & S_{yy} \end{bmatrix}

and κ=κ= kappa (default 0.04). I also normalize RR to [0,1][0, 1], then perform a 3×33×3 non-max suppression (NMS) keeping pixels above a quantile threshold (harris_quantile, default 0.995) and selecting local maxima; I cap this pre-set to max_candidates strongest responses. For ANMS, I follow the standard radius rule, for each candidate ii with response rir_i, compute the suppression radius ri=minjxixjr_i^⋆ =min_j ∥x_i − x_j∥ over corners jj whose responses satisfy rj>crir_j > cr_i (robustness c = anms_robust, default 0.9); if none exist, I fall back to the nearest neighbor. I then keep the anms_keep (default 1000) points with largest radii to ensure spatial diversity. Here I show the denser, clumped detections before ANMS versus the well-distributed set after ANMS.

Feature Descriptor Extraction

For each keypoint from B.1, I build a blurred base image by converting to grayscale in [0,1][0, 1] and applying a separable Gaussian (default σ=2.0σ = 2.0). Around each keypoint (x,y)(x, y) I sample a 40×4040×40 window using bilinear interpolation on a sub-pixel grid centered at the keypoint (half-pixel offsets), then downsample 40840→8 by average pooling over non-overlapping 5×55×5 cells to form an axis-aligned 8×88×8 patch. I then bias/gain normalize each descriptor by flattening to 64D, subtracting the mean, and dividing by the standard deviation (with εε for stability), yielding zero-mean/unit-std descriptors. To avoid boundary artifacts, keypoints within 20 pixels of the border (more generally, < win/2) are discarded so the full 40×4040×40 window is in-bounds; I also cap the processed set to --keep keypoints for efficiency. For the image in the section above, I am getting the features below.

Feature Matching

I match the part B.2 descriptors by computing all pairwise squared Euclidean distances between the zero-mean/unit-std 64-D vectors from the two images. For each descriptor in image 1 I find the nearest (d1d₁) and second-nearest (d2)(d₂) neighbors in image 2 using a fast partial sort, and apply the ratio test d1/d2<τd_1/d_2 < τ with a tunable threshold (--ratio, default 0.8). I optionally enforce mutual (cross-check) consistency (--mutual) by requiring that image 2’s nearest neighbor of the chosen match also points back to the same image 1 feature. For the Implementation details, distances use ab22=a2+b22ab∣∣a−b∣∣^2_2 = ∥a∥^2 + ∥b∥^2 − 2a^⊤b for speed. I also guard small denominators in the ratio. Moreover, I trim descriptor/point lists to the common length to keep indices aligned. In the end, I write a B3_matches.csv table with (i1,x1,y1,i2,x2,y2,dist,ratio)(i_1, x_1, y_1, i_2, x_2, y_2, dist, ratio) per match, and produce a visualization (shown below) that stacks the two images side by side, plots the matched keypoints, and draws connecting lines (capped by --max_plot, default 300).

RANSAC for Robust Homography

From the B.3 matches (PiQi)(P_i ↔ Q_i), I run 4-point RANSAC to estimate a homography robustly, each iteration randomly samples 4 correspondences (skipping degenerate quads with near-zero area), fits HH (both normalized DLT and Ah=bAh=b tried; best kept), and scores all matches with a reprojection error (default symmetric transfer error HPQ2+H1QP2∥HP−Q∥^2 + ∥H^{−1}Q − P∥^2) against an inlier threshold of 3 px. I keep the model with the largest inlier count, then re-estimate HH on all inliers using the chosen estimator. The implementation exposes --iters (default 3000), --thresh (px), --method (normalized/Ahb), and --one_sided (use HPQ∥HP−Q∥ only) for more possibilities that you can test on your own. It also logs inlier stats and saves per-pair inlier visualizations overlaying green (inlier) vs. red (outlier) matches. For mosaicing, I choose a reference image, set Href=IH_{ref} = I, and compose each non-ref HH to the global canvas (computed by warping all image corners, then applying a translation offset). Each image is inverse-warped (bilinear by default; --nn optional) and blended with feathered alpha averaging (alpha falls off to edges) to produce an automatic mosaic. Below I show the compaison between stitching manually and automatically.

manual

automatic

manual

automatic

manual

automatic