NeRF: Neural Radiance Field

Ziteng (Ender) Ji

Introduction

This project implements a neural radiance field (NeRF) pipeline end to end for novel view synthesis. I begin with camera calibration and pose recovery from ArUco tags, undistort images, and package intrinsics and camera to world transforms. From these, I cast rays per pixel and uniformly sample 3D points along each ray. A multilayer perceptron with sinusoidal positional encodings maps sample locations and view directions to color and volume density, and I compose per-ray colors via discrete volume rendering with transmittance weighting. I first validate the system on the Synthetic Lego scene, reporting training curves and PSNR while visualizing rays, sample distributions, and spherical renderings. I then adapt hyperparameters to our own captured dataset and produce orbit videos and intermediate reconstructions. The result is a compact, reproducible NeRF that ties together geometry, sampling, neural function approximation, and physically motivated rendering into a single coherent pipeline.

3D Scan & Dataset

Camera Calibration

For camera calibration, I followed the required ArUco-based pipeline end-to-end and made it robust to real data quirks. After capturing ~30–50 photos at a fixed zoom, the script scans a folder of images, auto-filters to a single dominant resolution and then detects ArUco 4×4 tags with OpenCV’s aruco . For each image with any detection, I extract the four tag corners, refine them to sub-pixel accuracy, and pair them with their known 3D world coordinates defined from the physical tag size (e.g., for --tag_size_m=s the square’s corners are [(0,0,0),(s,0,0),(s,s,0),(0,s,0)][(0,0,0),(s,0,0),(s,s,0),(0,s,0)]). I support using all markers or just the largest marker per image; if no tag is found, the frame is safely skipped (preventing crashes) and counted. With all (objpoints, imgpoints) collected, I run cv2.calibrateCamera() to estimate the intrinsics KK and distortion coefficients, then report quality via global RMS and per-view reprojection errors, plus sanity checks on fxf_x vs fyf_y, principal point vs image center, and distortion magnitudes. Optional overlays/visualization can also be saved for quick inspections. Finally, the code will write calibration_cam.npz containing KK, distCoeffs, image size, RMS, and per-view errors.

Example images used for in this project (shown 6, but more from different angles)

Capturing 3D Scan

For Part 0, I captured 30 ArUco-tag images at a fixed zoom while varying viewpoint and distance, then calibrated the camera and recovered per-image poses. In previous (01.py) I scan the folder, auto-filter to a single resolution, detect ArUco tags with OpenCV (DICT_4x4_50), refine the four corners to subpixel accuracy, and pair them with metrically correct 3D square corners [(0,0,0),(s,0,0),(s,s,0),(0,s,0)][(0, 0, 0),(s, 0, 0),(s, s, 0),(0, s, 0)] where ss is the printed tag size (0.06 m). I aggregate all detections across images and run cv2.calibrateCamera to estimate KK and distCoeffs, report RMS and per-view reprojection errors, and save calibration_cam.npz; images without detections are safely skipped to avoid crashes. Then I captured 50+ images with ArUco tage and a model car. In 0.3 (03.py) I load these intrinsics, detect the single largest tag per object image, refine corners, and solve PnP using a best-of-8 corner ordering/flip search in both distorted and undistorted domains to minimize reprojection error. From solvePnP’s world-to-camera (Rwc,twc)(R_{wc} ,t_{wc}) I compute the camera-to-world pose Tc2w=T_{c2w} = [RwcRwctwc][R_{wc}^⊤ ∣ −R_{wc}^⊤t_{wc}] for NeRF; failures (no tag or unstable PnP) are skipped. I also export detection overlays and Viser frustum visualizations to verify geometry.

Creating Dataset

For 0.4 (Undistort + dataset packaging) I load the calibrated intrinsics/distortion and each pose, undistort every RGB image with cv2.undistort using a new intrinsics from cv2.getOptimalNewCameraMatrix(K, dist, (W,H), alpha=0.0), which chooses a pinhole-equivalent KK′ that minimizes black borders; because I don’t crop with the ROI, no principal-point shift is needed beyond what K' encodes. Each undistorted frame is resized back to (H,W)(H,W) if needed and converted BGR→RGB, while missing/unreadable files are safely skipped. I then split 90/10 (shuffle feature is provided with a seed) into train/val and mirror the val poses as test for novel view rendering. The saved NPZ strictly follows the expected schema: images_train/images_val (uint8, N×H×W×3N × H × W × 3), c2ws_train / c2ws_val / c2ws_test ( N×4×4N×4×4 camera-to-world), and a scalar focal computed from the optimized intrinsics as (fx+fy)/2(f_x ′ + f_y ′)/2 (assuming fxfyf_x ≈ f_y). This yields a distortion free, pinhole consistent dataset that drops cleanly into the later parts’ loaders and training code.

Fit a Neural Field to a 2D Image

ground truth image

progress 100

progress 300

progress 1000

progress 3000

PSNR Curve

ground truth image

progress 100

progress 1000

progress 2000

progress 3000

PSNR Curve

For Part 1, I fit a 2D neural field to images with an MLP + sinusoidal positional encoding and report both progression and PSNR as required. Concretely, I normalize pixel coordinates to [0,1]2[0, 1]^2 using pixel centers (u+0.5,v+0.5)/(W,H)(u + 0.5, v + 0.5) / (W, H) and normalize RGB to [0,1][0, 1]; at each step I randomly sample NN pixels (default N=10,000N = 10,000) to form a lightweight dataloader. I then apply PE with LL frequencies while also concatenating the original (x,y)(x, y) (input dim =2+4L= 2 + 4L) and pass the result through an MLP (depth = 3 hidden layers, width = 256 by default, batch size = 10,000 pixels/step) with ReLU activations and a final Sigmoid to predict RGB. I also optimize with Adam (learning rate =1e2= 1e^{-2}) and MSE loss for 1k–3k iterations, logging a full resolution render at milestones and computing PSNR=10log10(1/MSE)PSNR = 10 log_{10}(1/MSE) as our metric. I save a PSNR curve for the provided image and a sequence of progress images for both the fox image and my own image. To study capacity vs. frequency, I run an automatic 2×2 grid sweep over two LL values (2 and 10) and two widths(64 and 256), save images and results are provided below. Reproducibility is ensured via a fixed seed; the script auto-selects CUDA when available and falls back to CPU.

above I proved the 2x2 image grid, top left we have L2, W64, top right we have L2, W256, bottom left we have L10, W64, and bottom right we have L10, W256. Below I also provide more detailed results with the training progress for each combination of L and W.

L2, W64, progress 300

L2, W256, progress 300

L10, W64, progress 300

L10, W256, progress 300

L2, W64, progress 1500

L2, W256, progress 1500

L10, W64, progress 1500

L10, W256, progress 1500

L2, W64, progress 3000

L2, W256, progress 3000

L10, W64, progress 3000

L10, W256, progress 3000

Fit a Neural Radiance Field from Multi-view Images

Create Rays from Cameras

For Part 2.1, I implement a clean pixel→ray pipeline that uses the calibrated intrinsics K and each image’s camera-to-world pose c2w to produce, for any pixel (u,v)(u, v), a ray origin ror_o and normalized direction rdr_d. I first reconstruct a camera-space point at unit depth (default s=1s=1) via the analytic inverse pinhole map

xc=((ucx)sfx, (vcy)sfy, s)\mathbf{x}_{\mathrm{c}}=\left(\frac{(u-c_x)\,s}{f_x},\ \frac{(v-c_y)\,s}{f_y},\ \mathbf{s}\right)

optionally centering pixels with (u+0.5,v+0.5)(u + 0.5, v + 0.5). Using the generic homogeneous transform, I send xcx_c to world space xw=transform(c2w,xc)x_w = transform(c2w, x_c); the ray origin is the camera position ro=c2w[:3,3]r_o = c2w[:3, 3] (broadcast to match batch shape), and the ray direction is the normalized vector rd=xwroxwro2\mathbf{r}_d=\dfrac{\mathbf{x}_w-\mathbf{r}_o}{\lVert \mathbf{x}_w-\mathbf{r}_o\rVert}_{2}.

Sampling

For Part 2.2, I implemented both required layers of sampling: (1) ray sampling across multiple images and (2) point sampling along each ray. Given a focal-derived intrinsics KK,

K=[f0W/20fH/2001], K=\begin{bmatrix} f & 0 & W/2\\ 0 & f & H/2\\ 0 & 0 & 1 \end{bmatrix}, 

I convert pixel coordinates to rays using pixel-center offsets ((u+0.5,v+0.5))((u+0.5, v+0.5)), build camera-space points at unit depth, transform with each image’s c2w, and normalize directions; colors are pulled from the images after ([0,255]([0,255] to [0,1][0,1]) scaling. I provide two ray samplers: a global sampler that flattens all I×H×WI \times H \times W pixels and draws NN rays uniformly across the full set, and a per-image sampler that first chooses MM images and then draws N/M\lfloor N/M\rfloor rays from each. Both return batched tensors{ro, rd, rgb, uv, img_idx}\{\,r_o,\ r_d,\ rgb,\ uv,\ \mathrm{img\_idx}\,\} and include a reproducible RNG seed. For point sampling, I discretize each ray with uniform linspace t[near,far]t ∈ [near, far] (Lego default near = 2.0, far = 6.0) and support stratified jitter during training only; I perturb each bin’s start by U(Δt/2,Δt/2)U(− Δt/2, Δt/2) so the network eventually “touches” every location(I recommend n_samples {32,6432,64}). The function returns (t_vals, pts) with shapes [N, n_samples] and [N, n_samples,3] and performs sanity checks (direction unit norms).

Visualization of rays and samples with cameras (up to 100 rays)

For Part 2.3, I packaged ray construction and sampling into a single, reusable dataloader that operates directly on the multiview image set. At initialization, RaysData builds the pinhole intrinsics

K=[f0W/20fH/2001],K=\begin{bmatrix} f & 0 & W/2\\ 0 & f & H/2\\ 0 & 0 & 1 \end{bmatrix}, 

enumerates all pixel centers (u+0.5,v+0.5)(u + 0.5, v + 0.5), and for each camera converts them to camera-space unit-depth points, transforms with that view’s c2w, and normalizes to obtain per-pixel ray origins and directions; ground-truth colors are normalized to [0,1][0, 1]. The dataloader then supports random ray sampling either globally across all views or restricted to one camera via one_cam, returning the required tuples (ro,rd,rgb,uv)(r_o, r_d, rgb, uv) for training. I also include two checks and a 3D preview, first I have a UV indexing test that compares sampled pixels against the stored color buffer and also direction-norm sanity. For visualization, I used Viser visualization that renders camera frustums, sampled rays, and sampled 3D points, with a recommended mode that draws rays from a single camera to confirm they stay within that frustum. Below I provide two sets of images from different angles, the first row is 100 rays globally and second row is 100 rays from camera 0 only.

NeRF and Volume Rendering

iter 50

iter 500

iter 1000

iter 1500

iter 2000

iter 2500

iter 3000

For 2.4 (Neural Radiance Field) I switch the network input from 2D pixels to 3D sample locations and view directions. We encode positions with a high-frequency positional encoding Lxyz=10L_{xyz} = 10 and directions with a lower one Ldir=4L_{dir} = 4 (PositionalEncoding keeps the raw input and appends sin/cos(2kπ)sin/cos(2kπ⋅)). The MLP is deeper (width 256, depth 8) and includes a skip/concat at layer 4 that re-injects the encoded position to stabilize optimization on 3D structure. The head branches, one predicts density σ0σ ≥ 0 via a Linear→softplus (bias initialized positive to avoid dead rays), and the other produces a feature vector that, concatenated with the direction PE, flows through two color layers to output RGB in [0,1][0, 1] via sigmoid. Rays are built from KK and c2wc2w; per iteration we sample a batch of rays, draw nsn_s ∈  {32,6432, 64} points per ray between near = 2.0 and far = 6.0 with stratified jitter, and feed (x,d)(x, d) to the network.

For 2.5 (Volume Rendering) we implement the discrete alpha-compositing form of the rendering equation. Given per-sample σiσ_i, RGB cic_i, and bin sizes δiδ_i , we compute αi=1eσiδiα_i = 1 − e^{−σ_i δ_i}, the accumulated transmittance Ti=exp(j<iσjδj)T_i = exp(−∑_{j<i} σ_j δ_j) via a cumulative sum trick, weights wi=Tiαiw_i = T_iα_i, and the rendered color C=iwiciC = ∑_i w_i c_i  (volrend). A self-check asserts our output against the provided reference tensor. Training uses Adam (lr=5e-4) with an MSE loss between rendered colors and ground-truth pixels (10k rays/step by default); we log training PSNR, periodically render full images with chunked inference to manage memory, track a validation PSNR curve, and export a spherical novel-view GIF from the held-out camera poses.

Above I provide the predicted images across iterations, PSNR curve and below I provide the spherical render for iteration 1000 and 3000.

spherical render 1000 (refresh to replay)

spherical render 3000 (refresh to replay)

Training with my own data (Toyota AE86 model car)

ground truth

iter 100

iter 300

iter 500

iter 1000

For Part 2.6 I trained a full NeRF on my own dataset (my_data.npz) using the same principled pipeline as Parts 2.4–2.5, with a few real-data tweaks. The model is an 8-layer, width-256 MLP with a skip connection after layer 4 that concatenates the position PE back into the trunk; positions use sinusoidal PE with Lx=10L_x = 10 (input kept), view directions use a lighter PE with Ld=4L_d = 4. The head predicts density via softplus(sigma_fc) (bias initialized positive) and color via a feature branch concatenated with encoded view directions followed by two color layers and a final sigmoid. From the Part-0 calibration I rebuild KK and convert randomly sampled pixels (centered at +0.5) and c2w into rays; per training step I draw 8k rays, sample 64 stratified points between near = 0.02 and far = 0.5, and render colors with the discrete volume-rendering weights wi=Ti(1eσiΔi)w_i = T_i (1 − e^{−σ_i Δ_i}). I optimize MSE with Adam (lr = 5e-4), log PSNR, and save intermediate renders every save_every steps plus a loss curve (losses.npy and loss_curve.png). For speed/robustness I support chunked rendering for full images, fixed seeding, and an optional render-only path that reloads a checkpoint. Finally, I generate the required novel-view GIF by orbiting the camera around the object; this, together with the saved intermediate images and loss plot. Below I provide the orbit gif, but please note the placement of our car model with respect to the ar tag; since the ar tag is right in front of the car, so there is no image taken from behind the car (because the car will block the ar tag in that angle). As a result, the orbit viewpoint will be in front of the car.