Diffusion Models

Ziteng (Ender) Ji

Introduction

This project implements modern generative image models from the ground up through two complementary lenses: diffusion and flow matching. In Part A, I work with a pretrained DeepFloyd IF model to reconstruct the forward and reverse diffusion processes, starting from simple noising and denoising of real images and progressing to full text-to-image sampling with classifier-free guidance. Building on this core sampler, I implement image-to-image translation via SDEdit, inpainting with spatial masks, and more creative applications such as visual anagrams and hybrid images that combine different prompts across frequency bands. In Part B, I step away from pretrained models and build my own UNet for denoising and flow matching on MNIST. I first train a single-step denoiser, then extend it with time and class conditioning to learn continuous flows from noise to digits, and finally apply classifier-free guidance in this setting as well. Together, these components form a coherent pipeline that connects the probabilistic foundations of diffusion, the architecture and conditioning of UNets, and the practical tricks needed to steer generative models toward controllable, high-quality images.

Setup

The set of prompts that I used in this project is

['a humanoid robot goalkeeper diving to save a soccer ball',
'a cyberpunk city at night with neon lights and rain',
'a photo of a sunset over the ocean',
'a photo of a busy city street at night',
'a photo of a woman',
'a photo of a cat sitting on a windowsill',
'an oil painting of a forest in autumn',
'an oil painting of a ship in a stormy sea',
'a lithograph of a mountain range',
'a lithograph of an ancient city',
'a man wearing glasses',
'a woman wearing a hat',
'a charcoal drawing of a face',
'a high quality photo',
'']

I used 'a humanoid robot goalkeeper diving to save a soccer ball', 'a cyberpunk city at night with neon lights and rain', and 'a man wearing glasses' to create the images below with inference step 10 and 500. It turns out adding inference step adds more detail to the image and the image is more aligned with the prompt. I used the random seed 100.

num_inference_steps = 10

num_inference_steps = 500

Sampling Loops

Forward Process

To implement the forward diffusion process in Eq. (1–2), I wrote a forward(im, t) function that takes a clean Campanile image $x_0$ and a timestep $t$ , then constructs the noisy sample $x_t=\sqrt{\bar{\alpha}_t}\,x_0+\sqrt{1-\bar{\alpha}_t}\,\epsilon$ . Concretely, I index into the provided alphas_cumprod array to get $\bar{\alpha}_t$ , move it to the same device/dtype as the image, and reshape it to (1,1,1,1) so it can broadcast over the batch and spatial dimensions. I then compute sqrt_alpha_bar_t and sqrt_one_minus_alpha_bar_t, draw $ϵ$ from torch.randn_like(im), and form the noisy image as a weighted sum of the scaled clean image and the scaled noise, all inside a torch.no_grad() block since this is a fixed forward process. To visualize the results, I run this function on the Campanile test image at timesteps $t = 250,500,750$ , rescale each output back from $[−1,1]$ to $[0,1]$ , clamp, permute to HWC, and display with media.show_image, also saving each as campanile_{t}.jpg. As expected, the images become progressively more corrupted as 𝑡 t increases: at $t=250$ the Campanile is still recognizable with moderate noise, at $t=500$ the structure is heavily degraded, and by $t=750$ the image is dominated by nearly Gaussian noise, matching the behavior of the forward diffusion process.

t = 250

t = 500

t = 750

Classical Denoising

For classical denoising, I reused the noisy Campanile images generated in 1.1 at timesteps $t∈{250,500,750}$ and applied Gaussian blur to each one in an attempt to remove noise. Using torchvision.transforms.functional.gaussian_blur, I chose progressively larger kernel sizes and sigmas for higher noise levels (e.g., a smaller 5×5 kernel with $σ≈1.0$ for $t = 250$ , up to a 9×9 kernel with $σ≈2.0$ for $t=750$ ), reflecting the intuition that stronger corruption requires more aggressive smoothing. For each timestep, I kept both the original noisy image and its Gaussian-denoised version, converted them back to the $[0,1]$ range, and displayed them side by side in a 2-column grid, also saving a single stacked panel image for the report. Visually, this makes it clear that while Gaussian blur can slightly reduce high-frequency noise at lower timesteps, it also blurs edges and fine structure, and at $t=500$ and especially $t=750$ it fails to recover meaningful Campanile details, illustrating why simple low-pass filtering is inadequate for reversing the diffusion forward process. From top to bottom, we have t = 250, 500, and 750 for both noisy and Gaussian denoised images.

One-Step Denoising

For one-step denoising with a pretrained diffusion model, I first reuse my forward function to generate noisy Campanile images at timesteps $t∈{250,500,750}$ , then pass each noisy image through the DeepFloyd Stage I UNet stage_1.unet together with the timestep and the provided "a high quality photo" prompt embedding. The UNet is run in torch.no_grad() on GPU in half precision, and I take only the first three channels of its (1, 6, 64, 64) output as the predicted noise $\hat{ϵ}^t$ , ignoring the variance channels. Using Equation (2), I reconstruct an estimate of the clean image by inverting the forward process: given $x_t$ , $\hat{a}^t$ , and $\hat{ϵ}^t$ , I compute $\hat{x}_0=\big(x_t-\sqrt{1-\bar{\alpha}_t}\,\hat{\epsilon}_t\big)\,/\,\sqrt{\bar{\alpha}_t}$ , then clamp the result back to $[−1,1]$ . For each timestep, I convert the original clean image, the noisy image, and the UNet-denoised estimate back to $[0,1]$ RGB and visualize them side by side in a 3-column layout, also saving a combined panel image for the report, with the left column been original image, middle column been results from previous part, and right column been the result of one-step denoising.

Iterative Denoising

To implement iterative denoising, I first defined a strided timestep schedule strided_timesteps = [990, 960, …, 0] (stride 30) and registered it with stage_1.scheduler.set_timesteps, so that I can jump from very noisy to clean images in a small number of steps. Starting from a Campanile image noised to timestep strided_timesteps[10], I wrote an iterative_denoise function that repeatedly moves from $t$ to $t′$ using Eq. (3), at each step I run the UNet with the "a high quality photo" prompt to get both a noise estimate and a predicted variance, reconstruct an estimate $\hat{x}_0$ of the clean image by inverting the forward process, then combine $\hat{x}_0$ and the current $\hat{x}_t$ with the correct $α_t$ , $β_t$ coefficients before adding learned variance via the provided add_variance helper. This loop marches through the strided_timesteps until $t′ =0$ , and I visualize the intermediate results by displaying every 5th denoising step, showing the image gradually sharpening as noise is removed. Finally, I compare three “clean” outputs from the same starting noisy image. The result of full iterative denoising, a single-step denoising estimate computed as in 1.3, and a Gaussian-blurred version as in 1.2, and assemble them, along with the intermediate frames, into a panel image panel. This demonstrates that iterative denoising with a strided schedule yields a much better reconstruction than either a one-shot UNet pass or classical Gaussian filtering, while satisfying all required visualizations and comparisons.

from left to right, we have step 0, 5, 10, 15, 22

from left to right, we have the result of iterative denoising, one-shot denoising, and gaussian blur

Diffusion Model Sampling

To turn the diffusion model into a pure sampler, I reused my iterative_denoise routine but started directly from Gaussian noise instead of a real image. For each of five samples, I drew an i.i.d. tensor noise ~ N(0, I) of shape $(1,3,64,64)$ , moved it to the GPU in half precision, and passed it to iterative_denoise with i_start = 0 and the "a high quality photo" text embedding so that the UNet iteratively transports the noise along the reverse diffusion trajectory back to $t=0$ . The function returns a “clean” image in $[−1,1]$ , which I rescaled to $[0,1]$ , converted to HWC RGB, and displayed in a 5-column grid, labeling them as Sample 1–5.

Classifier-Free Guidance (CFG)

For classifier-free guidance, I modified my iterative sampler into an iterative_denoise_cfg function that runs the UNet twice at every timestep; once with the conditional prompt embedding for "a high quality photo" and once with the true null prompt "" to obtain an unconditional prediction. From these two outputs, I split off the noise estimates $ϵ_c$ and $ϵ_u$ and form a guided noise prediction $ϵ = ϵ_u +γ(ϵ_c − ϵ_u)$ with a guidance scale $γ=7$ , while reusing only the conditional predicted variance in the add_variance update. As in 1.4, I then invert the forward process to get $\hat{x}_0$ at each step and apply the DDPM update coefficients to move from $x_t$ to $x_{t′}$ along the strided timestep schedule until reaching $t=0$ . To sample images, I draw five independent Gaussian noise tensors, feed each through iterative_denoise_cfg with i_start = 0, rescale the resulting outputs from $[−1,1]$ to $[0,1]$ , and visualize them in a 5-column grid, also saving individual JPEGs and a horizontal panel. These CFG samples, conditioned on "a high quality photo", are noticeably sharper and more coherent than the unguided samples from previous part.

Image-to-Image Translation

For image-to-image translation, I followed the SDEdit idea by starting from a real image, adding a controlled amount of diffusion noise, and then pushing it back onto the natural image manifold with my CFG sampler. I wrote a helper run_image_to_image_edits that, for a given input image, loops over starting indices $i_{start} ∈$ { $1,3,5,7,10,20$ } along the strided_timesteps schedule. For each $i_{start}$ , I first apply the forward process to obtain a noisy version at timestep timesteps[i_start], then run iterative_denoise_cfg from that index down to $t=0$ using the conditional prompt "a high quality photo" together with the null prompt for unconditional guidance (CFG scale 7). This produces a family of “edits” that range from quite free reinterpretations when starting from very noisy states (small $i_{start}$ ) to subtle, almost identity-like reconstructions when starting from lightly noised images (larger $i_{start}$ ), illustrating how noise level controls edit strength. I applied this pipeline to the Campanile, Palace of Fine Art and Golden Gate Bridge images, saving and displaying all six edited versions labeled by their $i_{start}$ values and concatenating them into a horizontal panel, from left to right, we have noise indices $[1,3,5,7,10,20]$ , and the rightmost image is the input image

Editing Hand-Drawn and Web Images

To explore how SDEdit behaves on non-photographic inputs, I downloaded a stylized image from a URL, processed it, and then applied exactly the same editing pipeline as in 1.7: for each starting index $i _{start} ∈$ { $1,3,5,7,10,20$ }, I used the forward process to add diffusion noise at timestep strided_timesteps[i_start], then denoised back to $t=0$ with iterative_denoise_cfg using the conditional prompt "a high quality photo" and the null prompt for CFG. I stored the resulting edits in a dictionary keyed by $i_{start}$ , displayed them in a grid, and concatenated them horizontally into 1.7.1_web_edits_panel.jpg, which shows the image gradually becoming more photo-like as the noise level decreases, the order of the image is “1,3,5,7,10,20” and the original input image at the rightmost. I then repeated this procedure for two hand-drawn inputs, a simple rocket sketch and a house drawing. Across all three sources, the panels illustrate how the diffusion model can project nonrealistic line drawings and stylized images onto the natural image manifold, with higher noise producing more drastic reinterpretations and lower noise preserving more of the original structure.

Triangle hair guy from online

Inpainting

For inpainting, I first defined a binary mask over the Campanile image that selects the top portion of the tower as the region to be edited (mask = 1 inside the edit box, 0 elsewhere), and visualized the image, mask, and masked-out region to confirm it was correct. I then implemented an inpaint function that closely mirrors my CFG-based iterative sampler, but with an extra “projection” step at every diffusion iteration. Starting from pure Gaussian noise with the same shape as the original image, I ran the DDPM reverse update with classifier-free guidance: at each timestep $t$ I computed conditional and unconditional UNet outputs with the "a high quality photo" and null prompts, combined their noise estimates using $ϵ=ϵ u +γ(ϵ c −ϵ u )$ with $γ=7$ , reconstructed $\hat{x}_0$ , and formed the next state $x_{t′}$ using the appropriate $α_t, β_t$ coefficients and learned variance via add_variance. To enforce inpainting rather than free generation, I then re-imposed the unmasked region from the original image by computing a noised version of the original at timestep $t′$ with my forward function and blending as $x_{t′} ← m x_{t′} +(1−m)forward(x_{orig}, t′ )$ , so that only the masked area is synthesized while the rest matches the original up to the correct noise level. After iterating through all timesteps, I obtained a clean inpainted Campanile image, visualized it, and saved it . I then reused the same inpaint pipeline on two of my own test images with custom masks (Palace of Fine Arts and Golden Gate Bridge), demonstrating that the diffusion model can plausibly hallucinate new content inside the masks while preserving the untouched context.

Text-Conditional Image-to-image Translation

For text-conditioned image-to-image translation, I reused my CFG sampler but swapped the neutral prompt for a more semantic one, and I used different prompt for different image I first grabbed the corresponding prompt embedding from prompt_embeds_dict (plus the empty-string embedding for the unconditional branch), then wrote a helper run_text_conditioned_edits that mirrors the SDEdit pipeline from 1.7: for each starting index $i_{start} ∈$ { $1,3,5,7,10,20$ }, I apply the forward process to get a noised version of the input at timestep timesteps[i_start], and then call iterative_denoise_cfg with the input prompt and null prompt to denoise back to $t=0$ . This produces a set of six edited images per input, which I rescale to $[0,1]$ . I first applied this procedure to the Campanile test image, and then to two of my own 64×64 test photos. For campanille I used the prompt ‘a humanoid robot goalkeeper diving to save a soccer ball’ . For Palace of Fine Art, I used the prompt ‘a cyberpunk city at night with neon lights and rain’ . And for the Golden Gate Bridge, I used the prompt 'a photo of a sunset over the ocean’ .

Visual Anagrams

For visual anagrams, I extended my CFG-based reverse sampler into a make_flip_illusion routine that enforces two different text prompts depending on whether the image is viewed upright or upside down. At each diffusion step $t$ , I first run the UNet on the current image with prompt $p_1$ (e.g., “a woman wearing a hat”) and the null prompt to get conditional and unconditional noise estimates, combine them with CFG to obtain $ϵ_1$ , and keep the associated variance prediction. I then flip the current image vertically, run the UNet again with a second prompt $p_2$ (e.g., “a charcoal drawing of a face”) and the same null prompt, build a second guided noise estimate $ϵ_2$ , and flip $ϵ_2$ back. These two noise fields are averaged, $ϵ=(ϵ_1 + ϵ_2 )/2$ , and I plug this $ϵ$ into the usual DDPM update reconstruct $\hat{x}_0$ , combine it with the current $x_t$ using the correct $α_t, β_t$ coefficients, and add learned variance with add_variance to get $x_{t′}$ . Starting from pure Gaussian noise and running this process across the full strided_timesteps schedule yields a single “illusion” image that I then rescale to $[0,1]$ , visualize upright, and also show flipped vertically. Using this procedure, I generated two separate illusions, one that reads as a woman wearing a hat upright and hints at a charcoal drawing of a face when flipped, and another that transitions between “an oil painting of a forest in autumn” and “a lithograph of a mountain range” depending on orientation.

Hybrid Images

To create hybrid images with diffusion, I implemented a make_hybrids function that runs the standard DDPM reverse process but replaces the usual single noise estimate with a frequency-factorized combination of two prompts. At each timestep $t$ , I run the UNet twice on the current latent image with the same unconditional embedding but two different conditional text prompts $p_1$ and $p_2$ (e.g., “a rocket ship” and “a pencil”), producing two conditional noise predictions. For each branch, I form a classifier-free guided noise estimate by blending the conditional and unconditional outputs, then apply a large Gaussian blur (kernel size 33, $σ=2$ ) to obtain a low pass filtered version of the first guided noise and a corresponding blurred version of the second. I treat the blur of the first as the low frequency component and construct the high frequency component of the second as its guided noise minus its blur; the final hybrid noise is $ϵ=low(ϵ_1 )+high(ϵ_2)$ . This hybrid noise replaces $ϵ$ in the usual DDPM update, I reconstruct $\hat{x}_0$ , combine it with $x_t$ using the correct $α_t, β_t$ coefficients, add learned variance via add_variance, and iterate over the full strided_timesteps schedule starting from pure Gaussian noise. After denoising, I rescale the final tensor back to $[0,1]$ , visualize, and save the resulting hybrid image, which looks like one concept up close (dominated by high frequencies from $p_2$ ) and the other at a distance (dominated by low frequencies from $p_1$ ). I repeated this pipeline with two different pairs of prompts to produce two distinct hybrid illusions, the first image below has the prompt 'an oil painting of a forest in autumn’ and 'a lithograph of a mountain range’, the second image below has the prompt 'a woman wearing a hat’ and 'a charcoal drawing of a face’.

Flow Matching

Implementing the UNet

For this section, I implemented the unconditional denoiser exactly as the UNet in the spec, starting by coding the “simple ops” and then composing them into the full architecture. Conv, DownConv, and UpConv are all small Conv-BN-GELU modules: Conv keeps the spatial size fixed (3×3, stride 1), DownConv halves the resolution (3×3, stride 2), and UpConv upsamples by 2 using a transposed convolution (4×4, stride 2, padding 1). Flatten is an AvgPool2d with kernel size 7 that turns a 7×7 feature map into 1×1, and Unflatten is a ConvTranspose2d(7,7,0) that expands a 1×1 feature map back to 7×7. I then defined composed blocks that match the figure, ConvBlock stacks two Conv layers, DownBlock applies a DownConv followed by a ConvBlock, and UpBlock applies an UpConv followed by a ConvBlock. Using these pieces, the UnconditionalUNet takes a $1×28×28$ input and runs it through an encoder consisting of a ConvBlock and two DownBlocks, producing feature maps at 28×28, 14×14, and 7×7 with channel counts $D$ , $D$ , and $2D$ . The 7×7 features go through the Flatten/Unflatten bottleneck (1×1 → 7×7) and into the decoder, where I concatenate the skip connections channel-wise (cat with $x_2$ then $x_1$ ) and feed them into UpBlocks with input channels 4D and 3D so that the spatial resolutions go 7×7 → 14×14 → 28×28. Finally, I concatenate the last decoder features with the first encoder features, run a final ConvBlock, and project back to 1 channel with a 3×3 Conv2d. This implements a symmetric UNet with skip connections and hidden width $D=num$ _ $hiddens$ , satisfying the required structure and tensor shapes for the unconditional denoiser.

Using the UNet to Train a Denoiser

For this section, I implemented the simple Gaussian noising process $z=x+σϵ$ with $ϵ ∼ N(0,I)$ to visualize how corruption strength varies with $σ$ . I loaded a single normalized MNIST training image $x∈[0,1]$ , then iterated over $σ∈$ { $0.0,0.2,0.4,0.5,0.6,0.8,1.0$ }, sampling a fresh noise tensor $ϵ$ for each value and forming a noisy image $z$ . After clamping the result back into $[0,1]$ , I converted each noisy tensor to a NumPy array and collected them into a list. Using mediapy.show_images and a Matplotlib figure, I displayed all seven images side-by-side with titles indicating the corresponding 𝜎 σ, and I also saved the panel . This clearly shows the progression from a clean digit at $σ=0$ to a nearly pure noise image at $σ=1.0$

Training

I trained the UnconditionalUNet from 1.1 as a denoiser on MNIST digits corrupted with Gaussian noise of fixed standard deviation $σ=0.5$ . I first set the hyperparameters to match the spec: batch size 256, hidden dimension $D=128$ , Adam optimizer with learning rate $1×10^{−4}$ , and 5 training epochs. Using torchvision.datasets.MNIST, I built train and test loaders (training only on the train split) and, for each mini-batch, sampled fresh Gaussian noise $ϵ∼N(0,I)$ on the fly, forming noisy inputs $z=x+0.5ϵ$ . The UNet takes these noisy images and predicts a clean reconstruction, and I optimize an MSE loss between the network output and the original clean images, backpropagating and updating with Adam each iteration. Throughout training I record the per-iteration loss and at the end I plot it as a training loss curve over all iterations. After epochs 1 and 5 I switch the model to eval mode, take a batch from the test set, corrupt it with $σ=0.5$ noise, run the denoiser, and visualize eight digits in three aligned rows (clean, noisy, and denoised) while also saving these panels to disk. These visualizations show the qualitative improvement in denoising from epoch 1 to epoch 5, and together with the loss curve they meet all the requirements for the training section.

epoch 1

epoch 5

Out-of-Distribution Testing

I evaluated the trained UNet denoiser out of distribution by fixing a single test MNIST digit and varying the corruption level $σ∈[0.0,0.2,0.4,0.5,0.6,0.8,1.0]$ , even though the model was trained only at $σ=0.5$ . I switched the model to eval mode, took one image from the test loader, and for each $σ$ sampled fresh Gaussian noise $ϵ∼N(0,I)$ to form a noisy input $z=x+σϵ$ . I then passed each noisy version through the UNet to obtain a denoised output, storing both the noisy and reconstructed images along with their corresponding $σ$ labels. For visualization, I first showed the original clean digit, then displayed a row of noisy images and a matching row of denoised images across all seven $σ$ values, and finally saved this grid as a panel figure. This setup keeps the underlying digit fixed while changing only the noise level, clearly illustrating how the denoiser performs best near its training noise level and degrades as $σ$ moves farther away.

Denoising Pure Noise

For this section I trained a second UNet specifically to “denoise” pure Gaussian noise into clean MNIST digits by reusing the 1.2.1 setup but changing the input target pairs from $(z,x)$ to $(ϵ,x)$ . I instantiated a fresh UnconditionalUNet with the same hidden dimension (D = 128), optimized it with Adam at learning rate $1×10^{−4}$ , and trained for 5 epochs over the MNIST training set. In each iteration, I sampled $ϵ∼N(0,I)$ with the same shape as a batch of images, fed this pure noise as the network input, and kept the clean digits as the regression targets under an MSE loss, accumulating train_losses_pure to plot a training loss curve over all iterations. To visualize what the model learned as a generative denoiser, after the 1st and 5th epochs I switched to eval mode, sampled 8 fresh noise images, passed them through the trained network, and displayed/saved the resulting outputs as grids. Early in training the outputs are still noisy blobs, but by epoch 5 they clearly resemble MNIST-like digits (0–9) with recognizable strokes, though often somewhat blurry or averaged-looking; I believe this happens because the MSE objective encourages the network to map arbitrary noise toward the mean of the training distribution, so it learns to output prototypical digit shapes even when the input contains no structure.

epoch 1

epoch 5

Adding Time Conditioning to UNet

For this section, I extended my original MNIST UNet into a time-conditional model by introducing a small fully connected conditioning network and wiring it into the bottleneck and decoder, exactly as in the diagram below. I first implemented an FCBlock that maps a scalar time input of size 1 through a two-layer MLP with GELU, producing a feature vector of size 2D; this block is instantiated twice (fc1_t and fc2_t) to modulate the unflatten and the first upsampling block. In TimeConditionalUNet, I reused the encoder/decoder from the unconditional UNet (three downsampling stages, Flatten/Unflatten bottleneck, two up blocks, and an output conv), then added the two FCBlocks and, in the forward pass, sampled normalized times $t∈[0,1]$ , reshaped them to $(N,1)$ and passed them through fc1_t and fc2_t. I reshaped these outputs to $(N, 2D, 1, 1)$ and multiplied them element-wise into the Unflatten output and the first up block output, respectively, thereby injecting the time conditioning as a learned, per-channel scaling while preserving the spatial structure. On top of this architecture, I implemented the time flow-matching training objective time_fm_forward, which samples a random pair $(x₀, x₁)$ (noise and data), forms the linear interpolation $x_t = (1−t)x₀ + t x₁$ , defines the target “velocity” $u$ _ $target = x₁ − x₀$ , and trains the UNet via an MSE loss to predict this velocity from (x_t, t). Finally, I wrote a sampling routine time_fm_sample and a wrapper TimeConditionalFM that integrate the learned velocity field forward in time using an explicit Euler scheme over a discretized grid of num_ts timesteps, starting from Gaussian noise and updating $x ← x + Δt·û(x,t)$ to synthesize new images.

Training the UNet

For this section, I trained the time-conditioned UNet using the flow-matching objective from Algorithm B.1 on the MNIST training set and recorded the loss over the whole optimization. I first set up a DataLoader on torchvision.datasets.MNIST with batch_size = 64, shuffling only the training split and applying a simple ToTensor transform. The model is my TimeConditionalUNet with hidden width $D=64$ , wrapped in the TimeConditionalFM module so that each forward pass samples a random timestep $t∼Uniform[0,1]$ , draws a Gaussian noise image $x_0 ∼N(0,I)$ , forms the interpolated noisy image $x_t =(1−t)x_0 +tx_1$ , and computes the flow-matching MSE loss $∥(x_1 −x_0 )−u_θ (x_t ,t)∥^2$ . In the training loop, for each batch I move images to the GPU, call fm_model(images) to get this loss (ensuring noise is re-sampled on the fly as batches are fetched), backpropagate, and update parameters with Adam using an initial learning rate of $1×10^{−2}$ . I also attach an exponential learning-rate scheduler with $γ=0.1^{1/num\_epochs}$ and call scheduler.step() once per epoch so that the LR decays smoothly to 0.1× its initial value by the end of 20 epochs. I accumulate the per-iteration losses in train_losses_fm, periodically save model checkpoints at epochs 1, 5, and 10, and finally plot loss versus iteration.

Sampling from the UNet

I used my trained time-conditioned UNet checkpoints to visualize how sampling quality improves over training. I first defined a helper load_fm_model that reconstructs a TimeConditionalUNet with hidden dimension $D=64$ , wraps it in the TimeConditionalFM flow-matching wrapper with $T=50$ timesteps, loads the saved weights from fm_epoch_1.pt, fm_epoch_5.pt, and fm_epoch_10.pt, and switches the model to eval mode. For each of these epochs, I called fm_model_epoch.sample(img_wh=(28, 28), seed=0) to run Algorithm B.2: starting from Gaussian noise and repeatedly adding $Δtu_θ (x_t ,t)$ over 50 steps to produce a batch of synthesized digits. I clamped the resulting samples to $[0,1]$ , extracted the grayscale channel, and arranged them in a matplotlib grid with 16 columns and one row per epoch. This gives clear side by side sampling results from the time-conditioned UNet after 1, 5, and 10 epochs and show that the digits become progressively more legible as training proceeds.

first row 1 epoch

second row 5 epochs

third row 10 epochs

Adding Class-Conditioning to UNet

I extended the time-conditioned UNet to be class-conditioned on MNIST digit labels 0–9 by adding two additional FCBlocks for the class vector and combining them with the existing time FCBlocks. In ClassConditionalUNet, I keep the same encoder–decoder U-Net backbone as before, then embed the scalar time $t$ with fc1_t/fc2_t and the one-hot digit class $c$ with fc1_c/fc2_c. After reshaping these embeddings to $(N,D,1,1)$ , I inject them at the two modulation points shown in the handout, first I compute the bottleneck feature u = unflatten(b) and modulate it as u = c1 * u + t1, and later I modulate the first upsampling feature as u1 = c2 * u1 + t2, so the network is conditioned jointly on time and class. To implement classifier-free guidance during training, class_fm_forward samples a random interpolation timestep, builds a flow-matching pair $(x_t ,u_{target} )$ , and then applies a Bernoulli mask with probability p_uncond (0.1) that zeros out the class one-hot vector, forcing the model to sometimes learn an unconditional flow. At sampling time, class_fm_sample runs the same flow-matching ODE but, for each step, evaluates the UNet twice: once with all-zero class mask (u_uncond) and once with full conditioning (u_cond), then combines them with a classifier-free guidance rule $u=u_{uncond} +s(u_{cond} −u_{uncond} )$ using a configurable guidance_scale. Finally, I wrap everything in ClassConditionalFM, which exposes both a training forward method and a sample method that returns class-conditioned samples (and an animation cache).

Training the UNet

For this section, I trained the class-conditioned UNet almost identically to the time-only model, but now conditioning on both the timestep and the MNIST digit label and using classifier-free dropout. I loaded the MNIST training set with a batch size of 64 and constructed a ClassConditionalUNet with hidden width $D=64$ , then wrapped it in the ClassConditionalFM flow-matching module with $T=300$ integration steps and an unconditional probability $p_{uncond} =0.1$ , matching Algorithm below to randomly drop the class conditioning during training. The optimizer is Adam with an initial learning rate of $1×10^{−2}$ , and I applied an exponential LR scheduler with $γ=0.1^{1/num\_epochs}$ so the learning rate decays smoothly across 20 epochs. In each epoch, for every minibatch of images and labels, I computed the class-conditional flow-matching loss loss = cc_fm_model(images, labels), backpropagated, and updated the parameters while accumulating both per-iteration and per-epoch losses; I also checkpointed the model at epochs 1, 5, 10, and 20. After training, I plotted the full history of train_losses_cc versus iteration.

Sampling from the UNet

For the last section, I used my trained class-conditional flow-matching model to generate MNIST samples with classifier-free guidance. I first wrote a small loader that reconstructs the ClassConditionalUNet with hidden size $D=64$ , wraps it in ClassConditionalFM with $T=300$ steps and $p_{uncond} =0.1$ , and then loads the saved checkpoints from epochs 1, 5, and 10. For each checkpoint, I created a label tensor c that repeats the digits 0–9 four times, so the sampler should produce exactly four instances of every class as required. I then called cc_fm.sample with this label tensor, image size $28×28$ , and classifier-free guidance scale $γ=5.0$ , which internally computes both conditional and unconditional flows and combines them as $u=u_{uncond} +γ(u_{cond} −u_{uncond} )$ while integrating forward in time. The resulting samples are clamped to $[0,1]$ , reshaped into a $4×10$ grid (rows = different draws, columns = digits 0–9), and saved image for epochs 1, 5, and 10. These panels visually demonstrate how class-conditional sampling quality improves over training.

I explore whether we could get rid of the annoying learning rate scheduler by removed the ExponentialLR schedule and instead used a fixed Adam learning rate. To compensate for the lack of annealing, I lowered the base learning rate from $1×10^{−2}$ to a smaller constant value (so the effective step size late in training matches what the scheduled run would have seen) and kept the same batch size, number of timesteps, and 20-epoch budget. After training, I regenerated the same 4×10 class-conditional sample grids for epochs 10, and compared them to the scheduler run. Qualitatively, the no-scheduler samples at epoch 10 still show clean, well-separated digits 0–9 with similar sharpness and diversity. Overall, tuning the fixed learning rate to be slightly smaller allowed me to match the performance of the exponential scheduler while keeping the optimization setup simpler; the accompanying grids from the no-scheduler run demonstrate that we can maintain good generative quality without LR decay.