How much can a video generated by the same diffusion model differ across GPU architectures if the initial noise latent is fixed? [D]

Hi! I am trying to sanity-check an assumption for diffusion video generation reproducibility.

Suppose I run the same video diffusion model on two different GPU architectures, with:

identical model weights and implementation (same attention backend, etc)
identical prompt and parameters (same number of denoising steps, etc)
deterministic sampler (no extra noise is injected during inference)
the exact same starting noise latent

Could I expect more or less the same generated video?

I understand that there's no way to guarantee bitwise-identical outputs due to floating-point math differences, but could it realistically make the generated videos so different that it'd be immediately noticeable to a human eye? Or would one normally expect only tiny pixel-level/minor perceptual differences?

submitted by /u/hellosandrik
[link] [comments]

Leave a Comment