Hi! I am trying to sanity-check an assumption for diffusion video generation reproducibility.
Suppose I run the same video diffusion model on two different GPU architectures, with:
- identical model weights and implementation (same attention backend, etc)
- identical prompt and parameters (same number of denoising steps, etc)
- deterministic sampler (no extra noise is injected during inference)
- the exact same starting noise latent
Could I expect more or less the same generated video?
I understand that there's no way to guarantee bitwise-identical outputs due to floating-point math differences, but could it realistically make the generated videos so different that it'd be immediately noticeable to a human eye? Or would one normally expect only tiny pixel-level/minor perceptual differences?
[link] [comments]