Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video
Foley Control is a lightweight approach to video-guided Foley that keeps
pretrained single-modality models frozen and learns only a small
cross-attention bridge between them.
Foley Control is a lightweight approach to video-guided Foley that keeps
pretrained single-modality models frozen and learns only a small
cross-attention bridge between them.
We present Stable Video Materials 3D (SViM3D), a framework to predict
multi-view consistent physically based rendering (PBR) materials, given a
single image. Recently, video diffusion models have been successfully used
to reconstruct 3D objects from a single image efficiently.
We introduce Reservoir SWD (ReSWD), which integrates Weighted Reservoir
Sampling into SWD to adaptively retain informative projection directions in
optimization steps, resulting in stable gradients while remaining unbiased.
We introduce Stable Cinemetrics, a structured evaluation framework that
formalizes filmmaking controls into four disentangled, hierarchical
taxonomies: Setup, Event, Lighting, and Camera.
We study how musicians use artificial intelligence (AI) across formats like
singles, albums, performances, installations, voices, ballets, operas, or
soundtracks.
We present SD3.5-Flash, an efficient few-step distillation framework that
brings high-quality image generation to accessible consumer devices.
We present Stable Part Diffusion 4D (SP4D), a framework for generating
paired RGB and kinematic part videos from monocular inputs.
Editing materials of objects in images based on exemplar images is an
active area of research in computer vision and graphics. We propose MARBLE,
a method for performing material blending and recomposing fine-grained
material properties by finding material embeddings in CLIP-space and using
that to control pre-trained text-to-image models.
We present Adversarial Relativistic-Contrastive (ARC) post-training, the
first adversarial acceleration algorithm for diffusion/flow models not
based on distillation.
We present a novel framework for generating high-quality, animatable 4D
avatar from a single image. While recent advances have shown promising
results in 4D avatar creation, existing methods either require extensive
multiview data or struggle with shape accuracy and identity consistency.