Adapting Large VLMs with Iterative and Manual Instructions for Generative Low-light Enhancement

arXiv:2507.18064v2 Announce Type: replace Abstract: Most existing low-light image enhancement (LLIE) methods rely on pre-trained model priors, low-light inputs, or both, while neglecting the semantic guidance available from normal-light images. This limitation hinders their effectiveness in complex lighting conditions. In this paper, we propose VLM-IMI, a framework that adapts large vision-language models with iterative and manual instructions for generative LLIE. VLM-IMI mainly contains two branches: Normal-Light Instruction Prior Generation (NL-IPG) and Instruction-aware Light Enhancement Diffusion (IA-LED). The NL-IPG incorporates textual descriptions of the desired normal-light content as enhancement cues, enabling semantically informed restoration. IA-LED incorporates instruction priors from the NL-IPG to guide the diffusion process, enabling precise illumination enhancement. To effectively integrate cross-modal priors, we introduce a learnable instruction prior fusion module, which dynamically aligns and fuses image and text features, promoting the generation of detailed and semantically coherent outputs. During inference, as the ground-truth normal-light images are not available, we propose an inference with an iterative instructions strategy to refine textual instructions, progressively improving visual quality. Our VLM-IMI also inherently supports manual instruction control by allowing users to directly input custom instructions into the LLM to generate user-expected outputs. Experiments across diverse scenarios demonstrate that VLM-IMI outperforms SOTA methods in terms of perception and realism. The source code is available at: https://github.com/sunxiaoran01/VLM-IMI.

Leave a Comment