DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling

University of New South Wales (UNSW Sydney)
Interpolate start reference image.

Two monkeys are piloting an airplane.

Interpolate start reference image.

Two cats watering roses in a greenhouse.

Interpolate start reference image.

a toy poodle as a rocket scientist.

Interpolate start reference image.

a tower of cheese.

Interpolate start reference image.

A painting of a koala wearing a princess dress and crown, with a confetti background.

Interpolate start reference image.

Harry potter as a cat, pixar style, octane render, HD, high-detail.

Interpolate start reference image.

Gnomes are playing music during Independence Day festivities in a forest near Lake George.

Interpolate start reference image.

paw patrol. ’This is some serious gourmet’. 2 dogs holding mugs.

Interpolate start reference image.

Disease Monitoring: Through big data technology, trends in specific diseases can be monitored and predicted, thus improving disease prevention and treatment effectiveness.

Interpolate start reference image.

A small green dinosaur toy with orange spots standing on its hind legs and roaring with its mouth open.

Interpolate start reference image.

Award-winning Kawaii illustration of a cat samurai, holding two swords, background cyberpunk Styles, 4k, golden hour, cinematic light.

Interpolate start reference image.

A slime monster.

Interpolate start reference image.

crop top skinny russian 12 years old teen girl at the water mountain, HDR magazine photo.

Interpolate start reference image.

A young woman witch cosplaying with a magic wand and broom, wearing boots, and posing in a full body shot with a detailed face.

Interpolate start reference image.

A happy daffodil with big eyes, multiple leaf arms and vine legs, rendered in 3D Pixar style.

Interpolate start reference image.

A 3D Rendering of a cockatoo wearing sunglasses. The sunglasses have a deep black frame with bright pink lenses. Fashion photography, volumetric lighting, CG rendering.

Interpolate start reference image.

The image is a portrait of Homer Simpson as a Na’vi from Avatar, created with vibrant colors and highly detailed in a cinematic style reminiscent of romanticism by Eugene de Blaas and Ross Tran, available on Artstation with credits to Greg Rutkowski.

Interpolate start reference image.

Anthropomorphic beagle dog wearing steampunk time traveller outfit, clocks and large round window above, photoreal epic composition, old world deco, tv commercial, sebastian kruger, artem, epic lighting, by Heinz Anger, wow factor, aardman animations, blocking the sun, very artistic pose, alexander abdulov.

Interpolate start reference image.

Chic Fantasy Compositions, Ultra Detailed Artistic, Midnight Aura, Night Sky, Dreamy, Glowing, Glamour, Glimmer, Shadows, Oil On Canvas, Brush Strokes, Smooth, Ultra High Definition, 8k, Unreal Engine 5, Ultra Sharp Focus, Art By magali villeneuve, rossdraws, Intricate Artwork Masterpiece, Matte Painting Movie Poster.

Interpolate start reference image.

Full Portrait of Consort Chunhui by Giuseppe Castiglione, symmetrical face, ancient Chinese painting, single face, insanely detailed and intricate, beautiful, elegant, artstation, character concept in the style illustration by Miho Hirano, Giuseppe Castiglione –ar 9:16.



Abstract

Text-to-image diffusion model alignment is critical for improving the alignment between the generated images and human preferences. While training-based methods are constrained by high computational costs and dataset requirements, training-free alignment methods remain underexplored and are often limited by inaccurate guidance.

We propose a plug-and-play training-free alignment method, DyMO, for aligning the generated images and human preferences during inference. Apart from text-aware human preference scores, we introduce a semantic alignment objective for enhancing the semantic alignment in the early stages of diffusion, relying on the fact that the attention maps are effective reflections of the semantics in noisy images. We propose dynamic scheduling of multiple objectives and intermediate recurrent steps to reflect the requirements at different steps.

Experiments with diverse pre-trained diffusion models and metrics demonstrate the effectiveness and robustness of the proposed method.



Method Overview

pipeline

The framework of our method. (a) Given a user prompt, we use the LLMs to identify the entities and corresponding attributes for knowledge graph construction. Then we design a semantic alignment objective via cross attention map alignment based on graph, cooperating with a pre-trained preference model to dynamically guide the denoising process for high-quality image generation. (b) The entire denoising process of one-step predicted clean images under the guidance of our method.



Quantitative Results

Comparison of AI feedback on SD V1.5-based methods.
Methods PickScore HPSv2 ImageReward Aesthetics
SD V1.5 20.73 0.2341 0.1697 5.337
DNO 20.05 0.2591 -0.3212 5.597
PromptOpt 20.26 0.2490 -0.3366 5.465
FreeDom 21.96 0.2605 0.3963 5.515
AlignProp 20.56 0.2627 0.1128 5.456
Diffusion-DPO 20.97 0.2656 0.2989 5.594
Diffusion-KTO 21.15 0.2719 0.6156 5.697
SPO 21.46 0.2671 0.2321 5.702
SD V1.5+Ours 23.07 0.2755 0.7170 5.831
Comparison of AI feedback on SDXL-based methods.
Methods PickScore HPSv2 ImageReward Aesthetics
SDXL 21.91 0.2602 0.7755 5.960
DNO 22.14 0.2725 0.9053 6.042
PromptOpt 21.98 0.2708 0.8671 5.881
FreeDom 22.13 0.2719 0.7722 5.908
SDXL+Ours 24.90 0.2839 1.074 6.138
Diffusion-DPO 22.30 0.2741 0.9789 5.891
Diff-DPO+Ours 24.46 0.2836 1.049 6.116
SPO 22.81 0.2778 1.082 6.319
SPO+Ours 23.85 0.2821 1.166 6.278
SD V3.5 21.93 0.2726 0.9697 5.775
FLUX 22.04 0.2760 1.011 6.077


User Study Results

pipeline
User preference distribution on HPSv2 benchmark.
pipeline
Human evaluation with and without our method on PartiPrompts and HPSv2 datasets.


Qualitative Results

pipeline
Qualitative comparison based on SD V1.5 backbones. The text prompts used to generate the images from top to bottom are: (1) A photo of a frog holding an apple while smiling in the forest. (2) little tiny cub beautiful light color White fox soft fur kawaii chibi Walt Disney style, beautiful smiley face and beautiful eyes sweet and smiling features, snuggled in its soft and soft pastel pink cover, magical light background, style Thomas kinkade Nadja Baxter Anne Stokes Nancy Noel realistic. (3) On the Mid-Autumn Festival, the bright full moon hangs in the night sky. A quaint pavilion is illuminated by dim lights, resembling a beautiful scenery in a painting. Camera type: close-up. Camera lens type: telephoto. Time of day: night. Style of lighting: bright. Film type: ancient style. HD. (4) a gopro snapshot of an anthropomorphic cat dressed as a firefighter putting out a building fire. (5) A rock formation in the shape of a horse, insanely detailed. (6) A 3D Rendering of a cockatoo wearing sunglasses. The sunglasses have a deep black frame with bright pink lenses. Fashion photography, volumetric lighting, CG rendering. (7) A swirling, multicolored portal emerges from the depths of an ocean of coffee, with waves of the rich liquid gently rippling outward. The portal engulfs a coffee cup, which serves as a gateway to a fantastical dimension. The surrounding digital art landscape reflects the colors of the portal, creating an alluring scene of endless possibilities.

pipeline
Qualitative comparison based on SDXL backbones. The text prompts used to generate the images from top to bottom are: (1) A smiling beautiful sorceress wearing a high necked blue suit surrounded by swirling rainbow aurora, hyper-realistic, cinematic, post-production. (2) a golden retriever dressed like a General in the north army of the American Civil war. Portrait style, looking proud detailed 8k realistic super realistic Ultra HD cinematography photorealistic epic composition Unreal Engine Cinematic Color Grading portrait Photography UltraWide Angle Depth of Field hyperdetailed beautifully colorcoded insane details intricate details beautifully color graded Unreal Engine Editorial Photography Photography Photoshoot DOF Tilt Blur White Balance 32k SuperResolution Megapixel ProPhoto RGB VR Halfrear Lighting Backlight Natural Lighting Incandescent Optical Fiber Moody Lighting Cinematic Lighting Studio Lighting Soft Lighting Volumetric ContreJour Beautiful Lighting Accent Lighting Global Illumination Screen Space Global Illumination Ray Tracing Optics Scattering Glowing Shadows Rough Shimmering Ray Tracing Reflections Lumen Reflections Screen Space Reflections Diffraction Grading Chromatic Aberration GB Displacement Scan Lines Ray Traced Ray Tracing Ambient Occlusion AntiAliasing FKAA TXAA RTX SSAO Shaders. (3) A profile picture of an anime boy, half robot, brown hair. (4) Full body, a Super cute little girl, wearing cute little giraffe pajamas, Smile and look ahead, ultra detailed sky blue eyes, 8k bright front lighting, fine luster, ultra detail, hyper detailed 3D rendering s750. (5) little tiny cub beautiful light color White fox soft fur kawaii chibi Walt Disney style, beautiful smiley face and beautiful eyes sweet and smiling features, snuggled in its soft and soft pastel pink cover, magical light background, style Thomas kinkade Nadja Baxter Anne Stokes Nancy Noel realistic. (6) A swirling, multicolored portal emerges from the depths of an ocean of coffee, with waves of the rich liquid gently rippling outward. The portal engulfs a coffee cup, which serves as a gateway to a fantastical dimension. The surrounding digital art landscape reflects the colors of the portal, creating an alluring scene of endless possibilities. (7) a white polar bear cub wearing sunglasses sits in a meadow with flowers.



More Visual Results

The entire denoising process of SD V1.5 and DyMO, including the noisy images and one-step predicted clean images at step t, respectively.

pipeline
A photo of a frog holding an apple while smiling in the forest.
pipeline
a white polar bear cub wearing sunglasses sits in a meadow with flowers.


BibTeX


@article{xin2024dymo,
  title={DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling},
  author={Xie, Xin and Gong, Dong},
  journal={arXiv preprint arXiv:2412.00759},
  year={2024}
}
            


Acknowledgements

Website adapted from the following template.