AI-generated images have gone from a technical curiosity to a production tool in under three years. Understanding how they work — not at a mathematical level, but conceptually — is what allows you to use them effectively and understand why they produce what they produce.

The logic of reverse noise

Diffusion models, which underlie most current image generation tools, learn to generate images by first learning to destroy them.

During training, the model receives millions of real images and progressively “corrupts” them by adding Gaussian noise — a kind of static grain — in small steps until the image is completely unrecognisable. The model then learns to reverse the process: given a noisy image at a certain stage of corruption, it predicts how to remove the noise to approach the clean image.

Once trained, the model can generate new images by starting from pure noise and applying the reverse process iteratively: at each step, it removes a little noise guided by a text signal, until it obtains a coherent image.

Image generation (simplified process):

Pure noise → [Step 1: remove noise] → Very blurry image
           → [Step 2: remove noise] → Vaguely recognisable form
           → [Step 3: remove noise] → Clear structure
           ...
           → [Step N: remove noise] → Final image
           
At each step, the process is guided by the text of the prompt.

The number of steps is a configurable parameter. More steps = more detailed image but slower process. Between 20 and 50 steps is the usual range for quality images.

CLIP: connecting text and image

For the diffusion process to be guided by text, the model needs a component that understands the relationship between words and images. That component is CLIP (Contrastive Language-Image Pre-training), developed by OpenAI.

CLIP was trained on hundreds of millions of (image, descriptive text) pairs and learned to represent images and text in the same mathematical space. This means it can measure how “similar” an image is to a textual description.

In diffusion models, CLIP (or similar variants) acts as a guide: at each step of the denoising process, the model orients itself towards the direction that maximises the similarity between the image it is producing and the text of the prompt.

That explains why the prompt matters so much: it is literally the signal that guides each step of the generation.

The visual prompt: how it works

Visual prompting is different from text prompting for one fundamental reason: image models were trained on visual descriptions, not on instructions.

What works for text — “explain in detail how X works” — does not produce the best results for image. What works for image is a dense visual description: what is in the image, how it is lit, from what angle, in what artistic style, with what technical quality.

Elements of an effective visual prompt:

  • Subject: what is in the image and what it looks like (“a woman in her forties, casual clothes, smiling”)
  • Action or position: what it is doing or how it is situated
  • Environment: where the scene takes place
  • Lighting: natural, studio lighting, golden hour, dramatic backlighting
  • Style: documentary photography, editorial illustration, oil on canvas, pixel art
  • Technical quality: 4k, hyperrealistic, highly detailed, sharp focus
  • Reference artists (with caution): “in the style of Edward Hopper” activates specific visual patterns from training

Negative modifiers. Most models allow you to specify what you do NOT want in the image (negative prompt): “blurry, distorted, extra fingers, low quality, watermark.”

Fingers and hands are notoriously difficult for current models — they frequently produce more or fewer fingers than normal — and appear regularly in negative prompts.

The main models

Stable Diffusion (open source). The base of the open ecosystem. Executable locally at no API cost. Thousands of specialised variants (checkpoints) for photography, anime, architecture, fashion. Maximum control but the steepest learning curve.

Midjourney. The artistic standard. Produces images with very polished aesthetics by default, even with simple prompts. Operates through Discord. Especially good for images with strong artistic intent.

DALL·E 3 (OpenAI). Integrated in ChatGPT. The most accessible option. Understands natural language better without needing to master visual prompting. Less artistic control than Midjourney but much easier for users without experience.

Adobe Firefly. Integrated into the Adobe ecosystem and trained on appropriately licensed images. The safest option for commercial use from a rights perspective.

Flux. A more recent open-source model that has matched and in some respects surpassed the quality of closed models. It has reinvigorated the open-source ecosystem.

Current limitations

Text in images. Current diffusion models produce text within images inconsistently — deformed letters, invented words. This improves with each generation of models.

Consistency across multiple images. Generating several images showing the same character with a consistent appearance requires additional techniques (ControlNet, IP-Adapter) and does not work reliably without them.

Complex composition. Scenes with many elements, precise spatial relationships or specific interactions between objects are harder to control.

Rights and ownership. Models were trained on internet images, many of them under copyright. The legal status of generated images and the use of specific artists’ styles remains a subject of legal and ethical debate.

Image generation is probably the area where AI has advanced most visibly in the last three years. The images a quality diffusion model produces today were impossible, or required weeks of human work, just four years ago.