Skip to main content
Create By Prompt
โ€” BTC โ€”
๐Ÿš€ Start Here

AI Creative Glossary: 85+ Terms for AI Creators

Complete reference glossary for AI art, music, and video creation. Definitions for LoRA, CFG scale, diffusion models, prompt engineering, and 85+ more terms.

The AI creative space moves fast, and the terminology can be overwhelming. LoRA, CFG scale, latent space, embeddings, inference โ€” if you're hearing these terms in tutorials or community forums but don't fully understand them, you're not alone. This glossary is your decoder ring for the AI creative world, covering 85+ essential terms across image, music, video, and text generation.

Use the search box below to quickly find definitions, or browse alphabetically. Each term includes a plain-English explanation, context about where it's used, and links to related concepts and full articles. Bookmark this page โ€” you'll refer back to it constantly.

A

AI Art

Visual artwork created using artificial intelligence tools, typically through text-to-image models like Midjourney, DALL-E, or Stable Diffusion. AI art generation works by training neural networks on millions of images, then using text prompts to guide the creation of new images. Controversial in some art communities but rapidly gaining mainstream acceptance.

Used in: Midjourney, DALL-E, Stable Diffusion

Aesthetic Score

A numerical rating (typically 1-10) that some AI models use to evaluate the visual appeal of generated images. Based on training data that includes human aesthetic preferences. Higher scores generally correlate with more "pleasing" compositions, but aesthetic is subjective. Some Stable Diffusion models allow you to target specific aesthetic score ranges.

Used in: Stable Diffusion (some models)

AIVA

AI music composition tool specializing in orchestral and soundtrack music. AIVA (Artificial Intelligence Virtual Artist) generates original compositions based on genre, mood, and style preferences. Popular for film scoring, video game music, and content creators needing royalty-free tracks. Offers more fine-tuned control than Suno but less vocal capability.

Category: Music generation

Aspect Ratio

The proportional relationship between width and height of an image or video, expressed as two numbers (e.g., 16:9, 4:3, 1:1). Critical for AI image generation because different ratios suit different purposes: 16:9 for landscape/YouTube, 9:16 for vertical/TikTok, 1:1 for Instagram. In Midjourney, set with --ar. In DALL-E, describe in natural language. In Stable Diffusion, set explicit width/height values.

Used in: All image and video tools

AUTOMATIC1111

The most popular web interface for running Stable Diffusion locally. Often called "A1111" or "SD WebUI." Provides a full-featured UI for text-to-image, image-to-image, inpainting, and more. Requires installation and a decent GPU but gives you complete control and unlimited generations. Alternative to ComfyUI and commercial services.

Used in: Stable Diffusion (local installs)

B

Batch Generation

Creating multiple images, audio tracks, or outputs in a single operation. Useful for testing variations or producing volume. Most tools support batch generation: Midjourney creates 4 images per prompt by default, Stable Diffusion lets you specify batch count, Suno can generate multiple songs at once. Be aware of credit consumption when batching on paid plans.

Used in: All AI creative tools

Bokeh

The aesthetic quality of out-of-focus areas in photography, characterized by soft, blurred backgrounds that make subjects pop. In AI image generation, requesting "bokeh" or "shallow depth of field" helps create this effect. Particularly effective for portrait prompts. Quality varies by model โ€” photorealistic models handle bokeh better than illustration-focused ones.

Used in: Photorealistic image generation

C

CFG Scale (Classifier-Free Guidance Scale)

Controls how closely an AI model follows your prompt versus generating freely. Higher CFG values (12-20) force strict prompt adherence but can cause oversaturation and artifacts. Lower values (1-5) give the model creative freedom but may stray from your intent. The sweet spot for most use cases is 7-12. In Midjourney, this is inversely controlled by --stylize (lower stylize = higher CFG adherence). Critical parameter in Stable Diffusion.

Used in: Stable Diffusion, Midjourney (as --stylize)

CLIP Skip

In Stable Diffusion, determines which layer of the CLIP text encoder to stop at when processing your prompt. CLIP Skip 1 (default) uses the final layer for maximum prompt fidelity. CLIP Skip 2 uses an earlier layer, often producing more artistic, less literal interpretations. Anime and illustration models often recommend CLIP Skip 2. Technical setting most beginners can ignore.

Used in: Stable Diffusion

ComfyUI

A node-based interface for Stable Diffusion that lets you build complex generation workflows by connecting visual nodes. More powerful than AUTOMATIC1111 for advanced users, but steeper learning curve. Ideal for automating multi-step processes like upscaling + face restoration + style transfer. Preferred by professionals for its flexibility and customization.

Used in: Stable Diffusion (advanced)

Concept Art

Visual representations created during early creative development, typically for games, films, or products. AI tools excel at rapid concept art iteration, letting you explore dozens of design directions in minutes. Prompting for "concept art" typically yields painterly, detailed illustrations with clear subject focus. Popular style descriptor across all image generation tools.

Used in: Game dev, film pre-production, product design

Conditioning

The process of influencing AI generation using additional information beyond the text prompt, such as reference images, masks, depth maps, or control signals. ControlNet is the most common conditioning system. Think of it as giving the AI extra "context clues" about what you want. Powerful for maintaining consistency or precise control over composition.

Used in: Stable Diffusion (ControlNet), advanced generation

ControlNet

An extension for Stable Diffusion that allows precise control over image composition using reference images. You provide a "control image" (edge map, depth map, pose skeleton, etc.) and ControlNet forces the generated image to match that structure while following your text prompt. Game-changer for consistency and complex compositions. Essentially a way to tell the AI "make this, but with this exact layout."

Used in: Stable Diffusion

CLIP

OpenAI's "Contrastive Language-Image Pre-training" model that understands connections between text and images. The "translator" that converts your text prompt into a format the image generation model can understand. CLIP's training on 400 million image-text pairs is why AI models "know" what concepts look like. All major image generators use some form of CLIP or similar text encoding.

Used in: Virtually all text-to-image models

D

DALL-E

OpenAI's text-to-image AI model. DALL-E 3 (current version, available via ChatGPT) excels at following complex prompts and generating text within images. Known for safety guardrails and content policy enforcement. Prefers natural language prompts over keyword lists. Integrated with ChatGPT Plus ($20/mo) or available via API. Generally more "literal" than Midjourney's artistic interpretation.

Category: Image generation

Denoising Strength

In image-to-image generation (img2img), controls how much the AI alters the input image. Value from 0 to 1: 0.3 makes subtle changes (recolor, refine), 0.7 significantly transforms (change style, composition), 0.9+ nearly ignores the input. The key parameter for balancing "keep the original structure" vs "creative reinterpretation." Start around 0.5-0.7 and adjust.

Used in: Stable Diffusion img2img, similar tools

Deep Dream

Google's early neural network visualization technique that creates surreal, psychedelic images by amplifying patterns the network detects. Characterized by recursive, fractal-like patterns and lots of "eyes" and "dog faces." Mostly a historical curiosity now, but influenced modern AI art aesthetics. Predates modern diffusion models.

Category: Early AI art (2015)

Diffusion Model

The underlying architecture powering most modern AI image generators. Works by gradually adding noise to images during training, then learning to reverse that process. Generation starts with pure noise and progressively "de-noises" it into a coherent image matching your prompt. Stable Diffusion, Midjourney, and DALL-E all use variants of diffusion models. More technically: DDPM (Denoising Diffusion Probabilistic Models).

Used in: Stable Diffusion, Midjourney, DALL-E, most modern image AI

E

Embedding

A custom-trained concept you can inject into Stable Diffusion prompts using a short trigger word. Created through "textual inversion" โ€” training the model on 5-20 images of a specific subject, style, or person. Lighter weight than LoRA, less versatile, but faster. Example: train an embedding on your face, then use it with <myface> in prompts. Popular on Civitai and HuggingFace.

Used in: Stable Diffusion

ElevenLabs

Leading AI voice synthesis and text-to-speech platform. Creates incredibly realistic human-sounding voices from text or clones voices from short audio samples. Popular for audiobooks, video voiceovers, and character voices. Tiered pricing from free (10K characters/mo) to Creator ($22/mo, 100K characters). Controversial due to potential misuse for deepfakes.

Category: Voice/audio generation

Epochs

In AI training, one complete pass through the entire training dataset. More epochs = more learning, but too many causes "overfitting" where the model memorizes training data instead of generalizing. When training custom LoRAs or embeddings, you'll typically run 10-50 epochs. For end users (not training models), this is mostly invisible.

Category: Model training (advanced)

Euler Sampler

A sampling algorithm used during Stable Diffusion image generation. Euler (and Euler A) are fast, stable, and produce good results with fewer steps. Alternative samplers include DPM++ 2M Karras, LMS, DDIM, each with different speed/quality tradeoffs. Most beginners can stick with Euler or DPM++ 2M Karras. Choice of sampler affects style subtly โ€” experiment to find your preference.

Used in: Stable Diffusion

F

Fine-tuning

Further training a pre-trained AI model on specialized data to adapt it for specific styles, subjects, or domains. More involved than LoRA or embeddings โ€” essentially creating a custom model variant. Example: fine-tuning Stable Diffusion on 10,000 anime images creates an anime-specialized model. Requires technical knowledge, GPU resources, and significant time. Most users consume fine-tuned models rather than create them.

Category: Model customization (advanced)

Firefly (Adobe)

Adobe's generative AI suite integrated into Creative Cloud. Includes image generation, generative fill (Photoshop), and more. Key differentiator: trained only on Adobe Stock and public domain content, giving it the cleanest commercial license for business use. Quality trails Midjourney but legal clarity is superior. Included with Creative Cloud subscription ($20-55/mo depending on plan).

Category: Image generation (commercial)

Flux

A new open-source image generation model from Black Forest Labs (former Stability AI team) released in 2024. Flux.1 competes with SDXL in quality with better prompt adherence and text rendering. Three versions: Flux.1 [pro] (paid API), Flux.1 [dev] (non-commercial), Flux.1 [schnell] (fast, Apache 2.0 license). Gaining rapid adoption in the open-source community.

Category: Image generation model

G

GAN (Generative Adversarial Network)

An earlier AI architecture where two neural networks compete: a "generator" creates images, a "discriminator" judges if they're real or fake. The generator improves by trying to fool the discriminator. GANs powered pre-2022 AI art (like StyleGAN, BigGAN) but have largely been superseded by diffusion models for image generation. Still used in some specialized applications.

Category: Legacy AI architecture

Guidance Scale

Another term for CFG Scale. See: CFG Scale above. Different tools use different names for the same concept โ€” guidance scale, CFG scale, prompt strength all refer to how strictly the model follows your prompt.

Used in: Stable Diffusion, various tools

Generation Credits

The "currency" many AI platforms use to meter usage. Example: Midjourney Basic gives 3.3 GPU hours โ‰ˆ 200 images. Leonardo gives 150 daily tokens โ‰ˆ 30 images. Suno Pro gives 500 credits โ‰ˆ 100 songs. Different actions consume different amounts: higher resolution costs more, video costs more than images. Monitor credit usage to avoid unexpected overages or throttling.

Used in: Most subscription AI services

H

Hallucination (AI)

When an AI generates content that seems plausible but is incorrect or fabricated. In text models, this means making up facts. In image models, it manifests as anatomical errors (extra fingers, warped faces), impossible physics, or nonsensical objects. Not technically a "mistake" โ€” the model is functioning as designed, but its training data or prompt interpretation led to implausible output. Quality has improved significantly in 2024-2026.

Category: AI behavior pattern

HuggingFace

A platform for sharing, discovering, and deploying open-source AI models. The "GitHub of AI models" โ€” home to thousands of Stable Diffusion checkpoints, LoRAs, embeddings, and text models. Also provides APIs and inference hosting. Essential resource for anyone working with open-source AI. Alternative to Civitai (which focuses more on image models specifically).

Category: Model repository platform

Hypernetwork

An older method for customizing Stable Diffusion models, now largely replaced by LoRA. Works by training a small network that modifies the main model's behavior. Hypernetworks were popular in 2022-2023 but have fallen out of favor due to LoRA's superior results and flexibility. Mentioned here for completeness; new users should use LoRA instead.

Used in: Stable Diffusion (legacy)

I

Image-to-Image (img2img)

Using an existing image as a starting point for AI generation rather than starting from noise. The model uses the input image's composition, colors, or structure as guidance while applying your text prompt. Controlled by "denoising strength" parameter. Perfect for variations, style transfer, or refining existing images. Available in most image generation tools.

Used in: All major image generation tools

Inpainting

Selectively editing or replacing parts of an image while keeping the rest unchanged. You "mask" the area to change, provide a new prompt, and the AI regenerates only that region while blending seamlessly with surroundings. Essential tool for fixing mistakes (removing extra fingers), changing elements (swap a red car for blue), or adding objects. DALL-E, Stable Diffusion, and most tools support inpainting.

Used in: All major image tools

Inference

The process of using a trained AI model to generate outputs โ€” as opposed to training the model itself. When you hit "generate" in Midjourney or Stable Diffusion, you're running inference. "Inference time" is how long generation takes. Cloud platforms charge for inference, local setups require GPU capable of inference. Not to be confused with training (which is far more resource-intensive).

Category: AI operation

Iteration

The process of repeatedly generating, evaluating, and refining outputs to achieve desired results. Central to AI creative workflows: generate โ†’ review โ†’ adjust prompt โ†’ regenerate. Professional AI artists often iterate 10-50+ times per final image. Efficient iteration means systematic prompt refinement rather than random changes. Also: in technical terms, one step in the diffusion denoising process.

Category: Creative process

J

JPEG Artifacts

Visual distortions caused by lossy image compression โ€” blocky patterns, color banding, blurring around edges. AI models can inadvertently generate images with artifact-like patterns, especially at lower quality settings or with poor prompts. Include "high quality, sharp, no artifacts" in prompts to minimize. Also: actual compression artifacts appear when you save/re-save AI outputs at low quality settings.

Category: Image quality issue

K

Kling AI

Chinese AI video generation model by Kuaishou that rivals Runway Gen-3 in quality. Known for excellent motion coherence and cinematic output. Available internationally at $10-30/mo. Prompt syntax and aesthetic differ from Western tools โ€” tends toward certain visual styles. Worth exploring if Runway's results don't match your needs. Access sometimes restricted by region.

Category: Video generation

KNN (K-Nearest Neighbor)

A machine learning algorithm sometimes used in AI image tools for similarity search or style matching. Finds the K most similar items to a query. Not central to generation itself, but used in some backend systems for organizing and retrieving training data or style references. Mostly invisible to end users.

Category: Technical (background)

L

Latent Diffusion

The specific type of diffusion model used by Stable Diffusion. Instead of working directly on pixel space (slow, expensive), latent diffusion compresses images into a smaller "latent space," runs diffusion there (fast, cheap), then decodes back to pixels. This innovation makes Stable Diffusion runnable on consumer GPUs. The "Stable" in Stable Diffusion refers to this latent approach.

Used in: Stable Diffusion

Latent Space

A compressed, abstract mathematical representation of data used internally by neural networks. Think of it as a "concept map" where similar concepts are located near each other. Moving through latent space produces smooth transitions between concepts. Why interpolating between two AI-generated images creates coherent intermediate images. Fundamental to how generative models work, but mostly invisible to users.

Category: Technical concept

LoRA (Low-Rank Adaptation)

A lightweight method for fine-tuning AI models without retraining the entire model. LoRAs add specific capabilities (new art styles, characters, concepts) to base models with small file sizes (10-200MB vs 2-7GB for full models). Load multiple LoRAs simultaneously. Installed by placing files in a folder and referencing them in prompts with <lora:name:weight>. The dominant customization method for Stable Diffusion in 2024-2026.

Used in: Stable Diffusion

LDRM

Latent Diffusion Resolution Model โ€” technical term sometimes used in academic papers. Refers to variants of latent diffusion models optimized for specific resolutions. Not commonly used in practice; usually just called "Stable Diffusion" with resolution specified separately. Included here in case you encounter it in technical documentation.

Category: Technical (rarely used)

Luma AI

AI company known for 3D capture technology and emerging video generation tools. Luma Dream Machine competes with Runway and Pika for text-to-video. Known for strong motion and camera movement. Pricing and availability have fluctuated โ€” check current offerings. Worth monitoring as video generation space evolves rapidly.

Category: Video/3D generation

M

Midjourney

The most popular AI image generator, known for stunning artistic results and strong aesthetic sensibility. Operates through Discord (divisive UX). Three paid tiers: Basic ($10/mo), Standard ($30/mo), Pro ($60/mo). Excels at: fantasy art, concept art, surrealism, cinematic imagery. Weaker at: photorealism (improving), text rendering, precise control. Best for creative exploration and polished artistic output.

Category: Image generation (commercial)

Model (AI)

The trained neural network that performs AI generation. "The model" can refer to: (1) The architecture (e.g., "Stable Diffusion"), (2) A specific checkpoint/version (e.g., "SDXL 1.0"), or (3) A fine-tuned variant (e.g., "Realistic Vision v5"). Models are large files (2-7GB typically) containing billions of learned parameters. Swapping models dramatically changes output style and capability.

Category: Core concept

Multimodal

AI models that can understand and generate multiple types of content (text, images, audio, video). Examples: GPT-4V understands images and text, Runway Gen-3 uses text to generate video, DALL-E uses text to create images. The future of AI is increasingly multimodal โ€” single models handling diverse inputs and outputs seamlessly.

Category: AI capability

MIDI

Musical Instrument Digital Interface โ€” a protocol for encoding music as note events rather than audio waveforms. Some AI music tools (like AIVA) export MIDI files you can edit in DAWs. More flexible than fixed audio: change tempo, instrumentation, individual notes. Other tools (like Suno) only export audio files. MIDI support indicates production-oriented features.

Used in: Music production tools

N

Negative Prompt

Text describing what you DON'T want in your generation. Essential tool for quality control. Common negative prompts: "blurry, low quality, distorted, extra fingers, text, watermark." In Midjourney: --no hands. In DALL-E: "without hands." In Stable Diffusion: separate "negative prompt" field. Most effective when specific: "malformed hands" works better than "bad hands."

Used in: All image generation tools

Noise

Random pixel values that serve as the starting point for diffusion model generation. Pure static โ†’ gradually refined into coherent image. Controlled by the seed value: same seed = same initial noise = reproducible results (if other parameters match). Understanding noise helps explain why generation isn't deterministic without seed control.

Category: Technical concept

Neural Network

The fundamental architecture of modern AI, inspired by biological brains. Consists of interconnected layers of "neurons" that process and transform data. "Deep learning" refers to neural networks with many layers. All AI creative tools (Midjourney, Stable Diffusion, Suno, etc.) are powered by various types of neural networks. Understanding this isn't necessary to use the tools, but explains their "black box" nature.

Category: AI foundation

NSFW Filter

Content moderation system that blocks or flags "Not Safe For Work" (adult/explicit) content. All major commercial AI tools enforce NSFW filters: DALL-E (strict), Midjourney (strict, auto-bans violations), Stable Diffusion (optional โ€” local installs can disable, but most hosted services enforce). Filters also block violence, hate speech, celebrity likeness. Occasionally triggers false positives on innocent prompts.

Category: Content policy

O

Outpainting

Extending an image beyond its original borders by generating new content that seamlessly continues the scene. Opposite of inpainting (which edits within borders). The AI analyzes the existing image and imagines what might exist beyond the frame. Useful for changing aspect ratios, expanding compositions, or creating panoramas. Available in DALL-E, Stable Diffusion, and specialized tools.

Used in: DALL-E, Stable Diffusion, specialized tools

Overfitting

When an AI model trains "too well" on its dataset and memorizes rather than generalizes. Results in poor performance on new inputs. In custom LoRA training, overfitting happens with too many epochs or too few training images. Signs: generated images look almost identical to training data, no variation. Prevented by proper training techniques, regularization, and adequate dataset size. Mostly relevant if you're training models, not using them.

Category: Model training issue

P

Parameters

Settings that control AI generation behavior. In Midjourney: flags like --ar, --stylize, --chaos. In Stable Diffusion: values like steps, CFG scale, sampler, seed. In Suno: genre, vocals, tempo. Understanding parameters is key to consistent, predictable results. Each tool has different parameters โ€” see our comparison tools for syntax across platforms.

Used in: All AI tools

Pika

Budget-friendly AI video generation tool ($8/mo Standard tier). Produces 3-10 second clips from text or image prompts. Quality trails Runway Gen-3 but at 1/4 the price. Good for: social media clips, motion graphics, creative experiments. Improving rapidly. Best value option if video is occasional need rather than primary focus.

Category: Video generation

Prompt

The text description you provide to guide AI generation. The single most important factor in output quality. Good prompts balance specificity with flexibility, include style/mood/technical details, and match the platform's preferred syntax. Prompt engineering is the skill of crafting effective prompts. Different platforms respond to different prompt styles: Midjourney likes poetic descriptors, DALL-E prefers natural language, Stable Diffusion wants keywords.

Category: Core concept

Prompt Engineering

The practice of systematically designing prompts to reliably produce desired AI outputs. Involves understanding: how models interpret language, which descriptors produce which effects, how to structure multi-concept prompts, platform-specific syntax. A learnable skill that dramatically improves results. Professional AI artists spend 80% of their time on prompt engineering, 20% on post-processing.

Category: Skill/discipline

Prompt Weighting

Adjusting the relative importance of different parts of your prompt. In Stable Diffusion: (keyword:1.5) emphasizes, (keyword:0.8) de-emphasizes. In Midjourney: implicit through word order and phrasing (earlier words carry more weight). Mastering weighting lets you fine-tune which elements dominate your composition versus serve as subtle influences.

Used in: Stable Diffusion (explicit), Midjourney (implicit)

Q

Quality Settings

Parameters controlling output fidelity, detail level, and generation time. Higher quality = better results but slower/costlier. In Midjourney: --quality 2. In Stable Diffusion: higher step counts (30-50), higher resolution. In Suno: "high fidelity" mode. Balance quality vs speed based on use case: high quality for final outputs, lower for rapid iteration.

Used in: All AI tools

R

Random Seed

A number (typically 0-4294967295) that initializes the random number generator, controlling the "randomness" of generation. Same seed + same prompt + same parameters = identical output (reproducible). Essential for: iterating on specific results, creating variations, debugging. In Midjourney/SD: --seed or seed parameter. DALL-E doesn't expose seeds. Pro tip: save seeds of successful generations.

Used in: Midjourney, Stable Diffusion, Suno

Resolution

The dimensions of generated images in pixels (e.g., 1024ร—1024, 1920ร—1080). Higher resolution = more detail but slower generation and more memory. Most models have native training resolution: Stable Diffusion 1.5 (512ร—512), SDXL (1024ร—1024), Midjourney v6 (~1456ร—1456). Generating far from native resolution risks quality degradation. Use upscaling for larger final outputs.

Category: Image parameter

Refiner

In SDXL (Stable Diffusion XL), an optional second model that adds fine details after the base model generates the composition. Used in a two-stage process: base model creates layout/structure, refiner adds texture/detail. Not always necessary โ€” many workflows skip the refiner. Costs extra time/memory. Experiment to see if refiner improves your specific style.

Used in: Stable Diffusion XL

Runway Gen-3

State-of-the-art AI video generation model from Runway ML. Produces 5-10 second clips with exceptional motion coherence and cinematic quality. Three tiers: Standard ($15/mo), Pro ($35/mo), Unlimited ($95/mo). Used by professionals in film, advertising, music videos. Expensive but currently the quality benchmark. Gen-3 Turbo (faster, lower quality) available for previewing.

Category: Video generation

S

Sampling Method

The algorithm used to progressively denoise images during diffusion generation. Different samplers produce subtly different aesthetics and converge at different speeds. Popular options: Euler, Euler A, DPM++ 2M Karras, LMS, DDIM. Most users stick with Euler or DPM++ 2M Karras. Affects both speed (steps needed) and style (subtle). Stable Diffusion specific โ€” Midjourney/DALL-E handle sampling internally.

Used in: Stable Diffusion

Scheduler

Controls how aggressively noise is removed at each step during diffusion generation. Related to sampler but technically distinct. Common schedulers: Karras, Exponential, Normal. Karras is most popular. Scheduler choice affects: speed to convergence, final detail level, tendency toward artifacts. Another "sampler-adjacent" setting in Stable Diffusion that most users set once and forget.

Used in: Stable Diffusion

Seed

See Random Seed above. The two terms are used interchangeably.

Used in: Most AI generation tools

Stable Diffusion

The leading open-source text-to-image AI model, developed by Stability AI. Unlike Midjourney or DALL-E, you can run it locally for free (requires GPU) or use hosted services. Fully customizable: swap models, load LoRAs, adjust every parameter. Steeper learning curve but unmatched flexibility. Versions: SD 1.5 (legacy), SDXL (current), SD3 (recent). Powers Leonardo, Clipdrop, and dozens of services.

Category: Image generation (open-source)

SDXL

Stable Diffusion XL โ€” the current flagship version of Stable Diffusion, released mid-2023. Native 1024ร—1024 resolution (vs 512ร—512 in SD 1.5), better prompt understanding, improved image quality. Larger model (7GB vs 2GB) requires more VRAM but produces noticeably better results. Most new models and LoRAs target SDXL. SD 1.5 still used for some specialized workflows.

Used in: Stable Diffusion ecosystem

Stems (Audio)

Individual instrument/vocal tracks that make up a complete song (drums, bass, vocals, etc.). Having stems allows remixing, volume balancing, or using parts separately. Most AI music generators (Suno, Udio) only export mixed stereo audio, not stems. Some offer "stem separation" features to extract drums/vocals after generation. True multitrack stems are rare in AI music generation currently.

Category: Music production

Spectrogram

A visual representation of audio showing frequency content over time. Looks like a heatmap: time on X-axis, frequency on Y-axis, brightness shows intensity. Some AI audio tools use spectrograms internally or display them for analysis. Useful for understanding audio structure but not essential for most users. More relevant for audio engineers than creators.

Category: Audio visualization

Style Transfer

Applying the visual style of one image to the content of another. Example: make a photo look like a Van Gogh painting, or render a portrait in anime style. Core capability of AI image tools. In practice: use img2img with style-focused prompt, or use LoRAs trained on specific art styles. Some tools have dedicated "style transfer" modes. Powerful for consistent aesthetic across multiple images.

Used in: All image generation tools

Suno

Leading AI music generation platform. Creates full songs with vocals, lyrics, and production in any genre in ~2 minutes. Free tier: 10 songs/day. Pro ($10/mo): 500 credits/mo (100 songs). Best for: quick music creation, demos, content soundtracks. Limitations: songs sound "AI-ish" on close listen, no stems export, limited control over structure. Still the most capable AI music tool available in 2026.

Category: Music generation

T

Text-to-Image

The core capability of AI image generators: creating images from text descriptions. Abbreviated txt2img. The "default mode" for Midjourney, DALL-E, Stable Diffusion. Opposite of image-to-image (which starts from an existing image). Text-to-image starts with pure noise, guided only by your prompt. The foundational AI creative capability that kicked off the current AI art revolution.

Category: Core capability

Text-to-Video

Generating video clips from text prompts. Significantly harder than text-to-image due to temporal consistency requirements (motion coherence, no flickering, logical progression). Current leaders: Runway Gen-3, Kling, Pika. Quality improving rapidly but still far from photorealistic human motion. Best current use cases: abstract motion, landscapes, establishing shots, motion graphics.

Category: Video generation

Text-to-Speech (TTS)

AI-generated human-like voice from written text. ElevenLabs is the market leader. Applications: audiobooks, video voiceovers, character voices, accessibility. Quality now nearly indistinguishable from real voices. Ethical concerns around voice cloning and deepfakes. Most platforms require consent for voice cloning, ban impersonation. Commercial TTS starts at $5-22/mo.

Category: Voice synthesis

Textual Inversion

The training technique used to create embeddings in Stable Diffusion. You provide 5-20 images of a concept, and the model learns to associate a trigger word with that concept. Lighter-weight than LoRA training but less versatile. Useful for: specific faces, unique objects, niche art styles. Results saved as small embedding files (.pt or .safetensors). Mostly superseded by LoRA for new use cases.

Used in: Stable Diffusion

Token

The basic unit of text that AI models process. Roughly: 1 token โ‰ˆ 0.75 words. Models have token limits: GPT-4 handles 8K-32K tokens, CLIP (for image prompts) ~75 tokens. Extremely long prompts get truncated. Also: some services call generation credits "tokens" (confusing dual meaning). When discussing prompts, token count matters for very long descriptions.

Category: Technical unit

Training Data

The images, text, or audio used to teach an AI model. Stable Diffusion trained on billions of image-text pairs from the internet (LAION dataset). Midjourney's training data is proprietary. Training data determines: what the model "knows," potential biases, licensing questions. Major controversy: artists unhappy about their work in training sets without permission. Commercial tools increasingly use licensed-only data (Adobe Firefly).

Category: Model foundation

U

Upscaling

Increasing image resolution while adding detail and sharpness (not just stretching pixels). AI upscalers (ESRGAN, Real-ESRGAN, Topaz Gigapixel) analyze and enhance. In Midjourney: click U1-U4 buttons for 2-4x upscale. In Stable Diffusion: use separate upscaler models or built-in options. Essential for print-ready outputs. Some upscaling is "creative" (adds invented detail), some is "conservative" (preserves exactly).

Used in: All image tools + dedicated upscaling AI

Udio

Suno's main competitor in AI music generation. Similar capabilities: full songs with vocals in any genre. Udio offers slightly better structural control and slightly worse vocal quality (subjective). Pricing matches Suno: ~$10/mo. Worth trying both to see which aesthetic you prefer. Some genres work better in one vs the other. Both improving rapidly.

Category: Music generation

V

VAE (Variational Autoencoder)

The component of Stable Diffusion that compresses images into latent space (encoder) and decodes them back to pixels (decoder). Different VAEs affect color saturation, sharpness, and overall aesthetic. Most users stick with the model's default VAE. Custom VAEs available for specific effects (more vibrant colors, etc.). Technical component most beginners don't need to worry about.

Used in: Stable Diffusion (technical component)

Variation

Creating alternate versions of a generated output while maintaining core elements. In Midjourney: V1-V4 buttons create variations of one image from the grid. In Stable Diffusion: use same seed with slightly changed prompt, or use "variation seed strength." Essential for iterative refinement: generate, pick best, create variations, repeat until perfect.

Used in: All AI tools (creative workflow)

Vector

Graphics defined by mathematical paths rather than pixels, allowing infinite scaling without quality loss. Most AI generators produce raster (pixel) images, not vectors. Some tools claim "AI vector generation" but actually output rasters. For true vectors: generate raster AI image, then trace in Illustrator or use specialized tools. Important for logos, icons, print graphics needing extreme scaling.

Category: Graphics format

W

Weighting

See Prompt Weighting above.

Used in: Stable Diffusion, implicit in other tools

Workflow

A systematic process for AI creative work, typically involving: ideation โ†’ prompt drafting โ†’ batch generation โ†’ selection โ†’ refinement โ†’ post-processing โ†’ export. Professional AI creators develop efficient workflows to produce consistent quality at scale. Also: in ComfyUI, "workflow" refers to saved node-based pipelines. Developing good workflows dramatically improves output quality and speed.

Category: Creative process

X, Y, Z

X/Y/Z Plot

A Stable Diffusion feature that generates grids of images systematically varying 2-3 parameters. Example: X-axis varies CFG scale (5, 7, 10, 15), Y-axis varies steps (20, 30, 50), generating a 4ร—3 grid showing all combinations. Invaluable for testing which parameter combinations work best for your prompt. Also called "parameter grid" or "prompt matrix." Helps find optimal settings efficiently.

Used in: Stable Diffusion (testing/optimization)

No terms found matching your search. Try different keywords or browse alphabetically.

How to Use This Glossary Effectively

Bookmark this page and return often โ€” the AI creative space evolves rapidly, and terminology shifts as new tools emerge. When you encounter an unfamiliar term in a tutorial, community discussion, or tool documentation, search here first before Googling (which often returns outdated or context-less definitions). If you're new to AI creation, start by reading the definitions for: AI Art, Prompt, Prompt Engineering, Model, CFG Scale, Seed, and the specific tools you're using (Midjourney, Stable Diffusion, Suno, etc.).

Cross-reference with our guides: Each glossary entry links to related concepts and full articles where relevant. After understanding a term's definition, dive deeper with our comprehensive guides: Prompt Anatomy Guide, Midjourney Prompts Guide, Stable Diffusion Prompts, and AI Music Prompts Guide. Understanding terminology accelerates learning, but practical experimentation solidifies knowledge โ€” read the definition, then immediately test it in your tool of choice.


๐Ÿ“š AI and machine learning reference books on Amazon are a useful desk companion alongside this glossary โ€” they often go deeper on the mathematical and conceptual foundations behind these terms. Contains affiliate links โ€” disclosure.