AI Video Reference Image Checklist: How to Get Reusable Results

GemiOmni TeamMay 18, 2026

Reference images can make AI video generation dramatically more controllable, but only when they are prepared like production assets. A messy reference set forces the model to guess. A clean reference set tells the model what to preserve, what to animate, and what to ignore.

This checklist is for product marketers, creators, and teams building repeatable image-to-video workflows.

The five-reference rule

Before uploading anything, label each reference with one of five roles:

Identity: the person, character, mascot, or product that must remain recognizable.
Geometry: shape, silhouette, packaging, layout, or room structure.
Material: fabric, glass, metal, skin texture, food surface, or lighting texture.
Environment: location, background, weather, time of day.
Motion: a pose, frame, or previous clip that suggests movement.

If a reference has no role, remove it. More references do not automatically create more control.

Clean input beats clever prompting

Use reference images that are:

High resolution enough to show the detail you care about.
Not heavily filtered unless the filter is the style target.
Free of watermarks, UI overlays, and random text.
Cropped around the important subject.
Consistent in lighting when identity or product accuracy matters.

If the product label is tiny in the uploaded photo, do not expect the model to preserve it. Upload a clean packshot and tell the model which details matter.

Prompt each reference explicitly

Bad:

Use these references to make a cool fashion video.

Better:

Use reference 1 for the model's face and outfit. Use reference 2 for the studio lighting and gray background. Use reference 3 only for the handbag shape and leather texture. Create an 8-second slow push-in with subtle fabric movement. Do not change the face, outfit color, or handbag proportions.

Preserve successful inputs

The best reference workflow is not only about upload quality. It also needs persistence. When a generation works, save the full setup:

Field	Why it matters
Prompt	Captures the creative instruction.
Model and mode	Text-to-video and image-to-video behave differently.
Aspect ratio	Vertical and landscape shots compose differently.
Duration	Motion pacing changes with length.
Resolution	Affects finishing quality and credit cost.
Sound setting	Determines whether audio must be directed.
Reference URLs	Lets the team regenerate or iterate later.
Output URLs	Keeps the generated asset available after temporary links expire.

If these inputs are stored, history becomes a production tool instead of a gallery. A teammate can click an old generation, recover the original prompt and references, adjust one variable, and generate a controlled variation.

A repeatable workflow

Use this operating rhythm:

Upload only the references that have a clear role.
Write a prompt that assigns each reference a job.
Generate the first clip at the cheapest acceptable setting.
Fix composition before fixing detail.
Save the working setup before increasing resolution.
Reuse the same references for variants instead of re-uploading different crops.

Common failure modes

Failure	Likely cause	Fix
Face changes between shots	Identity reference is unclear or mixed with style references	Use one clean portrait and say "preserve identity."
Product shape changes	Prompt asks for motion that deforms the product	Add "keep proportions unchanged" and reduce action.
Scene looks generic	Environment reference is weak	Add a location reference and describe time of day.
Audio feels random	Sound was not directed	Name ambience, foley, music, and dialogue separately.
Re-run cannot match old result	Inputs were not saved	Store prompt, settings, references, and output URLs.

Sources

Google Cloud: Veo video generation API parameters
Google: Veo 3.1 Ingredients to Video update
ByteDance Seed: Seedance 2.0 Official Launch