Back to Glossary
GEO & AI Search

Multimodal Search

Multimodal search is a way of searching that combines inputs in several formats — text, images, voice, and video — into a single query. Unlike traditional search that relies on keyword matching, it compares inputs of different formats by meaning to find an answer.

  • Multimodal search combines several input formats into one query — for example, snapping a photo and asking a question in text at the same time.
  • Google launched multisearch, which pairs text with images, as a U.S. English beta in April 2022, with its MUM AI model doing the heavy lifting underneath.
  • The core idea is to compare photos, voice, and text in a shared embedding space by meaning rather than by format.
  • In the era of AI search, an image's alt text, captions, structured data, and surrounding context become the signals a model actually reads.
  • When image resolution or OCR legibility is poor, a model can misread the visual tokens and hallucinate as a result.

Multimodal search is a way of searching that combines inputs in different formats — text, images, voice, video — into a single query to understand intent and find an answer. Take a dress you like: photograph it, add the word "green," and the search engine interprets both inputs together to surface the same design in green. The key difference from traditional search, where you had to type exact keywords, is that you can find things that are hard to put into words simply by showing or speaking them.

The technical foundation for this shift is MUM (Multitask Unified Model), which Google announced in May 2021. MUM was trained across 75 languages and designed to understand information in multiple formats — text, images, video — all at once. At its Search On event in September 2021, Google demonstrated multimodal search by combining MUM with Google Lens, and on April 7, 2022 it launched multisearch as a beta for U.S. English users. In the Google app, you take a photo with Lens and then append text via "+ Add to your search," merging a visual query and a text query into one.

What Makes It Different From Traditional Search

The biggest differences are the input format and the matching method. Traditional search checks whether text keywords match; multimodal search converts inputs of differing formats into units of meaning and compares those.

DimensionTraditional (Keyword) SearchMultimodal Search
Input formatText keywordsA combination of text, image, voice, and video
Matching methodKeyword and string matchingMeaning-based comparison in a shared embedding space
Example query"Green dress"A photo of a dress + the text "green"
Best suited forWhen you can describe the target precisely in wordsWhen you can show or speak a target that is hard to describe in words
Representative technologyInverted index, ranking algorithmsMultimodal AI models such as MUM, multimodal embeddings

Real Examples and Evidence

The examples Google offered in its official blog and Search On presentations make the uses of multimodal search concrete: finding a dress you like in a different color; photographing a dining table and adding "coffee table" to find furniture that goes with it; snapping an unfamiliar plant to learn how to care for it. In the Search On 2021 demo, Google also showed taking a photo of an unknown bike part and asking "how do I fix this?" — the system matched that image to the exact moment in a video and surfaced the repair method. Another demo photographed the pattern on a shirt to find socks with the same pattern. Google later added a "multisearch near me" feature that finds a product you see through your camera at nearby stores.

As AI search becomes routine, multimodal search is expanding into a question of how you optimize image, video, and voice assets. In a December 2025 article, Search Engine Land's Myriam Jessier explains that when AI systems convert an image into "visual tokens," low resolution can cause those tokens to be misread and lead to hallucination. According to the same piece, for text inside an image to be read by OCR, the character height should be at least 30 pixels and the contrast roughly 40 grayscale values or more, while heavily stylized fonts interfere with OCR. She also stresses that alt text goes beyond mere accessibility text — it acts as a "semantic signpost" that helps a model pin down the meaning of ambiguous visual tokens.

Execution Checklist

  • Write alt text for every key image that specifically captures the subject and context (decorative images are the exception).
  • Secure sufficient image resolution, and for text inside images aim for at least 30-pixel character height and 40-plus contrast.
  • Place descriptive captions and body context around images and videos to add text signals the model can read.
  • Apply structured data (schema) to products and images to make format and relationship information explicit.
  • Add captions, timestamps, and descriptions of key scenes to videos so specific moments can be matched.
  • Use original images to establish source credibility, and avoid duplicate or stolen images.

References