Table of Contents

βΈ»

βΈ»

1. Overview

Modern β€œmultimodal” LLMs such as GPT-4o, Gemini 2, and Claude 3 Opus are often described as seeing or understanding images.

In reality, the LLM never looks at pixels.

It receives dense visual embeddings produced by a separate vision transformer encoder (ViT).

The reasoning and interpretation β€” the actual β€œunderstanding” β€” happen entirely inside the LLM.

βΈ»

βΈ»

LLM Image Recognition

2. System Architecture

Image (RGB pixels)

↓

Patchify (e.g., 16Γ—16 patches)

↓

Vision Transformer (ViT / CLIP-style encoder)

↓

Visual Tokens [v₁, vβ‚‚, …, vβ‚™]

↓

Projection Layer (dimension alignment)

↓

Language Transformer (LLM backbone)

↓

Text Output (caption, classification, reasoning)

βΈ»

βΈ»

3. Roles of Each Component

Component | Function | Analogy |
Vision Transformer (ViT) | Encodes raw pixels into high-dimensional vectors; captures spatial & texture patterns | Retina / early visual cortex |
Projection Layer | Maps visual embeddings into the same vector space as text tokens | Optic nerve |
Language Transformer (LLM) | Performs reasoning, abstraction, explanation, and dialogue | Prefrontal cortex |

βΈ»

βΈ»

4. How the β€œLabel” Emerges

The ViT does not send a label such as β€œcat” or β€œdog.”

Instead, it sends a set of learned embeddings that geometrically encode cat-ness β€” shapes, edges, fur texture, posture.

The LLM then interprets those embeddings through cross-attention and generates the word β€œcat” when prompted.

During multimodal training, models learn these mappings through paired datasets:

Input: [image embeddings] + β€œDescribe the image.”

Target: β€œA brown cat wearing glasses.”

βΈ»

βΈ»

5. Fusion Mechanisms

Mode | Description | Used By |
Feature Projection (Concat) | Project visual tokens into LLM embedding space, concatenate before text | GPT-4o, Gemini 1.5 |
Cross-Attention Fusion | LLM has vision-specific attention layers that attend to visual tokens directly | Gemini 2, Kosmos-2 |
Joint Token Space (Unified Encoder) | Vision and text share transformer blocks | Experimental research only |

βΈ»

βΈ»

6. Tensor-Level View

At runtime:

  1. ViT output β†’ sequence of embeddings V ∈ β„βΏΛ£α΅ˆα΅₯

  2. Projection layer β†’ map to text dimension P(V) ∈ β„βΏΛ£α΅ˆβ‚œ

  3. Token concatenation:

[ P(v₁)…P(vβ‚™) t₁…tβ‚˜]

  1. LLM attention operates jointly:
  • Queries (Q), Keys (K), Values (V) built over both image and text tokens

  • Cross-modal reasoning emerges naturally

  1. Output logits β†’ textual response (caption, classification, or reasoning chain)

βΈ»

βΈ»

7. Example

Prompt:

β€œWhat’s unusual about this image?”

ViT encodes patches β†’ [fur pattern] [eyes] [object: glasses] [pose] [background] …

LLM cross-attends to those embeddings and concludes:

β€œCats don’t usually wear glasses β€” that’s unusual.”

The reasoning and judgment occur in the language model, not the vision model.

βΈ»

βΈ»

8. Key Takeaways

  • 🧠 LLM = reasoning engine, not sensor.

  • πŸ–ΌοΈ Vision Transformer = sensory encoder, not interpreter.

  • βš™οΈ Projection layer = bridge between modalities.

  • πŸ”„ Training objective teaches the LLM to map from visual geometry to linguistic meaning.

βΈ»

βΈ»

TL;DR

Multimodal LLMs don’t β€œsee” β€” they think about what’s been seen.

The vision encoder translates pixels into a token language, and the LLM performs all higher-order cognition on that internal code. the ViT has no idea what it’s looking at in human terms. It’s just compressing visual structure into embeddings β€” dense geometric coordinates in feature space.

It’s the LLM that translates that geometry into language and reasoning.

So when a multimodal model says β€œa cat sitting on a couch,” here’s what actually happened:

Vision Transformer:

Encodes the image into ~1,000 visual tokens, each representing patch-level structure (edges, textures, color gradients).

It has never been told β€œthis is a cat” β€” it only knows patterns.

Projection layer:

Aligns the ViT’s embedding space with the LLM’s token space β€” same dimensionality, compatible distributions.

βΈ»

βΈ»

LLM (text transformer): Performs attention over both text and visual embeddings, infers relationships (β€œround ears + fur + tail = cat”), then produces the natural-language description.

So the LLM is really doing β€œvision-language reasoning.”

The ViT is a preprocessor, not a thinker β€” the LLM is the cognitive core.

βΈ»

βΈ»

9. Tensor-Level Anatomy of Multimodal Fusion

Let’s trace a single RGB image as it moves through a modern multimodal LLM pipeline β€” from raw pixels to natural-language reasoning.

βΈ»

βΈ»

πŸ”Ή Step 1 β€” Input Image

Image tensor: (H, W, 3)

Example: (512, 512, 3)

Each pixel has 3 channels (R, G, B).

βΈ»

βΈ»

πŸ”Ή Step 2 β€” Patchify

The image is divided into small fixed-size patches.

Patch size: 16Γ—16

Number of patches N = (H/16) Γ— (W/16) = 1024

Each patch is flattened β†’ 16Γ—16Γ—3 = 768-D vector.

βΈ»

βΈ»

πŸ”Ή Step 3 β€” Vision Transformer (ViT)

Each patch embedding passes through multiple transformer blocks:

Input: (N, D_v) = (1024, 768)

Output: (N, D_v) = (1024, 1024)

The ViT learns rich spatial features β€” edges, texture, color, geometry β€”

but no language semantics.

Result: a sequence of visual embeddings

V ∈ β„βΏΛ£α΅ˆα΅› = (1024, 1024)

βΈ»

βΈ»

πŸ”Ή Step 4 β€” Projection Layer

We map the visual embeddings into the text-token dimension D_t.

Linear projection: W_proj ∈ β„α΅ˆα΅›Λ£α΅ˆα΅—

P(V) = V Γ— W_proj

Output: (N, D_t) = (1024, 4096)

Now the visual vectors live in the same space as the LLM’s word tokens.

βΈ»

βΈ»

πŸ”Ή Step 5 β€” Token Fusion

The visual tokens are concatenated before the text tokens.

[ P(v₁)…P(v₁₀₂₄) t₁…tβ‚˜ ]

Sequence length L = N + M

Typical values: Nβ‰ˆ1024, Mβ‰ˆ512–2048.

βΈ»

βΈ»

βΈ»

πŸ”Ή Step 6 β€” Joint Attention in the LLM

Within the transformer blocks:

Q, K, V ∈ β„Λ‘Λ£α΅ˆβ‚œ (e.g., 4096)

Attention weights: softmax(QKα΅€ / √d_t)

The LLM now attends across both modalities:

  • attends from words β†’ visual tokens (cross-attention)

  • attends within visual tokens (self-attention)

  • attends within words (language context)

This is where semantic grounding happens.

βΈ»

βΈ»

βΈ»

πŸ”Ή Step 7 β€” Feed-Forward Reasoning

Each transformer block integrates visual + textual context:

Residual + MLP β†’ LayerNorm β†’ Next Block

Through dozens of blocks, the model builds an internal conceptual map:

β€œFur texture + eye shape + whisker geometry β†’ cat.”

βΈ»

βΈ»

βΈ»

πŸ”Ή Step 8 β€” Output Projection

Finally, logits are produced for the vocabulary:

LLM hidden state: (L, D_t)

β†’ Linear head W_vocab ∈ β„α΅ˆα΅—Λ£α΅› (V = vocab size)

β†’ Softmax β†’ Text output

Example Output:

β€œA brown cat wearing glasses.”

βΈ»

βΈ»

βΈ»

βš™οΈ Shape Summary

Stage | Tensor Shape | Description |

|——–|β€”β€”β€”β€”β€”|β€”β€”β€”β€”-|

Raw image | (512, 512, 3) | RGB pixels |
Patches | (1024, 768) | Flattened 16Γ—16 regions |
ViT output | (1024, 1024) | Visual embeddings |
Projected visual tokens | (1024, 4096) | Aligned to LLM dimension |
Text tokens | (M, 4096) | Natural-language embeddings |
Fused sequence | (N + M, 4096) | Multimodal context |
Output logits | (V) | Vocabulary predictions |

βΈ»

βΈ»

βΈ»

🧩 Takeaway

  • ViT encodes structure β†’ LLM interprets meaning.

  • Every image becomes a language of embeddings, and the transformer learns to speak that language.

  • Once fused, text and vision are no longer separate β€” both live as tokens in the same high-dimensional reasoning space.

visuals {##Visuals}

visual-one

βœ…

Vision LLM

VISUAL 1 β€” Full-Multimodal Pipeline Diagram

This is the canonical diagram every serious multimodal doc needs.

     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

     β”‚       Image (RGB)       β”‚

     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                   β”‚

           Patchify (16Γ—16)

                   β”‚

     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

     β”‚   Vision Transformer (ViT) β”‚

     β”‚  Spatial β†’ Feature Embeds  β”‚

     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                   β”‚ V ∈ β„βΏΛ£α΅ˆα΅₯

           Linear Projection

                   β”‚ P(V) ∈ β„βΏΛ£α΅ˆβ‚œ

     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

     β”‚  Fused Token Sequence      β”‚

     β”‚ [P(v₁)…P(vβ‚™) t₁…tβ‚˜]        β”‚

     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                   β”‚

          Language Transformer

                   β”‚

     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

     β”‚      Reasoning + Text     β”‚

     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                   β”‚

             Output Logits

                   β”‚

          β€œA brown cat wearing glasses.”

Where to place it:

Directly after Section 2: System Architecture.

visual-two {##Visual-two}

βΈ»

βΈ»

βΈ»

βΈ»

βΈ»

βœ…

VISUAL 2 β€” Tensor Shape Evolution Table

A crisp table lets the reader see the pipeline in one shot.

Stage Shape Meaning

────────────────────────────────────────────────────────────────────

Raw image (512, 512, 3) Pixels

Patchify (1024, 768) 16Γ—16 RGB patches

ViT output (1024, 1024) Visual embeddings

Project to text dim (1024, 4096) Aligned tokens

Text tokens (M, 4096) Normal LLM embeddings

Fused sequence (1024+M, 4096) Joint context

Attention matrices (L, L) Full multimodal attention

LLM hidden state (L, 4096) Mixed-mode reasoning

Output logits (V) Vocabulary prediction

Drop this under Section 9.

βΈ»

βΈ»

βΈ»

βΈ»

βœ…

VISUAL 3 β€” Cross-Attention Map

This visual explains exactly how the model β€œlooks” at the image while generating text.

 Text Token Query "cat"

              β”‚

              β–Ό

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚        Attention Matrix         β”‚

β”‚ Q(text) Β· K(visual)α΅€           β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

   β–²       β–²       β–²       β–²

   β”‚       β”‚       β”‚       β”‚

[p₁] [pβ‚‚] [p₃] … [pβ‚™]

Patch embeddings from ViT

Strong activations β†’ ears, fur, tail

Weak activations β†’ background, couch

This makes your page look like a real multimodal interpretability document.

Place this right after β€œHow the label emerges.”

βΈ»

βΈ»

βΈ»

βΈ»

βœ…

VISUAL 4 β€” β€œLLM Reasoning Stack”

This shows the conceptual climb from geometry β†’ semantics.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Natural Language β”‚

β”‚ (β€œA brown cat wearing…”) β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

              β”‚

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Semantic Concepts β”‚

β”‚ cat, glasses, object roles β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

              β”‚

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Multimodal Attention β”‚

β”‚ links visual + text tokens β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

              β”‚

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ ViT Geometry + Patterns β”‚

β”‚ fur, edges, colors, posture β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

              β”‚

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Raw Pixels (RGB) β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This makes your article visually explain the cognitive ladder inside multimodal models.

βΈ»

βΈ»

βΈ»

βΈ»

βœ…

VISUAL 5 β€” β€œWhat the ViT Actually Sees vs What the LLM Thinks”

Very powerful for readers:

Vision Encoder (ViT) Output

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Patch embeddings: β”‚

β”‚ β€’ edge orientations β”‚

β”‚ β€’ fur texture gradients β”‚

β”‚ β€’ local color histograms β”‚

β”‚ β€’ blob and contour activations β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

         ✘ No concepts  

         ✘ No objects  

         ✘ No language  

         ✘ No β€œcat”

β–Ό Passed into LLM as vectors β–Ό

Language Model Interpretation

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ β€œround ears + whisker pattern + β”‚

β”‚ bilateral symmetry β†’ cat” β”‚

β”‚ β”‚

β”‚ β€œfur + eyes + object = glasses” β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

         βœ” Concepts  

         βœ” Objects  

         βœ” Relationships  

         βœ” Explanations  

This pair nails the entire philosophy of multimodal alignment.