Table of Contents
- 1. Overview
- 2. System Architecture
- 3. Roles of Each Component
- 4. How the Label Emerges
- 5. Fusion Mechanisms
- 6. Tensor-Level View
- 7. Example
- 8. Key Takeaways
- Tensor-Level Anatomy of Multimodal Fusion
- πΉ Step 1 β Input Image
βΈ»
βΈ»
1. Overview
Modern βmultimodalβ LLMs such as GPT-4o, Gemini 2, and Claude 3 Opus are often described as seeing or understanding images.
In reality, the LLM never looks at pixels.
It receives dense visual embeddings produced by a separate vision transformer encoder (ViT).
The reasoning and interpretation β the actual βunderstandingβ β happen entirely inside the LLM.
βΈ»
βΈ»

2. System Architecture
Image (RGB pixels)
β
Patchify (e.g., 16Γ16 patches)
β
Vision Transformer (ViT / CLIP-style encoder)
β
Visual Tokens [vβ, vβ, β¦, vβ]
β
Projection Layer (dimension alignment)
β
Language Transformer (LLM backbone)
β
Text Output (caption, classification, reasoning)
βΈ»
βΈ»
3. Roles of Each Component
βΈ»
βΈ»
4. How the βLabelβ Emerges
The ViT does not send a label such as βcatβ or βdog.β
Instead, it sends a set of learned embeddings that geometrically encode cat-ness β shapes, edges, fur texture, posture.
The LLM then interprets those embeddings through cross-attention and generates the word βcatβ when prompted.
During multimodal training, models learn these mappings through paired datasets:
Input: [image embeddings] + βDescribe the image.β
Target: βA brown cat wearing glasses.β
βΈ»
βΈ»
5. Fusion Mechanisms
βΈ»
βΈ»
6. Tensor-Level View
At runtime:
ViT output β sequence of embeddings
V β ββΏΛ£α΅α΅₯Projection layer β map to text dimension
P(V) β ββΏΛ£α΅βToken concatenation:
[ P(vβ)β¦P(vβ) tββ¦tβ]
- LLM attention operates jointly:
Queries (Q), Keys (K), Values (V) built over both image and text tokens
Cross-modal reasoning emerges naturally
- Output logits β textual response (caption, classification, or reasoning chain)
βΈ»
βΈ»
7. Example
Prompt:
βWhatβs unusual about this image?β
ViT encodes patches β [fur pattern] [eyes] [object: glasses] [pose] [background] β¦
LLM cross-attends to those embeddings and concludes:
βCats donβt usually wear glasses β thatβs unusual.β
The reasoning and judgment occur in the language model, not the vision model.
βΈ»
βΈ»
8. Key Takeaways
π§ LLM = reasoning engine, not sensor.
πΌοΈ Vision Transformer = sensory encoder, not interpreter.
βοΈ Projection layer = bridge between modalities.
π Training objective teaches the LLM to map from visual geometry to linguistic meaning.
βΈ»
βΈ»
TL;DR
Multimodal LLMs donβt βseeβ β they think about whatβs been seen.
The vision encoder translates pixels into a token language, and the LLM performs all higher-order cognition on that internal code. the ViT has no idea what itβs looking at in human terms. Itβs just compressing visual structure into embeddings β dense geometric coordinates in feature space.
Itβs the LLM that translates that geometry into language and reasoning.
So when a multimodal model says βa cat sitting on a couch,β hereβs what actually happened:
Vision Transformer:
Encodes the image into ~1,000 visual tokens, each representing patch-level structure (edges, textures, color gradients).
It has never been told βthis is a catβ β it only knows patterns.
Projection layer:
Aligns the ViTβs embedding space with the LLMβs token space β same dimensionality, compatible distributions.
βΈ»
βΈ»
LLM (text transformer): Performs attention over both text and visual embeddings, infers relationships (βround ears + fur + tail = catβ), then produces the natural-language description.
So the LLM is really doing βvision-language reasoning.β
The ViT is a preprocessor, not a thinker β the LLM is the cognitive core.
βΈ»
βΈ»
9. Tensor-Level Anatomy of Multimodal Fusion
Letβs trace a single RGB image as it moves through a modern multimodal LLM pipeline β from raw pixels to natural-language reasoning.
βΈ»
βΈ»
πΉ Step 1 β Input Image
Image tensor: (H, W, 3)
Example: (512, 512, 3)
Each pixel has 3 channels (R, G, B).
βΈ»
βΈ»
πΉ Step 2 β Patchify
The image is divided into small fixed-size patches.
Patch size: 16Γ16
Number of patches N = (H/16) Γ (W/16) = 1024
Each patch is flattened β 16Γ16Γ3 = 768-D vector.
βΈ»
βΈ»
πΉ Step 3 β Vision Transformer (ViT)
Each patch embedding passes through multiple transformer blocks:
Input: (N, D_v) = (1024, 768)
Output: (N, D_v) = (1024, 1024)
The ViT learns rich spatial features β edges, texture, color, geometry β
but no language semantics.
Result: a sequence of visual embeddings
V β ββΏΛ£α΅α΅ = (1024, 1024)
βΈ»
βΈ»
πΉ Step 4 β Projection Layer
We map the visual embeddings into the text-token dimension D_t.
Linear projection: W_proj β βα΅α΅Λ£α΅α΅
P(V) = V Γ W_proj
Output: (N, D_t) = (1024, 4096)
Now the visual vectors live in the same space as the LLMβs word tokens.
βΈ»
βΈ»
πΉ Step 5 β Token Fusion
The visual tokens are concatenated before the text tokens.
[ P(vβ)β¦P(vββββ) tββ¦tβ ]
Sequence length L = N + M
Typical values: Nβ1024, Mβ512β2048.
βΈ»
βΈ»
βΈ»
πΉ Step 6 β Joint Attention in the LLM
Within the transformer blocks:
Q, K, V β βΛ‘Λ£α΅β (e.g., 4096)
Attention weights: softmax(QKα΅ / βd_t)
The LLM now attends across both modalities:
attends from words β visual tokens (cross-attention)
attends within visual tokens (self-attention)
attends within words (language context)
This is where semantic grounding happens.
βΈ»
βΈ»
βΈ»
πΉ Step 7 β Feed-Forward Reasoning
Each transformer block integrates visual + textual context:
Residual + MLP β LayerNorm β Next Block
Through dozens of blocks, the model builds an internal conceptual map:
βFur texture + eye shape + whisker geometry β cat.β
βΈ»
βΈ»
βΈ»
πΉ Step 8 β Output Projection
Finally, logits are produced for the vocabulary:
LLM hidden state: (L, D_t)
β Linear head W_vocab β βα΅α΅Λ£α΅ (V = vocab size)
β Softmax β Text output
Example Output:
βA brown cat wearing glasses.β
βΈ»
βΈ»
βΈ»
βοΈ Shape Summary
|βββ|βββββ|ββββ-|
βΈ»
βΈ»
βΈ»
π§© Takeaway
ViT encodes structure β LLM interprets meaning.
Every image becomes a language of embeddings, and the transformer learns to speak that language.
Once fused, text and vision are no longer separate β both live as tokens in the same high-dimensional reasoning space.
visuals {##Visuals}
visual-one
β

VISUAL 1 β Full-Multimodal Pipeline Diagram
This is the canonical diagram every serious multimodal doc needs.
βββββββββββββββββββββββββββ
β Image (RGB) β
βββββββββββββββ¬ββββββββββββ
β
Patchify (16Γ16)
β
βββββββββββββββΌβββββββββββββββ
β Vision Transformer (ViT) β
β Spatial β Feature Embeds β
βββββββββββββββ¬βββββββββββββββ
β V β ββΏΛ£α΅α΅₯
Linear Projection
β P(V) β ββΏΛ£α΅β
βββββββββββββββΌββββββββββββββ
β Fused Token Sequence β
β [P(vβ)β¦P(vβ) tββ¦tβ] β
βββββββββββββββ¬ββββββββββββββ
β
Language Transformer
β
βββββββββββββββΌββββββββββββββ
β Reasoning + Text β
βββββββββββββββ¬ββββββββββββββ
β
Output Logits
β
βA brown cat wearing glasses.β
Where to place it:
Directly after Section 2: System Architecture.
visual-two {##Visual-two}
βΈ»
βΈ»
βΈ»
βΈ»
βΈ»
β
VISUAL 2 β Tensor Shape Evolution Table
A crisp table lets the reader see the pipeline in one shot.
Stage Shape Meaning
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Raw image (512, 512, 3) Pixels
Patchify (1024, 768) 16Γ16 RGB patches
ViT output (1024, 1024) Visual embeddings
Project to text dim (1024, 4096) Aligned tokens
Text tokens (M, 4096) Normal LLM embeddings
Fused sequence (1024+M, 4096) Joint context
Attention matrices (L, L) Full multimodal attention
LLM hidden state (L, 4096) Mixed-mode reasoning
Output logits (V) Vocabulary prediction
Drop this under Section 9.
βΈ»
βΈ»
βΈ»
βΈ»
β
VISUAL 3 β Cross-Attention Map
This visual explains exactly how the model βlooksβ at the image while generating text.
Text Token Query "cat"
β
βΌ
βββββββββββββββββββββββββββββββββββ
β Attention Matrix β
β Q(text) Β· K(visual)α΅ β
βββββββββββββββββββββββββββββββββββ
β² β² β² β²
β β β β
[pβ] [pβ] [pβ] β¦ [pβ]
Patch embeddings from ViT
Strong activations β ears, fur, tail
Weak activations β background, couch
This makes your page look like a real multimodal interpretability document.
Place this right after βHow the label emerges.β
βΈ»
βΈ»
βΈ»
βΈ»
β
VISUAL 4 β βLLM Reasoning Stackβ
This shows the conceptual climb from geometry β semantics.
ββββββββββββββββββββββββββββββββ
β Natural Language β
β (βA brown cat wearingβ¦β) β
ββββββββββββββββ²ββββββββββββββββ
β
ββββββββββββββββΌββββββββββββββββ
β Semantic Concepts β
β cat, glasses, object roles β
ββββββββββββββββ²ββββββββββββββββ
β
ββββββββββββββββΌββββββββββββββββ
β Multimodal Attention β
β links visual + text tokens β
ββββββββββββββββ²ββββββββββββββββ
β
ββββββββββββββββΌββββββββββββββββ
β ViT Geometry + Patterns β
β fur, edges, colors, posture β
ββββββββββββββββΌββββββββββββββββ
β
ββββββββββββββββ΄ββββββββββββββββ
β Raw Pixels (RGB) β
ββββββββββββββββββββββββββββββββ
This makes your article visually explain the cognitive ladder inside multimodal models.
βΈ»
βΈ»
βΈ»
βΈ»
β
VISUAL 5 β βWhat the ViT Actually Sees vs What the LLM Thinksβ
Very powerful for readers:
Vision Encoder (ViT) Output
ββββββββββββββββββββββββββββββββββββββββββ
β Patch embeddings: β
β β’ edge orientations β
β β’ fur texture gradients β
β β’ local color histograms β
β β’ blob and contour activations β
ββββββββββββββββββββββββββββββββββββββββββ
β No concepts
β No objects
β No language
β No βcatβ
βΌ Passed into LLM as vectors βΌ
Language Model Interpretation
ββββββββββββββββββββββββββββββββββββββββββ
β βround ears + whisker pattern + β
β bilateral symmetry β catβ β
β β
β βfur + eyes + object = glassesβ β
ββββββββββββββββββββββββββββββββββββββββββ
β Concepts
β Objects
β Relationships
β Explanations
This pair nails the entire philosophy of multimodal alignment.