Camera Models & Projections

Understanding how 3D world coordinates map to 2D image pixels - the foundation of computer vision

Pinhole Camera Geometry

Coordinate System Transformations

World Frame Xw Yw Zw Arbitrary global reference [R | t] Extrinsics Camera Frame Xc Yc Zc Origin at optical center, Z forward K Intrinsics Image Plane (0,0) u v (u,v) 2D pixel coordinates
The Complete Pipeline:
World FrameExtrinsics [R|t]Camera FrameIntrinsics KImage Plane

This transforms a 3D point $(X_w, Y_w, Z_w)$ to 2D pixel coordinates $(u, v)$

What is Camera Projection?

The Big Picture
Simple idea: Camera projection is how we go from a 3D world to a 2D photo. Just like your shadow on the ground is a 2D "projection" of your 3D body, an image is a 2D projection of the 3D scene.

The pinhole camera model is the mathematical way to describe this: light rays pass through a single point (optical center) and hit the image sensor. Understanding this is the foundation for 3D reconstruction, AR, robot vision, and self-driving cars.

Visual Analogy: The Pinhole Camera

3D World
☀️
🌲
Far (Z=10m)
🏠
Medium (Z=5m)
🚶
Close (Z=2m)
Real 3D Scene
2D Image
☀️
🌲
🏠
🚶
Photo (2D Projection)
🎯 What Happened?
3D → 2D transformation loses information:
  • Depth is lost! We can't tell if tree is 10m or 100m away from just the photo
  • Size becomes relative: Close person (🚶) looks bigger than far tree (🌲) even if tree is actually much taller
  • Perspective effect: Objects farther away appear smaller (that's the "division by Z")
  • All rays converge: Light from all objects passes through one point (camera center)
💡 The math question: If we know camera parameters (K, R, t), can we predict where a 3D point (X,Y,Z) will appear in the image (u,v)? YES! That's what the projection equation tells us.

Algorithm Overview

Goal
Given a 3D point in the world and camera parameters, calculate exactly which pixel it will appear at in the image. This lets us: render 3D graphics, understand what the camera sees, reconstruct 3D from images, track objects, and much more.
Input
• 3D point in world coordinates: \(\mathbf{X}_w = [X, Y, Z]^T\)
• Camera intrinsic parameters: \(K\) matrix
• Camera extrinsic parameters: Rotation \(R\), Translation \(\mathbf{t}\)
Output
• 2D pixel coordinates: \(\mathbf{x} = [u, v]^T\)
• Homogeneous image coordinates: \([u, v, 1]^T\)
• Projection transformation

Pinhole Camera Model

Complete Projection Equation
$$s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = K [R | \mathbf{t}] \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix}$$
Complete pipeline: World → Camera → Image
• \(s\) - scale factor (depth), usually \(s = Z\)
• \([u, v]^T\) - pixel coordinates in image
• \(K\) - intrinsic matrix (camera internals)
• \([R | \mathbf{t}]\) - extrinsic matrix (camera pose)
• \([X, Y, Z, 1]^T\) - 3D point in homogeneous coordinates
Intrinsic Matrix (K)
$$K = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}$$
Intrinsic parameters (internal camera properties):
• \(f_x\) - focal length in x direction (pixels)
• \(f_y\) - focal length in y direction (pixels)
• \(c_x\) - principal point x-coordinate (image center x)
• \(c_y\) - principal point y-coordinate (image center y)

Physical meaning:
• Focal length relates to zoom: larger \(f\) = more zoom
• Principal point is where optical axis intersects image plane
• Usually \(c_x \approx \text{width}/2\), \(c_y \approx \text{height}/2\)
• If pixels are square: \(f_x = f_y\)
Extrinsic Matrix [R | t]
$$[R | \mathbf{t}] = \begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \\ r_{21} & r_{22} & r_{23} & t_y \\ r_{31} & r_{32} & r_{33} & t_z \end{bmatrix}$$
Extrinsic parameters (camera pose in world):
• \(R \in \mathbb{R}^{3 \times 3}\) - rotation matrix (3×3, orthogonal)
• \(\mathbf{t} \in \mathbb{R}^3\) - translation vector (3×1)
• Together define camera position and orientation

Transformation:
\(\mathbf{X}_c = R \mathbf{X}_w + \mathbf{t}\)
Converts world coordinates to camera coordinates

Degrees of Freedom: 6 DoF (3 rotation + 3 translation)
Simplified Projection (Camera Coordinates)
$$u = f_x \frac{X_c}{Z_c} + c_x$$ $$v = f_y \frac{Y_c}{Z_c} + c_y$$
Perspective projection (camera frame):
• \((X_c, Y_c, Z_c)\) - 3D point in camera coordinates
• Division by \(Z_c\) creates perspective effect
• Closer objects (\(Z_c\) small) → larger in image
• Farther objects (\(Z_c\) large) → smaller in image

Key insight: Depth is lost in projection! Multiple 3D points can project to same 2D pixel.

Homogeneous Coordinates

Why Homogeneous Coordinates?
Homogeneous coordinates allow us to represent projective transformations (including perspective) as linear operations (matrix multiplications). This is crucial for computer vision!
Euclidean ↔ Homogeneous Conversion
$$\text{Euclidean to Homogeneous: } [X, Y] \rightarrow [X, Y, 1]$$ $$\text{Homogeneous to Euclidean: } [X, Y, W] \rightarrow \left[\frac{X}{W}, \frac{Y}{W}\right]$$
Key properties:
• Add extra dimension with value 1
• Scale invariance: \([X, Y, W]\) ≡ \([kX, kY, kW]\) for any \(k \neq 0\)
• All represent the same Euclidean point

Examples:
• \([2, 3, 1]\) ≡ \([4, 6, 2]\) ≡ \([6, 9, 3]\) → all represent point (2, 3)
• \([4, 6, 2]\) → Euclidean: \((4/2, 6/2) = (2, 3)\)
Points at Infinity
$$\text{Point at infinity: } [X, Y, 0]$$ $$\text{Direction vector: } (X, Y)$$
Special case when W = 0:
• Cannot convert back to Euclidean (division by zero)
• Represents a direction, not a point
• Useful for representing parallel lines and vanishing points
• Example: \([1, 0, 0]\) = point at infinity in x-direction
Why Useful in Vision?
1. Linear Transformations:
Translation becomes matrix multiplication (not possible with Euclidean)

2. Unified Framework:
Rotation, translation, scaling, perspective → all matrix operations

3. Handle Infinity:
Vanishing points in perspective images are at infinity
Parallel lines in 3D meet at infinity in 2D projection

4. Projective Geometry:
Natural representation for perspective projection
Handles division by depth elegantly

Visual Explanation: Euclidean vs Homogeneous Coordinates

Euclidean Coordinates (2D)
Standard Cartesian Plane
(3, 2)
X
Y
Representation: (x, y)
Example: Point P = (3, 2)
Dimensions: 2 values needed
Unique: One way to represent each point
Homogeneous Coordinates (2D)
Projective Space (3D)
w = 1
(3,2,1)
(6,4,2)
origin
Representation: [x, y, w]
Example: P = [3, 2, 1] ≡ [6, 4, 2]
Dimensions: 3 values (for 2D point!)
Scale-invariant: Infinite ways!
🎯 Simple Analogy: The Flashlight Trick
Imagine you have a flashlight at the origin (0,0,0) shining through space:
  • Euclidean point (3, 2) is a specific dot on a wall
  • Homogeneous [3, 2, 1] is the ray of light from origin through that dot
  • All points on this ray [6, 4, 2], [9, 6, 3], [3k, 2k, k] represent the SAME 2D point
  • To get back to 2D: see where the ray hits the plane w=1 → (3/1, 2/1) = (3, 2)
💡 Key insight: The "w" coordinate is like a projector screen distance. When w=1, it's at standard distance. When w=2, same point appears at (3/2, 2/2) because screen is farther away!
↔️ Conversion Examples
Euclidean → Homogeneous
(2, 3) → [2, 3, 1]
(5, -1) → [5, -1, 1]
(0, 0) → [0, 0, 1]
Just add a 1 at the end!
Homogeneous → Euclidean
[4, 6, 2] → (4/2, 6/2) = (2, 3)
[10, -2, 2] → (10/2, -2/2) = (5, -1)
[0, 0, 5] → (0/5, 0/5) = (0, 0)
Divide x and y by w!

Why We Need This for Camera Projection

❌ Without Homogeneous (Doesn't Work!)
Problem: Perspective division

To project 3D → 2D, we need:
u = f · (X / Z)
v = f · (Y / Z)
Issue: Division by Z is NOT a linear operation!

❌ Can't write as simple matrix multiplication
❌ Need special case handling
❌ Messy math with if-statements
✅ With Homogeneous (Works!)
Solution: Embed division in coordinates

Matrix multiplication gives:
[u·Z, v·Z, Z] = K [X, Y, Z]
Convert to Euclidean:
(u, v) = (u·Z/Z, v·Z/Z)
✅ Everything is matrix multiplication!
✅ Clean, elegant math
✅ GPU-friendly operations
🎓 The Big Picture:
Homogeneous coordinates are like a "trick" that lets us do perspective projection (which requires division) using only matrix multiplication. We pay with an extra dimension, but gain:
  • Simplicity: All transformations (rotation, translation, projection) are just matrix × vector
  • Composability: Chain transformations by multiplying matrices
  • Infinity handling: Points at infinity have w=0 (represents directions like vanishing points)
  • Unified framework: Same math works for 2D, 3D, projections, everything!

Camera Calibration

What is Camera Calibration?
Finding the intrinsic (\(K\)) and extrinsic (\(R, \mathbf{t}\)) parameters of a camera. Essential for accurate 3D reconstruction and measurement.
Calibration Process
1. Capture images of known pattern:
• Usually checkerboard pattern
• Known 3D coordinates of corners
• Detect 2D pixel positions in image

2. Establish correspondences:
• Match 3D world points to 2D image points
• Need multiple views (typically 10-20 images)

3. Solve for parameters:
• Use optimization (e.g., Zhang's method)
• Minimize reprojection error
• Output: \(K\), distortion coefficients, \(R\), \(\mathbf{t}\)

4. Validate:
• Check reprojection error (should be < 1 pixel)
• Test on new images
Lens Distortion
Real cameras have lens distortion (pinhole is idealized):

Radial Distortion:
• Barrel distortion (wide-angle lenses)
• Pincushion distortion (telephoto lenses)
• Modeled with coefficients \(k_1, k_2, k_3\)

Tangential Distortion:
• Lens not parallel to image plane
• Modeled with coefficients \(p_1, p_2\)

Correction: Apply inverse distortion model after calibration

Field of View (FOV) Calculations

📐 Field of View - The Complete Picture

What: Angular extent of the observable scene captured by the camera

Why it matters: Determines coverage area at a given distance - critical for camera selection

Depends on: Sensor size, focal length, and working distance

FOV Formula (Angle)
$$\text{FOV} = 2 \times \arctan\left(\frac{d}{2f}\right)$$
Where:
• $d$ = sensor dimension (width for horizontal FOV, height for vertical FOV)
• $f$ = focal length (in same units as sensor dimension)
• Result is in radians (multiply by $180/\pi$ for degrees)

Horizontal FOV: Use sensor width
$$\text{FOV}_h = 2 \times \arctan\left(\frac{w_{\text{sensor}}}{2f}\right)$$ Vertical FOV: Use sensor height
$$\text{FOV}_v = 2 \times \arctan\left(\frac{h_{\text{sensor}}}{2f}\right)$$ Diagonal FOV: Use sensor diagonal
$$\text{FOV}_d = 2 \times \arctan\left(\frac{\sqrt{w^2 + h^2}}{2f}\right)$$
FOV in Pixels (Computer Vision)
$$\text{FOV} = 2 \times \arctan\left(\frac{\text{image width}}{2 \times f_x}\right)$$
When using calibrated intrinsics:
• Use image dimensions in pixels
• Use focal length in pixels ($f_x, f_y$ from camera matrix $K$)
• No need to know physical sensor size!

Example: Image 640×480, $f_x = 500$ pixels
$$\text{FOV}_h = 2 \times \arctan(640 / (2 \times 500)) = 2 \times \arctan(0.64) = 64.4°$$
Coverage Area at Working Distance
$$W = 2 \times D \times \tan\left(\frac{\text{FOV}}{2}\right)$$
Where:
• $W$ = width (or height) of coverage area at distance $D$
• $D$ = working distance (distance from camera to scene)
• FOV = field of view angle (horizontal or vertical)

Alternative formula (direct from sensor & focal length):
$$W = \frac{d \times D}{f}$$ where $d$ is sensor dimension, $f$ is focal length, $D$ is working distance

Example: 35mm sensor (36mm wide), 50mm focal length, 5m distance
$$W = \frac{36 \times 5000}{50} = 3600\text{ mm} = 3.6\text{ m width}$$
Inverse: Calculate Focal Length from Required FOV
$$f = \frac{d}{2 \times \tan(\text{FOV}/2)}$$
Use case: Select lens focal length to achieve desired coverage

Or directly from coverage requirements:
$$f = \frac{d \times D}{W}$$ where $W$ is desired coverage width at working distance $D$

Example: Need to cover 2m width at 3m distance, sensor width 23mm
$$f = \frac{23 \times 3000}{2000} = 34.5\text{ mm focal length}$$

Common Sensor Sizes (Reference)

Sensor Type Width × Height (mm) Diagonal (mm) Common Use
Full Frame (35mm) 36 × 24 43.3 Pro cameras, high-end DSLRs
APS-C (Crop) 23.6 × 15.7 28.3 Consumer DSLRs, mirrorless
Micro Four Thirds 17.3 × 13.0 21.6 Olympus, Panasonic cameras
1" Sensor 13.2 × 8.8 15.9 Premium compacts, drones
1/2.3" Sensor 6.17 × 4.55 7.66 Smartphones, action cameras
💡 Practical Tips:
  • Wider FOV (lower focal length): More scene coverage, more distortion, less detail per pixel
  • Narrower FOV (higher focal length): Less coverage, less distortion, more detail (telephoto)
  • Machine vision: Calculate FOV to ensure entire inspection area is visible
  • Robotics: Wide FOV for navigation, narrow FOV for object recognition
  • Crop factor: APS-C has 1.5× crop factor (50mm lens ≈ 75mm FOV on full frame)

Key Insights

Focal Length & Field of View:
Larger focal length = narrower field of view (telephoto)
Smaller focal length = wider field of view (wide-angle)
See detailed FOV calculations in section above

Principal Point:
Ideally at image center, but manufacturing imperfections can shift it
Important for accurate reconstruction

Depth Ambiguity:
Single camera loses depth information!
Any point along a ray projects to same pixel
Solutions: stereo vision, structure from motion, depth sensors

Coordinate Frames:
World: arbitrary global reference
Camera: origin at optical center, Z-axis along viewing direction
Image: 2D pixel coordinates

Why Matrix Formulation?
Efficient computation (GPU-friendly)
Easy to compose transformations
Standard in computer vision libraries (OpenCV, etc.)