Deep Dive Technical Guide

Understanding
Kubernetes Pod Creation

The complete under-the-hood sequence — from kubectl apply to a running container, covering API processing, scheduling, cgroups, namespaces, CNI, CRI, and health probes.

11
Phases
12+
Components
~2s
Pod Startup
5
Namespaces
Scroll to explore ↓
Phase 01

API Processing & Validation

Every pod begins its life as a POST request to the API server. Before anything is scheduled, Kubernetes validates, authenticates, authorizes, and mutates the request through a pipeline of admission controllers.

API Server Processing Pipeline
sequenceDiagram participant User participant APIServer as API Server participant etcd User->>APIServer: POST /api/v1/namespaces/default/pods Note over User: kubectl apply -f pod.yaml APIServer->>APIServer: Parse YAML/JSON APIServer->>APIServer: Schema validation APIServer->>APIServer: Authentication (certificates/tokens) APIServer->>APIServer: Authorization (RBAC checks) loop Admission Controllers APIServer->>APIServer: Mutating admission (add defaults) Note over APIServer: ResourceQuota, LimitRange, SecurityContext APIServer->>APIServer: Validating admission (final checks) Note over APIServer: NetworkPolicies, PodSecurityPolicies end APIServer->>etcd: Store pod spec (phase: Pending, nodeName: null) etcd-->>APIServer: Confirm stored with resourceVersion APIServer-->>User: 201 Created + Pod UID
Key Concept
The Admission Controller Pipeline

Admission controllers are plugins that intercept requests to the API server after authentication and authorization but before the object is persisted to etcd. They run in two phases:

Mutating controllers modify the object (e.g., injecting default resource limits, adding sidecar containers). Validating controllers reject invalid objects (e.g., enforcing security policies). This is where tools like Istio inject sidecars and OPA/Gatekeeper enforce policies.


Phase 02

Scheduling Decision

The scheduler watches for unscheduled pods, evaluates every node through a filter-then-score pipeline, and binds the pod to the highest-scoring node.

Scheduler Pipeline
sequenceDiagram participant Scheduler participant APIServer as API Server participant etcd Scheduler->>APIServer: Watch unscheduled pods (HTTP long-poll) APIServer-->>Scheduler: Event: Pod created (nodeName: null) Scheduler->>APIServer: GET /api/v1/nodes APIServer-->>Scheduler: Node list with allocatable resources Scheduler->>Scheduler: Filter Phase Note over Scheduler: Resource reqs, NodeSelector, Affinity, Taints Scheduler->>Scheduler: Score Phase (0-100 per node) Note over Scheduler: ResourceFit, BalancedAlloc, SelectorSpread Scheduler->>Scheduler: Select highest-scoring node Scheduler->>APIServer: PATCH pod.spec.nodeName = "worker-1" APIServer->>etcd: Update pod with binding
Filter Phase
Predicate Evaluation

Eliminates nodes that cannot run the pod: insufficient CPU/memory, missing node selectors, unmatched taints/tolerations, port conflicts, volume zone constraints.

Score Phase
Priority Functions

Scores remaining nodes 0–100 on factors like resource balance, spreading across zones, affinity preferences, and image locality. The highest total score wins.


Phase 03

Kubelet Processing

Once the pod is bound to a node, the kubelet on that node picks it up via its API server watch, validates it locally, and begins orchestrating the container lifecycle.

Kubelet Watch & Validation
sequenceDiagram participant kubelet participant APIServer as API Server kubelet->>APIServer: Watch assigned pods for this node APIServer-->>kubelet: Event: Pod assigned to worker-1 kubelet->>kubelet: Validate pod spec against node kubelet->>kubelet: Check resource availability kubelet->>APIServer: Update status (phase: Pending, PodScheduled=True)

The kubelet is the node agent. It doesn't create containers directly — it delegates to the Container Runtime Interface (CRI). Think of it as the orchestrator on each node: it manages the full pod lifecycle, health probes, volume mounts, and status reporting.


Phase 04

Image Management

Before any container can start, its image must be available locally. The kubelet checks the local image store, and if missing, pulls layers from the registry.

Image Pull Workflow
sequenceDiagram participant kubelet participant ImageStore as Image Store participant Registry as Container Registry kubelet->>ImageStore: Check if image exists locally alt Image not found kubelet->>Registry: Authenticate with registry Note over Registry: docker.io, gcr.io, private registry kubelet->>Registry: Pull image layers Registry-->>kubelet: Stream compressed layers kubelet->>ImageStore: Decompress and store layers kubelet->>ImageStore: Create image manifest else Image exists kubelet->>ImageStore: Use cached image end
filesystem
# Content-addressable image storage
/var/lib/containerd/
├── io.containerd.content.v1.content/     # Content-addressable store
│   ├── ingest/                           # Temporary during pull
│   └── blobs/                            # Actual layer data
├── io.containerd.metadata.v1.bolt/       # Metadata database
└── io.containerd.snapshotter.v1.overlayfs/  # Filesystem snapshots
Always
Pull Every Time

Pull on every pod creation. Ensures latest image but adds latency.

IfNotPresent
Pull If Missing

Only pull if not cached locally. Default for tagged images.

Never
Local Only

Never pull. Fails if image isn't already on the node.


Phase 05

Container Runtime Preparation

The CRI creates isolated environments using Linux primitives: network namespaces for network isolation, mount namespaces for filesystem isolation, and cgroups for resource limits.

CRI & Namespace/CGroup Setup
sequenceDiagram participant kubelet participant CRI as Container Runtime participant CGroupMgr as CGroup Manager participant NetNS as Network NS participant MountNS as Mount NS kubelet->>CRI: CreateContainer request CRI->>CRI: Generate container config Note over CRI: Env vars, volume mounts, security context CRI->>NetNS: Create network namespace CRI->>MountNS: Create mount namespace CRI->>CGroupMgr: Create cgroup hierarchy CGroupMgr->>CGroupMgr: Set memory limits CGroupMgr->>CGroupMgr: Set CPU limits CGroupMgr->>CGroupMgr: Set PID limits CGroupMgr-->>CRI: Cgroup paths created

CGroup Hierarchy

filesystem
# CGroup hierarchy for a pod
/sys/fs/cgroup/
├── memory/kubepods/burstable/pod{uid}/
│   ├── memory.limit_in_bytes      # Memory limit
│   ├── memory.usage_in_bytes      # Current usage
│   └── container{id}/             # Per container
├── cpu/kubepods/burstable/pod{uid}/
│   ├── cpu.cfs_quota_us           # CPU quota
│   ├── cpu.cfs_period_us          # CPU period
│   └── cpu.shares                 # CPU weight
└── pids/kubepods/burstable/pod{uid}/
    └── pids.max                   # Max processes
Guaranteed
QoS Class

Requests == Limits. Path: /kubepods/pod{uid}/

Burstable
QoS Class

Requests < Limits. Path: /kubepods/burstable/pod{uid}/

BestEffort
QoS Class

No limits set. Path: /kubepods/besteffort/pod{uid}/

Component Details

CGroup Manager

Part of kubelet

Path: /sys/fs/cgroup/

Types: systemd, cgroupfs

Controls: CPU, Memory, PID, I/O

Image Store

Container runtime

Path: /var/lib/containerd/

Format: OCI image

Content-addressable

CNI Plugin

Location: /opt/cni/bin/

Config: /etc/cni/net.d/

Types: bridge, calico, flannel

Function: Pod networking

Namespaces

PID, Network, Mount, IPC, UTS

Creation: clone() syscall

Isolation: Process, Net, FS

Managed by runtime


Phase 06

Networking & CNI Setup

The CNI plugin creates a network namespace, allocates an IP, sets up a veth pair connecting the pod to the host, and configures routes and DNS.

CNI Network Setup
sequenceDiagram participant kubelet participant CNI as CNI Plugin participant NetNS as Network NS kubelet->>CNI: Setup network for pod CNI->>CNI: Allocate IP from CIDR pool CNI->>NetNS: Configure veth pair CNI->>NetNS: Set IP address and routes CNI->>CNI: Update IPAM database CNI-->>kubelet: Network ready (IP: 10.244.1.5)
network
# Network namespace creation step by step
1. CNI plugin creates network namespace: ip netns add {netns-id}
2. Creates veth pair:   ip link add veth0 type veth peer name eth0
3. Moves eth0 to pod:   ip link set eth0 netns {netns-id}
4. Assigns IP:          ip addr add 10.244.1.5/24 dev eth0
5. Sets routes:         ip route add default via 10.244.1.1
6. Updates IPAM:        Store IP allocation in etcd/file

# Result: Pod network configuration
Pod Network Namespace:
├── lo (loopback):  127.0.0.1/8
└── eth0:           10.244.1.5/24
    ├── Gateway:    10.244.1.1
    ├── DNS:        10.96.0.10 (CoreDNS)
    └── Routes:     default via 10.244.1.1
Key Concept
CNI Plugin Chain

Main plugin creates the interface (bridge, calico, cilium). IPAM plugin handles IP address management (host-local, dhcp). Meta plugins add features like bandwidth shaping and firewall rules.

Each pod gets its own network namespace. Containers in the same pod share the network namespace — they communicate over localhost. Different pods have separate network stacks.


Phase 07

Volume Mounting

Persistent volumes are attached via CSI drivers, and ConfigMaps/Secrets are projected into the mount namespace as files with the correct permissions.

Persistent Volumes
CSI Attachment

kubelet calls the CSI driver to attach and mount the volume into the container's filesystem. Supports EBS, GCE PD, NFS, Ceph, and more.

ConfigMaps & Secrets
Projected Volumes

Mounted as files into the mount namespace with configurable file permissions. Secrets are stored in tmpfs (memory-backed) for security.


Phase 08

Container Creation & Start

The container runtime (runc/kata) creates the container process, joins it to all the prepared namespaces and cgroups, applies the security context, and starts the init process.

Container Start Sequence
sequenceDiagram participant CRI as Container Runtime participant kubelet participant APIServer as API Server CRI->>CRI: Create container with runc/kata CRI->>CRI: Configure namespaces (pid, net, mnt, ipc, uts) CRI->>CRI: Apply security context (user, capabilities) CRI->>CRI: Join container to cgroups CRI->>CRI: Start container init process CRI-->>kubelet: Container created and started kubelet->>APIServer: Update pod status (phase: Running, startTime)

At this point the pod is Running. The container process is live, network is configured, volumes are mounted. But "Running" does not mean "Ready for traffic" — that requires passing readiness probes. The kubelet now enters the health monitoring phase.


Phase 09 – 10

Health Probes & Monitoring

Kubernetes runs three types of probes in parallel: startup (gates the others), liveness (restarts unhealthy containers), and readiness (controls traffic routing). Resource monitoring via cgroups runs continuously.

Health Probe Lifecycle
sequenceDiagram participant kubelet participant CRI as Container Runtime participant APIServer as API Server Note over kubelet,APIServer: Startup Probe Phase alt Startup Probe Configured kubelet->>kubelet: DISABLE liveness & readiness probes loop Until success or failure threshold kubelet->>CRI: Execute startup probe alt Success CRI-->>kubelet: HTTP 200 / exit 0 kubelet->>kubelet: ENABLE liveness & readiness probes else Failure threshold exceeded kubelet->>CRI: SIGTERM → SIGKILL → Restart kubelet->>APIServer: Update (restartCount++) end end end Note over kubelet,APIServer: Continuous Monitoring (Parallel) par Liveness loop Every 30s kubelet->>CRI: Liveness probe alt 3 failures kubelet->>CRI: Kill & restart container end end and Readiness loop Every 5s kubelet->>CRI: Readiness probe alt Failing kubelet->>APIServer: Ready=False → remove from endpoints else Passing kubelet->>APIServer: Ready=True → add to endpoints end end and Resources loop Continuous kubelet->>kubelet: Check cgroup memory/CPU/PID alt OOM kubelet->>CRI: SIGKILL (OOMKilled) end end end

Ready ≠ Running. A pod can be Running but not Ready if readiness probes fail. Running means the process is alive. Ready means "accept traffic." The readiness probe controls whether the pod's IP appears in Service endpoints — failing it removes the pod from load balancer rotation without killing it.

Probe Types

Startup
Startup Probe

Gates liveness & readiness. Protects slow-starting containers. Failure restarts the container.

Liveness
Liveness Probe

Detects deadlocks and hangs. Failure kills and restarts the container. Runs every periodSeconds.

Readiness
Readiness Probe

Controls traffic routing. Failure removes pod from Service endpoints. Container keeps running.


Phase 11

Pod Status Lifecycle

A pod transitions through well-defined phases. Each phase has sub-conditions that provide granular insight into exactly what's happening.

Pending

Pod accepted but not running. Waiting for scheduling, image pull, or volume attachment.

Running

Bound to a node, at least one container running. Health probes actively monitored.

Succeeded

All containers exited with code 0. Typical for Jobs and CronJobs. No restarts.

Failed

Container exit code ≠ 0, OOMKilled, image pull failure, or liveness probe exceeded.

Unknown

Pod state cannot be determined — node communication lost, kubelet not responding, or network partition. Pods may still be running. Manual intervention often required.

Pod Conditions

yaml
conditions:
  - type: PodScheduled
    status: "True"
    reason: "PodScheduled"
    message: "Successfully assigned default/my-pod to worker-1"

  - type: Initialized
    status: "True"
    reason: "PodCompleted"
    message: "All init containers completed successfully"

  - type: ContainersReady
    status: "False"
    reason: "ContainersNotReady"
    message: "containers with unready status: [app]"

  - type: Ready
    status: "False"
    reason: "ContainersNotReady"
    message: "containers with unready status: [app]"

Failure Scenarios

Common Failures & Recovery

Understanding failure modes is critical for debugging production issues. Here are the most common scenarios with their error signatures and recovery behavior.

Failure Scenarios
sequenceDiagram participant kubelet participant CRI as Container Runtime participant Registry participant CGroupMgr as CGroup Manager participant APIServer as API Server Note over kubelet,APIServer: Image Pull Failure kubelet->>CRI: Pull private-registry.com/app:latest CRI->>Registry: Auth required Registry-->>CRI: 401 Unauthorized alt Credentials provided CRI->>Registry: Retry with credentials alt Valid Registry-->>CRI: Image layers else Invalid CRI-->>kubelet: ErrImagePull kubelet->>APIServer: Pending (ImagePullBackOff) end else No credentials CRI-->>kubelet: ImagePullBackOff Note over kubelet: Backoff: 10s, 20s, 40s, 80s, 160s, 300s max end Note over kubelet,APIServer: OOM Kill CGroupMgr->>CGroupMgr: Memory limit exceeded CGroupMgr->>CRI: SIGKILL (OOM) CRI-->>kubelet: Terminated (OOMKilled, exit 137) kubelet->>CRI: Restart (if policy allows) kubelet->>APIServer: restartCount++

Restart Policy & Backoff

yaml
restartPolicy: Always      # Default for Deployments
restartPolicy: OnFailure   # Default for Jobs
restartPolicy: Never       # One-shot containers

# Exponential backoff calculation
delay = min(300, 2^restartCount * 10)
# → 10s, 20s, 40s, 80s, 160s, 300s (max)

# Reset: 10 minutes of successful running resets restartCount to 0

Critical Failure Points

Node Resources
Resource Exhaustion

Memory: OOM killer terminates pods. Disk: Image GC + pod eviction. PID: No new processes.

Network
Network Failures

CNI failure: Stuck in ContainerCreating. IP exhaustion: No IPs for new pods. DNS: Service discovery broken.

Runtime
Runtime Issues

containerd crash: All pods affected. Image corruption: Creation fails. Storage driver: Cannot create layers.