The complete under-the-hood sequence — from kubectl apply to a running container, covering API processing, scheduling, cgroups, namespaces, CNI, CRI, and health probes.
Every pod begins its life as a POST request to the API server. Before anything is scheduled, Kubernetes validates, authenticates, authorizes, and mutates the request through a pipeline of admission controllers.
Admission controllers are plugins that intercept requests to the API server after authentication and authorization but before the object is persisted to etcd. They run in two phases:
Mutating controllers modify the object (e.g., injecting default resource limits, adding sidecar containers). Validating controllers reject invalid objects (e.g., enforcing security policies). This is where tools like Istio inject sidecars and OPA/Gatekeeper enforce policies.
The scheduler watches for unscheduled pods, evaluates every node through a filter-then-score pipeline, and binds the pod to the highest-scoring node.
Eliminates nodes that cannot run the pod: insufficient CPU/memory, missing node selectors, unmatched taints/tolerations, port conflicts, volume zone constraints.
Scores remaining nodes 0–100 on factors like resource balance, spreading across zones, affinity preferences, and image locality. The highest total score wins.
Once the pod is bound to a node, the kubelet on that node picks it up via its API server watch, validates it locally, and begins orchestrating the container lifecycle.
The kubelet is the node agent. It doesn't create containers directly — it delegates to the Container Runtime Interface (CRI). Think of it as the orchestrator on each node: it manages the full pod lifecycle, health probes, volume mounts, and status reporting.
Before any container can start, its image must be available locally. The kubelet checks the local image store, and if missing, pulls layers from the registry.
filesystem # Content-addressable image storage /var/lib/containerd/ ├── io.containerd.content.v1.content/ # Content-addressable store │ ├── ingest/ # Temporary during pull │ └── blobs/ # Actual layer data ├── io.containerd.metadata.v1.bolt/ # Metadata database └── io.containerd.snapshotter.v1.overlayfs/ # Filesystem snapshots
Pull on every pod creation. Ensures latest image but adds latency.
Only pull if not cached locally. Default for tagged images.
Never pull. Fails if image isn't already on the node.
The CRI creates isolated environments using Linux primitives: network namespaces for network isolation, mount namespaces for filesystem isolation, and cgroups for resource limits.
filesystem # CGroup hierarchy for a pod /sys/fs/cgroup/ ├── memory/kubepods/burstable/pod{uid}/ │ ├── memory.limit_in_bytes # Memory limit │ ├── memory.usage_in_bytes # Current usage │ └── container{id}/ # Per container ├── cpu/kubepods/burstable/pod{uid}/ │ ├── cpu.cfs_quota_us # CPU quota │ ├── cpu.cfs_period_us # CPU period │ └── cpu.shares # CPU weight └── pids/kubepods/burstable/pod{uid}/ └── pids.max # Max processes
Requests == Limits. Path: /kubepods/pod{uid}/
Requests < Limits. Path: /kubepods/burstable/pod{uid}/
No limits set. Path: /kubepods/besteffort/pod{uid}/
Part of kubelet
Path: /sys/fs/cgroup/
Types: systemd, cgroupfs
Controls: CPU, Memory, PID, I/O
Container runtime
Path: /var/lib/containerd/
Format: OCI image
Content-addressable
Location: /opt/cni/bin/
Config: /etc/cni/net.d/
Types: bridge, calico, flannel
Function: Pod networking
PID, Network, Mount, IPC, UTS
Creation: clone() syscall
Isolation: Process, Net, FS
Managed by runtime
The CNI plugin creates a network namespace, allocates an IP, sets up a veth pair connecting the pod to the host, and configures routes and DNS.
network # Network namespace creation step by step 1. CNI plugin creates network namespace: ip netns add {netns-id} 2. Creates veth pair: ip link add veth0 type veth peer name eth0 3. Moves eth0 to pod: ip link set eth0 netns {netns-id} 4. Assigns IP: ip addr add 10.244.1.5/24 dev eth0 5. Sets routes: ip route add default via 10.244.1.1 6. Updates IPAM: Store IP allocation in etcd/file # Result: Pod network configuration Pod Network Namespace: ├── lo (loopback): 127.0.0.1/8 └── eth0: 10.244.1.5/24 ├── Gateway: 10.244.1.1 ├── DNS: 10.96.0.10 (CoreDNS) └── Routes: default via 10.244.1.1
Main plugin creates the interface (bridge, calico, cilium). IPAM plugin handles IP address management (host-local, dhcp). Meta plugins add features like bandwidth shaping and firewall rules.
Each pod gets its own network namespace. Containers in the same pod share the network namespace — they communicate over localhost. Different pods have separate network stacks.
Persistent volumes are attached via CSI drivers, and ConfigMaps/Secrets are projected into the mount namespace as files with the correct permissions.
kubelet calls the CSI driver to attach and mount the volume into the container's filesystem. Supports EBS, GCE PD, NFS, Ceph, and more.
Mounted as files into the mount namespace with configurable file permissions. Secrets are stored in tmpfs (memory-backed) for security.
The container runtime (runc/kata) creates the container process, joins it to all the prepared namespaces and cgroups, applies the security context, and starts the init process.
At this point the pod is Running. The container process is live, network is configured, volumes are mounted. But "Running" does not mean "Ready for traffic" — that requires passing readiness probes. The kubelet now enters the health monitoring phase.
Kubernetes runs three types of probes in parallel: startup (gates the others), liveness (restarts unhealthy containers), and readiness (controls traffic routing). Resource monitoring via cgroups runs continuously.
Ready ≠ Running. A pod can be Running but not Ready if readiness probes fail. Running means the process is alive. Ready means "accept traffic." The readiness probe controls whether the pod's IP appears in Service endpoints — failing it removes the pod from load balancer rotation without killing it.
Gates liveness & readiness. Protects slow-starting containers. Failure restarts the container.
Detects deadlocks and hangs. Failure kills and restarts the container. Runs every periodSeconds.
Controls traffic routing. Failure removes pod from Service endpoints. Container keeps running.
A pod transitions through well-defined phases. Each phase has sub-conditions that provide granular insight into exactly what's happening.
Pod accepted but not running. Waiting for scheduling, image pull, or volume attachment.
Bound to a node, at least one container running. Health probes actively monitored.
All containers exited with code 0. Typical for Jobs and CronJobs. No restarts.
Container exit code ≠ 0, OOMKilled, image pull failure, or liveness probe exceeded.
Pod state cannot be determined — node communication lost, kubelet not responding, or network partition. Pods may still be running. Manual intervention often required.
yaml
conditions:
- type: PodScheduled
status: "True"
reason: "PodScheduled"
message: "Successfully assigned default/my-pod to worker-1"
- type: Initialized
status: "True"
reason: "PodCompleted"
message: "All init containers completed successfully"
- type: ContainersReady
status: "False"
reason: "ContainersNotReady"
message: "containers with unready status: [app]"
- type: Ready
status: "False"
reason: "ContainersNotReady"
message: "containers with unready status: [app]"
Understanding failure modes is critical for debugging production issues. Here are the most common scenarios with their error signatures and recovery behavior.
yaml restartPolicy: Always # Default for Deployments restartPolicy: OnFailure # Default for Jobs restartPolicy: Never # One-shot containers # Exponential backoff calculation delay = min(300, 2^restartCount * 10) # → 10s, 20s, 40s, 80s, 160s, 300s (max) # Reset: 10 minutes of successful running resets restartCount to 0
Memory: OOM killer terminates pods. Disk: Image GC + pod eviction. PID: No new processes.
CNI failure: Stuck in ContainerCreating. IP exhaustion: No IPs for new pods. DNS: Service discovery broken.
containerd crash: All pods affected. Image corruption: Creation fails. Storage driver: Cannot create layers.