
Introduction to Kubernetes – Dataquest
Up until now you’ve learned about Docker containers and how they solve the “works on my machine” problem. But once your projects involve multiple containers running 24/7, new challenges appear, ones Docker alone doesn’t solve.
In this tutorial, you’ll discover why Kubernetes exists and get hands-on experience with its core concepts. We’ll start by understanding a common problem that developers face, then see how Kubernetes solves it.
By the end of this tutorial, you’ll be able to:
- Explain what problems Kubernetes solves and why it exists
- Understand the core components: clusters, nodes, pods, and deployments
- Set up a local Kubernetes environment
- Deploy a simple application and see self-healing in action
- Know when you might choose Kubernetes over Docker alone
Why Does Kubernetes Exist?
Let’s imagine a realistic scenario that shows why you might need more than just Docker.
You’re building a data pipeline with two main components:
- A PostgreSQL database that stores your processed data
- A Python ETL script that runs every hour to process new data
Using Docker, you’ve containerized both components and they work perfectly on your laptop. But now you need to deploy this to a production server where it needs to run reliably 24/7.
Here’s where things get tricky:
What happens if your ETL container crashes? With Docker alone, it just stays crashed until someone manually restarts it. You could configure VM-level monitoring and auto-restart scripts, but now you’re building container management infrastructure yourself.
What if the server fails? You’d need to recreate everything on a new server. Again, you could write scripts to automate this, but you’re essentially rebuilding what container orchestration platforms already provide.
The core issue is that you end up writing custom infrastructure code to handle container failures, scaling, and deployments across multiple machines.
This works fine for simple deployments, but becomes complex when you need:
- Application-level health checks and recovery
- Coordinated deployments across multiple services
- Dynamic scaling based on actual workload metrics
How Kubernetes Helps
Before we get into how Kubernetes helps, it’s important to understand that it doesn’t replace Docker. You still use Docker to build container images. What Kubernetes adds is a way to run, manage, and scale those containers automatically in production.
Kubernetes acts like an intelligent supervisor for your containers. Instead of telling Docker exactly what to do (“run this container”), you tell Kubernetes what you want the end result to look like (“always keep my ETL pipeline running”), and it figures out how to make that happen.
If your ETL container crashes, Kubernetes automatically starts a new one. If the entire server fails, Kubernetes can move your containers to a different server. If you need to handle more data, Kubernetes can run multiple copies of your ETL script in parallel.
The key difference is that Kubernetes shifts you from manual container management to automated container management.
The tradeoff? Kubernetes adds complexity, so for single-machine projects Docker Compose is often simpler. But for systems that need to run reliably over time and scale, the complexity is worth it.
How Kubernetes Thinks
To use Kubernetes effectively, you need to understand how it approaches container management differently than Docker.
When you use Docker directly, you think in imperative terms, meaning that you give specific commands about exactly what should happen:
docker run -d --name my-database postgres:13
docker run -d --name my-etl-script python:3.9 my-script.py
You’re telling Docker exactly which containers to start, where to start them, and what to call them.
Kubernetes, on the other hand, uses a declarative approach. This means you describe what you want the final state to look like, and Kubernetes figures out how to achieve and maintain that state. For example: “I want a PostgreSQL database to always be running” or “I want my ETL script to run reliably.”
This shift from “do this specific thing” to “maintain this desired state” is fundamental to how Kubernetes operates.
Here’s how Kubernetes maintains your desired state:
- You declare what you want using configuration files or commands
- Kubernetes stores your desired state in its database
- Controllers continuously monitor the actual state vs. desired state
- When they differ, Kubernetes takes action to fix the discrepancy
- This process repeats every few seconds, forever
This means that if something breaks your containers, Kubernetes will automatically detect the problem and fix it without you having to intervene.
Core Building Blocks
Kubernetes organizes everything using several key concepts. We’ll discuss the foundational building blocks here, and address more nuanced and complex concepts in a later tutorial.
Cluster
A cluster is a group of machines that work together as a single system. Think of it as your pool of computing resources that Kubernetes can use to run your applications. The important thing to understand is that you don’t usually care which specific machine runs your application. Kubernetes handles the placement automatically based on available resources.
Nodes
Nodes are the individual machines (physical or virtual) in your cluster where your containers actually run. You’ll mostly interact with the cluster as a whole rather than individual nodes, but it’s helpful to understand that your containers are ultimately running on these machines.
Note: We’ll cover the details of how nodes work in a later tutorial. For now, just think of them as the computing resources that make up your cluster.
Pods: Kubernetes’ Atomic Unit
Here’s where Kubernetes differs significantly from Docker. While Docker thinks in terms of individual containers, Kubernetes’ smallest deployable unit is called a Pod.
A Pod typically contains:
- At least one container
- Shared networking so containers in the Pod can communicate using localhost
- Shared storage volumes that all containers in the Pod can access
Most of the time, you’ll have one container per Pod, but the Pod abstraction gives Kubernetes a consistent way to manage containers along with their networking and storage needs.
Pods are ephemeral, meaning they come and go. When a Pod fails or gets updated, Kubernetes replaces it with a new one. This is why you rarely work with individual Pods directly in production (we’ll cover how applications communicate with each other in a future tutorial).
Deployments: Managing Pod Lifecycles
Since Pods are ephemeral, you need a way to ensure your application keeps running even when individual Pods fail. This is where Deployments come in.
A Deployment is like a blueprint that tells Kubernetes:
- What container image to use for your application
- How many copies (replicas) you want running
- How to handle updates when you deploy new versions
When you create a Deployment, Kubernetes automatically creates the specified number of Pods. If a Pod crashes or gets deleted, the Deployment immediately creates a replacement. If you want to update your application, the Deployment can perform a rolling update, replacing old Pods one at a time with new versions. This is the key to Kubernetes’ self-healing behavior: Deployments continuously monitor the actual number of running Pods and work to match your desired number.
Setting Up Your First Cluster
To understand how these concepts work in practice, you’ll need a Kubernetes cluster to experiment with. Let’s set up a local environment and deploy a simple application.
Prerequisites
Before we start, make sure you have Docker Desktop installed and running. Minikube uses Docker as its default driver to create the virtual environment for your Kubernetes cluster.
If you don’t have Docker Desktop yet, download it from docker.com and make sure it’s running before proceeding.
Install Minikube
Minikube creates a local Kubernetes cluster perfect for learning and development. Install it by following the official installation guide for your operating system.
You can verify the installation worked by checking the version:
minikube version
Start Your Cluster
Now you’re ready to start your local Kubernetes cluster:
minikube start
This command downloads a virtual machine image (if it’s your first time), starts the VM using Docker, and configures a Kubernetes cluster inside it. The process usually takes a few minutes.
You’ll see output like:
😄 minikube v1.33.1 on Darwin 14.1.2
✨ Using the docker driver based on existing profile
👍 Starting control plane node minikube in cluster minikube
🚜 Pulling base image ...
🔄 Restarting existing docker container for "minikube" ...
🐳 Preparing Kubernetes v1.28.3 on Docker 24.0.7 ...
🔎 Verifying Kubernetes components...
🌟 Enabled addons: storage-provisioner, default-storageclass
🏄 Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default
Set Up kubectl Access
Now that your cluster is running, you can use kubectl to interact with it. We’ll use the version that comes with Minikube rather than installing it separately to ensure compatibility:
minikube kubectl -- version
You should see version information for both the client and server.
While you could type minikube kubectl --
before every command, the standard practice is to create an alias. This mimics how you’ll work with kubectl in cloud environments where you just type kubectl
:
alias kubectl="minikube kubectl --"
Why use an alias? In production environments (AWS EKS, Google GKE, etc.), you’ll install kubectl separately and use it directly. By practicing with the kubectl
command now, you’re building the right muscle memory. The alias lets you use standard kubectl syntax while ensuring you’re talking to your local Minikube cluster.
Add this alias to your shell profile (.bashrc
, .zshrc
, etc.) if you want it to persist across terminal sessions.
Verify Your Cluster
Let’s make sure everything is working:
kubectl cluster-info
You should see something like:
Kubernetes control plane is running at <https://192.168.49.2:8443>
CoreDNS is running at <https://192.168.49.2:8443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy>
Now check what’s running in your cluster:
kubectl get nodes
You should see one node (your Minikube VM):
NAME STATUS ROLES AGE VERSION
minikube Ready control-plane 2m v1.33.1
Perfect! You now have a working Kubernetes cluster.
Deploy Your First Application
Let’s deploy a PostgreSQL database to see Kubernetes in action. We’ll create a Deployment that runs a postgres container. We’ll use PostgreSQL because it’s a common component in data projects, but the steps are the same for any container.
Create the deployment:
kubectl create deployment hello-postgres --image=postgres:13
kubectl set env deployment/hello-postgres POSTGRES_PASSWORD=mysecretpassword
Check what Kubernetes created for you:
kubectl get deployments
You should see:
NAME READY UP-TO-DATE AVAILABLE AGE
hello-postgres 1/1 1 1 30s
Note: If you see 0/1
in the READY column, that’s normal! PostgreSQL needs the environment variable to start properly. The deployment will automatically restart the Pod once we set the password, and you should see it change to 1/1
within a minute.
Now look at the Pods:
kubectl get pods
You’ll see something like:
NAME READY STATUS RESTARTS AGE
hello-postgres-7d8757c6d4-xyz123 1/1 Running 0 45s
Notice how Kubernetes automatically created a Pod with a generated name. The Deployment is managing this Pod for you.
Connect to Your Application
Your PostgreSQL database is running inside the cluster. There are two common ways to interact with it:
Option 1: Using kubectl exec (direct container access)
kubectl exec -it deployment/hello-postgres -- psql -U postgres
This connects you directly to a PostgreSQL session inside the container. The -it
flags give you an interactive terminal. You can run SQL commands directly:
postgres=# SELECT version();
postgres=# \q
Option 2: Using port forwarding (local connection)
kubectl port-forward deployment/hello-postgres 5432:5432
Leave this running and open a new terminal. Now you can connect using any PostgreSQL client on your local machine as if the database were running locally on port 5432. Press Ctrl+C to stop the port forwarding when you’re done.
Both approaches work well. kubectl exec
is faster for quick database tasks, while port forwarding lets you use familiar local tools. Choose whichever feels more natural to you.
Let’s break down what you just accomplished:
- You created a Deployment – This told Kubernetes “I want PostgreSQL running”
- Kubernetes created a Pod – The actual container running postgres
- The Pod got scheduled to your Minikube node (the single machine in your cluster)
- You connected to the database – Either directly with
kubectl exec
or through port forwarding
You didn’t have to worry about which node to use, how to start the container, or how to configure networking. Kubernetes handled all of that based on your simple deployment command.
Next, we’ll see the real magic: what happens when things go wrong.
The Magic Moment: Self-Healing
You’ve deployed your first application, but you haven’t seen Kubernetes’ most powerful feature yet. Let’s break something on purpose and watch Kubernetes automatically fix it.
Break Something on Purpose
First, let’s see what’s currently running:
kubectl get pods
You should see your PostgreSQL Pod running:
NAME READY STATUS RESTARTS AGE
hello-postgres-7d8757c6d4-xyz123 1/1 Running 0 5m
Now, let’s “accidentally” delete this Pod. In a traditional Docker setup, this would mean your database is gone until someone manually restarts it:
kubectl delete pod hello-postgres-7d8757c6d4-xyz123
Replace hello-postgres-7d8757c6d4-xyz123
with your actual Pod name from the previous command.
You’ll see:
pod "hello-postgres-7d8757c6d4-xyz123" deleted
Watch the Magic Happen
Immediately check your Pods again:
kubectl get pods
You’ll likely see something like this:
NAME READY STATUS RESTARTS AGE
hello-postgres-7d8757c6d4-abc789 1/1 Running 0 10s
Notice what happened:
- The Pod name changed – Kubernetes created a completely new Pod
- It’s already running – The replacement happened automatically
- It happened immediately – No human intervention required
If you’re quick enough, you might catch the Pod in ContainerCreating
status as Kubernetes spins up the replacement.
What Just Happened?
This is Kubernetes’ self-healing behavior in action. Here’s the step-by-step process:
- You deleted the Pod – The container stopped running
- The Deployment noticed – It continuously monitors the actual vs desired state
- State mismatch detected – Desired: 1 Pod running, Actual: 0 Pods running
- Deployment took action – It immediately created a new Pod to match the desired state
- Balance restored – Back to 1 Pod running, as specified in the Deployment
This entire process took seconds and required no human intervention.
Test It Again
Let’s verify the database is working in the new Pod:
kubectl exec deployment/hello-postgres -- psql -U postgres -c "SELECT version();"
Perfect! The database is running normally. The new Pod automatically started with the same configuration (PostgreSQL 13, same password) because the Deployment specification didn’t change.
What This Means
This demonstrates Kubernetes’ core value: turning manual, error-prone operations into automated, reliable systems. In production, if a server fails at 3 AM, Kubernetes automatically restarts your application on a healthy server within seconds, much faster than alternatives that require VM startup time and manual recovery steps.
You experienced the fundamental shift from imperative to declarative management. You didn’t tell Kubernetes HOW to fix the problem – you only specified WHAT you wanted (“keep 1 PostgreSQL Pod running”), and Kubernetes figured out the rest.
Next, we’ll wrap up with essential tools and guidance for your continued Kubernetes journey.
Cleaning Up
When you’re finished experimenting, you can clean up the resources you created:
# Delete the PostgreSQL deployment
kubectl delete deployment hello-postgres
# Stop your Minikube cluster (optional - saves system resources)
minikube stop
# If you want to completely remove the cluster (optional)
minikube delete
The minikube stop
command preserves your cluster for future use while freeing up system resources. Use minikube delete
only if you want to start completely fresh next time.
Wrap Up and Next Steps
You’ve successfully set up a Kubernetes cluster, deployed an application, and witnessed self-healing in action. You now understand why Kubernetes exists and how it transforms container management from manual tasks into automated systems.
Now you’re ready to explore:
- Services – How applications communicate within clusters
- ConfigMaps and Secrets – Managing configuration and sensitive data
- Persistent Volumes – Handling data that survives Pod restarts
- Advanced cluster management – Multi-node clusters, node pools, and workload scheduling strategies
- Security and access control – Understanding RBAC and IAM concepts
The official Kubernetes documentation is a great resource for diving deeper.
Remember the complexity trade-off: Kubernetes is powerful but adds operational overhead. Choose it when you need high availability, automatic scaling, or multi-server deployments. For simple applications running on a single machine, Docker Compose is often the better choice. Many teams start with Docker Compose and migrate to Kubernetes as their reliability and scaling requirements grow.
Now you have the foundation to make informed decisions about when and how to use Kubernetes in your data projects.
Source link