Getting Started with Docker and PyTorch

Docker has become an essential tool for machine learning engineers, providing consistent environments across different machines and making it easy to share and deploy models. In this guide, we'll walk through setting up Docker on macOS, creating a PyTorch container, and running a simple training example.

What is Docker?

Docker is a platform that packages applications and their dependencies into lightweight, portable containers. Think of containers as isolated environments that include everything needed to run your application - code, runtime, libraries, and system tools.

Why Use Docker for Machine Learning?

Reproducibility: Your code runs the same way on any machine
Isolation: Dependencies don't conflict with your system or other projects
Portability: Share your environment with teammates or deploy to cloud services
Version Control: Track different versions of your ML environment
GPU Support: Easy access to CUDA and GPU resources

Prerequisites

Before we begin, make sure you have:

macOS 10.15 or later
At least 4GB of RAM (8GB+ recommended)
20GB of free disk space
Admin access to your Mac

Step 1: Install Docker Desktop for Mac

Docker Desktop is the easiest way to run Docker on macOS.

Installation Steps

Visit the Docker Desktop download page: https://www.docker.com/products/docker-desktop
Download Docker Desktop for Mac (choose Apple Silicon or Intel based on your Mac)
Open the downloaded .dmg file and drag Docker to your Applications folder
Launch Docker from Applications
Follow the setup wizard and grant necessary permissions

Verify Installation

Open Terminal and run:

docker --version
docker run hello-world

You should see the Docker version and a welcome message confirming the installation.

Step 2: Understanding Docker Concepts

Before we dive into PyTorch, let's understand key Docker concepts:

Images vs Containers

Image: A blueprint or template (like a class in programming)
Container: A running instance of an image (like an object)

Dockerfile

A text file with instructions to build a Docker image. It defines:

Base image to start from
Dependencies to install
Files to copy
Commands to run

Docker Hub

A registry where you can find pre-built images, including official PyTorch images.

Step 3: Create a PyTorch Dockerfile

Let's create a custom Docker image with PyTorch and all necessary dependencies.

Create a new directory for your project:

mkdir pytorch-docker-demo
cd pytorch-docker-demo

Create a file named Dockerfile:

FROM python:3.10-slim

WORKDIR /app

RUN apt-get update && apt-get install -y \
    build-essential \
    curl \
    && rm -rf /var/lib/apt/lists/*

RUN pip install --no-cache-dir \
    torch==2.1.0 \
    torchvision==0.16.0 \
    numpy==1.24.3 \
    matplotlib==3.7.1 \
    scikit-learn==1.3.0

COPY . /app

CMD ["python", "train.py"]

Dockerfile Breakdown

FROM python:3.10-slim: Start with a lightweight Python 3.10 image
WORKDIR /app: Set working directory inside container
RUN apt-get update: Install system dependencies
RUN pip install: Install Python packages
COPY . /app: Copy your code into the container
CMD: Default command to run when container starts

Step 4: Create a Training Script

Create a file named train.py with a simple MNIST classifier:

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import time

print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

print("Downloading MNIST dataset...")
train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

test_dataset = datasets.MNIST(
    root='./data',
    train=False,
    download=True,
    transform=transform
)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(28 * 28, 128)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

model = SimpleNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for batch_idx, (data, target) in enumerate(loader):
        data, target = data.to(device), target.to(device)
        
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        pred = output.argmax(dim=1, keepdim=True)
        correct += pred.eq(target.view_as(pred)).sum().item()
        total += target.size(0)
        
        if batch_idx % 100 == 0:
            print(f"Batch {batch_idx}/{len(loader)}, Loss: {loss.item():.4f}")
    
    avg_loss = total_loss / len(loader)
    accuracy = 100. * correct / total
    return avg_loss, accuracy

def test(model, loader, criterion, device):
    model.eval()
    test_loss = 0
    correct = 0
    
    with torch.no_grad():
        for data, target in loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += criterion(output, target).item()
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()
    
    test_loss /= len(loader)
    accuracy = 100. * correct / len(loader.dataset)
    return test_loss, accuracy

print("Starting training...")
num_epochs = 3

for epoch in range(num_epochs):
    start_time = time.time()
    
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc = test(model, test_loader, criterion, device)
    
    epoch_time = time.time() - start_time
    
    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    print(f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%")
    print(f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")
    print(f"Time: {epoch_time:.2f}s\n")

print("Training completed!")
print(f"Final Test Accuracy: {test_acc:.2f}%")

torch.save(model.state_dict(), 'mnist_model.pth')
print("Model saved to mnist_model.pth")

Step 5: Build the Docker Image

Now let's build our Docker image:

docker build -t pytorch-mnist .

This command:

docker build: Builds an image from a Dockerfile
-t pytorch-mnist: Tags the image with name "pytorch-mnist"
.: Uses the current directory as build context

The build process will take a few minutes as it downloads dependencies.

Verify the Image

List your Docker images:

docker images

You should see pytorch-mnist in the list.

Step 6: Run the Container

Run your training inside a Docker container:

docker run --rm pytorch-mnist

Flags explained:

--rm: Automatically remove container when it exits
pytorch-mnist: Name of the image to run

Run with Volume Mounting

To persist data and models outside the container:

docker run --rm -v $(pwd)/data:/app/data -v $(pwd)/models:/app/models pytorch-mnist

This mounts local directories into the container, so downloaded data and saved models persist.

Step 7: Interactive Development

For development, you might want to run the container interactively:

docker run -it --rm -v $(pwd):/app pytorch-mnist bash

This opens a bash shell inside the container where you can:

Run Python scripts manually
Install additional packages
Debug issues
Explore the environment

Inside the container, you can run:

python train.py

Advanced: Docker Compose

For more complex setups, use Docker Compose. Create docker-compose.yml:

version: '3.8'

services:
  pytorch-training:
    build: .
    volumes:
      - ./data:/app/data
      - ./models:/app/models
      - ./logs:/app/logs
    environment:
      - PYTHONUNBUFFERED=1
    command: python train.py

Run with:

docker-compose up

Common Docker Commands

Here are essential Docker commands you'll use frequently:

docker ps
docker ps -a
docker images
docker rm <container-id>
docker rmi <image-id>
docker logs <container-id>
docker exec -it <container-id> bash
docker stop <container-id>
docker system prune

Troubleshooting

Container Exits Immediately

Check logs:

docker logs <container-id>

Out of Memory

Increase Docker Desktop memory allocation:

Open Docker Desktop
Go to Settings → Resources
Increase Memory limit to 8GB or more

Slow Performance on Mac

Docker on Mac uses virtualization, which can be slower than native Linux. Consider:

Using Docker volumes instead of bind mounts
Increasing allocated resources
Using .dockerignore to exclude unnecessary files

Permission Issues

If you encounter permission errors, ensure your user has Docker permissions:

sudo usermod -aG docker $USER

Then log out and back in.

Best Practices

1. Use .dockerignore

Create a .dockerignore file to exclude unnecessary files:

__pycache__
*.pyc
*.pyo
*.pyd
.Python
env/
venv/
.git
.gitignore
.DS_Store
*.ipynb_checkpoints
data/
models/
logs/

2. Multi-stage Builds

For production, use multi-stage builds to reduce image size:

FROM python:3.10-slim as builder

WORKDIR /app
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

FROM python:3.10-slim

WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY . .

ENV PATH=/root/.local/bin:$PATH

CMD ["python", "train.py"]

3. Pin Package Versions

Always specify exact versions in your Dockerfile to ensure reproducibility.

4. Layer Caching

Order Dockerfile commands from least to most frequently changing to maximize cache usage.

5. Security

Don't run containers as root in production
Scan images for vulnerabilities
Use official base images
Keep images updated

Next Steps

Now that you have a working Docker + PyTorch setup, you can:

Add GPU Support: Use NVIDIA Docker runtime for GPU acceleration
Deploy to Cloud: Push images to Docker Hub or AWS ECR
Orchestration: Learn Kubernetes for managing multiple containers
CI/CD: Integrate Docker into your continuous integration pipeline
Experiment Tracking: Add MLflow or Weights & Biases to your container

Conclusion

Docker provides a powerful way to containerize your machine learning workflows, ensuring consistency across development, testing, and production environments. By following this guide, you've learned how to:

Install Docker Desktop on macOS
Create a custom PyTorch Docker image
Run training inside a container
Use volumes for data persistence
Apply best practices for Docker in ML

The example MNIST classifier demonstrates the basics, but the same principles apply to complex deep learning projects. Start containerizing your ML workflows today and enjoy the benefits of reproducible, portable environments!

Resources

Happy containerizing! 🐳🔥