Building an ML Experimentation Cloud Lab - A Weekend Project

"I want AI to build the cloud tooling so I can spend more time experimenting." To paraphrase the now famous adage, I spent the weekend designing and running an ML experiment that involved the previously daunting side quest of building a cloud infrastructure and runner. Always the uninteresting plumbing in every project, and very often a big rock in getting things to work. While Claude is not yet a first class experiment designer (focus on the wrong things, rabbit-holing, local reasoning issues) it is pretty great at cloud infrastructure. So We spent the weekend running this experiment while building a scale out infrastrucutre for my hobby ml lab.

Here are the details.

Everybody Needs a Hobby

in this project I'm experimenting on how fine-tuning distorts the geometry of language model weight spaces. It requires manufacturing about 30 LoRA and DPO adapters, then analyzing their spectral properties. Each adapter is a full training run on a 3B-parameter model.

I typically run small experiments locally, but to achieve this one, I had to build some an experiment runner, to provision and drive my experiments on a cloud infrastructure, I started with GCP, arbitrarily. There's a similar spec ready for AWS.

Experimentation as Infrastructure

I didn't want the cost of a persistent cloud workstation. I wanted to be able to set up, define inputs (code, config, models), execute on appropriate hardware, collect outputs (adapters, metrics, logs), and tear down back to zero cost.

The principles I built around:

Declarative config. One file describes the full experiment. GPU type, model manifest, sync paths, sparse checkout scope.
Disposable instances. VMs are created, used, and deleted in a single invocation. No SSH-ing into long-lived machines.
Pessimistic cost control. Instances are torn down on failure. If something crashes, the checkpoint is on GCS and we resume on a fresh machine.
Resumable from storage. Progress lives in cloud storage, not on any particular disk. Resume means "pull the latest checkpoint from GCS and skip completed phases," regardless of which VM runs it.
Shared infra, isolated experiments. The launcher, bootstrap, image manager, and model cache are shared. Each experiment brings only its config, scripts, and requirements file.

The Config File

Each experiment lives in its own directory with a gcp_project.conf that describes everything: what GPU, what to install, what to run, what to sync back.

GCP_ZONE="us-west4-b"
GCP_MACHINE_TYPE="a2-highgpu-1g"
GCP_GPU_TYPE="nvidia-tesla-a100"

SPARSE_CHECKOUT_PATHS="ml/[project-name]/task-geometry ml/infra"
REQUIREMENTS_FILE="ml/[project-name]/task-geometry/requirements-gcp.txt"
MODELS_FILE="ml/[project-name]/task-geometry/models.txt"

EXPERIMENT_CMD="bash ml/[project-name]/task-geometry/run_experiment_cloud.sh"

SYNC_PATHS=(
    "data/results/:results/"
    "data/adapters/:adapters/"
)

One command to run:

bash ml/infra/gcp_launch.sh --project-config ml/[project-name]/task-geometry/gcp_project.conf

We can add a new experiment by writing a config, a run script, and a requirements file. The launcher, image manager, model cache, all that stays the same.

What Happens When You Run It

The launcher does five things in order.

It creates a GPU instance from a pre-baked custom image. If the image exists, the VM boots in about 30 seconds with Python, PyTorch, CUDA, and the base model already on disk. If there's no image yet, it falls back to a Google Deep Learning VM and runs the full bootstrap from scratch. That takes around 12 minutes.

Then it pulls code. Sparse checkout, so the VM only gets the directories this experiment actually needs. Verifies the model cache. Installs any new pip dependencies.

Then it runs the experiment. The experiment script handles its own phasing internally, layer analysis, adapter manufacturing, spectral features, classification, whatever the experiment needs. Each phase checks whether its outputs already exist and skips if so. More on that in a second.

When the experiment finishes (or crashes), a cleanup trap syncs everything to GCS. Results, logs, adapters. another utility can be used to download a copy of the results to the local machine. Then it deletes the VM. Always. The only durable state is in cloud storage.

Pre-baked Images

Bootstrapping PyTorch + CUDA + transformers + a 3B model from scratch every time would be 12+ minutes of overhead per run. So we pre-bake images.

Claude created an image manager utility that spins up a temp VM, runs the full bootstrap, downloads models, runs health checks (does torch import, does CUDA work, can we load the model), then snapshots the disk as a GCE image. The build VM gets deleted.

bash ml/infra/manage_image.sh --project-config <conf> build     # build once
bash ml/infra/manage_image.sh --project-config <conf> teardown  # archive to GCS
bash ml/infra/manage_image.sh --project-config <conf> import    # restore later

A live image costs about $5/month. The image manager the ability to archive the image to GCS, $0.04/month. I'm still evaluating if this is good idea or instead just recreate the images after a longer break, creating a session after ironing out the kinks worked pretty well.

Model Caching

Downloading a 3B model from HuggingFace takes 5-10 minutes. Doing that on every fresh VM is a waste, especially when debugging a new experiment. the experiment runner has a three-tier cache lookup.

First it checks local disk, maybe the model is already baked into the image. If not, it checks a GCS bucket where models get cached after first download. Restoring from GCS takes about 60 seconds for a 3B model. Only if both miss does it go to HuggingFace Hub, and then it uploads to the GCS cache for next time.

Resume

This was the hardest part to get right and the most satisfying to have working.

Spot instances get preempted. SSH connections drop. Scripts hit OOM. The experiment has to survive all of that. I needed it to be resumable on a completely different VM, not just the one that started the run.

The answer is that all progress lives in cloud storage. When you pass --resume, the launcher creates a fresh VM, pulls the latest state from GCS, and the experiment script checks each phase: if the outputs are already there, skip it. If not, run it and sync the outputs back. The adapter step looks for adapter_model.safetensors. Analysis steps look for result JSONs.

Per-phase sync to GCS after every major step means the worst case from a crash is losing one phase of work, not the whole experiment. In practice I've had runs die mid-way, resumed on a new machine, and they picked up exactly where they left off.

But phase-level skips aren't granular enough. My spectral analysis phase iterates over 30 adapters, each taking 4 minutes on an L4. When the VM died after adapter 18, resume skipped completed phases but re-ran the current phase from scratch. 18 adapters of work, gone. The fix was per-artifact caching within phases. Each adapter's output gets saved to disk as it completes, and checked before re-extracting. The loop went from "all or nothing" to "pick up where I left off, adapter by adapter." The lesson: any loop that takes more than a few minutes per iteration needs its own checkpoint.

What It Costs

A100 on-demand is about $4/hour. My 30-adapter experiment finishes in about 5 hours, so roughly $20. Spot would cut that to $8 but I haven't wired up automatic resume-on-preemption yet.

GCS storage for results is pennies a month. Model cache is $0.15/month. Images are $5/month live or $0.04/month archived. When I'm not running experiments, the idle cost is $0. Nothing stays running.

Things That Broke Along the Way

gsutil cp -r dir/ gs://path/ copies the directory itself, so you end up with results/results/. gsutil rsync -r dir/ gs://path/ syncs the contents. This took an embarrassingly long time to figure out. We were chasing nested artifact paths for long minutes before seeing the pattern.

The cleanup trap that syncs results before deleting the instance is maybe the most important thing in the whole system. It runs even on set -e failures. Without it, a crash in the last phase would mean losing everything from the phases before it.

The cleanup trap that saves you also kills you. The experiment runs detached via nohup so it survives SSH drops. But when your local SSH session dies, the launcher exits, the cleanup trap fires, and deletes the VM with the experiment still running on it. We lost two runs to this before we figured out what was happening. The fix was a state flag: before experiment launch, cleanup deletes the VM (correct, nothing running). After the experiment is detached, cleanup preserves the VM and prints reconnect instructions instead. The same safety mechanism was the most dangerous part of the system depending on timing.

Python stdout buffering is invisible until it isn't. When stdout gets piped to a file (as with nohup), Python block-buffers by default. The experiment is running, producing output, but the log file stays empty for hours. You SSH in, tail the log, see nothing, and assume it's stuck. One line at the top of every script fixes it: sys.stdout.reconfigure(line_buffering=True). We now treat this the same as set -euo pipefail in bash. If it's not there, it's a bug.

Algorithmic waste hiding behind infrastructure. A spectral analysis step was taking 3+ hours on an L4. We accepted it being slow for too long and started looking at bigger machines. Turns out the code was computing full SVD decompositions on 3072x3072 matrices when only the top 3 singular vectors were needed. Switching from np.linalg.svd to scipy.sparse.linalg.svds(k=3) brought it down to minutes. note to self: When a cloud step is slow, check the code before upgrading the GPU.

How it looks

local machine                        GCP
─────────────────                    ────────────────────────────────

gcp_launch.sh ──────────────────►  GPU Instance (A100/L4/T4)
  │  --project-config                  │
  │                                    ├─ /opt/ml-venv (Python 3.11 + PyTorch)
  │                                    ├─ /opt/hf-cache (model weights)
  │                                    ├─ /opt/ml-lab   (sparse repo checkout)
  │                                    │    └─ run_experiment_cloud.sh
  │                                    │         ├─ Phase 0: restore checkpoint
  │                                    │         ├─ Phase 1-N: run + sync each
  │                                    │         └─ sync final results
  │                                    │
  │  ◄──── fetch_results.sh ◄──────── GCS: gs://bucket/experiment/
  │                                        ├─ results/*.json
  │                                        ├─ adapters/*/
  │                                        └─ logs/*.log
  │
  └─ local: ml/experiment/data/
       ├─ results/
       └─ logs/

manage_image.sh
  ├─ build:    DL VM → bootstrap → health check → snapshot
  ├─ teardown: export to GCS → delete live image
  └─ import:   restore from GCS archive

What's Next

Spot instances with automatic resume is the obvious one. The checkpointing is already there, so preemption just means "resume on a new machine." That would cut compute cost 60-70%.

GPU availability is a moving target. A100s stock out without warning. L4s aren't available in every zone. Machine types are zone-specific, g2-standard-8 exists in us-west4-a but not us-west4-b. I keep a resource registry mapping GPUs to zones and machine types, updated after each failed launch. CLI flags on the launcher (--zone, --machine-type, --gpu-type) let me pivot in seconds without editing config files. This came out of an afternoon where I tried three zone/GPU combinations before one worked.

We also want tooling for pulling multiple experiments and generating comparative analysis locally. The results are structured and on GCS, so this should be straightforward. An agent that knows the schema could do most of the work.

Developed and written with Claude Opus 4.6