Skip to content

AMD ROCm Docker support (mirrors the Blackwell image from #5748) #6230

@LeoBorcherding

Description

@LeoBorcherding

What

Adds an AMD ROCm variant of the Docker setup alongside the NVIDIA/Blackwell image, so Unsloth runs in a container on AMD GPUs (RDNA2/3/4 and CDNA/Instinct). It mirrors the structure of the CUDA image so the two stay symmetric. Branch: LeoBorcherding/unsloth@feature/docker-rocm-support

Why / relationship to existing work

  • Add Docker build for Blackwell that runs on any NVIDIA GPU host #5748 (Add Docker build for Blackwell that runs on any NVIDIA GPU host, branch docker-blackwell-build) added the CUDA side. This is the AMD counterpart: same docker/ layout, same build.sh/run.sh entry points, same smoke-test shape.
  • [ROCm] add rocm dockerfile #3324 ([ROCm] add rocm dockerfile by @billishyahao) was an earlier ROCm dockerfile attempt that was closed; there's also the dh/recover-3324-rocm-dockerfile branch. This picks that thread back up with a maintained, CI-published image and the AMD-specific gotchas baked in.

What's included

  • docker/Dockerfile.rocm: ROCm torch wheels + the bitsandbytes pre-release wheel that carries the 4-bit decode fix (bnb <= 0.49.2 NaNs at decode on AMD). [huggingface] extra, SDPA fallback (no xformers on ROCm).
  • docker/Dockerfile.studio-rocm: Studio variant layered on the ROCm base.
  • docker/entrypoint-rocm.sh: preflight (/dev/kfd reachable, rocm-smi, HIP torch, gfx-arch check with HSA_OVERRIDE hints).
  • docker/smoke_test_rocm.py: ROCm smoke test incl. a 5-step LoRA train.
  • docker/test_locally-rocm.sh: end-to-end local build + smoke + notebook check.
  • .github/workflows/docker-publish-rocm.yml: GPU-free amd64 build + publish to Docker Hub.
  • build.sh / run.sh gain a --rocm flag; .dockerignore allowlists the new files.

GPU coverage

  • Default build (ROCM_VERSION=6.2.4, torch rocm6.2): RDNA2/RDNA3, CDNA/Instinct.
  • RDNA4 / Strix (gfx1200/1201, gfx1150/1151): ROCM_VERSION=7.2.4 + torch rocm7.2.

Validation so far

  • Base RDNA4 image (ROCM_VERSION=7.2.4, torch rocm7.2) builds cleanly on a GPU-free host, verified end to end at build time: torch 2.12.0+rocm7.2, HIP 7.2.53211, bitsandbytes 0.50.0.dev0 imports cleanly, all required packages present.
  • Static checks pass (shell, python, workflow yaml, .dockerignore COPY coverage).
  • Not yet built: the default rocm6.2 image and the Studio image (Dockerfile.studio-rocm).
  • Pending: the GPU smoke test (5-step LoRA) on real AMD hardware. Needs an AMD GPU runner or a cloud instance; the workflow's self-hosted amd-gpu smoke job can also cover it.
  • Strix image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions