Documentation

Documentation

AMD ROCm Documentation

Complete guide for setting up and optimizing ManifoldScript on AMD GPUs with ROCm.

Prerequisites

  • • AMD GPU with RDNA or CDNA architecture
  • • ROCm 5.7 or later
  • • Linux distribution (Ubuntu 22.04, RHEL 9, or SUSE 15)
  • • AMDGPU-Pro driver or open-source driver
  • • At least 8GB of GPU memory

Installation

# Add AMD repository
sudo apt-get update
sudo apt-get install wget
wget https://repo.radeon.com/amdgpu-install/6.0.2/ubuntu/jammy/amdgpu-install_6.0.60200-1_all.deb
sudo apt-get install ./amdgpu-install_6.0.60200-1_all.deb

# Install ROCm
sudo amdgpu-install --usecase=rocm --no-dkms

# Install ManifoldScript with ROCm support
curl -fsSL https://get.manifoldscript.dev/rocm | bash

# Verify installation
manifoldscript --version
manifoldscript --check-rocm

ROCm Architecture

ManifoldScript ROCm Architecture

graph TD A[ManifoldScript Source] --> B[ROCm Frontend] B --> C[HIP Code Generation] C --> D[HSACO Assembly] D --> E[ISA Binary] E --> F[GPU Execution] G[Memory Manager] --> H[ROCm Memory] H --> I[Fine-Grained Memory] I --> J[Compute Units] K[Grid Manager] --> L[Work Groups] L --> M[Wavefronts] M --> N[Work Items] N --> F O[Multi-GPU] --> P[XGMI Bridge] P --> Q[Infinity Fabric] Q --> F classDef frontend fill:#e1f5fe classDef compilation fill:#f3e5f5 classDef execution fill:#e8f5e9 classDef memory fill:#fff3e0 class B,C,D,E compilation class F,J execution class G,H,I memory class K,L,M,N,O,P,Q frontend

Multi-GPU with XGMI

Multi-GPU Configuration with XGMI

graph TB subgraph "Host System" A[ManifoldScript Runtime] --> B[ROCm Runtime] B --> C[GPU 0] B --> D[GPU 1] B --> E[GPU 2] end subgraph "GPU 0" C --> F[HIP Context] F --> G[Memory Pool] G --> H[Compute Queue] end subgraph "GPU 1" D --> I[HIP Context] I --> J[Memory Pool] J --> K[Compute Queue] end subgraph "GPU 2" E --> L[HIP Context] L --> M[Memory Pool] M --> N[Compute Queue] end H -->|XGMI| K K -->|XGMI| N N -->|XGMI| H P[XGMI Hub] -->|High Speed| Q[Infinity Fabric] Q -->|Low Latency| H Q -->|Low Latency| K Q -->|Low Latency| N classDef host fill:#e3f2fd classDef gpu fill:#ffebee classDef link fill:#f3e5f5 class A,B host class C,D,E,F,G,H,I,J,K,L,M,N gpu class P,Q link

Performance Optimization

Memory Optimization

  • • Use fine-grained memory system
  • • Enable coarse-grained for large transfers
  • • Optimize LDS (Local Data Share) usage
  • • Use texture memory for 2D access

Kernel Optimization

  • • Maximize wavefront occupancy
  • • Use VALU for scalar operations
  • • Enable MFMA for matrix operations
  • • Optimize register usage

Code Example

(manifold rocm_ops
  :requirements (:rocm :hip)
  :types tensor - rocm_tensor
  
  :action (gemm_acceleration 
    :parameters (?A ?B ?C - tensor)
    :pre (:rocm (?A :tensor) (?B :tensor))
    :eff (:rocm (?C :tensor)
      (rocm_gemm ?A ?B ?C)
      (rocm_mfma_optimize ?C)
    )
  )
)

;; Compile with ROCm optimization
manifoldscript compile --target=rocm --arch=gfx1100 rocm_ops.ms