Documentation

Documentation

NVIDIA CUDA Guide

Complete guide for optimizing ManifoldScript on NVIDIA GPUs with CUDA, including setup, performance tuning, and advanced features.

CUDA Overview

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model. ManifoldScript leverages CUDA to provide high-performance tensor operations on NVIDIA GPUs.

Supported Architectures

  • Volta (V100, TITAN V)
  • Turing (RTX 20xx, TITAN RTX)
  • Ampere (RTX 30xx, A100)
  • Ada Lovelace (RTX 40xx)
  • Hopper (H100)

Key Features

  • CUDA Streams for concurrent execution
  • Unified Memory for simplified programming
  • Tensor Cores for AI acceleration
  • NVLink for multi-GPU communication
  • NCCL for collective operations

Installation & Setup

1. Check GPU Compatibility

bash
1# Check CUDA-capable GPU
2nvidia-smi
3
4# Check compute capability
5nvidia-smi --query-gpu=compute_cap --format=csv,noheader
6
7# Expected output: 7.0+ (Volta or newer)

2. Install CUDA Toolkit

bash
1# Ubuntu/Debian
2sudo apt update
3sudo apt install nvidia-cuda-toolkit
4
5# CentOS/RHEL
6sudo yum install cuda-toolkit-12
7
8# Verify installation
9nvcc --version

3. Install ManifoldScript with CUDA

bash
1# Install CUDA-specific version
2curl -fsSL https://get.manifoldscript.dev/cuda | bash
3
4# Set environment variables
5export CUDA_HOME=/usr/local/cuda
6export PATH=$PATH:$CUDA_HOME/bin
7export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64
8
9# Verify CUDA support
10manifoldscript --check-gpu --target=cuda

CUDA Memory Architecture

CUDA Memory Hierarchy

graph TB subgraph "CUDA Memory Hierarchy" A[Host Memory] --> B[PCIe] B --> C[Device Memory] C --> D[L2 Cache] D --> E[L1 Cache] E --> F[Shared Memory] F --> G[Registers] C --> H[Constant Memory] C --> I[Texture Memory] J[Unified Memory] --> C J --> A end classDef host fill:#e3f2fd classDef device fill:#ffebee classDef cache fill:#fff3e0 class A host class C,D,E,F,G,H,I device class D,E,F cache

Memory Types & Access Patterns:

Global Memory

Main GPU memory, accessible by all threads

Access: 400-800 cycles

Shared Memory

Fast memory shared within thread blocks

Access: 1-32 cycles

Registers

Fastest memory, private to each thread

Access: 1 cycle

Constant Memory

Read-only memory for constants

Access: 1-64 cycles (cached)

Performance Optimization

1. Thread Configuration

manifoldscript
1# Optimize thread block dimensions
2pragma cuda_block_dim_x = 32
3pragma cuda_block_dim_y = 32
4pragma cuda_block_dim_z = 1
5
6# Optimize grid dimensions
7pragma cuda_grid_dim_x = 32
8pragma cuda_grid_dim_y = 32
9pragma cuda_grid_dim_z = 1
10
11# Example optimal configuration
12tensor A[1024, 1024] = random(1024, 1024)
13tensor B[1024, 1024] = random(1024, 1024)
14tensor C[1024, 1024] = A @ B

2. Memory Coalescing

manifoldscript
1# Coalesced memory access pattern
2# Good: Sequential access
3for i = 0 to 1023:
4 for j = 0 to 1023:
5 C[i, j] = A[i, j] + B[i, j]
6
7# Bad: Strided access
8for i = 0 to 1023:
9 for j = 0 to 1023:
10 C[j, i] = A[j, i] + B[j, i]

3. Shared Memory Usage

manifoldscript
1# Use shared memory for tile-based operations
2pragma use_shared_memory = true
3pragma shared_memory_size = 49152 # 48KB
4
5# Tile-based matrix multiplication
6tile_size = 32
7tensor C[1024, 1024] = tile_matmul(A, B, tile_size)

Tensor Core Acceleration

Enabling Tensor Cores

manifoldscript
1# Enable Tensor Core acceleration
2pragma use_tensor_cores = true
3pragma tensor_core_precision = "mixed" # mixed precision
4
5# Optimize for Tensor Core dimensions
6# Must be multiples of 8 for Volta, 16 for Turing/Ampere
7pragma tile_m = 256
8pragma tile_n = 128
9pragma tile_k = 64
10
11# Example with Tensor Cores
12tensor A[1024, 1024] = random(1024, 1024)
13tensor B[1024, 1024] = random(1024, 1024)
14tensor C[1024, 1024] = A @ B # Uses Tensor Cores

Mixed Precision Training

manifoldscript
1# Mixed precision configuration
2pragma precision = "fp16" # Half precision computation
3pragma accumulation = "fp32" # FP32 accumulation
4
5# Automatic mixed precision
6pragma amp_enabled = true
7pragma amp_loss_scale = 1024.0
8
9# Neural network layer with Tensor Cores
10tensor weights[1024, 1024] = random(1024, 1024)
11tensor inputs[64, 1024] = random(64, 1024)
12tensor outputs = inputs @ weights # Tensor Core accelerated

Multi-GPU Configuration

NVLink & NCCL Setup

bash
1# Check NVLink topology
2nvidia-smi topo -m
3
4# Install NCCL
5sudo apt install libnccl2 libnccl-dev
6
7# Verify NCCL installation
8ldconfig -p | grep nccl

Multi-GPU Programming

manifoldscript
1# Multi-GPU configuration
2pragma num_gpus = 4
3pragma gpu_ids = [0, 1, 2, 3]
4
5# Data parallelism
6tensor data[4096, 1024] = random(4096, 1024)
7tensor weights[1024, 512] = random(1024, 512)
8
9# Distribute across GPUs
10tensor results = parallel_matmul(data, weights, strategy="data_parallel")
11
12# Model parallelism
13tensor layer1[1024, 2048] = random(1024, 2048)
14tensor layer2[2048, 1024] = random(2048, 1024)
15tensor outputs = parallel_forward(layer1, layer2, strategy="model_parallel")

Profiling & Debugging

NVIDIA Nsight Tools

bash
1# Profile with Nsight Systems
2nsys profile manifoldscript compile program.ms
3
4# Profile with Nsight Compute
5ncu --set full manifoldscript compile program.ms
6
7# Generate timeline
8nsys stats report.qdrep
9
10# Analyze memory usage
11nsys stats --report cuda_gpu_mem_time_sum report.qdrep

Built-in Profiling

manifoldscript
1# Enable profiling
2pragma profile_enabled = true
3pragma profile_output = "profile.json"
4
5# Profile specific operations
6pragma profile_kernel = "matmul"
7pragma profile_memory = true
8pragma profile_timing = true
9
10# Example profiled program
11tensor A[2048, 2048] = random(2048, 2048)
12tensor B[2048, 2048] = random(2048, 2048)
13tensor C = A @ B # Profiled operation

Performance Benchmarks

Matrix Multiplication Performance

GPU ModelFP32 TFLOPSFP16 TFLOPSMemory Bandwidth
RTX 409082.6165.21,008 GB/s
A10019.5312.01,555 GB/s
TITAN V15.0120.0653 GB/s

Troubleshooting

CUDA Out of Memory

Error: CUDA out of memory

bash
1# Check GPU memory usage
2nvidia-smi
3
4# Reduce batch size or tensor dimensions
5pragma cuda_memory_limit = 0.8 # Use 80% of available memory
6
7# Enable memory pooling
8pragma cuda_memory_pool = true

CUDA Driver Issues

Error: CUDA driver version is insufficient

bash
1# Update NVIDIA drivers
2sudo apt install nvidia-driver-525
3
4# Restart system
5sudo reboot
6
7# Verify driver compatibility
8nvidia-smi
9nvcc --version

Compilation Errors

Error: PTX compilation failed

manifoldscript
1# Specify compute capability
2pragma cuda_arch = "75" # Turing architecture
3pragma cuda_arch = "80" # Ampere architecture
4
5# Enable debugging
6pragma cuda_debug = true
7pragma cuda_verbose = true