ManifoldScript Docs
Documentation
NVIDIA CUDA Guide
Complete guide for optimizing ManifoldScript on NVIDIA GPUs with CUDA, including setup, performance tuning, and advanced features.
CUDA Overview
CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model. ManifoldScript leverages CUDA to provide high-performance tensor operations on NVIDIA GPUs.
Supported Architectures
- Volta (V100, TITAN V)
- Turing (RTX 20xx, TITAN RTX)
- Ampere (RTX 30xx, A100)
- Ada Lovelace (RTX 40xx)
- Hopper (H100)
Key Features
- CUDA Streams for concurrent execution
- Unified Memory for simplified programming
- Tensor Cores for AI acceleration
- NVLink for multi-GPU communication
- NCCL for collective operations
Installation & Setup
1. Check GPU Compatibility
bash
1# Check CUDA-capable GPU2nvidia-smi3 4# Check compute capability5nvidia-smi --query-gpu=compute_cap --format=csv,noheader6 7# Expected output: 7.0+ (Volta or newer)
2. Install CUDA Toolkit
bash
1# Ubuntu/Debian2sudo apt update3sudo apt install nvidia-cuda-toolkit4 5# CentOS/RHEL6sudo yum install cuda-toolkit-127 8# Verify installation9nvcc --version
3. Install ManifoldScript with CUDA
bash
1# Install CUDA-specific version 2curl -fsSL https://get.manifoldscript.dev/cuda | bash 3 4# Set environment variables 5export CUDA_HOME=/usr/local/cuda 6export PATH=$PATH:$CUDA_HOME/bin 7export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64 8 9# Verify CUDA support10manifoldscript --check-gpu --target=cuda
CUDA Memory Architecture
CUDA Memory Hierarchy
graph TB
subgraph "CUDA Memory Hierarchy"
A[Host Memory] --> B[PCIe]
B --> C[Device Memory]
C --> D[L2 Cache]
D --> E[L1 Cache]
E --> F[Shared Memory]
F --> G[Registers]
C --> H[Constant Memory]
C --> I[Texture Memory]
J[Unified Memory] --> C
J --> A
end
classDef host fill:#e3f2fd
classDef device fill:#ffebee
classDef cache fill:#fff3e0
class A host
class C,D,E,F,G,H,I device
class D,E,F cache
Memory Types & Access Patterns:
Global Memory
Main GPU memory, accessible by all threads
Access: 400-800 cycles
Shared Memory
Fast memory shared within thread blocks
Access: 1-32 cycles
Registers
Fastest memory, private to each thread
Access: 1 cycle
Constant Memory
Read-only memory for constants
Access: 1-64 cycles (cached)
Performance Optimization
1. Thread Configuration
manifoldscript
1# Optimize thread block dimensions 2pragma cuda_block_dim_x = 32 3pragma cuda_block_dim_y = 32 4pragma cuda_block_dim_z = 1 5 6# Optimize grid dimensions 7pragma cuda_grid_dim_x = 32 8pragma cuda_grid_dim_y = 32 9pragma cuda_grid_dim_z = 110 11# Example optimal configuration12tensor A[1024, 1024] = random(1024, 1024)13tensor B[1024, 1024] = random(1024, 1024)14tensor C[1024, 1024] = A @ B
2. Memory Coalescing
manifoldscript
1# Coalesced memory access pattern 2# Good: Sequential access 3for i = 0 to 1023: 4 for j = 0 to 1023: 5 C[i, j] = A[i, j] + B[i, j] 6 7# Bad: Strided access 8for i = 0 to 1023: 9 for j = 0 to 1023:10 C[j, i] = A[j, i] + B[j, i]
3. Shared Memory Usage
manifoldscript
1# Use shared memory for tile-based operations2pragma use_shared_memory = true3pragma shared_memory_size = 49152 # 48KB4 5# Tile-based matrix multiplication6tile_size = 327tensor C[1024, 1024] = tile_matmul(A, B, tile_size)
Tensor Core Acceleration
Enabling Tensor Cores
manifoldscript
1# Enable Tensor Core acceleration 2pragma use_tensor_cores = true 3pragma tensor_core_precision = "mixed" # mixed precision 4 5# Optimize for Tensor Core dimensions 6# Must be multiples of 8 for Volta, 16 for Turing/Ampere 7pragma tile_m = 256 8pragma tile_n = 128 9pragma tile_k = 6410 11# Example with Tensor Cores12tensor A[1024, 1024] = random(1024, 1024)13tensor B[1024, 1024] = random(1024, 1024)14tensor C[1024, 1024] = A @ B # Uses Tensor Cores
Mixed Precision Training
manifoldscript
1# Mixed precision configuration 2pragma precision = "fp16" # Half precision computation 3pragma accumulation = "fp32" # FP32 accumulation 4 5# Automatic mixed precision 6pragma amp_enabled = true 7pragma amp_loss_scale = 1024.0 8 9# Neural network layer with Tensor Cores10tensor weights[1024, 1024] = random(1024, 1024)11tensor inputs[64, 1024] = random(64, 1024)12tensor outputs = inputs @ weights # Tensor Core accelerated
Multi-GPU Configuration
NVLink & NCCL Setup
bash
1# Check NVLink topology2nvidia-smi topo -m3 4# Install NCCL5sudo apt install libnccl2 libnccl-dev6 7# Verify NCCL installation8ldconfig -p | grep nccl
Multi-GPU Programming
manifoldscript
1# Multi-GPU configuration 2pragma num_gpus = 4 3pragma gpu_ids = [0, 1, 2, 3] 4 5# Data parallelism 6tensor data[4096, 1024] = random(4096, 1024) 7tensor weights[1024, 512] = random(1024, 512) 8 9# Distribute across GPUs10tensor results = parallel_matmul(data, weights, strategy="data_parallel")11 12# Model parallelism13tensor layer1[1024, 2048] = random(1024, 2048)14tensor layer2[2048, 1024] = random(2048, 1024)15tensor outputs = parallel_forward(layer1, layer2, strategy="model_parallel")
Profiling & Debugging
NVIDIA Nsight Tools
bash
1# Profile with Nsight Systems 2nsys profile manifoldscript compile program.ms 3 4# Profile with Nsight Compute 5ncu --set full manifoldscript compile program.ms 6 7# Generate timeline 8nsys stats report.qdrep 9 10# Analyze memory usage11nsys stats --report cuda_gpu_mem_time_sum report.qdrep
Built-in Profiling
manifoldscript
1# Enable profiling 2pragma profile_enabled = true 3pragma profile_output = "profile.json" 4 5# Profile specific operations 6pragma profile_kernel = "matmul" 7pragma profile_memory = true 8pragma profile_timing = true 9 10# Example profiled program11tensor A[2048, 2048] = random(2048, 2048)12tensor B[2048, 2048] = random(2048, 2048)13tensor C = A @ B # Profiled operation
Performance Benchmarks
Matrix Multiplication Performance
GPU Model | FP32 TFLOPS | FP16 TFLOPS | Memory Bandwidth |
---|---|---|---|
RTX 4090 | 82.6 | 165.2 | 1,008 GB/s |
A100 | 19.5 | 312.0 | 1,555 GB/s |
TITAN V | 15.0 | 120.0 | 653 GB/s |
Troubleshooting
CUDA Out of Memory
Error: CUDA out of memory
bash
1# Check GPU memory usage2nvidia-smi3 4# Reduce batch size or tensor dimensions5pragma cuda_memory_limit = 0.8 # Use 80% of available memory6 7# Enable memory pooling8pragma cuda_memory_pool = true
CUDA Driver Issues
Error: CUDA driver version is insufficient
bash
1# Update NVIDIA drivers2sudo apt install nvidia-driver-5253 4# Restart system5sudo reboot6 7# Verify driver compatibility8nvidia-smi9nvcc --version
Compilation Errors
Error: PTX compilation failed
manifoldscript
1# Specify compute capability2pragma cuda_arch = "75" # Turing architecture3pragma cuda_arch = "80" # Ampere architecture4 5# Enable debugging6pragma cuda_debug = true7pragma cuda_verbose = true