ManifoldScript Docs

Documentation

NVIDIA CUDA Guide

Complete guide for optimizing ManifoldScript on NVIDIA GPUs with CUDA, including setup, performance tuning, and advanced features.

CUDA Overview

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model. ManifoldScript leverages CUDA to provide high-performance tensor operations on NVIDIA GPUs.

Supported Architectures

Volta (V100, TITAN V)
Turing (RTX 20xx, TITAN RTX)
Ampere (RTX 30xx, A100)
Ada Lovelace (RTX 40xx)
Hopper (H100)

Key Features

CUDA Streams for concurrent execution
Unified Memory for simplified programming
Tensor Cores for AI acceleration
NVLink for multi-GPU communication
NCCL for collective operations

Installation & Setup

1. Check GPU Compatibility

bash

1# Check CUDA-capable GPU
2nvidia-smi
3 
4# Check compute capability
5nvidia-smi --query-gpu=compute_cap --format=csv,noheader
6 
7# Expected output: 7.0+ (Volta or newer)

2. Install CUDA Toolkit

bash

1# Ubuntu/Debian
2sudo apt update
3sudo apt install nvidia-cuda-toolkit
4 
5# CentOS/RHEL
6sudo yum install cuda-toolkit-12
7 
8# Verify installation
9nvcc --version

3. Install ManifoldScript with CUDA

bash

 1# Install CUDA-specific version
 2curl -fsSL https://get.manifoldscript.dev/cuda | bash
 3 
 4# Set environment variables
 5export CUDA_HOME=/usr/local/cuda
 6export PATH=$PATH:$CUDA_HOME/bin
 7export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64
 8 
 9# Verify CUDA support
10manifoldscript --check-gpu --target=cuda

CUDA Memory Architecture

CUDA Memory Hierarchy

graph TB subgraph "CUDA Memory Hierarchy" A[Host Memory] --> B[PCIe] B --> C[Device Memory] C --> D[L2 Cache] D --> E[L1 Cache] E --> F[Shared Memory] F --> G[Registers] C --> H[Constant Memory] C --> I[Texture Memory] J[Unified Memory] --> C J --> A end classDef host fill:#e3f2fd classDef device fill:#ffebee classDef cache fill:#fff3e0 class A host class C,D,E,F,G,H,I device class D,E,F cache

Memory Types & Access Patterns:

Global Memory

Main GPU memory, accessible by all threads

Access: 400-800 cycles

Shared Memory

Fast memory shared within thread blocks

Access: 1-32 cycles

Registers

Fastest memory, private to each thread

Access: 1 cycle

Constant Memory

Read-only memory for constants

Access: 1-64 cycles (cached)

Performance Optimization

1. Thread Configuration

manifoldscript

 1# Optimize thread block dimensions
 2pragma cuda_block_dim_x = 32
 3pragma cuda_block_dim_y = 32
 4pragma cuda_block_dim_z = 1
 5 
 6# Optimize grid dimensions
 7pragma cuda_grid_dim_x = 32
 8pragma cuda_grid_dim_y = 32
 9pragma cuda_grid_dim_z = 1
10 
11# Example optimal configuration
12tensor A[1024, 1024] = random(1024, 1024)
13tensor B[1024, 1024] = random(1024, 1024)
14tensor C[1024, 1024] = A @ B

2. Memory Coalescing

manifoldscript

 1# Coalesced memory access pattern
 2# Good: Sequential access
 3for i = 0 to 1023:
 4    for j = 0 to 1023:
 5        C[i, j] = A[i, j] + B[i, j]
 6 
 7# Bad: Strided access
 8for i = 0 to 1023:
 9    for j = 0 to 1023:
10        C[j, i] = A[j, i] + B[j, i]

3. Shared Memory Usage

manifoldscript

1# Use shared memory for tile-based operations
2pragma use_shared_memory = true
3pragma shared_memory_size = 49152  # 48KB
4 
5# Tile-based matrix multiplication
6tile_size = 32
7tensor C[1024, 1024] = tile_matmul(A, B, tile_size)

Tensor Core Acceleration

Enabling Tensor Cores

manifoldscript

 1# Enable Tensor Core acceleration
 2pragma use_tensor_cores = true
 3pragma tensor_core_precision = "mixed"  # mixed precision
 4 
 5# Optimize for Tensor Core dimensions
 6# Must be multiples of 8 for Volta, 16 for Turing/Ampere
 7pragma tile_m = 256
 8pragma tile_n = 128
 9pragma tile_k = 64
10 
11# Example with Tensor Cores
12tensor A[1024, 1024] = random(1024, 1024)
13tensor B[1024, 1024] = random(1024, 1024)
14tensor C[1024, 1024] = A @ B  # Uses Tensor Cores

Mixed Precision Training

manifoldscript

 1# Mixed precision configuration
 2pragma precision = "fp16"  # Half precision computation
 3pragma accumulation = "fp32"  # FP32 accumulation
 4 
 5# Automatic mixed precision
 6pragma amp_enabled = true
 7pragma amp_loss_scale = 1024.0
 8 
 9# Neural network layer with Tensor Cores
10tensor weights[1024, 1024] = random(1024, 1024)
11tensor inputs[64, 1024] = random(64, 1024)
12tensor outputs = inputs @ weights  # Tensor Core accelerated

Multi-GPU Configuration

NVLink & NCCL Setup

bash

1# Check NVLink topology
2nvidia-smi topo -m
3 
4# Install NCCL
5sudo apt install libnccl2 libnccl-dev
6 
7# Verify NCCL installation
8ldconfig -p | grep nccl

Multi-GPU Programming

manifoldscript

 1# Multi-GPU configuration
 2pragma num_gpus = 4
 3pragma gpu_ids = [0, 1, 2, 3]
 4 
 5# Data parallelism
 6tensor data[4096, 1024] = random(4096, 1024)
 7tensor weights[1024, 512] = random(1024, 512)
 8 
 9# Distribute across GPUs
10tensor results = parallel_matmul(data, weights, strategy="data_parallel")
11 
12# Model parallelism
13tensor layer1[1024, 2048] = random(1024, 2048)
14tensor layer2[2048, 1024] = random(2048, 1024)
15tensor outputs = parallel_forward(layer1, layer2, strategy="model_parallel")

Profiling & Debugging

NVIDIA Nsight Tools

bash

 1# Profile with Nsight Systems
 2nsys profile manifoldscript compile program.ms
 3 
 4# Profile with Nsight Compute
 5ncu --set full manifoldscript compile program.ms
 6 
 7# Generate timeline
 8nsys stats report.qdrep
 9 
10# Analyze memory usage
11nsys stats --report cuda_gpu_mem_time_sum report.qdrep

Built-in Profiling

manifoldscript

 1# Enable profiling
 2pragma profile_enabled = true
 3pragma profile_output = "profile.json"
 4 
 5# Profile specific operations
 6pragma profile_kernel = "matmul"
 7pragma profile_memory = true
 8pragma profile_timing = true
 9 
10# Example profiled program
11tensor A[2048, 2048] = random(2048, 2048)
12tensor B[2048, 2048] = random(2048, 2048)
13tensor C = A @ B  # Profiled operation

Performance Benchmarks

Matrix Multiplication Performance

GPU Model	FP32 TFLOPS	FP16 TFLOPS	Memory Bandwidth
RTX 4090	82.6	165.2	1,008 GB/s
A100	19.5	312.0	1,555 GB/s
TITAN V	15.0	120.0	653 GB/s

Troubleshooting

CUDA Out of Memory

Error: CUDA out of memory

bash

1# Check GPU memory usage
2nvidia-smi
3 
4# Reduce batch size or tensor dimensions
5pragma cuda_memory_limit = 0.8  # Use 80% of available memory
6 
7# Enable memory pooling
8pragma cuda_memory_pool = true

CUDA Driver Issues

Error: CUDA driver version is insufficient

bash

1# Update NVIDIA drivers
2sudo apt install nvidia-driver-525
3 
4# Restart system
5sudo reboot
6 
7# Verify driver compatibility
8nvidia-smi
9nvcc --version

Compilation Errors

Error: PTX compilation failed

manifoldscript

1# Specify compute capability
2pragma cuda_arch = "75"  # Turing architecture
3pragma cuda_arch = "80"  # Ampere architecture
4 
5# Enable debugging
6pragma cuda_debug = true
7pragma cuda_verbose = true

ManifoldScript

GPU Platforms

Operating Systems

Advanced Topics

ManifoldScript Docs

Documentation

ManifoldScript

GPU Platforms

Operating Systems

Advanced Topics

ManifoldScript Docs

Documentation

NVIDIA CUDA Guide

CUDA Overview

Supported Architectures

Key Features

Installation & Setup

1. Check GPU Compatibility

2. Install CUDA Toolkit

3. Install ManifoldScript with CUDA

CUDA Memory Architecture

CUDA Memory Hierarchy

Memory Types & Access Patterns:

Global Memory

Shared Memory

Registers

Constant Memory

Performance Optimization

1. Thread Configuration

2. Memory Coalescing

3. Shared Memory Usage

Tensor Core Acceleration

Enabling Tensor Cores

Mixed Precision Training

Multi-GPU Configuration

NVLink & NCCL Setup

Multi-GPU Programming

Profiling & Debugging

NVIDIA Nsight Tools

Built-in Profiling

Performance Benchmarks

Matrix Multiplication Performance

Troubleshooting

CUDA Out of Memory

CUDA Driver Issues

Compilation Errors